Step 1: Scientific Dataset Preparation and Labeling
Scientific Dataset Selection:We specifically selected complex scientific-domain queries from the publicly available Nectar dataset. Nectar provides a diverse collection of scientifically rigorous questions and associated detailed model-generated responses, including over 191K expert-level responses from GPT-4 and related large models. Data Labeling Approach (1–3 Rating Scale):
To best capture Mistral’s capability to respond accurately to each scientific query, we adopted a simplified yet robust rating scale:
- Score 3 (High Quality): Mistral provides scientifically accurate, detailed, and comprehensive answers.
- Score 2 (Moderate Quality): Mistral answers adequately but lacks complete depth or nuance.
- Score 1 (Low Quality): Mistral significantly struggles, indicating that the complexity and rigor of the query necessitate routing to the OpenAI o3-mini-high model.
Step 2: Fine-Tuning Our Custom LLaMA-3.1 Router
Router Model Selection:We chose a specialized LLaMA-3.1 8B model to function as our router. Its compact size ensures quick and cost-effective inference when deciding query routing. Training Process:
- The router was trained exclusively on scientific queries, explicitly fine-tuned to predict Mistral’s response quality rating (1–3) based solely on the query text.
- Training queries were presented in structured, instruction-driven prompts clearly guiding the model to recognize query complexity and the scientific depth required.
- We ensured balanced representation across rating scores, preventing bias toward either model and ensuring robust, generalizable performance.
- Conducted full-parameter fine-tuning using GPU resources, ensuring efficient and high-quality convergence.
- Validated strong predictive performance on scientific queries post-training.
Step 3: Router Evaluation and Optimization
We thoroughly evaluated our trained router against challenging scientific benchmarks (e.g., GSM8K and other quantitative reasoning tasks) to assess its ability to accurately direct queries, maintaining high accuracy at reduced cost. Evaluation Highlights:- Our router effectively identified when Mistral-8x7B could reliably handle queries, thereby reducing the need for expensive calls to OpenAI o3-mini-high.
- Compared to a random baseline, our scientifically-specialized router provided substantial cost savings while maintaining scientific rigor:
- Up to 70% cost reduction on general scientific benchmarks.
- Approximately 40% cost reduction on challenging quantitative tasks.