Description
What happened?
I have been using MIPROv2 to optimize a basic instruction of an Agent. The optimization process uses GPT-4o-mini as the teacher model and a quantized model (via vLLM) as the student.
The optimization improves the metric exact_match which compares the model’s predicted speaker role to the correct next speaker role.
Here are the results before and after optimization, with DSPy.Evaluate:
Before optimization: 0.190
After MIPROv2 optimization: 0.238
However, I noticed a significant performance drop when I extract the instructions and examples into a structured Markdown prompt and call the model directly without DSPy. The results for 20 test questions are:
Instruction (non-optimized): 47%
Instruction (optimized) + examples: 28.5%
System prompt generated via dspy.inspect_history(): 5% (does not follow the expected format)
Steps to reproduce
Signature I am using, I call it with dspy.Predict
class RouterSignature(dspy.Signature): """Read the conversation and select the next role from roles_list to play. Only return the role.""" roles = dspy.InputField(desc="available roles") roles_list = dspy.InputField() conversation = dspy.InputField() selected_role : Literal["..."] = dspy.OutputField()
MIPROv2 params:
optimization_model_kwargs = dict(prompt_model=openai_lm, task_model=vllm, teacher_settings=dict(lm=openai_lm)) optimizer = MIPROv2Optimizer(metric=exact_match_router, max_bootstrapped_demos=2, max_labeled_demos=5, optimization_model_kwargs=optimization_model_kwargs)
-
Extract the optimized instruction and examples into a Markdown prompt.
-
Call the model manually using:
` import requests
import json
def query_vllm(roles, available_roles, conversation):
api_url = "http://localhost:port/v1/chat/completions"
payload = {
"model": "kaitchup/Llama-3.2-3B-Instruct-gptqmodel-4bit",
"messages": [
{"role": "system", "content": OPTIMIZED_SYS_PROMPT},
{"role": "user", "content": user_prompt}
],
"temperature": 0.0,
"max_tokens": 500
}
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer fake-key" # VLLM doesn't validate the key
}
response = requests.post(api_url, headers=headers, data=json.dumps(payload))
if response.status_code == 200:
result = response.json()
role_selection = result["choices"][0]["message"]["content"]
return role_selection.strip()
else:
print(f"Error: {response.status_code}")
return None `
- compare perfomance
Expected Behavior:
I expected the optimized instruction (when manually prompted) to perform at least as well as the non-optimized instruction.
Observed Behavior:
Instead, performance drops significantly. This suggests that DSPy’s optimization process is doing something beyond just modifying the instruction text.
Questions:
- Could there be additional factors in DSPy that contribute to its improved performance?
- Is there a recommended way to extract and reuse DSPy-optimized prompts while maintaining their effectiveness outside DSPy?
DSPy version
2.6.10