Skip to content

Extracting DSPy-optimized prompts correctly #8042

Closed
@Francisca266

Description

@Francisca266

What happened?

I have been using MIPROv2 to optimize a basic instruction of an Agent. The optimization process uses GPT-4o-mini as the teacher model and a quantized model (via vLLM) as the student.

The optimization improves the metric exact_match which compares the model’s predicted speaker role to the correct next speaker role.

Here are the results before and after optimization, with DSPy.Evaluate:

Before optimization: 0.190
After MIPROv2 optimization: 0.238

However, I noticed a significant performance drop when I extract the instructions and examples into a structured Markdown prompt and call the model directly without DSPy. The results for 20 test questions are:

Instruction (non-optimized): 47%

Instruction (optimized) + examples: 28.5%

System prompt generated via dspy.inspect_history(): 5% (does not follow the expected format)

Steps to reproduce

Signature I am using, I call it with dspy.Predict
class RouterSignature(dspy.Signature): """Read the conversation and select the next role from roles_list to play. Only return the role.""" roles = dspy.InputField(desc="available roles") roles_list = dspy.InputField() conversation = dspy.InputField() selected_role : Literal["..."] = dspy.OutputField()

MIPROv2 params:
optimization_model_kwargs = dict(prompt_model=openai_lm, task_model=vllm, teacher_settings=dict(lm=openai_lm)) optimizer = MIPROv2Optimizer(metric=exact_match_router, max_bootstrapped_demos=2, max_labeled_demos=5, optimization_model_kwargs=optimization_model_kwargs)

  1. Extract the optimized instruction and examples into a Markdown prompt.

  2. Call the model manually using:
    ` import requests
    import json

def query_vllm(roles, available_roles, conversation):
api_url = "http://localhost:port/v1/chat/completions"

payload = {
"model": "kaitchup/Llama-3.2-3B-Instruct-gptqmodel-4bit",
"messages": [
{"role": "system", "content": OPTIMIZED_SYS_PROMPT},
{"role": "user", "content": user_prompt}
],
"temperature": 0.0,
"max_tokens": 500
}

headers = {
"Content-Type": "application/json",
"Authorization": "Bearer fake-key" # VLLM doesn't validate the key
}

response = requests.post(api_url, headers=headers, data=json.dumps(payload))

if response.status_code == 200:
result = response.json()
role_selection = result["choices"][0]["message"]["content"]
return role_selection.strip()
else:
print(f"Error: {response.status_code}")
return None `

  1. compare perfomance

Expected Behavior:

I expected the optimized instruction (when manually prompted) to perform at least as well as the non-optimized instruction.

Observed Behavior:

Instead, performance drops significantly. This suggests that DSPy’s optimization process is doing something beyond just modifying the instruction text.

Questions:

  • Could there be additional factors in DSPy that contribute to its improved performance?
  • Is there a recommended way to extract and reuse DSPy-optimized prompts while maintaining their effectiveness outside DSPy?

DSPy version

2.6.10

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions