Skip to content

Conversation

nbalepur
Copy link

Re-adding the perplexity fixes because I forgot to merge it 🥲

SQA Command:

uv run inspect eval /Users/nishantbalepur/Desktop/Repositories/agent-baselines/.venv/lib/python3.13/site-packages/astabench/evals/sqa/task.py@sqa \
    --display plain \
    -T with_search_tools=false \
    -T simplified_eval=true \
    -T assess_jointly=true \
    --max-connections 16 \
    --max-samples 4 \
    --model ${generation_model} \
    --solver agent_baselines/solvers/sqa/formatted_perplexity.py@formatted_solver \
    -T sentence_wise_cit_eval=false \
    -T all_at_once=true \
    -T scorer_model='google/gemini-2.5-flash-preview-05-20' \
    -T split=${split} \
    -S search_context_size=high \
    -S require_snippets=false \
    -S reasoning_effort=high

Results:

global_avg/mean: 0.673  global_avg/stderr: 0.00455  ingredient_recall/mean: 0.924  ingredient_recall/stderr: 0.0107  answer_precision/mean: 0.94  answer_precision/stderr: 0.0108
citation_precision/mean: 0.462  citation_precision/stderr: 0.00338  citation_recall/mean: 0.367  citation_recall/stderr: 0.00751

LitQA2 Command:

uv run inspect eval /Users/nishantbalepur/Desktop/Repositories/agent-baselines/.venv/lib/python3.13/site-packages/astabench/evals/labbench/litqa2/task.py@litqa2_test \
        --display plain \
        --solver agent_baselines/solvers/sqa/perplexity_base.py@perplexity_solver \
        --model ${generation_model} \
        -T with_search_tools=false \
        -T with_native_search_tools=false \
        -S search_context_size=high

LitQA2 Results:

score_litqa2/precision: 0.9  score_litqa2/coverage: 0.8  is_correct/accuracy: 0.72  is_correct/stderr: 0.0522

Very similar to the results here: allenai/asta-bench#92

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant