Perplexity fix #7

nbalepur · 2025-08-28T04:45:15Z

Re-adding the perplexity fixes because I forgot to merge it 🥲

SQA Command:

uv run inspect eval /Users/nishantbalepur/Desktop/Repositories/agent-baselines/.venv/lib/python3.13/site-packages/astabench/evals/sqa/task.py@sqa \
    --display plain \
    -T with_search_tools=false \
    -T simplified_eval=true \
    -T assess_jointly=true \
    --max-connections 16 \
    --max-samples 4 \
    --model ${generation_model} \
    --solver agent_baselines/solvers/sqa/formatted_perplexity.py@formatted_solver \
    -T sentence_wise_cit_eval=false \
    -T all_at_once=true \
    -T scorer_model='google/gemini-2.5-flash-preview-05-20' \
    -T split=${split} \
    -S search_context_size=high \
    -S require_snippets=false \
    -S reasoning_effort=high

Results:

global_avg/mean: 0.673  global_avg/stderr: 0.00455  ingredient_recall/mean: 0.924  ingredient_recall/stderr: 0.0107  answer_precision/mean: 0.94  answer_precision/stderr: 0.0108
citation_precision/mean: 0.462  citation_precision/stderr: 0.00338  citation_recall/mean: 0.367  citation_recall/stderr: 0.00751

LitQA2 Command:

uv run inspect eval /Users/nishantbalepur/Desktop/Repositories/agent-baselines/.venv/lib/python3.13/site-packages/astabench/evals/labbench/litqa2/task.py@litqa2_test \
        --display plain \
        --solver agent_baselines/solvers/sqa/perplexity_base.py@perplexity_solver \
        --model ${generation_model} \
        -T with_search_tools=false \
        -T with_native_search_tools=false \
        -S search_context_size=high

LitQA2 Results:

score_litqa2/precision: 0.9  score_litqa2/coverage: 0.8  is_correct/accuracy: 0.72  is_correct/stderr: 0.0522

Very similar to the results here: allenai/asta-bench#92

nbalepur added 2 commits August 27, 2025 21:39

perplexity fix

e6ecb61

black formatting

2d7e9db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perplexity fix #7

Perplexity fix #7

Uh oh!

nbalepur commented Aug 28, 2025

Uh oh!

Uh oh!

Perplexity fix #7

Are you sure you want to change the base?

Perplexity fix #7

Uh oh!

Conversation

nbalepur commented Aug 28, 2025

Uh oh!

Uh oh!