Fix pdf parsing bug #938

markenki · 2025-04-16T23:37:49Z

The line
text = page.get_text("text", sort=True)
in readers.py doesn't respect multiple columns. For example, applied to pasa.pdf (in tests/stub_data), the first line of text is extracted as "We introduce PaSa, an advanced Paper Search Academic paper search lies at the core of research" but the first half of that comes from the first column while the second half comes from the second column.

Replacing that line of code with

# Extract text blocks, which are already in the correct order, from the page
blocks = page.get_text("blocks", sort=False)

# Concatenate text blocks into a single string
text = "\n".join(block[4] for block in blocks)

extracts this text: "We introduce PaSa, an advanced Paper Search\nagent powered by large language models.", which is correct.

Copilot

Pull Request Overview

This pull request fixes a PDF parsing bug by updating the text extraction logic to correctly handle multi-column layouts.

Replaces the use of page.get_text("text", sort=True) with a blocks-based extraction
Concatenates text blocks in the order provided by the PDF parser

paperqa/readers.py

tests/stub_data/pasa.pdf

jamesbraza · 2025-04-17T00:14:46Z

paperqa/readers.py

+            # Extract text blocks, which are already in the correct order, from the page
+            blocks = page.get_text("blocks", sort=False)
+
+            # Concatenate text blocks into a single string
+            text = "\n".join(block[4] for block in blocks if len(block) > 4)


What do we lose with sort=False? I am wondering why we had sort=True originally (it predates my time at FutureHouse)

The problem wasn't sort=True. The problem was getting "text" rather than "blocks".

If the problem isn't sort=False, can you revert to sort=True there? Just to keep diff smaller

Using sort=False retains the correct order of blocks in two-column pdfs (as well as one-column pdfs).

Sounds good, thanks!

jamesbraza

We'll need lint workflow and tests to pass. Our CI doesn't work for contributors since the OpenAI API key secret doesn't propagate outside of the FutureHouse org.

That being said, looks like failing tests (locally for me) are:

test_get_directory_index
test_get_directory_index_w_manifest

Can you:

Get these to pass (adjusting the assertions)
Expand them to account for pasa.pdf

jamesbraza · 2025-04-17T17:47:13Z

paperqa/readers.py

+            # Extract text blocks, which are already in the correct order, from the page
+            blocks = page.get_text("blocks", sort=False)
+
+            # Concatenate text blocks into a single string
+            text = "\n".join(block[4] for block in blocks if len(block) > 4)


If the problem isn't sort=False, can you revert to sort=True there? Just to keep diff smaller

markenki · 2025-04-21T20:02:35Z

We'll need lint workflow and tests to pass. Our CI doesn't work for contributors since the OpenAI API key secret doesn't propagate outside of the FutureHouse org.

That being said, looks like failing tests (locally for me) are:

test_get_directory_index

test_get_directory_index_w_manifest

Can you:

Get these to pass (adjusting the assertions)

Expand them to account for pasa.pdf

Thanks, @jamesbraza. I fixed the failing unit tests.

markenki · 2025-04-30T18:20:04Z

@jamesbraza could you kick off the workflow, please. Thanks!

jamesbraza · 2025-06-13T18:23:26Z

Before this change:

Page 1:

            PaSa: An LLM Agent for Comprehensive Academic Paper Search

                     Yichen He∗1  Guanhua Huang∗1   Peiyuan Feng1  Yuan Lin†1
                             Yuchen Zhang1  Hang Li1  Weinan E2

                               1ByteDance Research   2Peking University

                        {hyc,huangguanhua,fpy,linyuan.0}@bytedance.com,
               {zhangyuchen.zyc,lihang.lh}@bytedance.com, [email protected]

                               Demo: https://pasa-agent.ai


2025                    Paper Search
Jan

                          Abstract                 1  Introduction17

            We introduce PaSa, an advanced Paper Search      Academic paper search lies at the core of research
                  agent powered by large language models. PaSa       yet represents a particularly challenging informa-
                can autonomously make a series of decisions,        tion retrieval task.  It requires long-tail special-
                  including invoking search tools, reading pa-[cs.IR]                                                            ized knowledge, comprehensive survey-level cover-
                    pers, and selecting relevant references, to ul-
                                                                   age, and the ability to address fine-grained queries.
                  timately obtain comprehensive and accurate
                                                          For instance, consider the query: "Which stud-                    results for complex scholarly queries. We op-
                   timize PaSa using reinforcement learning with        ies have focused on non-stationary reinforcement
                 a synthetic dataset, AutoScholarQuery, which       learning using value-based methods, specifically
                   includes 35k fine-grained academic queries and      UCB-based algorithms?" While widely used aca-
                  corresponding papers sourced from top-tier AI       demic search systems like Google Scholar are effec-
                 conference publications. Additionally, we de-                                                                          tive for general queries, they often fall short when
                 velop RealScholarQuery, a benchmark collect-
                                                                 addressing these complex queries (Gusenbauer and
                  ing real-world academic queries to assess PaSa
                                                     Haddaway, 2020). Consequently, researchers fre-                performance in more realistic scenarios. De-
                    spite being trained on synthetic data, PaSa sig-       quently spend substantial time conducting litera-
                    nificantly outperforms existing baselines on        ture surveys (Kingsley et al., 2011; Gusenbauer
                 RealScholarQuery, including Google, Google      and Haddaway, 2021).arXiv:2501.10120v1            Scholar, Google with GPT-4 for paraphrased                                                     The advancements in large language models
                   queries, chatGPT (search-enabled GPT-4o),
                                                    (LLMs) (OpenAI, 2023; Anthropic, 2024; Gemini,
                GPT-o1, and PaSa-GPT-4o (PaSa implemented
                                                          2023; Yang et al., 2024) have inspired numerous               by prompting GPT-4o). Notably, PaSa-7B sur-
                  passes the best Google-based baseline, Google        studies leveraging LLMs to enhance information
                 with GPT-4o, by 37.78% in recall@20 and         retrieval, particularly by refining or reformulating
               39.90% in recall@50.  It also exceeds PaSa-       search queries to improve retrieval quality (Alaofi
               GPT-4o by 30.36% in recall and 4.25% in pre-        et al., 2023; Li et al., 2023; Ma et al., 2023; Peng
                     cision. Model, datasets, and code are available                                                                         et al., 2024).  In academic search, however, the
                     at https://github.com/bytedance/pasa.
                                                              process goes beyond simple retrieval. Human re-
                                                                   searchers not only use search tools, but also engage
                                                                       in deeper activities, such as reading relevant papers
                 ∗Equal contribution.                            and checking citations, to perform comprehensive
                  †Corresponding author.                          and accurate literature surveys.


                                                    1

Page 2:

                         Crawler                       Paper Queue     User Query                  Selector
                                 [Search]

 User Query




                                 [Expand]                      [Stop]



                                                                                                         Select / Drop


Figure 1: Architecture of PaSa. The system consists of two LLM agents, Crawler and Selector. The Crawler
processes the user query and can access papers from the paper queue. It can autonomously invoke the search tool,
expand citations, or stop processing of the current paper. All papers collected by the Crawler are appended to the
paper queue. The Selector reads each paper in the paper queue to determine whether it meets the criteria specified in
the user query.


  In this paper, we introduce PaSa, a novel paper     Although AutoScholarQuery  only  provides
search agent designed to mimic human behavior   query and paper answers, without demonstrating
for comprehensive and accurate academic paper    the path by which scientists collect the papers, we
searches. As illustrated in Figure 1, PaSa con-   can utilize them to perform RL training to improve
sists of two LLM agents: the Crawler and the Se-   PaSa. In addition, we design a new session-level
lector. For a given user query, the Crawler can  PPO (Proximal Policy Optimization (Schulman
autonomously collect relevant papers by utilizing    et al., 2017)) training method to address the unique
search tools or extracting citations from the current    challenges of the paper search task: 1) sparse re-
paper, which are then added to a growing paper   ward: The papers in AutoScholarQuery are col-
queue. The Crawler iteratively processes each pa-   lected via citations, making it a smaller subset of
per in the paper queue, navigating citation networks    the actual qualified paper set. 2) long trajectories:
to discover increasingly relevant papers. The Selec-   The complete trajectory of the Crawler may involve
tor carefully reads each paper in the paper queue to   hundreds of papers, which is too long to directly
determine whether it meets the requirements of the    input into the LLM context.
user query. We optimize PaSa within the AGILE, a
                                           To evaluate PaSa, besides the test set of Au-reinforcement learning (RL) framework for LLM
                                                   toScholarQuery, we also develop a benchmark, Re-agents (Feng et al., 2024).
                                                   alScholarQuery. It contains 50 real-world academic   Effective training requires high-quality academic
                                                   queries with annotated relevant papers, to assesssearch data. Fortunately, human scientists have al-
                                           PaSa in real-world scenarios. We compare PaSaready created a vast amount of high-quality aca-
                                                with several baselines including Google, Googledemic papers, which contain extensive surveys on
                                                    Scholar, Google paired with GPT-4o for para-a wide range of research topics. We build a syn-
                                                phrased queries, chatGPT (search-enabled GPT-thetic but high-quality academic search dataset,
                                                     4o), GPT-o1 and PaSa-GPT-4o (PaSa agent real-AutoScholarQuery, which collects fine-grained
                                                     ized by prompting GPT-4o). Our experiments showscholar queries and their corresponding relevant
                                                           that PaSa-7b significantly outperforms all baselines.papers from the related work sections of papers
                                                         Specifically, for AutoScholarQuery test set, PaSa-published at ICLR 2023 1, ICML 2023 2, NeurIPS
                                           7b achieves a 34.05% improvement in Recall@202023 3, ACL 2024 4, and CVPR 2024 5.  Au-
                                            and a 39.36% improvement in Recall@50 com-toScholarQuery includes 33,511 / 1,000 / 1,000
                                                  pared to Google with GPT-4o, the strongest Google-query-paper pairs in the training / development /
                                               based baseline. PaSa-7b surpasses PaSa-GPT-4otest split.
                                          by 11.12% in recall, with similar precision. For
   1https://iclr.cc/Conferences/2023               RealScholarQuery, PaSa-7b outperforms Google
   2https://icml.cc/Conferences/2023
                                                  with GPT-4o by 37.78% in Recall@20 and 39.90%   3https://neurips.cc/Conferences/2023
   4https://2024.aclweb.org/                           in Recall@50. PaSa-7b surpasses PaSa-GPT-4o by
   5https://cvpr.thecvf.com/Conferences/2024       30.36% in recall and 4.25% in precision.


                                         2

After this change

Page 1:

PaSa: An LLM Agent for Comprehensive Academic Paper Search

Yichen He∗1
Guanhua Huang∗1
Peiyuan Feng1
Yuan Lin†1

Yuchen Zhang1
Hang Li1
Weinan E2

1ByteDance Research
2Peking University

{hyc,huangguanhua,fpy,linyuan.0}@bytedance.com,
{zhangyuchen.zyc,lihang.lh}@bytedance.com, [email protected]

Demo: https://pasa-agent.ai

Paper Search

Abstract

We introduce PaSa, an advanced Paper Search
agent powered by large language models. PaSa
can autonomously make a series of decisions,
including invoking search tools, reading pa-
pers, and selecting relevant references, to ul-
timately obtain comprehensive and accurate
results for complex scholarly queries. We op-
timize PaSa using reinforcement learning with
a synthetic dataset, AutoScholarQuery, which
includes 35k fine-grained academic queries and
corresponding papers sourced from top-tier AI
conference publications. Additionally, we de-
velop RealScholarQuery, a benchmark collect-
ing real-world academic queries to assess PaSa
performance in more realistic scenarios. De-
spite being trained on synthetic data, PaSa sig-
nificantly outperforms existing baselines on
RealScholarQuery, including Google, Google
Scholar, Google with GPT-4 for paraphrased
queries, chatGPT (search-enabled GPT-4o),
GPT-o1, and PaSa-GPT-4o (PaSa implemented
by prompting GPT-4o). Notably, PaSa-7B sur-
passes the best Google-based baseline, Google
with GPT-4o, by 37.78% in recall@20 and
39.90% in recall@50. It also exceeds PaSa-
GPT-4o by 30.36% in recall and 4.25% in pre-
cision. Model, datasets, and code are available
at https://github.com/bytedance/pasa.

∗Equal contribution.
†Corresponding author.

1
Introduction

Academic paper search lies at the core of research
yet represents a particularly challenging informa-
tion retrieval task. It requires long-tail special-
ized knowledge, comprehensive survey-level cover-
age, and the ability to address fine-grained queries.
For instance, consider the query: "Which stud-
ies have focused on non-stationary reinforcement
learning using value-based methods, specifically
UCB-based algorithms?" While widely used aca-
demic search systems like Google Scholar are effec-
tive for general queries, they often fall short when
addressing these complex queries (Gusenbauer and
Haddaway, 2020). Consequently, researchers fre-
quently spend substantial time conducting litera-
ture surveys (Kingsley et al., 2011; Gusenbauer
and Haddaway, 2021).

The advancements in large language models
(LLMs) (OpenAI, 2023; Anthropic, 2024; Gemini,
2023; Yang et al., 2024) have inspired numerous
studies leveraging LLMs to enhance information
retrieval, particularly by refining or reformulating
search queries to improve retrieval quality (Alaofi
et al., 2023; Li et al., 2023; Ma et al., 2023; Peng
et al., 2024). In academic search, however, the
process goes beyond simple retrieval. Human re-
searchers not only use search tools, but also engage
in deeper activities, such as reading relevant papers
and checking citations, to perform comprehensive
and accurate literature surveys.

1

arXiv:2501.10120v1  [cs.IR]  17 Jan 2025

Page 2:

Paper Queue
Crawler
User Query
Selector

User Query

Select / Drop

[Search]

[Expand]
[Stop]

Figure 1: Architecture of PaSa. The system consists of two LLM agents, Crawler and Selector. The Crawler
processes the user query and can access papers from the paper queue. It can autonomously invoke the search tool,
expand citations, or stop processing of the current paper. All papers collected by the Crawler are appended to the
paper queue. The Selector reads each paper in the paper queue to determine whether it meets the criteria specified in
the user query.

In this paper, we introduce PaSa, a novel paper
search agent designed to mimic human behavior
for comprehensive and accurate academic paper
searches. As illustrated in Figure 1, PaSa con-
sists of two LLM agents: the Crawler and the Se-
lector. For a given user query, the Crawler can
autonomously collect relevant papers by utilizing
search tools or extracting citations from the current
paper, which are then added to a growing paper
queue. The Crawler iteratively processes each pa-
per in the paper queue, navigating citation networks
to discover increasingly relevant papers. The Selec-
tor carefully reads each paper in the paper queue to
determine whether it meets the requirements of the
user query. We optimize PaSa within the AGILE, a
reinforcement learning (RL) framework for LLM
agents (Feng et al., 2024).
Effective training requires high-quality academic
search data. Fortunately, human scientists have al-
ready created a vast amount of high-quality aca-
demic papers, which contain extensive surveys on
a wide range of research topics. We build a syn-
thetic but high-quality academic search dataset,
AutoScholarQuery, which collects fine-grained
scholar queries and their corresponding relevant
papers from the related work sections of papers
published at ICLR 2023 1, ICML 2023 2, NeurIPS
2023 3, ACL 2024 4, and CVPR 2024 5.
Au-
toScholarQuery includes 33,511 / 1,000 / 1,000
query-paper pairs in the training / development /
test split.

1https://iclr.cc/Conferences/2023
2https://icml.cc/Conferences/2023
3https://neurips.cc/Conferences/2023
4https://2024.aclweb.org/
5https://cvpr.thecvf.com/Conferences/2024

Although AutoScholarQuery only provides
query and paper answers, without demonstrating
the path by which scientists collect the papers, we
can utilize them to perform RL training to improve
PaSa. In addition, we design a new session-level
PPO (Proximal Policy Optimization (Schulman
et al., 2017)) training method to address the unique
challenges of the paper search task: 1) sparse re-
ward: The papers in AutoScholarQuery are col-
lected via citations, making it a smaller subset of
the actual qualified paper set. 2) long trajectories:
The complete trajectory of the Crawler may involve
hundreds of papers, which is too long to directly
input into the LLM context.

To evaluate PaSa, besides the test set of Au-
toScholarQuery, we also develop a benchmark, Re-
alScholarQuery. It contains 50 real-world academic
queries with annotated relevant papers, to assess
PaSa in real-world scenarios. We compare PaSa
with several baselines including Google, Google
Scholar, Google paired with GPT-4o for para-
phrased queries, chatGPT (search-enabled GPT-
4o), GPT-o1 and PaSa-GPT-4o (PaSa agent real-
ized by prompting GPT-4o). Our experiments show
that PaSa-7b significantly outperforms all baselines.
Specifically, for AutoScholarQuery test set, PaSa-
7b achieves a 34.05% improvement in Recall@20
and a 39.36% improvement in Recall@50 com-
pared to Google with GPT-4o, the strongest Google-
based baseline. PaSa-7b surpasses PaSa-GPT-4o
by 11.12% in recall, with similar precision. For
RealScholarQuery, PaSa-7b outperforms Google
with GPT-4o by 37.78% in Recall@20 and 39.90%
in Recall@50. PaSa-7b surpasses PaSa-GPT-4o by
30.36% in recall and 4.25% in precision.

2

Seems like an improvement to me ❤️

jamesbraza

Nice work @markenki, thanks for doing this

jamesbraza · 2025-06-13T19:07:12Z

paperqa/readers.py

@@ -39,7 +41,25 @@ def parse_pdf_to_pages(
                    f" {file.page_count} for the PDF at path {path}, likely this PDF"
                    " file is corrupt."
                ) from exc
-            text = page.get_text("text", sort=True)
+
+            if os.environ.get("PQA_USE_BLOCK_PARSING", "0").lower() in ENABLED_LOOKUP:


@markenki we are making this opt-in for now to not break people (including ourselves at FutureHouse). If upon our next baseline this doesn't harm performance, we'll make this the default behavior

Just made a setting for this over env-var, see the README.md's settings table for the tl;dr on this

Co-authored-by: Copilot <[email protected]>

ParsingSetting too

mskarlin

nice -- thanks all!

Copilot AI review requested due to automatic review settings April 16, 2025 23:37

dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. bug Something isn't working labels Apr 16, 2025

Copilot AI reviewed Apr 16, 2025

View reviewed changes

paperqa/readers.py Outdated Show resolved Hide resolved

jamesbraza reviewed Apr 17, 2025

View reviewed changes

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. size:M This PR changes 30-99 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. size:S This PR changes 10-29 lines, ignoring generated files. labels Apr 17, 2025

jamesbraza force-pushed the main branch from 77dfbac to 93b71fe Compare June 13, 2025 18:09

jamesbraza approved these changes Jun 13, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Jun 13, 2025

jamesbraza force-pushed the main branch from 9add247 to ca7384d Compare June 13, 2025 19:05

jamesbraza reviewed Jun 13, 2025

View reviewed changes

markenc and others added 9 commits June 13, 2025 14:01

Fix pdf parsing bug: extract text blocks

bc1c651

Fix pdf parsing bug: exrtact text blocks

7631cd7

Update paperqa/readers.py

a62173a

Co-authored-by: Copilot <[email protected]>

Address PR comments

d7d779e

add newline

020f18b

fix unit tests

ddaf5a3

[pre-commit.ci lite] apply automatic fixes

0751f34

Cleaned up test of parse_pdf_to_pages

d543259

Removed pasa.pdf from test_agent_sharing_state as a perf boost

6865988

jamesbraza force-pushed the main branch 2 times, most recently from a67d39b to cec0d64 Compare June 13, 2025 21:10

Made block-based parsing an opt-in via a keyword argument, with a

522efa0

ParsingSetting too

jamesbraza force-pushed the main branch from cec0d64 to 522efa0 Compare June 13, 2025 21:13

mskarlin approved these changes Jun 13, 2025

View reviewed changes

jamesbraza merged commit efa41d4 into Future-House:main Jun 13, 2025
3 of 5 checks passed

jamesbraza mentioned this pull request Jun 14, 2025

pdf parsing doesn't handle multi-column papers correctly #937

Closed

Fix pdf parsing bug #938

Fix pdf parsing bug #938

Uh oh!

Conversation

markenki commented Apr 16, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamesbraza left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markenki commented Apr 21, 2025

Uh oh!

markenki commented Apr 30, 2025

Uh oh!

jamesbraza commented Jun 13, 2025

Uh oh!

jamesbraza left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mskarlin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!