Skip to content

Fix pdf parsing bug #938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 13, 2025
Merged

Fix pdf parsing bug #938

merged 10 commits into from
Jun 13, 2025

Conversation

markenki
Copy link
Contributor

The line
text = page.get_text("text", sort=True)
in readers.py doesn't respect multiple columns. For example, applied to pasa.pdf (in tests/stub_data), the first line of text is extracted as "We introduce PaSa, an advanced Paper Search Academic paper search lies at the core of research" but the first half of that comes from the first column while the second half comes from the second column.

Replacing that line of code with

# Extract text blocks, which are already in the correct order, from the page
blocks = page.get_text("blocks", sort=False)

# Concatenate text blocks into a single string
text = "\n".join(block[4] for block in blocks)

extracts this text: "We introduce PaSa, an advanced Paper Search\nagent powered by large language models.", which is correct.

@Copilot Copilot AI review requested due to automatic review settings April 16, 2025 23:37
@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. bug Something isn't working labels Apr 16, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request fixes a PDF parsing bug by updating the text extraction logic to correctly handle multi-column layouts.

  • Replaces the use of page.get_text("text", sort=True) with a blocks-based extraction
  • Concatenates text blocks in the order provided by the PDF parser

Comment on lines 43 to 47
# Extract text blocks, which are already in the correct order, from the page
blocks = page.get_text("blocks", sort=False)

# Concatenate text blocks into a single string
text = "\n".join(block[4] for block in blocks if len(block) > 4)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we lose with sort=False? I am wondering why we had sort=True originally (it predates my time at FutureHouse)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem wasn't sort=True. The problem was getting "text" rather than "blocks".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the problem isn't sort=False, can you revert to sort=True there? Just to keep diff smaller

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using sort=False retains the correct order of blocks in two-column pdfs (as well as one-column pdfs).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks!

Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need lint workflow and tests to pass. Our CI doesn't work for contributors since the OpenAI API key secret doesn't propagate outside of the FutureHouse org.

That being said, looks like failing tests (locally for me) are:

  • test_get_directory_index
  • test_get_directory_index_w_manifest

Can you:

  1. Get these to pass (adjusting the assertions)
  2. Expand them to account for pasa.pdf

Comment on lines 43 to 47
# Extract text blocks, which are already in the correct order, from the page
blocks = page.get_text("blocks", sort=False)

# Concatenate text blocks into a single string
text = "\n".join(block[4] for block in blocks if len(block) > 4)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the problem isn't sort=False, can you revert to sort=True there? Just to keep diff smaller

@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. size:M This PR changes 30-99 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. size:S This PR changes 10-29 lines, ignoring generated files. labels Apr 17, 2025
@markenki
Copy link
Contributor Author

We'll need lint workflow and tests to pass. Our CI doesn't work for contributors since the OpenAI API key secret doesn't propagate outside of the FutureHouse org.

That being said, looks like failing tests (locally for me) are:

  • test_get_directory_index
  • test_get_directory_index_w_manifest

Can you:

  1. Get these to pass (adjusting the assertions)
  2. Expand them to account for pasa.pdf

Thanks, @jamesbraza. I fixed the failing unit tests.

@markenki
Copy link
Contributor Author

@jamesbraza could you kick off the workflow, please. Thanks!

@jamesbraza
Copy link
Collaborator

Before this change:

Page 1:

            PaSa: An LLM Agent for Comprehensive Academic Paper Search

                     Yichen He∗1  Guanhua Huang∗1   Peiyuan Feng1  Yuan Lin†1
                             Yuchen Zhang1  Hang Li1  Weinan E2

                               1ByteDance Research   2Peking University

                        {hyc,huangguanhua,fpy,linyuan.0}@bytedance.com,
               {zhangyuchen.zyc,lihang.lh}@bytedance.com, [email protected]

                               Demo: https://pasa-agent.ai


2025                    Paper Search
Jan

                          Abstract                 1  Introduction17

            We introduce PaSa, an advanced Paper Search      Academic paper search lies at the core of research
                  agent powered by large language models. PaSa       yet represents a particularly challenging informa-
                can autonomously make a series of decisions,        tion retrieval task.  It requires long-tail special-
                  including invoking search tools, reading pa-[cs.IR]                                                            ized knowledge, comprehensive survey-level cover-
                    pers, and selecting relevant references, to ul-
                                                                   age, and the ability to address fine-grained queries.
                  timately obtain comprehensive and accurate
                                                          For instance, consider the query: "Which stud-                    results for complex scholarly queries. We op-
                   timize PaSa using reinforcement learning with        ies have focused on non-stationary reinforcement
                 a synthetic dataset, AutoScholarQuery, which       learning using value-based methods, specifically
                   includes 35k fine-grained academic queries and      UCB-based algorithms?" While widely used aca-
                  corresponding papers sourced from top-tier AI       demic search systems like Google Scholar are effec-
                 conference publications. Additionally, we de-                                                                          tive for general queries, they often fall short when
                 velop RealScholarQuery, a benchmark collect-
                                                                 addressing these complex queries (Gusenbauer and
                  ing real-world academic queries to assess PaSa
                                                     Haddaway, 2020). Consequently, researchers fre-                performance in more realistic scenarios. De-
                    spite being trained on synthetic data, PaSa sig-       quently spend substantial time conducting litera-
                    nificantly outperforms existing baselines on        ture surveys (Kingsley et al., 2011; Gusenbauer
                 RealScholarQuery, including Google, Google      and Haddaway, 2021).arXiv:2501.10120v1            Scholar, Google with GPT-4 for paraphrased                                                     The advancements in large language models
                   queries, chatGPT (search-enabled GPT-4o),
                                                    (LLMs) (OpenAI, 2023; Anthropic, 2024; Gemini,
                GPT-o1, and PaSa-GPT-4o (PaSa implemented
                                                          2023; Yang et al., 2024) have inspired numerous               by prompting GPT-4o). Notably, PaSa-7B sur-
                  passes the best Google-based baseline, Google        studies leveraging LLMs to enhance information
                 with GPT-4o, by 37.78% in recall@20 and         retrieval, particularly by refining or reformulating
               39.90% in recall@50.  It also exceeds PaSa-       search queries to improve retrieval quality (Alaofi
               GPT-4o by 30.36% in recall and 4.25% in pre-        et al., 2023; Li et al., 2023; Ma et al., 2023; Peng
                     cision. Model, datasets, and code are available                                                                         et al., 2024).  In academic search, however, the
                     at https://github.com/bytedance/pasa.
                                                              process goes beyond simple retrieval. Human re-
                                                                   searchers not only use search tools, but also engage
                                                                       in deeper activities, such as reading relevant papers
                 ∗Equal contribution.                            and checking citations, to perform comprehensive
                  †Corresponding author.                          and accurate literature surveys.


                                                    1

Page 2:

                         Crawler                       Paper Queue     User Query                  Selector
                                 [Search]

 User Query




                                 [Expand]                      [Stop]



                                                                                                         Select / Drop


Figure 1: Architecture of PaSa. The system consists of two LLM agents, Crawler and Selector. The Crawler
processes the user query and can access papers from the paper queue. It can autonomously invoke the search tool,
expand citations, or stop processing of the current paper. All papers collected by the Crawler are appended to the
paper queue. The Selector reads each paper in the paper queue to determine whether it meets the criteria specified in
the user query.


  In this paper, we introduce PaSa, a novel paper     Although AutoScholarQuery  only  provides
search agent designed to mimic human behavior   query and paper answers, without demonstrating
for comprehensive and accurate academic paper    the path by which scientists collect the papers, we
searches. As illustrated in Figure 1, PaSa con-   can utilize them to perform RL training to improve
sists of two LLM agents: the Crawler and the Se-   PaSa. In addition, we design a new session-level
lector. For a given user query, the Crawler can  PPO (Proximal Policy Optimization (Schulman
autonomously collect relevant papers by utilizing    et al., 2017)) training method to address the unique
search tools or extracting citations from the current    challenges of the paper search task: 1) sparse re-
paper, which are then added to a growing paper   ward: The papers in AutoScholarQuery are col-
queue. The Crawler iteratively processes each pa-   lected via citations, making it a smaller subset of
per in the paper queue, navigating citation networks    the actual qualified paper set. 2) long trajectories:
to discover increasingly relevant papers. The Selec-   The complete trajectory of the Crawler may involve
tor carefully reads each paper in the paper queue to   hundreds of papers, which is too long to directly
determine whether it meets the requirements of the    input into the LLM context.
user query. We optimize PaSa within the AGILE, a
                                           To evaluate PaSa, besides the test set of Au-reinforcement learning (RL) framework for LLM
                                                   toScholarQuery, we also develop a benchmark, Re-agents (Feng et al., 2024).
                                                   alScholarQuery. It contains 50 real-world academic   Effective training requires high-quality academic
                                                   queries with annotated relevant papers, to assesssearch data. Fortunately, human scientists have al-
                                           PaSa in real-world scenarios. We compare PaSaready created a vast amount of high-quality aca-
                                                with several baselines including Google, Googledemic papers, which contain extensive surveys on
                                                    Scholar, Google paired with GPT-4o for para-a wide range of research topics. We build a syn-
                                                phrased queries, chatGPT (search-enabled GPT-thetic but high-quality academic search dataset,
                                                     4o), GPT-o1 and PaSa-GPT-4o (PaSa agent real-AutoScholarQuery, which collects fine-grained
                                                     ized by prompting GPT-4o). Our experiments showscholar queries and their corresponding relevant
                                                           that PaSa-7b significantly outperforms all baselines.papers from the related work sections of papers
                                                         Specifically, for AutoScholarQuery test set, PaSa-published at ICLR 2023 1, ICML 2023 2, NeurIPS
                                           7b achieves a 34.05% improvement in Recall@202023 3, ACL 2024 4, and CVPR 2024 5.  Au-
                                            and a 39.36% improvement in Recall@50 com-toScholarQuery includes 33,511 / 1,000 / 1,000
                                                  pared to Google with GPT-4o, the strongest Google-query-paper pairs in the training / development /
                                               based baseline. PaSa-7b surpasses PaSa-GPT-4otest split.
                                          by 11.12% in recall, with similar precision. For
   1https://iclr.cc/Conferences/2023               RealScholarQuery, PaSa-7b outperforms Google
   2https://icml.cc/Conferences/2023
                                                  with GPT-4o by 37.78% in Recall@20 and 39.90%   3https://neurips.cc/Conferences/2023
   4https://2024.aclweb.org/                           in Recall@50. PaSa-7b surpasses PaSa-GPT-4o by
   5https://cvpr.thecvf.com/Conferences/2024       30.36% in recall and 4.25% in precision.


                                         2

After this change

Page 1:

PaSa: An LLM Agent for Comprehensive Academic Paper Search

Yichen He∗1
Guanhua Huang∗1
Peiyuan Feng1
Yuan Lin†1

Yuchen Zhang1
Hang Li1
Weinan E2

1ByteDance Research
2Peking University

{hyc,huangguanhua,fpy,linyuan.0}@bytedance.com,
{zhangyuchen.zyc,lihang.lh}@bytedance.com, [email protected]

Demo: https://pasa-agent.ai

Paper Search

Abstract

We introduce PaSa, an advanced Paper Search
agent powered by large language models. PaSa
can autonomously make a series of decisions,
including invoking search tools, reading pa-
pers, and selecting relevant references, to ul-
timately obtain comprehensive and accurate
results for complex scholarly queries. We op-
timize PaSa using reinforcement learning with
a synthetic dataset, AutoScholarQuery, which
includes 35k fine-grained academic queries and
corresponding papers sourced from top-tier AI
conference publications. Additionally, we de-
velop RealScholarQuery, a benchmark collect-
ing real-world academic queries to assess PaSa
performance in more realistic scenarios. De-
spite being trained on synthetic data, PaSa sig-
nificantly outperforms existing baselines on
RealScholarQuery, including Google, Google
Scholar, Google with GPT-4 for paraphrased
queries, chatGPT (search-enabled GPT-4o),
GPT-o1, and PaSa-GPT-4o (PaSa implemented
by prompting GPT-4o). Notably, PaSa-7B sur-
passes the best Google-based baseline, Google
with GPT-4o, by 37.78% in recall@20 and
39.90% in recall@50. It also exceeds PaSa-
GPT-4o by 30.36% in recall and 4.25% in pre-
cision. Model, datasets, and code are available
at https://github.com/bytedance/pasa.

∗Equal contribution.
†Corresponding author.

1
Introduction

Academic paper search lies at the core of research
yet represents a particularly challenging informa-
tion retrieval task. It requires long-tail special-
ized knowledge, comprehensive survey-level cover-
age, and the ability to address fine-grained queries.
For instance, consider the query: "Which stud-
ies have focused on non-stationary reinforcement
learning using value-based methods, specifically
UCB-based algorithms?" While widely used aca-
demic search systems like Google Scholar are effec-
tive for general queries, they often fall short when
addressing these complex queries (Gusenbauer and
Haddaway, 2020). Consequently, researchers fre-
quently spend substantial time conducting litera-
ture surveys (Kingsley et al., 2011; Gusenbauer
and Haddaway, 2021).

The advancements in large language models
(LLMs) (OpenAI, 2023; Anthropic, 2024; Gemini,
2023; Yang et al., 2024) have inspired numerous
studies leveraging LLMs to enhance information
retrieval, particularly by refining or reformulating
search queries to improve retrieval quality (Alaofi
et al., 2023; Li et al., 2023; Ma et al., 2023; Peng
et al., 2024). In academic search, however, the
process goes beyond simple retrieval. Human re-
searchers not only use search tools, but also engage
in deeper activities, such as reading relevant papers
and checking citations, to perform comprehensive
and accurate literature surveys.

1

arXiv:2501.10120v1  [cs.IR]  17 Jan 2025

Page 2:

Paper Queue
Crawler
User Query
Selector

User Query

Select / Drop

[Search]

[Expand]
[Stop]

Figure 1: Architecture of PaSa. The system consists of two LLM agents, Crawler and Selector. The Crawler
processes the user query and can access papers from the paper queue. It can autonomously invoke the search tool,
expand citations, or stop processing of the current paper. All papers collected by the Crawler are appended to the
paper queue. The Selector reads each paper in the paper queue to determine whether it meets the criteria specified in
the user query.

In this paper, we introduce PaSa, a novel paper
search agent designed to mimic human behavior
for comprehensive and accurate academic paper
searches. As illustrated in Figure 1, PaSa con-
sists of two LLM agents: the Crawler and the Se-
lector. For a given user query, the Crawler can
autonomously collect relevant papers by utilizing
search tools or extracting citations from the current
paper, which are then added to a growing paper
queue. The Crawler iteratively processes each pa-
per in the paper queue, navigating citation networks
to discover increasingly relevant papers. The Selec-
tor carefully reads each paper in the paper queue to
determine whether it meets the requirements of the
user query. We optimize PaSa within the AGILE, a
reinforcement learning (RL) framework for LLM
agents (Feng et al., 2024).
Effective training requires high-quality academic
search data. Fortunately, human scientists have al-
ready created a vast amount of high-quality aca-
demic papers, which contain extensive surveys on
a wide range of research topics. We build a syn-
thetic but high-quality academic search dataset,
AutoScholarQuery, which collects fine-grained
scholar queries and their corresponding relevant
papers from the related work sections of papers
published at ICLR 2023 1, ICML 2023 2, NeurIPS
2023 3, ACL 2024 4, and CVPR 2024 5.
Au-
toScholarQuery includes 33,511 / 1,000 / 1,000
query-paper pairs in the training / development /
test split.

1https://iclr.cc/Conferences/2023
2https://icml.cc/Conferences/2023
3https://neurips.cc/Conferences/2023
4https://2024.aclweb.org/
5https://cvpr.thecvf.com/Conferences/2024

Although AutoScholarQuery only provides
query and paper answers, without demonstrating
the path by which scientists collect the papers, we
can utilize them to perform RL training to improve
PaSa. In addition, we design a new session-level
PPO (Proximal Policy Optimization (Schulman
et al., 2017)) training method to address the unique
challenges of the paper search task: 1) sparse re-
ward: The papers in AutoScholarQuery are col-
lected via citations, making it a smaller subset of
the actual qualified paper set. 2) long trajectories:
The complete trajectory of the Crawler may involve
hundreds of papers, which is too long to directly
input into the LLM context.

To evaluate PaSa, besides the test set of Au-
toScholarQuery, we also develop a benchmark, Re-
alScholarQuery. It contains 50 real-world academic
queries with annotated relevant papers, to assess
PaSa in real-world scenarios. We compare PaSa
with several baselines including Google, Google
Scholar, Google paired with GPT-4o for para-
phrased queries, chatGPT (search-enabled GPT-
4o), GPT-o1 and PaSa-GPT-4o (PaSa agent real-
ized by prompting GPT-4o). Our experiments show
that PaSa-7b significantly outperforms all baselines.
Specifically, for AutoScholarQuery test set, PaSa-
7b achieves a 34.05% improvement in Recall@20
and a 39.36% improvement in Recall@50 com-
pared to Google with GPT-4o, the strongest Google-
based baseline. PaSa-7b surpasses PaSa-GPT-4o
by 11.12% in recall, with similar precision. For
RealScholarQuery, PaSa-7b outperforms Google
with GPT-4o by 37.78% in Recall@20 and 39.90%
in Recall@50. PaSa-7b surpasses PaSa-GPT-4o by
30.36% in recall and 4.25% in precision.

2

Seems like an improvement to me ❤️

Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @markenki, thanks for doing this

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jun 13, 2025
@@ -39,7 +41,25 @@ def parse_pdf_to_pages(
f" {file.page_count} for the PDF at path {path}, likely this PDF"
" file is corrupt."
) from exc
text = page.get_text("text", sort=True)

if os.environ.get("PQA_USE_BLOCK_PARSING", "0").lower() in ENABLED_LOOKUP:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markenki we are making this opt-in for now to not break people (including ourselves at FutureHouse). If upon our next baseline this doesn't harm performance, we'll make this the default behavior

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just made a setting for this over env-var, see the README.md's settings table for the tl;dr on this

@jamesbraza jamesbraza force-pushed the main branch 2 times, most recently from a67d39b to cec0d64 Compare June 13, 2025 21:10
Copy link
Collaborator

@mskarlin mskarlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice -- thanks all!

@jamesbraza jamesbraza merged commit efa41d4 into Future-House:main Jun 13, 2025
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants