-
Notifications
You must be signed in to change notification settings - Fork 745
Fix pdf parsing bug #938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix pdf parsing bug #938
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request fixes a PDF parsing bug by updating the text extraction logic to correctly handle multi-column layouts.
- Replaces the use of page.get_text("text", sort=True) with a blocks-based extraction
- Concatenates text blocks in the order provided by the PDF parser
paperqa/readers.py
Outdated
# Extract text blocks, which are already in the correct order, from the page | ||
blocks = page.get_text("blocks", sort=False) | ||
|
||
# Concatenate text blocks into a single string | ||
text = "\n".join(block[4] for block in blocks if len(block) > 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we lose with sort=False
? I am wondering why we had sort=True
originally (it predates my time at FutureHouse)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem wasn't sort=True. The problem was getting "text" rather than "blocks".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the problem isn't sort=False
, can you revert to sort=True
there? Just to keep diff smaller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using sort=False
retains the correct order of blocks in two-column pdfs (as well as one-column pdfs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need lint
workflow and tests to pass. Our CI doesn't work for contributors since the OpenAI API key secret doesn't propagate outside of the FutureHouse org.
That being said, looks like failing tests (locally for me) are:
test_get_directory_index
test_get_directory_index_w_manifest
Can you:
- Get these to pass (adjusting the assertions)
- Expand them to account for
pasa.pdf
paperqa/readers.py
Outdated
# Extract text blocks, which are already in the correct order, from the page | ||
blocks = page.get_text("blocks", sort=False) | ||
|
||
# Concatenate text blocks into a single string | ||
text = "\n".join(block[4] for block in blocks if len(block) > 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the problem isn't sort=False
, can you revert to sort=True
there? Just to keep diff smaller
Thanks, @jamesbraza. I fixed the failing unit tests. |
@jamesbraza could you kick off the workflow, please. Thanks! |
Before this change: Page 1:
Page 2:
After this change Page 1:
Page 2:
Seems like an improvement to me ❤️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @markenki, thanks for doing this
paperqa/readers.py
Outdated
@@ -39,7 +41,25 @@ def parse_pdf_to_pages( | |||
f" {file.page_count} for the PDF at path {path}, likely this PDF" | |||
" file is corrupt." | |||
) from exc | |||
text = page.get_text("text", sort=True) | |||
|
|||
if os.environ.get("PQA_USE_BLOCK_PARSING", "0").lower() in ENABLED_LOOKUP: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markenki we are making this opt-in for now to not break people (including ourselves at FutureHouse). If upon our next baseline this doesn't harm performance, we'll make this the default behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just made a setting for this over env-var, see the README.md
's settings table for the tl;dr on this
Co-authored-by: Copilot <[email protected]>
a67d39b
to
cec0d64
Compare
ParsingSetting too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice -- thanks all!
The line
text = page.get_text("text", sort=True)
in
readers.py
doesn't respect multiple columns. For example, applied to pasa.pdf (in tests/stub_data), the first line of text is extracted as "We introduce PaSa, an advanced Paper Search Academic paper search lies at the core of research" but the first half of that comes from the first column while the second half comes from the second column.Replacing that line of code with
extracts this text: "We introduce PaSa, an advanced Paper Search\nagent powered by large language models.", which is correct.