Skip to content

ENH: Automatically preserve links in added pages #3298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

larsga
Copy link
Contributor

@larsga larsga commented May 27, 2025

Here is a draft implementation of the first stage of the issue #3290 implementation. It handles links in pages added via add_page and insert_page, but it doesn't handle pages merged into those pages before adding.

Does this look OK?

I'm wondering if some users may already have written their own link patching code -- will this code break theirs? If so, should we make it possible to turn this behaviour off somehow?

At the moment I'm resolving everything by searching lists to find corresponding indirect references. It would be much faster with a hash, but I haven't been able to make that work. Thoughts?

@larsga larsga force-pushed the issue-3290 branch 2 times, most recently from d0b2c8a to 460139e Compare May 27, 2025 13:59
Copy link

codecov bot commented May 27, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.71%. Comparing base (ae7a064) to head (09ea9b0).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3298      +/-   ##
==========================================
+ Coverage   96.69%   96.71%   +0.02%     
==========================================
  Files          53       54       +1     
  Lines        9023     9084      +61     
  Branches     1674     1685      +11     
==========================================
+ Hits         8725     8786      +61     
  Misses        176      176              
  Partials      122      122              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@larsga larsga force-pushed the issue-3290 branch 2 times, most recently from fb8c123 to 7e394e4 Compare May 27, 2025 14:10
@stefan6419846
Copy link
Collaborator

Thanks for the PR.

Does this look OK?

At first sight it looks okay, but unless there are specific aspects to talk about, I tend to prefer a proper review once the automated checks were successful.

It seems like CI is still failing due to coverage, typing and code style issues - this is something to consider in a second step. Nevertheless, regarding the code style, I would prefer to move the new classes into a submodule of pypdf.generic to not bloat the pypdf._writer module where possible.

I'm wondering if some users may already have written their own link patching code -- will this code break theirs? If so, should we make it possible to turn this behaviour off somehow?

In theory, each release could break the code of some user when doing special stuff. As this fixes shortcomings of the current implementation without removing anything, I do not see the need to introduce a deprecation period for this at the moment.

At the moment I'm resolving everything by searching lists to find corresponding indirect references. It would be much faster with a hash, but I haven't been able to make that work. Thoughts?

What exactly have you tried and what has been the result?

@larsga
Copy link
Contributor Author

larsga commented May 27, 2025

I tend to prefer a proper review once the automated checks were successful.

Yeah, sorry. I had to cook dinner, and now I have a meeting. I didn't intend to leave the build broken like this. The trouble is I can't get the right version of ruff on my laptop right now, so all the checks end up being done in CI, which is slow. Anyway, I will sort all this out.

I would prefer to move the new classes into a submodule of pypdf.generic to not bloat the pypdf._writer module where possible.

Will do.

I do not see the need to introduce a deprecation period for this at the moment.

Ack! 👍

What exactly have you tried and what has been the result?

Mainly that I couldn't get the reference lookups to work, but I take this to mean that they should work. I'll work on this a bit more.

@larsga larsga force-pushed the issue-3290 branch 14 times, most recently from eaad222 to 7274240 Compare May 28, 2025 11:22
@larsga
Copy link
Contributor Author

larsga commented May 28, 2025

Now it should finally be ready for review.

Note that I changed the type of the Destination.page property. As far as I can tell it's been wrong all the time. I certainly don't get an int back when I reference it. I get an IndirectObject, and that's also what the docs say should be passed in as the value. I was forced to do this to get type checking to accept my code -- let me know if you want this change separated out.

Once this PR is merged I'll look at handling links in merged-in pages, but because of upcoming holiday that will take a while.

@stefan6419846
Copy link
Collaborator

Thanks. I will try to have a look at this as soon as possible - this might take some time as well.

Regarding the broken type hints, there is a corresponding issue as well: #3233.

Copy link
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I just had a first look at the changes and added some small comments. As general notes:

  • Using abbreviations in the names and docstrings should be avoided. It is completely fine to use "reference" instead of "ref" for example to improve clarity and avoid having to deprecate stuff later on.
  • In the type hints, please use type1, type2 instead of type1,type2. I have marked some cases, but not all.
  • Instead of nesting functions and bloating the already large modules, consider moving the corresponding functionality to the new submodule.

@@ -209,6 +212,11 @@ def __init__(
"""The PDF file identifier,
defined by the ID in the PDF file's trailer dictionary."""

self._unresolved_links: list[tuple[RefLink,RefLink]] = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._unresolved_links: list[tuple[RefLink,RefLink]] = []
self._unresolved_links: list[tuple[RefLink, RefLink]] = []

@@ -209,6 +212,11 @@ def __init__(
"""The PDF file identifier,
defined by the ID in the PDF file's trailer dictionary."""

self._unresolved_links: list[tuple[RefLink,RefLink]] = []
"Tracks links in pages added to the writer for resolving later."
self._merged_in_pages: Dict[Optional[IndirectObject],Optional[IndirectObject]] = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._merged_in_pages: Dict[Optional[IndirectObject],Optional[IndirectObject]] = {}
self._merged_in_pages: Dict[Optional[IndirectObject], Optional[IndirectObject]] = {}

@@ -482,12 +490,47 @@ def _add_page(
]
except Exception:
pass

def _extract_links(new_page: PageObject, old_page: PageObject) -> List[Tuple[RefLink,RefLink]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of nesting functions, could we please move them to the new module as well? As far as I can see, they already depend on the parameters only.



@pytest.mark.enable_socket
def test_named_ref_to_page_thats_gone(pdf_file_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_named_ref_to_page_thats_gone(pdf_file_path):
def test_named_ref_to_page_that_is_gone(pdf_file_path):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants