ENH: Automatically preserve links in added pages #3298

larsga · 2025-05-27T13:47:52Z

Here is a draft implementation of the first stage of the issue #3290 implementation. It handles links in pages added via add_page and insert_page, but it doesn't handle pages merged into those pages before adding.

Does this look OK?

I'm wondering if some users may already have written their own link patching code -- will this code break theirs? If so, should we make it possible to turn this behaviour off somehow?

At the moment I'm resolving everything by searching lists to find corresponding indirect references. It would be much faster with a hash, but I haven't been able to make that work. Thoughts?

codecov · 2025-05-27T14:07:06Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.71%. Comparing base (ae7a064) to head (09ea9b0).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3298      +/-   ##
==========================================
+ Coverage   96.69%   96.71%   +0.02%     
==========================================
  Files          53       54       +1     
  Lines        9023     9084      +61     
  Branches     1674     1685      +11     
==========================================
+ Hits         8725     8786      +61     
  Misses        176      176              
  Partials      122      122

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

stefan6419846 · 2025-05-27T15:42:35Z

Thanks for the PR.

Does this look OK?

At first sight it looks okay, but unless there are specific aspects to talk about, I tend to prefer a proper review once the automated checks were successful.

It seems like CI is still failing due to coverage, typing and code style issues - this is something to consider in a second step. Nevertheless, regarding the code style, I would prefer to move the new classes into a submodule of pypdf.generic to not bloat the pypdf._writer module where possible.

I'm wondering if some users may already have written their own link patching code -- will this code break theirs? If so, should we make it possible to turn this behaviour off somehow?

In theory, each release could break the code of some user when doing special stuff. As this fixes shortcomings of the current implementation without removing anything, I do not see the need to introduce a deprecation period for this at the moment.

At the moment I'm resolving everything by searching lists to find corresponding indirect references. It would be much faster with a hash, but I haven't been able to make that work. Thoughts?

What exactly have you tried and what has been the result?

larsga · 2025-05-27T15:58:41Z

I tend to prefer a proper review once the automated checks were successful.

Yeah, sorry. I had to cook dinner, and now I have a meeting. I didn't intend to leave the build broken like this. The trouble is I can't get the right version of ruff on my laptop right now, so all the checks end up being done in CI, which is slow. Anyway, I will sort all this out.

I would prefer to move the new classes into a submodule of pypdf.generic to not bloat the pypdf._writer module where possible.

Will do.

I do not see the need to introduce a deprecation period for this at the moment.

Ack! 👍

What exactly have you tried and what has been the result?

Mainly that I couldn't get the reference lookups to work, but I take this to mean that they should work. I'll work on this a bit more.

larsga · 2025-05-28T13:20:16Z

Now it should finally be ready for review.

Note that I changed the type of the Destination.page property. As far as I can tell it's been wrong all the time. I certainly don't get an int back when I reference it. I get an IndirectObject, and that's also what the docs say should be passed in as the value. I was forced to do this to get type checking to accept my code -- let me know if you want this change separated out.

Once this PR is merged I'll look at handling links in merged-in pages, but because of upcoming holiday that will take a while.

stefan6419846 · 2025-05-28T13:26:22Z

Thanks. I will try to have a look at this as soon as possible - this might take some time as well.

Regarding the broken type hints, there is a corresponding issue as well: #3233.

stefan6419846

Thanks for the PR. I just had a first look at the changes and added some small comments. As general notes:

Using abbreviations in the names and docstrings should be avoided. It is completely fine to use "reference" instead of "ref" for example to improve clarity and avoid having to deprecate stuff later on.
In the type hints, please use type1, type2 instead of type1,type2. I have marked some cases, but not all.
Instead of nesting functions and bloating the already large modules, consider moving the corresponding functionality to the new submodule.

stefan6419846 · 2025-06-04T08:51:26Z

pypdf/_writer.py

@@ -209,6 +212,11 @@ def __init__(
        """The PDF file identifier,
        defined by the ID in the PDF file's trailer dictionary."""

+        self._unresolved_links: list[tuple[RefLink,RefLink]] = []


Suggested change

self._unresolved_links: list[tuple[RefLink,RefLink]] = []

self._unresolved_links: list[tuple[RefLink, RefLink]] = []

stefan6419846 · 2025-06-04T08:52:04Z

pypdf/_writer.py

@@ -209,6 +212,11 @@ def __init__(
        """The PDF file identifier,
        defined by the ID in the PDF file's trailer dictionary."""

+        self._unresolved_links: list[tuple[RefLink,RefLink]] = []
+        "Tracks links in pages added to the writer for resolving later."
+        self._merged_in_pages: Dict[Optional[IndirectObject],Optional[IndirectObject]] = {}


Suggested change

self._merged_in_pages: Dict[Optional[IndirectObject],Optional[IndirectObject]] = {}

self._merged_in_pages: Dict[Optional[IndirectObject], Optional[IndirectObject]] = {}

stefan6419846 · 2025-06-04T08:54:52Z

pypdf/_writer.py

@@ -482,12 +490,47 @@ def _add_page(
            ]
        except Exception:
            pass
+
+        def _extract_links(new_page: PageObject, old_page: PageObject) -> List[Tuple[RefLink,RefLink]]:


Instead of nesting functions, could we please move them to the new module as well? As far as I can see, they already depend on the parameters only.

stefan6419846 · 2025-06-04T08:57:11Z

tests/test_merger.py

+
+
+@pytest.mark.enable_socket
+def test_named_ref_to_page_thats_gone(pdf_file_path):


Suggested change

def test_named_ref_to_page_thats_gone(pdf_file_path):

def test_named_ref_to_page_that_is_gone(pdf_file_path):

larsga force-pushed the issue-3290 branch 2 times, most recently from d0b2c8a to 460139e Compare May 27, 2025 13:59

larsga force-pushed the issue-3290 branch 2 times, most recently from fb8c123 to 7e394e4 Compare May 27, 2025 14:10

larsga force-pushed the issue-3290 branch 14 times, most recently from eaad222 to 7274240 Compare May 28, 2025 11:22

ENH: Automatically preserve links in added pages

09ea9b0

larsga force-pushed the issue-3290 branch from 7274240 to 09ea9b0 Compare May 28, 2025 13:05

stefan6419846 requested changes Jun 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: Automatically preserve links in added pages #3298

ENH: Automatically preserve links in added pages #3298

Uh oh!

larsga commented May 27, 2025

Uh oh!

codecov bot commented May 27, 2025 •

edited

Loading

Uh oh!

stefan6419846 commented May 27, 2025

Uh oh!

larsga commented May 27, 2025

Uh oh!

larsga commented May 28, 2025

Uh oh!

stefan6419846 commented May 28, 2025

Uh oh!

stefan6419846 left a comment

Uh oh!

stefan6419846 Jun 4, 2025

Uh oh!

stefan6419846 Jun 4, 2025

Uh oh!

stefan6419846 Jun 4, 2025

Uh oh!

stefan6419846 Jun 4, 2025

Uh oh!

Uh oh!

	self._unresolved_links: list[tuple[RefLink,RefLink]] = []
	self._unresolved_links: list[tuple[RefLink, RefLink]] = []

	self._merged_in_pages: Dict[Optional[IndirectObject],Optional[IndirectObject]] = {}
	self._merged_in_pages: Dict[Optional[IndirectObject], Optional[IndirectObject]] = {}



		@pytest.mark.enable_socket
		def test_named_ref_to_page_thats_gone(pdf_file_path):

	def test_named_ref_to_page_thats_gone(pdf_file_path):
	def test_named_ref_to_page_that_is_gone(pdf_file_path):

ENH: Automatically preserve links in added pages #3298

Are you sure you want to change the base?

ENH: Automatically preserve links in added pages #3298

Uh oh!

Conversation

larsga commented May 27, 2025

Uh oh!

codecov bot commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

stefan6419846 commented May 27, 2025

Uh oh!

larsga commented May 27, 2025

Uh oh!

larsga commented May 28, 2025

Uh oh!

stefan6419846 commented May 28, 2025

Uh oh!

stefan6419846 left a comment

Choose a reason for hiding this comment

Uh oh!

stefan6419846 Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

stefan6419846 Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

stefan6419846 Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

stefan6419846 Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented May 27, 2025 •

edited

Loading