Allow for non-mutated user specified doc-ids #977

mskarlin · 2025-06-19T19:24:58Z

This was a deep bug. On round-trip serialization of a DocDetails object, if a user doesn't have a doi field to create a deterministic doc_id, then when serializing/deserializing (during validation), the doc_id was mutating. This prevented users from ever setting their own doc_id value.

This round trip serialization happens during the gather_evidence step, when the existing doc_id for each DocDetails object in Docs would then diverge from the doc_id for the associated Context objects generated. So for each call to gather_evidence, if a DocDetails did not have a DOI, a random doc_id was created for each Text.doc object within each Context. This PR fixes this behavior.

As an aside, my tests were failing because the RetractionDataPostProcessor test file (tests/stub_data/test_retractions.csv) was more than 30 days since the creation date. So this triggered a cache break which tried to rebuild the file. This then triggered a network call which broke the VCR cassette cache. I turned this behavior off by default, as I don't think this is needed under normal circumstances.

Copilot

Pull Request Overview

This PR fixes a bug where user‐specified doc_ids were being overwritten during round-trip serialization of DocDetails objects. Key changes include:

Updating the DocDetails validation in paperqa/types.py to preserve user‐defined doc_ids.
Enhancing test coverage with a new test_docdetails_doc_id_roundtrip to ensure serialization does not mutate doc_ids.
Modifying clients in retractions and journal_quality to explicitly pass through the doc_id and dockey fields.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
tests/test_paperqa.py	Added tests to verify consistent doc_id behavior upon serialization.
paperqa/types.py	Updated the doc_id population logic to honor user-specified values.
paperqa/clients/retractions.py	Adjusted _process to preserve doc_id/dockey when merging DocDetails.
paperqa/clients/journal_quality.py	Updated merging of DocDetails to preserve doc_id/dockey.

Comments suppressed due to low confidence (1)

tests/test_paperqa.py:1576

The assertion message contradicts the condition being tested. Update the error message to indicate that the doc_id should match the specified test value.

    ), "DocDetails with doc_id should not match test_specified_doc_id"

tests/test_paperqa.py

Co-authored-by: Copilot <[email protected]>

jamesbraza · 2025-06-19T19:35:51Z

paperqa/clients/journal_quality.py

+            doc_id=doc_details.doc_id,  # ensure doc_id is preserved
+            dockey=doc_details.dockey,  # ensure dockey is preserved


Perhaps let's improve these comment a bit, imo in their current form they're just mirroring the code

A good comment should mention why -- perhaps explain why was the word "preservation" used

DocDetails( # Propagate doc_id and dockey to preserve them # so DocDetails construction doesn't generate new ones doc_id=doc_details.doc_id, dockey=doc_details.dockey,

Or maybe remove the comment entirely

jamesbraza · 2025-06-19T19:39:57Z

paperqa/types.py

@@ -431,14 +431,15 @@ def lowercase_doi_and_populate_doc_id(cls, data: dict[str, Any]) -> dict[str, An
                if doi.startswith(url_prefix_to_remove):
                    doi = doi.replace(url_prefix_to_remove, "")
            data["doi"] = doi.lower()
-            data["doc_id"] = encode_id(doi.lower())
-        else:
+            if "doc_id" not in data or not data["doc_id"]:  # keep user defined doc_ids


if not data.get("doc_id") may be simpler, sometimes the double negatives are harder to comprehend

allow for non-mutated user specified doc-ids

1acd7e6

mskarlin requested review from jamesbraza, nadolskit and Copilot June 19, 2025 19:24

Copilot AI reviewed Jun 19, 2025

View reviewed changes

tests/test_paperqa.py Outdated Show resolved Hide resolved

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Jun 19, 2025

Update tests/test_paperqa.py

5454833

Co-authored-by: Copilot <[email protected]>

nadolskit approved these changes Jun 19, 2025

View reviewed changes

mskarlin merged commit 5f08886 into main Jun 19, 2025
5 checks passed

mskarlin deleted the allow-user-specified-doc-ids branch June 19, 2025 19:37

jamesbraza approved these changes Jun 19, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Jun 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow for non-mutated user specified doc-ids #977

Allow for non-mutated user specified doc-ids #977

Uh oh!

mskarlin commented Jun 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

jamesbraza Jun 19, 2025

Uh oh!

jamesbraza Jun 19, 2025

Uh oh!

Uh oh!

		doc_id=doc_details.doc_id, # ensure doc_id is preserved
		dockey=doc_details.dockey, # ensure dockey is preserved

Allow for non-mutated user specified doc-ids #977

Allow for non-mutated user specified doc-ids #977

Uh oh!

Conversation

mskarlin commented Jun 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

jamesbraza Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

jamesbraza Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!