Skip to content

Allow for non-mutated user specified doc-ids #977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 19, 2025

Conversation

mskarlin
Copy link
Collaborator

This was a deep bug. On round-trip serialization of a DocDetails object, if a user doesn't have a doi field to create a deterministic doc_id, then when serializing/deserializing (during validation), the doc_id was mutating. This prevented users from ever setting their own doc_id value.

This round trip serialization happens during the gather_evidence step, when the existing doc_id for each DocDetails object in Docs would then diverge from the doc_id for the associated Context objects generated. So for each call to gather_evidence, if a DocDetails did not have a DOI, a random doc_id was created for each Text.doc object within each Context. This PR fixes this behavior.

As an aside, my tests were failing because the RetractionDataPostProcessor test file (tests/stub_data/test_retractions.csv) was more than 30 days since the creation date. So this triggered a cache break which tried to rebuild the file. This then triggered a network call which broke the VCR cassette cache. I turned this behavior off by default, as I don't think this is needed under normal circumstances.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug where user‐specified doc_ids were being overwritten during round-trip serialization of DocDetails objects. Key changes include:

  • Updating the DocDetails validation in paperqa/types.py to preserve user‐defined doc_ids.
  • Enhancing test coverage with a new test_docdetails_doc_id_roundtrip to ensure serialization does not mutate doc_ids.
  • Modifying clients in retractions and journal_quality to explicitly pass through the doc_id and dockey fields.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
tests/test_paperqa.py Added tests to verify consistent doc_id behavior upon serialization.
paperqa/types.py Updated the doc_id population logic to honor user-specified values.
paperqa/clients/retractions.py Adjusted _process to preserve doc_id/dockey when merging DocDetails.
paperqa/clients/journal_quality.py Updated merging of DocDetails to preserve doc_id/dockey.
Comments suppressed due to low confidence (1)

tests/test_paperqa.py:1576

  • The assertion message contradicts the condition being tested. Update the error message to indicate that the doc_id should match the specified test value.
    ), "DocDetails with doc_id should not match test_specified_doc_id"

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Jun 19, 2025
@mskarlin mskarlin merged commit 5f08886 into main Jun 19, 2025
5 checks passed
@mskarlin mskarlin deleted the allow-user-specified-doc-ids branch June 19, 2025 19:37
Comment on lines +48 to +49
doc_id=doc_details.doc_id, # ensure doc_id is preserved
dockey=doc_details.dockey, # ensure dockey is preserved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps let's improve these comment a bit, imo in their current form they're just mirroring the code

A good comment should mention why -- perhaps explain why was the word "preservation" used

 DocDetails(
    # Propagate doc_id and dockey to preserve them
    # so DocDetails construction doesn't generate new ones
    doc_id=doc_details.doc_id,
    dockey=doc_details.dockey,

Or maybe remove the comment entirely

@@ -431,14 +431,15 @@ def lowercase_doi_and_populate_doc_id(cls, data: dict[str, Any]) -> dict[str, An
if doi.startswith(url_prefix_to_remove):
doi = doi.replace(url_prefix_to_remove, "")
data["doi"] = doi.lower()
data["doc_id"] = encode_id(doi.lower())
else:
if "doc_id" not in data or not data["doc_id"]: # keep user defined doc_ids
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not data.get("doc_id") may be simpler, sometimes the double negatives are harder to comprehend

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jun 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants