-
Notifications
You must be signed in to change notification settings - Fork 745
Allow for non-mutated user specified doc-ids #977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a bug where user‐specified doc_ids were being overwritten during round-trip serialization of DocDetails objects. Key changes include:
- Updating the DocDetails validation in paperqa/types.py to preserve user‐defined doc_ids.
- Enhancing test coverage with a new test_docdetails_doc_id_roundtrip to ensure serialization does not mutate doc_ids.
- Modifying clients in retractions and journal_quality to explicitly pass through the doc_id and dockey fields.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
tests/test_paperqa.py | Added tests to verify consistent doc_id behavior upon serialization. |
paperqa/types.py | Updated the doc_id population logic to honor user-specified values. |
paperqa/clients/retractions.py | Adjusted _process to preserve doc_id/dockey when merging DocDetails. |
paperqa/clients/journal_quality.py | Updated merging of DocDetails to preserve doc_id/dockey. |
Comments suppressed due to low confidence (1)
tests/test_paperqa.py:1576
- The assertion message contradicts the condition being tested. Update the error message to indicate that the doc_id should match the specified test value.
), "DocDetails with doc_id should not match test_specified_doc_id"
Co-authored-by: Copilot <[email protected]>
doc_id=doc_details.doc_id, # ensure doc_id is preserved | ||
dockey=doc_details.dockey, # ensure dockey is preserved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps let's improve these comment a bit, imo in their current form they're just mirroring the code
A good comment should mention why -- perhaps explain why was the word "preservation" used
DocDetails(
# Propagate doc_id and dockey to preserve them
# so DocDetails construction doesn't generate new ones
doc_id=doc_details.doc_id,
dockey=doc_details.dockey,
Or maybe remove the comment entirely
@@ -431,14 +431,15 @@ def lowercase_doi_and_populate_doc_id(cls, data: dict[str, Any]) -> dict[str, An | |||
if doi.startswith(url_prefix_to_remove): | |||
doi = doi.replace(url_prefix_to_remove, "") | |||
data["doi"] = doi.lower() | |||
data["doc_id"] = encode_id(doi.lower()) | |||
else: | |||
if "doc_id" not in data or not data["doc_id"]: # keep user defined doc_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not data.get("doc_id")
may be simpler, sometimes the double negatives are harder to comprehend
This was a deep bug. On round-trip serialization of a
DocDetails
object, if a user doesn't have adoi
field to create a deterministicdoc_id
, then when serializing/deserializing (during validation), thedoc_id
was mutating. This prevented users from ever setting their owndoc_id
value.This round trip serialization happens during the
gather_evidence
step, when the existingdoc_id
for eachDocDetails
object inDocs
would then diverge from thedoc_id
for the associatedContext
objects generated. So for each call togather_evidence
, if aDocDetails
did not have a DOI, a randomdoc_id
was created for eachText.doc
object within eachContext
. This PR fixes this behavior.As an aside, my tests were failing because the
RetractionDataPostProcessor
test file (tests/stub_data/test_retractions.csv
) was more than 30 days since the creation date. So this triggered a cache break which tried to rebuild the file. This then triggered a network call which broke the VCR cassette cache. I turned this behavior off by default, as I don't think this is needed under normal circumstances.