-
-
Notifications
You must be signed in to change notification settings - Fork 78
Include Similar Documents in Title Generation Prompt for Improved Naming Consistency #660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
When disabling the tag generation using AUTO_GENERATE_TAGS=false there was the possibility for tags to become null, in case newTagIDs wasn't initialized yet.
WalkthroughIntroduces retrieval of similar documents via a new client method and integrates their titles into the LLM title-suggestion template. Updates Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User as Caller
participant App as App
participant Client as PaperlessClient
participant LLM as LLM
User->>App: generateDocumentSuggestions(ctx, documentID, content, originalTitle)
App->>App: getSuggestedTitle(ctx, documentID, content, originalTitle)
App->>Client: GetSimilarDocuments(documentID, max=5)
alt similar docs retrieved
Client-->>App: []Document (titles)
App->>App: Build template data incl. SimilarDocumentTitles
else fetch fails
Client--x App: error
App->>App: Proceed with empty SimilarDocumentTitles
end
App->>LLM: Prompt with content, originalTitle, SimilarDocumentTitles?
LLM-->>App: suggestedTitle
App-->>User: suggestedTitle
rect rgb(235, 245, 255)
note over Client,App: New/changed flow: similar-doc retrieval and prompt enrichment
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
paperless.go (1)
541-550
: Avoid unintended tag clearing; do case-insensitive mapping and skip update when no IDs resolveCase-sensitive lookup can fail to resolve intended tags (e.g., “Invoice” vs “invoice”), resulting in sending an empty slice and clearing all tags. Map tags case-insensitively and only apply an empty slice when an explicit clear is intended (finalTagNames is empty).
if !hasSameTags(originalDoc.Tags, finalTagNames) { originalFields["tags"] = originalDoc.Tags - var newTagIDs []int = []int{} + // Case-insensitive lookup to avoid misses due to casing + ci := make(map[string]int, len(availableTags)) + for name, id := range availableTags { + ci[strings.ToLower(name)] = id + } + var newTagIDs []int = []int{} for _, tagName := range finalTagNames { - if tagID, exists := availableTags[tagName]; exists { + if tagID, exists := ci[strings.ToLower(tagName)]; exists { newTagIDs = append(newTagIDs, tagID) } } - updatedFields["tags"] = newTagIDs + // If user provided non-empty names but none resolved, skip to avoid accidental clearing + if len(finalTagNames) > 0 && len(newTagIDs) == 0 { + log.Warnf("No tag IDs resolved for document %d; skipping tags update to avoid clearing.", documentID) + } else { + updatedFields["tags"] = newTagIDs // empty slice [] clears intentionally when finalTagNames is empty + } }
🧹 Nitpick comments (9)
default_prompts/title_prompt.tmpl (1)
6-6
: Optional copy tweak to guide format matchingEmphasize matching date/number formats from exemplars.
-{{end}}Please try to be consistent with the naming pattern of these similar documents if they provide informative titles.{{end}} +{{end}}Please match the naming pattern (including any date/number formats) used in these similar documents when they provide informative titles.{{end}}types.go (1)
164-165
: Interface extension looks good; confirm implementers are updatedNew method reads well. Please ensure all external mocks/adapters implement it to avoid compile breaks. Consider renaming
count
→maxResults
for clarity in call sites.paperless_test.go (2)
613-681
: Stabilize env/global state in testsUse t.Setenv for env isolation and also restore global vars manualTag/autoTag to prevent cross-test leakage.
- originalManualTag := os.Getenv("MANUAL_TAG") - originalAutoTag := os.Getenv("AUTO_TAG") - defer func() { - os.Setenv("MANUAL_TAG", originalManualTag) - os.Setenv("AUTO_TAG", originalAutoTag) - }() + originalManualTag := os.Getenv("MANUAL_TAG") + originalAutoTag := os.Getenv("AUTO_TAG") + origManualTagVar, origAutoTagVar := manualTag, autoTag + defer func() { + _ = os.Setenv("MANUAL_TAG", originalManualTag) + _ = os.Setenv("AUTO_TAG", originalAutoTag) + manualTag, autoTag = origManualTagVar, origAutoTagVar + }() - // Set the tag values and reinitialize the global variables - os.Setenv("MANUAL_TAG", "paperless-gpt") - os.Setenv("AUTO_TAG", "paperless-gpt-auto") + // Set the tag values and reinitialize the global variables + t.Setenv("MANUAL_TAG", "paperless-gpt") + t.Setenv("AUTO_TAG", "paperless-gpt-auto") manualTag = "paperless-gpt" autoTag = "paperless-gpt-auto"
682-745
: Apply the same env/global isolation hereMirror the approach from the previous test to avoid leaking state.
- originalManualTag := os.Getenv("MANUAL_TAG") - originalAutoTag := os.Getenv("AUTO_TAG") - defer func() { - os.Setenv("MANUAL_TAG", originalManualTag) - os.Setenv("AUTO_TAG", originalAutoTag) - }() + originalManualTag := os.Getenv("MANUAL_TAG") + originalAutoTag := os.Getenv("AUTO_TAG") + origManualTagVar, origAutoTagVar := manualTag, autoTag + defer func() { + _ = os.Setenv("MANUAL_TAG", originalManualTag) + _ = os.Setenv("AUTO_TAG", originalAutoTag) + manualTag, autoTag = origManualTagVar, origAutoTagVar + }() - // Set the tag values and reinitialize the global variables - os.Setenv("MANUAL_TAG", "paperless-gpt") - os.Setenv("AUTO_TAG", "paperless-gpt-auto") + // Set the tag values and reinitialize the global variables + t.Setenv("MANUAL_TAG", "paperless-gpt") + t.Setenv("AUTO_TAG", "paperless-gpt-auto") manualTag = "paperless-gpt" autoTag = "paperless-gpt-auto"app_llm_test.go (2)
365-370
: Add compile-time interface conformance checkHelps catch future ClientInterface changes early.
type mockPaperlessClient struct { CustomFields []CustomField SimilarDocuments []Document Error error } + +// Ensure mock implements ClientInterface at compile time +var _ ClientInterface = (*mockPaperlessClient)(nil)
499-551
: Great: validates prompt inclusion of exemplars and final titleConsider also adding one similar doc whose title equals originalTitle to assert it is excluded from the exemplar list.
paperless.go (3)
399-405
: Clamp and validate maxResults to avoid 0/negative and overly large requestsPrevents page_size=0 and huge payloads on accidental inputs.
func (client *PaperlessClient) GetSimilarDocuments(ctx context.Context, documentID int, maxResults int) ([]Document, error) { + // Defensive clamp + if maxResults <= 0 { + maxResults = 5 + } + if maxResults > 50 { + maxResults = 50 + }
414-418
: Build the query with url.Values to avoid brittle string concatenationSafer and handles encoding automatically.
- path := fmt.Sprintf("api/documents/?ordering=-score&truncate_content=true&more_like_id=%d&page_size=%d", documentID, maxResults) - if len(excludeTagIDs) > 0 { - path += "&tags__id__none=" + strings.Join(excludeTagIDs, ",") - } + q := url.Values{} + q.Set("ordering", "-score") + q.Set("truncate_content", "true") + q.Set("more_like_id", strconv.Itoa(documentID)) + q.Set("page_size", strconv.Itoa(maxResults)) + if len(excludeTagIDs) > 0 { + q.Set("tags__id__none", strings.Join(excludeTagIDs, ",")) + } + path := "api/documents/?" + q.Encode()
452-492
: Replace nested scans with O(1) reverse lookups for tags and correspondentsCurrent loops are O(R * (T + C)). Build id→name maps once and map in O(1).
allCorrespondents, err := client.GetAllCorrespondents(ctx) if err != nil { return nil, err } + // Reverse lookups for O(1) ID→name mapping + tagIDToName := make(map[int]string, len(allTags)) + for name, id := range allTags { + tagIDToName[id] = name + } + corrIDToName := make(map[int]string, len(allCorrespondents)) + for name, id := range allCorrespondents { + corrIDToName[id] = name + } + documents := make([]Document, 0, len(documentsResponse.Results)) for _, result := range documentsResponse.Results { // Skip the document itself if it appears in the results if result.ID == documentID { continue } - tagNames := make([]string, len(result.Tags)) - for i, resultTagID := range result.Tags { - for tagName, tagID := range allTags { - if resultTagID == tagID { - tagNames[i] = tagName - break - } - } - } + tagNames := make([]string, 0, len(result.Tags)) + for _, tagID := range result.Tags { + if name, ok := tagIDToName[tagID]; ok { + tagNames = append(tagNames, name) + } + } correspondentName := "" if result.Correspondent != 0 { - for name, id := range allCorrespondents { - if result.Correspondent == id { - correspondentName = name - break - } - } + if name, ok := corrIDToName[result.Correspondent]; ok { + correspondentName = name + } } documents = append(documents, Document{ ID: result.ID, Title: result.Title,
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
app_llm.go
(2 hunks)app_llm_test.go
(7 hunks)default_prompts/title_prompt.tmpl
(1 hunks)paperless.go
(2 hunks)paperless_test.go
(2 hunks)types.go
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
types.go (1)
web-app/src/DocumentProcessor.tsx (1)
Document
(9-15)
paperless_test.go (1)
types.go (2)
GetDocumentApiResponseResult
(21-45)GetDocumentsApiResponse
(11-17)
paperless.go (1)
types.go (3)
Document
(77-87)GetDocumentsApiResponse
(11-17)Correspondent
(122-138)
app_llm.go (1)
main.go (1)
App
(109-127)
app_llm_test.go (3)
main.go (1)
App
(109-127)paperless.go (1)
CustomField
(42-46)types.go (1)
Document
(77-87)
🔇 Additional comments (14)
default_prompts/title_prompt.tmpl (1)
4-6
: Nice addition: contextual exemplars for title consistencyConditional block and list rendering look correct and concise.
paperless_test.go (4)
481-539
: Good coverage of happy-path retrieval and query verificationAssertions validate ordering, truncation, more_like_id, and page_size; result mapping checks are clear.
540-570
: LGTM: no-results path exercised cleanlyVerifies empty slice without errors.
571-595
: LGTM: server error path validatedError assertions and message prefix check are appropriate.
596-612
: LGTM: tag API error path validatedCovers prerequisite failure before documents call.
app_llm.go (2)
192-196
: Template data wiring looks correctIncludes Language, Content, Title, and SimilarDocumentTitles; works with token budgeting.
468-469
: Call-site update LGTMPassing documentID through maintains the new signature without changing behavior otherwise.
app_llm_test.go (6)
105-111
: Wiring mock client into App is correctPrevents nil Client in title generation tests.
162-163
: Updated call-site compiles against new signatureDocument ID plumbed through as intended.
272-279
: Same here: proper App initialization for title testsClient provided; avoids panics in getSuggestedTitle.
414-419
: Mock GetSimilarDocuments implementation is fineSimple passthrough suits the tests.
553-590
: LGTM: prompt omission when no similar docsCovers the empty-slice branch well.
592-629
: LGTM: resilient to lookup failureEnsures feature degrades gracefully and still returns a title.
paperless.go (1)
543-543
: LGTM: initialize to empty slice, not nil, for predictable PATCH semanticsEnsures the API receives [] rather than null when clearing tags.
Please confirm the Paperless API treats "tags": [] as “clear all tags” and rejects "tags": null. If null is tolerated, behavior should still be consistent across endpoints.
func (app *App) getSuggestedTitle(ctx context.Context, documentID int, content string, originalTitle string, logger *logrus.Entry) (string, error) { | ||
likelyLanguage := getLikelyLanguage() | ||
|
||
// Fetch similar documents to help with title consistency | ||
var similarDocumentTitles []string | ||
similarDocs, err := app.Client.GetSimilarDocuments(ctx, documentID, 5) // Get up to 5 similar documents | ||
if err != nil { | ||
// Log the error but don't fail the title generation | ||
logger.Debugf("Failed to fetch similar documents for title consistency: %v", err) | ||
} else { | ||
// Extract titles from similar documents | ||
for _, doc := range similarDocs { | ||
if doc.Title != "" && doc.Title != originalTitle { | ||
similarDocumentTitles = append(similarDocumentTitles, doc.Title) | ||
} | ||
} | ||
logger.Debugf("Found %d similar documents for title consistency", len(similarDocumentTitles)) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nil-pointer risk when Client is unset; also dedup similar titles
If App.Client is nil, this will panic. Guard it and deduplicate titles to reduce prompt noise.
- // Fetch similar documents to help with title consistency
- var similarDocumentTitles []string
- similarDocs, err := app.Client.GetSimilarDocuments(ctx, documentID, 5) // Get up to 5 similar documents
- if err != nil {
- // Log the error but don't fail the title generation
- logger.Debugf("Failed to fetch similar documents for title consistency: %v", err)
- } else {
- // Extract titles from similar documents
- for _, doc := range similarDocs {
- if doc.Title != "" && doc.Title != originalTitle {
- similarDocumentTitles = append(similarDocumentTitles, doc.Title)
- }
- }
- logger.Debugf("Found %d similar documents for title consistency", len(similarDocumentTitles))
- }
+ // Fetch similar documents to help with title consistency
+ var similarDocumentTitles []string
+ if app.Client != nil {
+ similarDocs, err := app.Client.GetSimilarDocuments(ctx, documentID, 5) // Get up to 5 similar documents
+ if err != nil {
+ // Log the error but don't fail the title generation
+ logger.Debugf("Failed to fetch similar documents for title consistency: %v", err)
+ } else {
+ seen := make(map[string]struct{}, len(similarDocs))
+ for _, doc := range similarDocs {
+ title := strings.TrimSpace(doc.Title)
+ if title == "" || title == originalTitle {
+ continue
+ }
+ if _, ok := seen[title]; ok {
+ continue
+ }
+ seen[title] = struct{}{}
+ similarDocumentTitles = append(similarDocumentTitles, title)
+ }
+ logger.Debugf("Found %d similar documents for title consistency", len(similarDocumentTitles))
+ }
+ } else {
+ logger.Debug("Paperless client is nil; skipping similar document lookup for title consistency")
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
func (app *App) getSuggestedTitle(ctx context.Context, documentID int, content string, originalTitle string, logger *logrus.Entry) (string, error) { | |
likelyLanguage := getLikelyLanguage() | |
// Fetch similar documents to help with title consistency | |
var similarDocumentTitles []string | |
similarDocs, err := app.Client.GetSimilarDocuments(ctx, documentID, 5) // Get up to 5 similar documents | |
if err != nil { | |
// Log the error but don't fail the title generation | |
logger.Debugf("Failed to fetch similar documents for title consistency: %v", err) | |
} else { | |
// Extract titles from similar documents | |
for _, doc := range similarDocs { | |
if doc.Title != "" && doc.Title != originalTitle { | |
similarDocumentTitles = append(similarDocumentTitles, doc.Title) | |
} | |
} | |
logger.Debugf("Found %d similar documents for title consistency", len(similarDocumentTitles)) | |
} | |
func (app *App) getSuggestedTitle( | |
ctx context.Context, | |
documentID int, | |
content string, | |
originalTitle string, | |
logger *logrus.Entry, | |
) (string, error) { | |
likelyLanguage := getLikelyLanguage() | |
// Fetch similar documents to help with title consistency | |
var similarDocumentTitles []string | |
if app.Client != nil { | |
similarDocs, err := app.Client.GetSimilarDocuments(ctx, documentID, 5) // Get up to 5 similar documents | |
if err != nil { | |
// Log the error but don't fail the title generation | |
logger.Debugf("Failed to fetch similar documents for title consistency: %v", err) | |
} else { | |
seen := make(map[string]struct{}, len(similarDocs)) | |
for _, doc := range similarDocs { | |
title := strings.TrimSpace(doc.Title) | |
if title == "" || title == originalTitle { | |
continue | |
} | |
if _, ok := seen[title]; ok { | |
continue | |
} | |
seen[title] = struct{}{} | |
similarDocumentTitles = append(similarDocumentTitles, title) | |
} | |
logger.Debugf("Found %d similar documents for title consistency", len(similarDocumentTitles)) | |
} | |
} else { | |
logger.Debug("Paperless client is nil; skipping similar document lookup for title consistency") | |
} | |
// ... remainder of getSuggestedTitle ... | |
} |
🤖 Prompt for AI Agents
In app_llm.go around lines 168 to 186, the code risks a nil-pointer panic if
App.Client is unset and may append duplicate similar titles; first guard the
call by checking if app.Client != nil before calling GetSimilarDocuments and log
a debug/warn when the client is nil and skip fetching, and second deduplicate
the collected similarDocumentTitles (e.g., use a map to ensure each title is
only added once) before using or logging them to reduce prompt noise.
var excludeTagIDs []string | ||
for tagName, tagID := range allTags { | ||
if tagName == manualTag || tagName == autoTag { | ||
excludeTagIDs = append(excludeTagIDs, fmt.Sprintf("%d", tagID)) | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Confirm filter semantics for tags__id__none and ordering with more_like_id
Please verify that:
- tags__id__none excludes any document having any of the provided tag IDs.
- ordering=-score is effective when using more_like_id (or unnecessary because similarity already sorts by score).
- truncate_content is supported on this endpoint.
🌐 Web query:
Check Paperless‑ngx documents API for:
1) Whether the query parameter "tags__id__none" is supported and its semantics.
2) Whether "more_like_id" results are sortable or already sorted by a "score" and whether "ordering=-score" is valid.
3) Whether "truncate_content" is a valid query parameter on /api/documents/.
💡 Result:
Short answers with sources:
-
tags__id__none — Not documented / not a supported query parameter. The official API docs list available filters (and examples) but do not mention a tags__id__none operator; issues show tag-filtering is done via the usual field lookups and that tag-exclusion has been a separate bug, implying no special "none" operator. (docs.paperless-ngx.com)
-
more_like_id / score / ordering — full‑text and "more like" results are returned with a search_hit containing score, rank and highlights, and the docs state full‑text results are always sorted by search score (i.e. best match first). That implies you cannot rely on (or override) sorting by an API ordering like ordering=-score for full‑text/more_like queries. (docs.paperless-ngx.com)
-
truncate_content on /api/documents/ — Not documented. The official REST API reference for /api/documents/ does not list a truncate_content query parameter, so it should not be considered a supported parameter. (docs.paperless-ngx.com)
If you want, I can: a) show exact example queries to get documents with/without tags using supported filters, or b) try a live request against a specific paperless‑ngx instance to confirm behavior.
Citations:
- 1: https://docs.paperless-ngx.com/api/?utm_source=openai
- 2: https://docs.paperless-ngx.com/api/?utm_source=openai
- 3: https://docs.paperless-ngx.com/api/?utm_source=openai
Remove unsupported API parameters from document queries
- The
tags__id__none
filter isn’t supported by the/api/documents/
endpoint—drop it or implement exclusion client-side. - Full-text queries (including
more_like_id
) are intrinsically sorted by score;ordering=-score
is ignored and should be removed. truncate_content
isn’t a valid query parameter—remove it.
🤖 Prompt for AI Agents
In paperless.go around lines 407 to 413, the code builds query parameters using
unsupported API filters: remove any use of the tags__id__none parameter, drop
ordering=-score when full-text/more_like_id is used, and remove truncate_content
from the request; instead pass only supported parameters to /api/documents/ and
implement exclusion client-side by collecting manual/auto tag IDs (as you
already do) and filtering the returned documents to exclude any document whose
tag IDs contain those IDs; ensure you no longer send tags__id__none,
ordering=-score, or truncate_content in the HTTP query.
Motivation
Since I was not entirely happy with the consistency of the generated titles of similar documents (like monthly invoices etc.), I wanted to give the llms some context on how similar documents were named.
Implementation
Therefore I added a feature where paperless-gpt uses the similarity search api of paperless-ngx in order to obtain the 5 closest documents that are currently not marked for processing by paperless-gpt (in order to not get unprocessed titles). Then it extracts the titles from those documents and provides them as a reference.
In my experience this improves the quality and consistency of titles significantly.
I had this running for a few days now already, maybe you'll find it interesting and check it out if you like the proposal
Summary by CodeRabbit
New Features
Tests