-
Notifications
You must be signed in to change notification settings - Fork 31
Closed
Labels
Description
There is a regression in 84a4e1a: When passing multiple pages for an image-only input fileGrp, e.g. -g phys_0001,phys_0007 -I OCR-D-IMG
, now the logic that tries to prevent mixing derived images with original images is falsely triggered:
core/ocrd/ocrd/processor/base.py
Lines 118 to 125 in edf31fa
ret = self.workspace.mets.find_all_files( | |
fileGrp=self.input_file_grp, pageId=self.page_id, mimetype="//image/.*") | |
if self.page_id and len(ret) > 1: | |
raise ValueError("No PAGE-XML %s in fileGrp '%s' but multiple images." % ( | |
"for page '%s'" % self.page_id if self.page_id else '', | |
self.input_file_grp | |
)) | |
return ret |
The problem is that self.page_id
here is actually a list (formatted in comma-join notation).
So the correct way of ensuring that no single page gets multiple image file results is by
- either disallowing
find_all_files
to aggregate them like this (which is probably valid in other contexts, though) - or going through its result
ret
and checking whether any of itspageId
s repeat:
page_ids = [file.pageId for file in ret]
if len(page_ids) != len(set(page_ids)):