Skip to content

Processor.input_files broke for pageId selector lists #622

@bertsky

Description

@bertsky

There is a regression in 84a4e1a: When passing multiple pages for an image-only input fileGrp, e.g. -g phys_0001,phys_0007 -I OCR-D-IMG, now the logic that tries to prevent mixing derived images with original images is falsely triggered:

ret = self.workspace.mets.find_all_files(
fileGrp=self.input_file_grp, pageId=self.page_id, mimetype="//image/.*")
if self.page_id and len(ret) > 1:
raise ValueError("No PAGE-XML %s in fileGrp '%s' but multiple images." % (
"for page '%s'" % self.page_id if self.page_id else '',
self.input_file_grp
))
return ret

The problem is that self.page_id here is actually a list (formatted in comma-join notation).

So the correct way of ensuring that no single page gets multiple image file results is by

  • either disallowing find_all_files to aggregate them like this (which is probably valid in other contexts, though)
  • or going through its result ret and checking whether any of its pageIds repeat:
page_ids = [file.pageId for file in ret]
if len(page_ids) != len(set(page_ids)):

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions