performance of input_files on large workspaces

By implementing #635 to properly handle all cases of PAGE-XML file matching per pageId, we have lost sight of the severe performance penalty that this comes with. In effect, we are now nearly as slow as before #482 on workspaces with lots of pages and fileGrps.

Here's a typical scenario:
1. ocrd-cis-ocropy-dewarp creates an image file for each text line, and references it under the pageId and fileGrp which it belongs to – under an image mimetype (this creates 29.000 files for me in a workspace with 500 pages)
2. ocrd-tesserocr-recognize runs afterwards and queries its `self.input_files`:
https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd/ocrd/processor/base.py#L294-L299
3. this searches through all `mets:file` entries, matching them for `fileGrp` (which is reasonably fast, it only gets a little inefficient when additionally filtering by `pageId`):
https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd_models/ocrd_models/ocrd_mets.py#L176-L208
4. Then in line 298 (and again further below) it queries `OcrdFile.pageId`:
https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd_models/ocrd_models/ocrd_file.py#L116-L122
5. This in turn needs to **repeatedly** query the whole structMap via XPath (which on a workspace with 500 files and 25 fileGrps and 200.000 takes about 0.2sec per file, i.e. needs more than 1h just for the computation of `input_files`):
https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd_models/ocrd_models/ocrd_mets.py#L434-L441

A little cosmetics like turning `OcrdFile.pageId` into a `functools.cached_property` won't help here, the problem is bigger. METS with its mutually related fileGrp and pageId mappings is inherently expensive to parse. I know we have in the past decided against in-memory representations like dicts because that looked like memory leaks or seemed too expensive on very large workspaces. But have we really weighed the cost of that memory-cputime tradeoff carefully (and considering the necessity for pageId/mimetype filtering) yet? Is there any existing code attempting to cache fileGrp and pageId mappings to avoid reparsing the METS again and again, which I could tamper with?

	for file_ in sorted(self.workspace.mets.find_all_files(
	pageId=self.page_id, fileGrp=ifg, mimetype=mimetype),
	# sort by MIME type so PAGE comes before images
	key=lambda file_: file_.mimetype):
	if not file_.pageId:
	continue

	for cand in self._tree.getroot().xpath('//mets:file', namespaces=NS):
	if ID:
	if ID.startswith(REGEX_PREFIX):
	if not fullmatch(ID[REGEX_PREFIX_LEN:], cand.get('ID')): continue
	else:
	if not ID == cand.get('ID'): continue

	if pageId is not None and cand.get('ID') not in pageId:
	continue

	if fileGrp:
	if fileGrp.startswith(REGEX_PREFIX):
	if not fullmatch(fileGrp[REGEX_PREFIX_LEN:], cand.getparent().get('USE')): continue
	else:
	if cand.getparent().get('USE') != fileGrp: continue

	if mimetype:
	if mimetype.startswith(REGEX_PREFIX):
	if not fullmatch(mimetype[REGEX_PREFIX_LEN:], cand.get('MIMETYPE') or ''): continue
	else:
	if cand.get('MIMETYPE') != mimetype: continue

	if url:
	cand_locat = cand.find('mets:FLocat', namespaces=NS)
	if cand_locat is None:
	continue
	cand_url = cand_locat.get('{%s}href' % NS['xlink'])
	if url.startswith(REGEX_PREFIX):
	if not fullmatch(url[REGEX_PREFIX_LEN:], cand_url): continue
	else:
	if cand_url != url: continue

	f = OcrdFile(cand, mets=self)

	def pageId(self):
	"""
	Get the ``@ID`` of the physical ``mets:structMap`` entry corresponding to this ``mets:file`` (physical page manifestation).
	"""
	if self.mets is None:
	raise Exception("OcrdFile %s has no member 'mets' pointing to parent OcrdMets" % self)
	return self.mets.get_physical_page_for_file(self)

	def get_physical_page_for_file(self, ocrd_file):
	"""
	Get the physical page ID (``@ID`` of the physical ``mets:structMap`` ``mets:div`` entry)
	corresponding to the ``mets:file`` :py:attr:`ocrd_file`.
	"""
	ret = self._tree.getroot().xpath(
	'/mets:mets/mets:structMap[@TYPE="PHYSICAL"]/mets:div[@TYPE="physSequence"]/mets:div[@TYPE="page"][./mets:fptr[@FILEID="%s"]]/@ID' %
	ocrd_file.ID, namespaces=NS)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

performance of input_files on large workspaces #723

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

performance of input_files on large workspaces #723

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions