-
Notifications
You must be signed in to change notification settings - Fork 31
Description
By implementing #635 to properly handle all cases of PAGE-XML file matching per pageId, we have lost sight of the severe performance penalty that this comes with. In effect, we are now nearly as slow as before #482 on workspaces with lots of pages and fileGrps.
Here's a typical scenario:
- ocrd-cis-ocropy-dewarp creates an image file for each text line, and references it under the pageId and fileGrp which it belongs to – under an image mimetype (this creates 29.000 files for me in a workspace with 500 pages)
- ocrd-tesserocr-recognize runs afterwards and queries its
self.input_files
:
core/ocrd/ocrd/processor/base.py
Lines 294 to 299 in 9069a65
for file_ in sorted(self.workspace.mets.find_all_files( pageId=self.page_id, fileGrp=ifg, mimetype=mimetype), # sort by MIME type so PAGE comes before images key=lambda file_: file_.mimetype): if not file_.pageId: continue - this searches through all
mets:file
entries, matching them forfileGrp
(which is reasonably fast, it only gets a little inefficient when additionally filtering bypageId
):
core/ocrd_models/ocrd_models/ocrd_mets.py
Lines 176 to 208 in 9069a65
for cand in self._tree.getroot().xpath('//mets:file', namespaces=NS): if ID: if ID.startswith(REGEX_PREFIX): if not fullmatch(ID[REGEX_PREFIX_LEN:], cand.get('ID')): continue else: if not ID == cand.get('ID'): continue if pageId is not None and cand.get('ID') not in pageId: continue if fileGrp: if fileGrp.startswith(REGEX_PREFIX): if not fullmatch(fileGrp[REGEX_PREFIX_LEN:], cand.getparent().get('USE')): continue else: if cand.getparent().get('USE') != fileGrp: continue if mimetype: if mimetype.startswith(REGEX_PREFIX): if not fullmatch(mimetype[REGEX_PREFIX_LEN:], cand.get('MIMETYPE') or ''): continue else: if cand.get('MIMETYPE') != mimetype: continue if url: cand_locat = cand.find('mets:FLocat', namespaces=NS) if cand_locat is None: continue cand_url = cand_locat.get('{%s}href' % NS['xlink']) if url.startswith(REGEX_PREFIX): if not fullmatch(url[REGEX_PREFIX_LEN:], cand_url): continue else: if cand_url != url: continue f = OcrdFile(cand, mets=self) - Then in line 298 (and again further below) it queries
OcrdFile.pageId
:
core/ocrd_models/ocrd_models/ocrd_file.py
Lines 116 to 122 in 9069a65
def pageId(self): """ Get the ``@ID`` of the physical ``mets:structMap`` entry corresponding to this ``mets:file`` (physical page manifestation). """ if self.mets is None: raise Exception("OcrdFile %s has no member 'mets' pointing to parent OcrdMets" % self) return self.mets.get_physical_page_for_file(self) - This in turn needs to repeatedly query the whole structMap via XPath (which on a workspace with 500 files and 25 fileGrps and 200.000 takes about 0.2sec per file, i.e. needs more than 1h just for the computation of
input_files
):
core/ocrd_models/ocrd_models/ocrd_mets.py
Lines 434 to 441 in 9069a65
def get_physical_page_for_file(self, ocrd_file): """ Get the physical page ID (``@ID`` of the physical ``mets:structMap`` ``mets:div`` entry) corresponding to the ``mets:file`` :py:attr:`ocrd_file`. """ ret = self._tree.getroot().xpath( '/mets:mets/mets:structMap[@TYPE="PHYSICAL"]/mets:div[@TYPE="physSequence"]/mets:div[@TYPE="page"][./mets:fptr[@FILEID="%s"]]/@ID' % ocrd_file.ID, namespaces=NS)
A little cosmetics like turning OcrdFile.pageId
into a functools.cached_property
won't help here, the problem is bigger. METS with its mutually related fileGrp and pageId mappings is inherently expensive to parse. I know we have in the past decided against in-memory representations like dicts because that looked like memory leaks or seemed too expensive on very large workspaces. But have we really weighed the cost of that memory-cputime tradeoff carefully (and considering the necessity for pageId/mimetype filtering) yet? Is there any existing code attempting to cache fileGrp and pageId mappings to avoid reparsing the METS again and again, which I could tamper with?