Skip to content

performance of input_files on large workspaces #723

@bertsky

Description

@bertsky

By implementing #635 to properly handle all cases of PAGE-XML file matching per pageId, we have lost sight of the severe performance penalty that this comes with. In effect, we are now nearly as slow as before #482 on workspaces with lots of pages and fileGrps.

Here's a typical scenario:

  1. ocrd-cis-ocropy-dewarp creates an image file for each text line, and references it under the pageId and fileGrp which it belongs to – under an image mimetype (this creates 29.000 files for me in a workspace with 500 pages)
  2. ocrd-tesserocr-recognize runs afterwards and queries its self.input_files:
    for file_ in sorted(self.workspace.mets.find_all_files(
    pageId=self.page_id, fileGrp=ifg, mimetype=mimetype),
    # sort by MIME type so PAGE comes before images
    key=lambda file_: file_.mimetype):
    if not file_.pageId:
    continue
  3. this searches through all mets:file entries, matching them for fileGrp (which is reasonably fast, it only gets a little inefficient when additionally filtering by pageId):
    for cand in self._tree.getroot().xpath('//mets:file', namespaces=NS):
    if ID:
    if ID.startswith(REGEX_PREFIX):
    if not fullmatch(ID[REGEX_PREFIX_LEN:], cand.get('ID')): continue
    else:
    if not ID == cand.get('ID'): continue
    if pageId is not None and cand.get('ID') not in pageId:
    continue
    if fileGrp:
    if fileGrp.startswith(REGEX_PREFIX):
    if not fullmatch(fileGrp[REGEX_PREFIX_LEN:], cand.getparent().get('USE')): continue
    else:
    if cand.getparent().get('USE') != fileGrp: continue
    if mimetype:
    if mimetype.startswith(REGEX_PREFIX):
    if not fullmatch(mimetype[REGEX_PREFIX_LEN:], cand.get('MIMETYPE') or ''): continue
    else:
    if cand.get('MIMETYPE') != mimetype: continue
    if url:
    cand_locat = cand.find('mets:FLocat', namespaces=NS)
    if cand_locat is None:
    continue
    cand_url = cand_locat.get('{%s}href' % NS['xlink'])
    if url.startswith(REGEX_PREFIX):
    if not fullmatch(url[REGEX_PREFIX_LEN:], cand_url): continue
    else:
    if cand_url != url: continue
    f = OcrdFile(cand, mets=self)
  4. Then in line 298 (and again further below) it queries OcrdFile.pageId:
    def pageId(self):
    """
    Get the ``@ID`` of the physical ``mets:structMap`` entry corresponding to this ``mets:file`` (physical page manifestation).
    """
    if self.mets is None:
    raise Exception("OcrdFile %s has no member 'mets' pointing to parent OcrdMets" % self)
    return self.mets.get_physical_page_for_file(self)
  5. This in turn needs to repeatedly query the whole structMap via XPath (which on a workspace with 500 files and 25 fileGrps and 200.000 takes about 0.2sec per file, i.e. needs more than 1h just for the computation of input_files):
    def get_physical_page_for_file(self, ocrd_file):
    """
    Get the physical page ID (``@ID`` of the physical ``mets:structMap`` ``mets:div`` entry)
    corresponding to the ``mets:file`` :py:attr:`ocrd_file`.
    """
    ret = self._tree.getroot().xpath(
    '/mets:mets/mets:structMap[@TYPE="PHYSICAL"]/mets:div[@TYPE="physSequence"]/mets:div[@TYPE="page"][./mets:fptr[@FILEID="%s"]]/@ID' %
    ocrd_file.ID, namespaces=NS)

A little cosmetics like turning OcrdFile.pageId into a functools.cached_property won't help here, the problem is bigger. METS with its mutually related fileGrp and pageId mappings is inherently expensive to parse. I know we have in the past decided against in-memory representations like dicts because that looked like memory leaks or seemed too expensive on very large workspaces. But have we really weighed the cost of that memory-cputime tradeoff carefully (and considering the necessity for pageId/mimetype filtering) yet? Is there any existing code attempting to cache fileGrp and pageId mappings to avoid reparsing the METS again and again, which I could tamper with?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions