Skip to content

mwxml appears to incrementally consume memory while iterating over dump #23

@whpac

Description

@whpac

It appears that for every item or page in the dump, mwxml consumes additional 100 bytes of RAM memory. It is not much per se, but when operating on dumps from WMF wikis, it grows to a considerable amount.

For example, I was just using mwxml to iterate over the plwiki-20241201-pages-logging.xml.gz dump (which contains ca. 14M items). It required approx. 1.2 GB of RAM memory. Indeed, a Toolforge job that was given 512 MB of memory (the default) was terminated due to the fact that it was out-of-memory.

I've observed a similar behavior when working with the dump of pages (stub-meta-history.xml.gz), where the memory usage grew by 100 bytes for every page in the file (however, I did not read its revisions, it could have changed the behavior).

The code below only reads the dump file without performing any operation. This should be an operation that in a longer run runs under a constant memory requirements (as the data is streamed and not accumulated). However, as can be seen in the output, the amount of memory consumed by the script is substantial in the end.

The current state is feasible for most of the applications. However it may be inefective or infeasible when working with dumps of large WMF wikis.

import gzip
import mwxml
import psutil

# Using logging1 (partial dump) for shorter execution time
dump = mwxml.Dump.from_file(gzip.open('/public/dumps/public/plwiki/20241201/plwiki-20241201-pages-logging1.xml.gz'))
proc = psutil.Process()

mem_used = proc.memory_info().rss / (2**20)
print(f'Memory: {mem_used:.2f} MB')

i = 0
for log_item in dump.log_items:
    i += 1

mem_used = proc.memory_info().rss / (2**20)
print(f'Memory: {mem_used:.2f} MB')

Output:

Memory: 16.58 MB
Memory: 290.90 MB; 3246470 iterations

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions