mwxml appears to incrementally consume memory while iterating over dump #23

whpac · 2025-01-07T18:34:28Z

It appears that for every item or page in the dump, mwxml consumes additional 100 bytes of RAM memory. It is not much per se, but when operating on dumps from WMF wikis, it grows to a considerable amount.

For example, I was just using mwxml to iterate over the plwiki-20241201-pages-logging.xml.gz dump (which contains ca. 14M items). It required approx. 1.2 GB of RAM memory. Indeed, a Toolforge job that was given 512 MB of memory (the default) was terminated due to the fact that it was out-of-memory.

I've observed a similar behavior when working with the dump of pages (stub-meta-history.xml.gz), where the memory usage grew by 100 bytes for every page in the file (however, I did not read its revisions, it could have changed the behavior).

The code below only reads the dump file without performing any operation. This should be an operation that in a longer run runs under a constant memory requirements (as the data is streamed and not accumulated). However, as can be seen in the output, the amount of memory consumed by the script is substantial in the end.

The current state is feasible for most of the applications. However it may be inefective or infeasible when working with dumps of large WMF wikis.

import gzip
import mwxml
import psutil

# Using logging1 (partial dump) for shorter execution time
dump = mwxml.Dump.from_file(gzip.open('/public/dumps/public/plwiki/20241201/plwiki-20241201-pages-logging1.xml.gz'))
proc = psutil.Process()

mem_used = proc.memory_info().rss / (2**20)
print(f'Memory: {mem_used:.2f} MB')

i = 0
for log_item in dump.log_items:
    i += 1

mem_used = proc.memory_info().rss / (2**20)
print(f'Memory: {mem_used:.2f} MB')

Output:

Memory: 16.58 MB
Memory: 290.90 MB; 3246470 iterations

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mwxml appears to incrementally consume memory while iterating over dump #23

mwxml appears to incrementally consume memory while iterating over dump #23

whpac commented Jan 7, 2025

mwxml appears to incrementally consume memory while iterating over dump #23

mwxml appears to incrementally consume memory while iterating over dump #23

Comments

whpac commented Jan 7, 2025