Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mwxml appears to incrementally consume memory while iterating over dump #23

Open
whpac opened this issue Jan 7, 2025 · 0 comments
Open

Comments

@whpac
Copy link

whpac commented Jan 7, 2025

It appears that for every item or page in the dump, mwxml consumes additional 100 bytes of RAM memory. It is not much per se, but when operating on dumps from WMF wikis, it grows to a considerable amount.

For example, I was just using mwxml to iterate over the plwiki-20241201-pages-logging.xml.gz dump (which contains ca. 14M items). It required approx. 1.2 GB of RAM memory. Indeed, a Toolforge job that was given 512 MB of memory (the default) was terminated due to the fact that it was out-of-memory.

I've observed a similar behavior when working with the dump of pages (stub-meta-history.xml.gz), where the memory usage grew by 100 bytes for every page in the file (however, I did not read its revisions, it could have changed the behavior).

The code below only reads the dump file without performing any operation. This should be an operation that in a longer run runs under a constant memory requirements (as the data is streamed and not accumulated). However, as can be seen in the output, the amount of memory consumed by the script is substantial in the end.

The current state is feasible for most of the applications. However it may be inefective or infeasible when working with dumps of large WMF wikis.

import gzip
import mwxml
import psutil

# Using logging1 (partial dump) for shorter execution time
dump = mwxml.Dump.from_file(gzip.open('/public/dumps/public/plwiki/20241201/plwiki-20241201-pages-logging1.xml.gz'))
proc = psutil.Process()

mem_used = proc.memory_info().rss / (2**20)
print(f'Memory: {mem_used:.2f} MB')

i = 0
for log_item in dump.log_items:
    i += 1

mem_used = proc.memory_info().rss / (2**20)
print(f'Memory: {mem_used:.2f} MB')

Output:

Memory: 16.58 MB
Memory: 290.90 MB; 3246470 iterations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant