You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It appears that for every item or page in the dump, mwxml consumes additional 100 bytes of RAM memory. It is not much per se, but when operating on dumps from WMF wikis, it grows to a considerable amount.
For example, I was just using mwxml to iterate over the plwiki-20241201-pages-logging.xml.gz dump (which contains ca. 14M items). It required approx. 1.2 GB of RAM memory. Indeed, a Toolforge job that was given 512 MB of memory (the default) was terminated due to the fact that it was out-of-memory.
I've observed a similar behavior when working with the dump of pages (stub-meta-history.xml.gz), where the memory usage grew by 100 bytes for every page in the file (however, I did not read its revisions, it could have changed the behavior).
The code below only reads the dump file without performing any operation. This should be an operation that in a longer run runs under a constant memory requirements (as the data is streamed and not accumulated). However, as can be seen in the output, the amount of memory consumed by the script is substantial in the end.
The current state is feasible for most of the applications. However it may be inefective or infeasible when working with dumps of large WMF wikis.
It appears that for every item or page in the dump, mwxml consumes additional 100 bytes of RAM memory. It is not much per se, but when operating on dumps from WMF wikis, it grows to a considerable amount.
For example, I was just using mwxml to iterate over the
plwiki-20241201-pages-logging.xml.gz
dump (which contains ca. 14M items). It required approx. 1.2 GB of RAM memory. Indeed, a Toolforge job that was given 512 MB of memory (the default) was terminated due to the fact that it was out-of-memory.I've observed a similar behavior when working with the dump of pages (
stub-meta-history.xml.gz
), where the memory usage grew by 100 bytes for every page in the file (however, I did not read its revisions, it could have changed the behavior).The code below only reads the dump file without performing any operation. This should be an operation that in a longer run runs under a constant memory requirements (as the data is streamed and not accumulated). However, as can be seen in the output, the amount of memory consumed by the script is substantial in the end.
The current state is feasible for most of the applications. However it may be inefective or infeasible when working with dumps of large WMF wikis.
Output:
The text was updated successfully, but these errors were encountered: