Chunked independent synthetics preprocessing #133
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For independent synthetics, the data source we use for the model is identical to the source data except PK and FK columns are removed. Now that we have source data stored on disk instead of in memory, this small code block seemed annoying—read the whole dataframe (minus key columns) into memory only to write it right back out to CSV? Blegh.
This change at least limits the memory usage by reading and writing in chunks.
I extracted these two basic implementations and ran an experiment with tracemalloc using a 400MB CSV file. Using the previous approach, traced memory peaked at 977,813,835. When using chunks, traced memory peaked at 232,618,553.
There may be some future improvements to this (and other areas of the code) leveraging Dask, but that requires more research and testing. (In fact, a naive read-all-and-write with Dask performs worse than both these approaches, perhaps due to the overhead of Dask launching multiple workers? Dask would probably prove more beneficial with an even larger source file, one that could not fit into memory all at once.) The changes here seem like a quick enough win worth landing on their own.