Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked independent synthetics preprocessing #133

Merged
merged 1 commit into from
Jul 13, 2023

Conversation

mikeknep
Copy link
Contributor

For independent synthetics, the data source we use for the model is identical to the source data except PK and FK columns are removed. Now that we have source data stored on disk instead of in memory, this small code block seemed annoying—read the whole dataframe (minus key columns) into memory only to write it right back out to CSV? Blegh.

This change at least limits the memory usage by reading and writing in chunks.

I extracted these two basic implementations and ran an experiment with tracemalloc using a 400MB CSV file. Using the previous approach, traced memory peaked at 977,813,835. When using chunks, traced memory peaked at 232,618,553.

There may be some future improvements to this (and other areas of the code) leveraging Dask, but that requires more research and testing. (In fact, a naive read-all-and-write with Dask performs worse than both these approaches, perhaps due to the overhead of Dask launching multiple workers? Dask would probably prove more beneficial with an even larger source file, one that could not fit into memory all at once.) The changes here seem like a quick enough win worth landing on their own.

Copy link
Contributor

@tylersbray tylersbray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mikeknep mikeknep merged commit 81677d2 into main Jul 13, 2023
@mikeknep mikeknep deleted the independent-preprocessing branch July 13, 2023 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants