Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source on disk #130

Merged
merged 5 commits into from
Jul 10, 2023
Merged

Source on disk #130

merged 5 commits into from
Jul 10, 2023

Conversation

mikeknep
Copy link
Contributor

Instead of storing tables' source data as Pandas DataFrames in memory, store them as CSV files on disk. Users can now provide either a dataframe or a string/Path to a CSV when calling add_table.

I did some initial profiling using the Airline db from relational fit, which self-reports as being 455MB. I used tracemalloc to measure both the "resting memory footprint" at certain checkpoints, as well as the peak memory size between those checkpoints. The table compares current released code on main / Trainer 0.9.0 with the code on this branch.

Using the first row and main branch as an example, the way to read this table is:

While running on main branch, after instantiating a Connector, the standing memory load was 1,132,735 bytes. While that method was running the memory spiked as high as 1,134,936.

Checkpoint Main current Main peak Branch current Branch peak
conn = Connector.from_conn_str(...) 1,132,735 1,134,936 1,132,561 1,134,762
rd = conn.extract() 321,629,892 1,270,347,471 4,313,712 1,270,054,369
mt = MultiTable(rd) 322,222,376 923,687,442 4,902,519 5,140,588
mt.train_transforms(...) 323,666,151 628,013,664 6,285,388 6,328,929

Some observations:

  • the general "standing pressure" is much lower since we're only storing pointers to files in memory when nothing else is actively going on
  • largest memory spike is during extraction; no difference between this branch and main
  • since transforms models train on unaltered source data, we don't need to load the data into memory at all (we now just pass along the file pointer as the data source), which is why that bottom-right corner stays so nice and small.

@mikeknep
Copy link
Contributor Author

mikeknep commented Jun 28, 2023

Another larger runthrough with tracemalloc, using the (smaller) 10.3MB SFScores database:

Checkpoint Main current Main peak Branch current Branch peak
conn = Connector.from_conn_str(...) 1,151,096 1,161,228 1,150,812 1,160,944
rd = conn.extract() 8,315,849 26,839,407 2,827,917 26,779,260
mt = MultiTable(rd) 8,909,101 15,948,522 3,412,425 3,699,777
mt.classify(...) 9,074,523 14,618,511 3,569,290 4,063,625
mt.train_transforms(...) 9,327,326 14,820,846 3,808,074 3,854,207
mt.run_transforms(...) 14,409,295 23,160,397 8,947,244 14,633,737
mt.train_synthetics(...) 18,806,941 19,540,049 13,326,966 14,741,821
mt.generate() 31,055,466 63,558,885 25,620,783 62,624,282

The substantial increases after run_transforms and generate are almost surely due to the fact that we're currently stashing the final output data from those actions as dataframes on the MultiTable object (mt.[transform|synthetic]_output_tables). We have plans to deprecate that property and just expose pointers to the final output files on disk; will come in a separate PR.

The end of the generate method is also where we do table joining (to assemble datasets that we send to Evaluate for our "cross-table SQS" scores)—there are plans to investigate ways to optimize that specifically. (If we were using ancestral strategy, this joining would take place during train_synthetics.)

Copy link
Contributor

@tylersbray tylersbray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it through all the changes... Looks like mostly adding the new param for the data source, updating the test fixtures etc. It's nice that this change actually simplifies some of the dumping to csv that was already taking place and the data in your attached tables is pretty promising. LGTM.

Copy link
Contributor

@gracecvking gracecvking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too

Copy link
Contributor

@pimlock pimlock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! Few minor nits/potential tweaks.

@mikeknep mikeknep merged commit 2b23892 into main Jul 10, 2023
@mikeknep mikeknep deleted the source-on-disk branch July 10, 2023 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants