Source on disk #130

mikeknep · 2023-06-27T19:30:39Z

Instead of storing tables' source data as Pandas DataFrames in memory, store them as CSV files on disk. Users can now provide either a dataframe or a string/Path to a CSV when calling add_table.

I did some initial profiling using the Airline db from relational fit, which self-reports as being 455MB. I used tracemalloc to measure both the "resting memory footprint" at certain checkpoints, as well as the peak memory size between those checkpoints. The table compares current released code on main / Trainer 0.9.0 with the code on this branch.

Using the first row and main branch as an example, the way to read this table is:

While running on main branch, after instantiating a Connector, the standing memory load was 1,132,735 bytes. While that method was running the memory spiked as high as 1,134,936.

Checkpoint	Main current	Main peak	Branch current	Branch peak
`conn = Connector.from_conn_str(...)`	1,132,735	1,134,936	1,132,561	1,134,762
`rd = conn.extract()`	321,629,892	1,270,347,471	4,313,712	1,270,054,369
`mt = MultiTable(rd)`	322,222,376	923,687,442	4,902,519	5,140,588
`mt.train_transforms(...)`	323,666,151	628,013,664	6,285,388	6,328,929

Some observations:

the general "standing pressure" is much lower since we're only storing pointers to files in memory when nothing else is actively going on
largest memory spike is during extraction; no difference between this branch and main
since transforms models train on unaltered source data, we don't need to load the data into memory at all (we now just pass along the file pointer as the data source), which is why that bottom-right corner stays so nice and small.

mikeknep · 2023-06-28T20:38:10Z

Another larger runthrough with tracemalloc, using the (smaller) 10.3MB SFScores database:

Checkpoint	Main current	Main peak	Branch current	Branch peak
conn = Connector.from_conn_str(...)	1,151,096	1,161,228	1,150,812	1,160,944
rd = conn.extract()	8,315,849	26,839,407	2,827,917	26,779,260
mt = MultiTable(rd)	8,909,101	15,948,522	3,412,425	3,699,777
mt.classify(...)	9,074,523	14,618,511	3,569,290	4,063,625
mt.train_transforms(...)	9,327,326	14,820,846	3,808,074	3,854,207
mt.run_transforms(...)	14,409,295	23,160,397	8,947,244	14,633,737
mt.train_synthetics(...)	18,806,941	19,540,049	13,326,966	14,741,821
mt.generate()	31,055,466	63,558,885	25,620,783	62,624,282

The substantial increases after run_transforms and generate are almost surely due to the fact that we're currently stashing the final output data from those actions as dataframes on the MultiTable object (mt.[transform|synthetic]_output_tables). We have plans to deprecate that property and just expose pointers to the final output files on disk; will come in a separate PR.

The end of the generate method is also where we do table joining (to assemble datasets that we send to Evaluate for our "cross-table SQS" scores)—there are plans to investigate ways to optimize that specifically. (If we were using ancestral strategy, this joining would take place during train_synthetics.)

tylersbray

I made it through all the changes... Looks like mostly adding the new param for the data source, updating the test fixtures etc. It's nice that this change actually simplifies some of the dumping to csv that was already taking place and the data in your attached tables is pretty promising. LGTM.

gracecvking

LGTM too

pimlock

Awesome work! Few minor nits/potential tweaks.

src/gretel_trainer/relational/connectors.py

src/gretel_trainer/relational/core.py

src/gretel_trainer/relational/json.py

tests/relational/conftest.py

mikeknep added 2 commits June 27, 2023 10:13

Store source data on disk

73fc24e

Simplify train_transforms preprocessing

287a147

mikeknep requested review from pimlock, tylersbray, johntmyers and gracecvking June 27, 2023 19:30

tylersbray approved these changes Jun 29, 2023

View reviewed changes

gracecvking approved these changes Jun 29, 2023

View reviewed changes

pimlock approved these changes Jul 1, 2023

View reviewed changes

mikeknep added 3 commits July 10, 2023 09:25

Use Path mkdir method

f4e0c0a

Remove unnecessary yield/iterator in fixture

52f06a5

Early exit to save memory

ad8abbf

mikeknep merged commit 2b23892 into main Jul 10, 2023

mikeknep deleted the source-on-disk branch July 10, 2023 14:54

This was referenced Jul 17, 2023

fix: nested json table string length #135

Merged

Re-add source_ prefix to archived source files #140

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source on disk #130

Source on disk #130

mikeknep commented Jun 27, 2023

mikeknep commented Jun 28, 2023 •

edited

Loading

tylersbray left a comment

gracecvking left a comment

pimlock left a comment

Source on disk #130

Source on disk #130

Conversation

mikeknep commented Jun 27, 2023

mikeknep commented Jun 28, 2023 • edited Loading

tylersbray left a comment

Choose a reason for hiding this comment

gracecvking left a comment

Choose a reason for hiding this comment

pimlock left a comment

Choose a reason for hiding this comment

mikeknep commented Jun 28, 2023 •

edited

Loading