-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source on disk #130
Source on disk #130
Conversation
Another larger runthrough with tracemalloc, using the (smaller) 10.3MB SFScores database:
The substantial increases after The end of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it through all the changes... Looks like mostly adding the new param for the data source, updating the test fixtures etc. It's nice that this change actually simplifies some of the dumping to csv that was already taking place and the data in your attached tables is pretty promising. LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! Few minor nits/potential tweaks.
Instead of storing tables' source data as Pandas DataFrames in memory, store them as CSV files on disk. Users can now provide either a dataframe or a string/Path to a CSV when calling
add_table
.I did some initial profiling using the Airline db from relational fit, which self-reports as being 455MB. I used tracemalloc to measure both the "resting memory footprint" at certain checkpoints, as well as the peak memory size between those checkpoints. The table compares current released code on main / Trainer 0.9.0 with the code on this branch.
Using the first row and main branch as an example, the way to read this table is:
conn = Connector.from_conn_str(...)
rd = conn.extract()
mt = MultiTable(rd)
mt.train_transforms(...)
Some observations: