-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow parsing of seed file #874
Comments
see also #867 |
Here's what some quick profiler work shows (I only wrapped
Looks like most of the time is spent in parsing dates and times and the agate TypeTester. One way to make this faster (~3x speedup on this test data) would be to allow users to specify the types of their csv columns, but that sounds potentially quite difficult and error-prone. |
Would something like this help? |
Also, I think there’s a way already to specify types in dbt seeds? Seems familiar anyways. |
Switching out csv parsers would be a bit involved as Also that parser is fast in part because it exclusively supports a subset of valid iso8601/rfc3339 dates, which a lot of user data is not... for example, the sample data you provided :) The general case of parsing user-provided csv files is pretty tough, it's a bit of an under-specified format. I don't believe there's any way to specify column types for seeds when invoking |
Is there any chance of reopening this? I've been testing out seeds and have found that some files can take up to 4 mins to parse (~45MB), even if the full schema is provided. If not, I would suggest that this section from the docs is removed:
I haven't dived into the code, but I suspect the inference is always done (or isn't in my version 0.12.1) and then overwritten. |
hey @mayansalama - I think you're very right about that -- i can update the docs accordingly. To clarify: dbt has two different notions of "column types" for a seed file
Generally, loading data into a warehouse is hard work! dbt seed does a good for job tiny datasets with uncomplicated schemas. For 45mb of data, you'll probably want to use a dedicated tool that exists to solve exactly this type of problem. Once you do that, you'll be able to use dbt to transform that data as you would with any other dataset! Hope this helps, and thanks for the docs pointer :) |
Makes sense :) I agree that using dbt for a full ingestion pipeline is underkill, however for my purposes (a demo) it was very convenient! For the actual use-case I'm thinking of (CICD with containerised Postgres to test a pipeline), small data sizes suit the bill fine! Cheers mate |
dbt takes about 14 seconds to parse the attached CSV file (only 5k lines long).
This file contains a lot of dates, not sure if that's related.
retail_calendar.txt
The text was updated successfully, but these errors were encountered: