Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow parsing of seed file #874

Closed
joevandyk opened this issue Jul 23, 2018 · 10 comments
Closed

Slow parsing of seed file #874

joevandyk opened this issue Jul 23, 2018 · 10 comments

Comments

@joevandyk
Copy link
Contributor

joevandyk commented Jul 23, 2018

dbt takes about 14 seconds to parse the attached CSV file (only 5k lines long).

2018-07-23 16:30:14,064: Parsing /Users/joe/projects/crowdcow_dbt/data/retail_calendar.csv
2018-07-23 16:30:28,241: Parsing seed.crowdcow_dbt.retail_calendar

This file contains a lot of dates, not sure if that's related.

retail_calendar.txt

@drewbanin
Copy link
Contributor

see also #867

@beckjake
Copy link
Contributor

Here's what some quick profiler work shows (I only wrapped run_flat because that was obviously the problem area, and I snipped out all the cruft at the bottom):

         21752731 function calls (21363515 primitive calls) in 18.556 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   18.557   18.557 /Users/jake/src/fishtown/dbt/dbt/runner.py:219(run_flat)
        1    0.000    0.000   18.557   18.557 /Users/jake/src/fishtown/dbt/dbt/runner.py:176(run_from_graph)
        1    0.000    0.000   17.973   17.973 /Users/jake/src/fishtown/dbt/dbt/runner.py:170(compile)
        1    0.000    0.000   17.973   17.973 /Users/jake/src/fishtown/dbt/dbt/compilation.py:299(compile)
        1    0.000    0.000   17.442   17.442 /Users/jake/src/fishtown/dbt/dbt/loader.py:13(load_all)
        6    0.000    0.000   17.438    2.906 /Users/jake/src/fishtown/dbt/dbt/loader.py:32(load_all)
        2    0.000    0.000   16.529    8.264 /Users/jake/src/fishtown/dbt/dbt/loader.py:180(load_project)
        2    0.000    0.000   16.529    8.264 /Users/jake/src/fishtown/dbt/dbt/parser/seeds.py:44(load_and_parse)
        1    0.000    0.000   16.525   16.525 /Users/jake/src/fishtown/dbt/dbt/parser/seeds.py:17(parse_seed_file)
        1    0.001    0.001   16.525   16.525 /Users/jake/src/fishtown/dbt/dbt/clients/agate_helper.py:40(from_csv)
        1    0.002    0.002   16.524   16.524 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/agate/table/from_csv.py:8(from_csv)
        1    0.054    0.054   16.512   16.512 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/agate/table/__init__.py:79(__init__)
    92032    1.079    0.000   12.326    0.000 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/parsedatetime/__init__.py:1940(nlp)
        1    0.140    0.140   12.057   12.057 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/agate/type_tester.py:73(run)
   127145    0.078    0.000   11.839    0.000 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/agate/data_types/base.py:21(test)
   122702    1.021    0.000   10.350    0.000 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/parsedatetime/__init__.py:1779(parse)
    61353    0.209    0.000    8.523    0.000 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/agate/data_types/date.py:47(cast)
    30679    0.199    0.000    7.228    0.000 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/agate/data_types/date_time.py:53(cast)

Looks like most of the time is spent in parsing dates and times and the agate TypeTester. One way to make this faster (~3x speedup on this test data) would be to allow users to specify the types of their csv columns, but that sounds potentially quite difficult and error-prone.

@joevandyk
Copy link
Contributor Author

Would something like this help?
https://github.com/closeio/ciso8601/blob/master/README.rst

@joevandyk
Copy link
Contributor Author

Also, I think there’s a way already to specify types in dbt seeds? Seems familiar anyways.

@beckjake
Copy link
Contributor

Switching out csv parsers would be a bit involved as dbt makes pretty heavy use of agate. It's possible we could have some other parser ingest and then use that to call into agate but it's quite a lot of work - I think we'd gain more by implementing #867. Is there a use case where faster seed parsing is very important and would be better than just not parsing at all?

Also that parser is fast in part because it exclusively supports a subset of valid iso8601/rfc3339 dates, which a lot of user data is not... for example, the sample data you provided :) The general case of parsing user-provided csv files is pretty tough, it's a bit of an under-specified format.

I don't believe there's any way to specify column types for seeds when invoking dbt. Certainly not in the path this goes through, given the stuff I did to test it without the TypeTester.

@drewbanin
Copy link
Contributor

drewbanin commented Jul 24, 2018

totally agree @beckjake -- I think the move is to implement #867. Even if we could make this faster, we shouldn't be reading the files at all in this example!

@drewbanin
Copy link
Contributor

I don't think there's much we can/should do on the type-inference front. Going to close this along with #867 now that #1046 is merged. Seeds may still be slow, but it should be confined to only invocations of dbt seed!

@mayansalama
Copy link

Is there any chance of reopening this? I've been testing out seeds and have found that some files can take up to 4 mins to parse (~45MB), even if the full schema is provided. If not, I would suggest that this section from the docs is removed:

In addition to configuring schemas, the column types of a seed file can also be configured in the dbt_project.yml file. If this configuration is not provided, dbt will try to infer column types from a sample of the seed data.

I haven't dived into the code, but I suspect the inference is always done (or isn't in my version 0.12.1) and then overwritten.

@drewbanin
Copy link
Contributor

hey @mayansalama - I think you're very right about that -- i can update the docs accordingly.

To clarify: dbt has two different notions of "column types" for a seed file

  1. a python data type (like int, float, str, etc). This is inferred by the dataframe library we use, called Agate.
  2. a sql data type (like varchar, bigint, numeric, etc). This is generated from the type that was inferred in step 1, and varies by database. This value can be overridden with the column_types config.

Generally, loading data into a warehouse is hard work! dbt seed does a good for job tiny datasets with uncomplicated schemas. For 45mb of data, you'll probably want to use a dedicated tool that exists to solve exactly this type of problem. Once you do that, you'll be able to use dbt to transform that data as you would with any other dataset!

Hope this helps, and thanks for the docs pointer :)

@mayansalama
Copy link

Makes sense :)

I agree that using dbt for a full ingestion pipeline is underkill, however for my purposes (a demo) it was very convenient! For the actual use-case I'm thinking of (CICD with containerised Postgres to test a pipeline), small data sizes suit the bill fine!

Cheers mate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants