Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better hint sniffing #57

Merged
merged 38 commits into from
May 16, 2020
Merged

Better hint sniffing #57

merged 38 commits into from
May 16, 2020

Conversation

vinceatbluelabs
Copy link
Contributor

@vinceatbluelabs vinceatbluelabs commented May 7, 2020

The Pandas hint inference is actually not doing that much for us. The following things aren't inferred at all by it:

  • 'escape'
  • 'doublequote'
  • 'quotechar',
  • 'encoding'
  • 'header-row'

This is confusing, as our current code nominally pulls that info from a pandas object - but in practice, those things are either pulled from what we originally passed in to read_csv() or set to default values - https://github.com/pandas-dev/pandas/blob/e9b019b653d37146f9095bb0522525b3a8d9e386/pandas/io/parsers.py#L2253

It turns out that Python itself ships with an csv.Sniffer class that can do a lot of this. There are some limitations - it only works with some common record terminators and it only operates on decompressed files, but those are things we can work around.

The net effect is a lot better hint sniffing overall, which shows in the component test results here (sample CSV need fewer initial hints to be able to understand the files).

@vinceatbluelabs vinceatbluelabs requested a review from cwegrzyn May 13, 2020 18:10


logger = logging.getLogger(__name__)

HINT_INFERENCE_SAMPLING_SIZE_BYTES = 1024
Copy link
Contributor Author

@vinceatbluelabs vinceatbluelabs May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you think this magic number feels important enough to add in as a ProcessingInstruction. My gut is that the default value is small enough not to matter performance-wise and big enough that we're not likely to get better information out of the hint sniffers by tweaking it higher.

This could potentially be something we offer config via the mechanism in #47 - though if it does make sense as a parameter, I'm not sure it makes sense as a system-wide parameter...

@vinceatbluelabs vinceatbluelabs marked this pull request as ready for review May 13, 2020 18:14
@vinceatbluelabs vinceatbluelabs removed the request for review from cwegrzyn May 13, 2020 18:14
@vinceatbluelabs
Copy link
Contributor Author

(whoops, need to figure out the integration test failure!)

@vinceatbluelabs vinceatbluelabs requested a review from cwegrzyn May 13, 2020 19:58
@vinceatbluelabs vinceatbluelabs merged commit ff786a8 into master May 16, 2020
@vinceatbluelabs vinceatbluelabs deleted the better_hint_sniffing branch May 16, 2020 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants