-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better hint sniffing #57
Conversation
tests/unit/resources/hint_sniffing/delimited-tsv-with-header.json
Outdated
Show resolved
Hide resolved
|
||
|
||
logger = logging.getLogger(__name__) | ||
|
||
HINT_INFERENCE_SAMPLING_SIZE_BYTES = 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you think this magic number feels important enough to add in as a ProcessingInstruction. My gut is that the default value is small enough not to matter performance-wise and big enough that we're not likely to get better information out of the hint sniffers by tweaking it higher.
This could potentially be something we offer config via the mechanism in #47 - though if it does make sense as a parameter, I'm not sure it makes sense as a system-wide parameter...
(whoops, need to figure out the integration test failure!) |
The Pandas hint inference is actually not doing that much for us. The following things aren't inferred at all by it:
This is confusing, as our current code nominally pulls that info from a pandas object - but in practice, those things are either pulled from what we originally passed in to read_csv() or set to default values - https://github.com/pandas-dev/pandas/blob/e9b019b653d37146f9095bb0522525b3a8d9e386/pandas/io/parsers.py#L2253
It turns out that Python itself ships with an csv.Sniffer class that can do a lot of this. There are some limitations - it only works with some common record terminators and it only operates on decompressed files, but those are things we can work around.
The net effect is a lot better hint sniffing overall, which shows in the component test results here (sample CSV need fewer initial hints to be able to understand the files).