-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG Incorrect delimiter in qsv sniff #1719
Comments
I ended up creating qsv-sniffer since it seems csv-sniffer was no longer being actively maintained as shown by the numerous unanswered issues. But TBH, the Viterbi algorithm it uses to sniff and infer CSV schemas is still something I don't fully grok, so it trips up on certain CSVs. It works well enough for "typical" CSVs and I've tweaked it enough to remove the panics, but the whole Viterbi inferencing part needs to be refactored. FYI, my intent with Perhaps, we can tag-team on qsv-sniffer to make its CSV schema inferencing more reliable? |
I'll take a look on it -- We've apparently hit this before (according to a co-worker) and we've got a build of dp+ that basically ignores non-standard delimiters. So immediate issue is patched around. So, I'm not good with rust (you've seen basically all of what I've ever done), but this confuses me: https://github.com/jqnatividad/qsv-sniffer/blob/master/src/sniffer.rs#L506 . We're calling this with all of the possible quote characters in character, delim is None. This for the "most common" case (csv with "), this regex (somewhat simplified) is hopefully looking for The delimiter count here is only going to ever pick up a valid delimiter if there's a What if we look at a sample of likely delimiters. |
So having a look at what python's csv.sniff is doing, the It also includes a plain |
Thanks for digging into this @EricSoroos ! Aligning qsv-sniffer's behavior with python's csv sniffer is the way to go! Do you know if it uses the same Viterbi algorithm? If not, I'm inclined to just rewrite qsv-sniffer to just do a straight port of python's csv sniffer... |
It doesn't look like it -- it just looks like a frequency based check. Viterbi looks like a general constrain satisfaction algorithm, so it's just one way to determine if a set of parameters is the most likely description of the data. |
Awesome! I'll create a new branch on the qsv-sniffer crate then and start porting over the code. Would be nice if we can co-maintain it as there's really nothing like python's csv sniffing in the Rust ecosystem beyond csv-sniffer and qsv-sniffer. Polars has a very fast multi-threaded, mem-mapped CSV reader, and it has some CSV schema sniffing smarts, but its not a general library that can be used separately without bringing in a lot of polars crates. |
Here's the tracking issue for the EPIC (in more ways than one 😉 ) port of python's csv-sniff |
Perhaps, the best way to "fix" this is to NOT use the qsv-sniffer crate and just use the workhorse csv crate instead. The new approach will just attempt to open the candidate file as a CSV file, and get the metadata that way. |
Hi, @EricSoroos and @jqnatividad. A much more compact and easy-to-understand way is described in a research paper, also having a Python implementation. The approach can be easily implemented over whatever built-in CSV parser. |
Hi @ws-garcia , thanks for sharing your paper and the reference python implementation. I'll have to dig into the paper and your python code to understand it better. |
Describe the bug
QSV sniff from 0.102, 0.112, and 0.125 is incorrectly determining the delimiter for a specific file.
To Reproduce
incorrect-delimiter.csv
Expected behavior
Comma as the delimiter. Columns the same as
qsv stats
orqsv headers
.Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: