Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG Incorrect delimiter in qsv sniff #1719

Open
EricSoroos opened this issue Apr 3, 2024 · 10 comments
Open

BUG Incorrect delimiter in qsv sniff #1719

EricSoroos opened this issue Apr 3, 2024 · 10 comments
Labels
bug Something isn't working enhancement New feature or request. Once marked with this label, its in the backlog.

Comments

@EricSoroos
Copy link
Contributor

EricSoroos commented Apr 3, 2024

Describe the bug
QSV sniff from 0.102, 0.112, and 0.125 is incorrectly determining the delimiter for a specific file.

To Reproduce
incorrect-delimiter.csv

$ qsv sniff file.csv
Path: ...file.csv
Sniff Timestamp: 2024-04-03T15:41:05.706837+00:00
Last Modified: 2024-04-03T15:18:29+00:00
Delimiter: E
Header Row: true
Preamble Rows: 0
Quote Char: none
Flexible: false
Is UTF8: true
Detected Mime Type: text/plain
Detected Kind: Other
Retrieved Size (bytes): 4,375
File Size (bytes): 4,375
Sampled Records: 52
Estimated: false
Num Records: 52
Avg Record Len (bytes): 76
Num Fields: 2
Stats Types: false
Fields:
    0:  Text  Quarter,Completions N
    1:  Text  W,Commencements,Completions > Commencements ,%YoY Completions ,%YoY Commencements ,YoY Completions ,YoY Commencements ,%QoQ Completions  ,%QoQ Commencements ,QoQ Completions ,QoQ Commencements 

Expected behavior
Comma as the delimiter. Columns the same as qsv stats or qsv headers.

$ qsv headers incorrect-delimiter.csv 
1   Quarter
2   Completions NEW
3   Commencements
4   Completions > Commencements 
5   %YoY Completions 
6   %YoY Commencements 
7   YoY Completions 
8   YoY Commencements 
9   %QoQ Completions  
10  %QoQ Commencements 
11  QoQ Completions 
12  QoQ Commencements 
$ qsv stats incorrect-delimiter.csv 
field,type,sum,min,max,range,min_length,max_length,mean,stddev,variance,nullcount,sparsity
Quarter,String,,Q1 12,Q4 23,,5,7,,,,0,0
Completions NEW,Integer,64693,106,3799,3693,3,4,1244.0962,956.6243,915130.0484,0,0
Commencements,String,,"1,855",898,,2,5,,,,0,0
Completions > Commencements ,String,,FALSE,TRUE,,4,5,,,,0,0
%YoY Completions ,String,,-0.60%,92.80%,,0,7,,,,4,0.0769
%YoY Commencements ,String,,-11.20%,88.00%,,0,7,,,,4,0.0769
YoY Completions ,Float,12016.1609,-885.9696684,1487.868126,2373.8378,0,12,250.3367,462.0062,213449.7579,4,0.0769
YoY Commencements ,Integer,13255,-2823,3859,6682,0,5,276.1458,999.0011,998003.2912,4,0.0769
%QoQ Completions  ,String,,-0.50%,9.26%,,0,7,,,,1,0.0192
%QoQ Commencements ,String,,-0.10%,91.50%,,0,8,,,,1,0.0192
QoQ Completions ,Float,3661.8907,-982.0637068,819.2936393,1801.3573,0,12,71.8018,334.6705,112004.3264,1,0.0192
QoQ Commencements ,Integer,3936,-2064,4618,6682,0,5,77.1765,923.7335,853283.5963,1,0.0192

Desktop (please complete the following information):

  • OS: Ubuntu and MacOS
  • qsv Version 102, 112 125
@EricSoroos EricSoroos changed the title BUG BUG Incorrect delimiter in qsv sniff Apr 3, 2024
@jqnatividad jqnatividad added the bug Something isn't working label Apr 3, 2024
@jqnatividad
Copy link
Collaborator

sniff uses the qsv-sniffer crate, which in turn, is a fork of the csv-sniffer crate.

I ended up creating qsv-sniffer since it seems csv-sniffer was no longer being actively maintained as shown by the numerous unanswered issues.

But TBH, the Viterbi algorithm it uses to sniff and infer CSV schemas is still something I don't fully grok, so it trips up on certain CSVs.

It works well enough for "typical" CSVs and I've tweaked it enough to remove the panics, but the whole Viterbi inferencing part needs to be refactored.

FYI, my intent with qsv sniff is to help power a next-gen harvester, that's why I've added all the other extra stuff (mime-type sniffing, range requests sniffing, etc.).

Perhaps, we can tag-team on qsv-sniffer to make its CSV schema inferencing more reliable?

@EricSoroos
Copy link
Contributor Author

I'll take a look on it -- We've apparently hit this before (according to a co-worker) and we've got a build of dp+ that basically ignores non-standard delimiters. So immediate issue is patched around.

So, I'm not good with rust (you've seen basically all of what I've ever done), but this confuses me: https://github.com/jqnatividad/qsv-sniffer/blob/master/src/sniffer.rs#L506 . We're calling this with all of the possible quote characters in character, delim is None. This for the "most common" case (csv with "), this regex (somewhat simplified) is hopefully looking for "[ ]*?,[ ]*". If there are no quotes in the csv, then I'm unclear what it's ever going to match, but apparently it was coming up with E?

The delimiter count here is only going to ever pick up a valid delimiter if there's a quote delim quote pattern, which doesn't look like it's going to happen with this test file, and isn't necessarily going to be consistent unless it's a 'quote all' type of file. I suspect if the initial guess of the delimiter was good, then we'd probably get the Viterbi confirming the choice.

What if we look at a sample of likely delimiters. ,\t;|:, and ran counts per line. even in the ignorance of quotes, there should be one that has a roughly consistent number of occurrences per line, with the min value likely to be approximately the number of columns. It would at least rule out ones that don't appear, or have large variations in the number of occurrences per line. That would probably work better for then figuring out a quote, because quotes should be either [quote]{2}, or [quote][delim] or [delim][quote], or it's not a valid quoting for the file.

@EricSoroos
Copy link
Contributor Author

So having a look at what python's csv.sniff is doing, the _guess_quote_and_delimiter https://github.com/python/cpython/blob/main/Lib/csv.py#L273 is very similar, but covers all 4 possible quote patterns for a quoted field, not trying to find the possible cases for a delimiter between quotes.

It also includes a plain _guess_delimiter https://github.com/python/cpython/blob/main/Lib/csv.py#L349 that's essentially what I was suggesting in the last para as a fallback to when guessing the quote and delimiter at the same time doesn't work.

@jqnatividad
Copy link
Collaborator

Thanks for digging into this @EricSoroos !

Aligning qsv-sniffer's behavior with python's csv sniffer is the way to go!

Do you know if it uses the same Viterbi algorithm? If not, I'm inclined to just rewrite qsv-sniffer to just do a straight port of python's csv sniffer...

@EricSoroos
Copy link
Contributor Author

It doesn't look like it -- it just looks like a frequency based check. Viterbi looks like a general constrain satisfaction algorithm, so it's just one way to determine if a set of parameters is the most likely description of the data.

@jqnatividad
Copy link
Collaborator

jqnatividad commented Apr 5, 2024

Awesome! I'll create a new branch on the qsv-sniffer crate then and start porting over the code.

Would be nice if we can co-maintain it as there's really nothing like python's csv sniffing in the Rust ecosystem beyond csv-sniffer and qsv-sniffer. Polars has a very fast multi-threaded, mem-mapped CSV reader, and it has some CSV schema sniffing smarts, but its not a general library that can be used separately without bringing in a lot of polars crates.

@jqnatividad
Copy link
Collaborator

Here's the tracking issue for the EPIC (in more ways than one 😉 ) port of python's csv-sniff

jqnatividad/qsv-sniffer#14

@jqnatividad jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Apr 5, 2024
@jqnatividad
Copy link
Collaborator

Perhaps, the best way to "fix" this is to NOT use the qsv-sniffer crate and just use the workhorse csv crate instead.

The new approach will just attempt to open the candidate file as a CSV file, and get the metadata that way.

@ws-garcia
Copy link

Hi, @EricSoroos and @jqnatividad. A much more compact and easy-to-understand way is described in a research paper, also having a Python implementation. The approach can be easily implemented over whatever built-in CSV parser.

@jqnatividad
Copy link
Collaborator

Hi @ws-garcia , thanks for sharing your paper and the reference python implementation.

I'll have to dig into the paper and your python code to understand it better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request. Once marked with this label, its in the backlog.
Projects
None yet
Development

No branches or pull requests

3 participants