Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A wrong CSV mapped as valid? #1050

Closed
aborruso opened this issue Jul 6, 2022 · 7 comments
Closed

A wrong CSV mapped as valid? #1050

aborruso opened this issue Jul 6, 2022 · 7 comments
Assignees

Comments

@aborruso
Copy link
Contributor

aborruso commented Jul 6, 2022

Hi,
I have this input CSV, that is wrong: It has two cells in two fields without field name.

But if I run mlr --ifs ";" --csv check input.csv, I have no error.

Is it normal?

Thank you

image

@johnkerl
Copy link
Owner

johnkerl commented Jul 6, 2022

@aborruso as of #225 and #794 we have a field-name deduplicator.

$ cat x
a,b,b,a,a,b
1,2,3,4,5,6

$ mlr --csv check x

$ mlr --c2x cat x
a   1
b   2
b_2 3
a_2 4
a_3 5
b_3 6

In your example the first column named "" is left as-is while the second is deduplicated to "_2":

$ mlr --ifs semicolon --csv check input.csv

$ mlr --ifs semicolon --c2x head -n 1 input.csv
Period   w_44_2021
Zone_101 30.547875563
Zone_102 29.647857121
Zone_103 30.258906489
Zone_104 29.084443251
Zone_105 30.322373863
Zone_106 30.201716127
Zone_107 30.110808549
Zone_108 30.050539392
Zone_109 30.336472375
Zone_110 29.972881168
Zone_111 30.564672832
Zone_112 29.876034876
Zone_113 29.395016018
Zone_114 30.680595852
Zone_115 30.79686536
Zone_116 30.440053095
Zone_117 30.310936498
Zone_201
Zone_202
Zone_203
Zone_204
Zone_205

_2

We could prohibit "" in field names entirely, or have mlr check complain about that, but, in my experience getting CSVs from R/pandas/numpy it's quite common for the first field name to be "".

@aborruso
Copy link
Contributor Author

aborruso commented Jul 7, 2022

Hi @johnkerl thank you.

In my opinion if you have 3 out of 20 columns with no names, these are not duplicates fields, but columns with no name.
Moreover, here, removing duplicates you have still a field with no name (the first).

The CSV is a format in which there are often a lot of ugliness, and I think that it's better to prohibit "" in field names entirely.

Moreover now check is not RFC 4180 compliant:

There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This
header will contain names corresponding to the fields in the file and should contain the same number of fields as the
records in the rest of the file (the presence or absence of the header line should be indicated via the optional "header"
parameter of this MIME type).

To be able to use check (also as an option) to run RFC 4180 check would be a great thing. Yesterday I wanted to use it to do this check.

@johnkerl
Copy link
Owner

johnkerl commented Jul 7, 2022

@aborruso thanks! I would mention though that in the RFC I don't see a requirement that the column names be non-empty (or even unique).

@aborruso
Copy link
Contributor Author

aborruso commented Jul 7, 2022

ok, thank you

@aborruso aborruso closed this as completed Jul 7, 2022
@johnkerl johnkerl reopened this Jul 7, 2022
@johnkerl
Copy link
Owner

johnkerl commented Jul 7, 2022

Not to say we can't add a warning for it in mlr check!!! :)

@johnkerl
Copy link
Owner

@aborruso what do you think of #1330?

@aborruso
Copy link
Contributor Author

@aborruso what do you think of #1330?

it's good! Thank you very much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants