Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default splitter in dlmread #1852

Closed
JeffBezanson opened this issue Dec 30, 2012 · 10 comments
Closed

default splitter in dlmread #1852

JeffBezanson opened this issue Dec 30, 2012 · 10 comments
Labels
needs decision A decision on this change is needed

Comments

@JeffBezanson
Copy link
Member

Somebody brought this up on the list. dlmwrite uses ',' by default, but dlmread calls split(line)by default, which uses whitespace, and also discards empty fields. If a delimiter is specified then it uses split(line, dlm, true). I don't remember why we did this, but perhaps the default should just be split(line, ',', true). The empty fields thing can be changed, but if so then for both cases.

@HarlanH
Copy link
Contributor

HarlanH commented Jan 5, 2013

I definitely think that dlmread should default to comma-separation and matching dlmwrite, and tend to think that empty fields should not be discarded in either case.

@StefanKarpinski
Copy link
Member

I think the space-separated and output using tabs would be sensible. If you want to read csv, you should use a real csv reader.

@HarlanH
Copy link
Contributor

HarlanH commented Jan 5, 2013

OK. Worth noting here that DataFrames.jl has a souped-up CSV reader that now supports escaped-quotes, newlines in quoted fields, UTF-8, etc. It might be worth replacing the existing csvread (now just dlmread with different defaults) with the non-DF-specific parts of that code at some point. read_separated_text() returns an Array of strings.

https://github.com/HarlanH/DataFrames.jl/blob/master/src/io.jl

@johnmyleswhite
Copy link
Member

+1 for @HarlanH's idea. If our read_separated_text() function were used to replace csvread, it would:

  • Give Julia a much more robust default reader for delimited data
  • Increase the number of eyeballs looking at that code, which would help the performance of DataFrames

@JeffBezanson
Copy link
Member Author

Is there a clear difference between TSV and CSV, e.g. that TSV is just tab-separated numbers, but CSV might include various other formatting? We might want to have a separate function for simple delimited data if "real csv" is complex.
In any case, using your parsing code sounds good to me. Feel free to replace/incorporate with csvread.

@HarlanH
Copy link
Contributor

HarlanH commented Jan 5, 2013

My understanding is that it's all a holy mess with no standards and no
consistency. It's not as simple as TSV is simple and CSV is complex. But we
could pretend it is, with dlmread/write being just whitespace-separation
and no quotes, and csvread (aka read_separated_text) being arbitrarily
complex. Will start an issue to discuss some details...

On Sat, Jan 5, 2013 at 12:17 PM, Jeff Bezanson [email protected]:

Is there a clear difference between TSV and CSV, e.g. that TSV is just
tab-separated numbers, but CSV might include various other formatting? We
might want to have a separate function for simple delimited data if "real
csv" is complex.
In any case, using your parsing code sounds good to me. Feel free to
replace/incorporate with csvread.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1852#issuecomment-11916629.

@johnmyleswhite
Copy link
Member

I think delimited files are like HTML: there are standards, but lots of buggy examples that systems like Excel have evolved to accept. I think that TSV can be used with all of the complexity of CSV, but that this is required less often because of the relatively low frequency with which a field contains a tab that would have to be escaped.

Let's open another issue. I'll prepare a pull request for read_separated_text() before the end of the day.

@HarlanH
Copy link
Contributor

HarlanH commented Jan 5, 2013

#1893.

Sounds like the resolution to the original issue is as Stefan suggests,
dlmread splits on whitespace; dlmwrite defaults to tabs.

On Sat, Jan 5, 2013 at 1:04 PM, John Myles White
[email protected]:

I think delimited files are like HTML: there are standards, but lots of
buggy examples that systems like Excel have evolved to accept. I think that
TSV can be used with all of the complexity of CSV, but that this is
required less often because of the relatively low frequency with which a
field contains a tab that would have to be escaped.

Let's open another issue. I'll prepare a pull request for
read_separated_text() before the end of the data.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1852#issuecomment-11917256.

@johnmyleswhite
Copy link
Member

For future reference: the IANA standard for TSV disallows tabs within fields. See http://www.iana.org/assignments/media-types/text/tab-separated-values

@StefanKarpinski
Copy link
Member

Presumably newlines are also not allowed in fields, although it doesn't say that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs decision A decision on this change is needed
Projects
None yet
Development

No branches or pull requests

4 participants