Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of space-delimited data #30

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

Shimuuar
Copy link
Contributor

@Shimuuar Shimuuar commented Jan 2, 2013

Encoding and decoding for all APIs together with tests and documentations. Previously dicussed at #14

Benchmarks don't indicate any performance regression for CSV.

@tibbe
Copy link
Collaborator

tibbe commented Jan 2, 2013

I had a very brief look at the patch on my phone. Do we really need new
top-level functions? Can't we just extend DecodeOptions?

@Shimuuar
Copy link
Contributor Author

Shimuuar commented Jan 3, 2013

I didn't give it much thought. But it look like a good idea. The only
question is how to extend DecodeOptions. `decDelimiter' only make sense
for CSV data.

data DecodeOptions = SpaceDelim | CSV Word8

Other option is to use record and ignore custom delimiter

data DecodeOptions = DO { decDelim :: Word8, spaceDelim :: Bool }

@Shimuuar
Copy link
Contributor Author

Shimuuar commented Jan 3, 2013

Maybe flag for skipping header should go to the DecodeOptions too

@tibbe
Copy link
Collaborator

tibbe commented Jan 4, 2013

Could you describe the grammar and escaping rules for space delimited data? Can there be multiple spaces between columns? Can there be other space characters than ASCII spaces?

I think we should be able to support this by adding fields to DecodeOptions for recognizing the separators, escaping the right characters, etc. Python does it this way. I don't think we need new top-level functions, except a convenient spaceDelimDecodeOptions :: DecodeOptions constant.

@Shimuuar
Copy link
Contributor Author

Shimuuar commented Jan 4, 2013

AFAIK there is no standard for space-delimited data (or however it called). And if such standard does exist no one cares and just do whatever he like. So it's important to be permissive. Fields are separated by one or more space or tab or any mixture of them. Also leading and trailing spaces should be dropped. Following data have both leading spaces, multiple spaces as separator and trailing spaces

   1 a   1.22
 123 bcd 3.3
  88 c   0.4   ← invisible trailng space
1078 d   0.4

Escaping rules are complicated matter. I work mostly with numbers so I don't know waht escaping schemes are used. I assumed CSV escaping.

Here is grammar written as haskell-like pseudocode. Hope it's undestandable

row   = many ws *> field `sepBy` many1 ws <* many ws
field = csvEscapedField <|> many1 notWS
ws    = ' ' <|> '\t'

@tibbe
Copy link
Collaborator

tibbe commented Jan 5, 2013

Would it be enough to change decDelimiter to a Parser ByteString? If we did that, you could parse space-delimited data using the current code by setting decDelimiter = many (' ' <|> '\t'). We just need to make sure this doesn't kill performance.

We would also need to add a decStripWhitespace option.

Check out the Python (http://docs.python.org/3/library/csv.html) and Go (http://golang.org/pkg/encoding/csv/) CSV modules for ideas. They already support customizations like these.

@Shimuuar
Copy link
Contributor Author

Shimuuar commented Jan 7, 2013

Maybe. There could be some subtle moments. I need to think about it

@Shimuuar
Copy link
Contributor Author

Here is grammar for both CSV and space-delimited data.

CSV grammar

   file        = [header CRLF] record *(CRLF record) [CRLF]
   header      = name  *(COMMA name)
   record      = field *(COMMA field)
   name        = field
   field       = escaped | non-escaped
   escaped     = DQUOTE *(TEXTDATA | COMMA | CR | LF | 2DQUOTE) DQUOTE
   non-escaped = *TEXTDATA

   COMMA    = ',' or 1-byte delimiter
   DQUOTE   = "
   CRLF     = CR LF | LF
   TEXTDATA = ^[ DQUOTE COMMA CR LF ]
   LF       = %x0A
   CR       = %x0D

Space delimited grammar

   file        = [header CRLF] record *(CRLF record) [CRLF]
   header      = *WS name  *(+WS name)  *WS
   record      = *WS field *(+WS field) *WS
   name        = field
   field       = escaped | non-escaped
   escaped     = DQUOTE *(TEXTDATA | COMMA | CR | LF | 2DQUOTE) DQUOTE
   non-escaped = +TEXTDATA

   DQUOTE   = "
   CRLF     = CR LF | LF
   TEXTDATA = ^[ DQUOTE WS CR LF ]
   LF       = %x0A
   CR       = %x0D
   WS       = %x20 | %x09

There are three differences: different separators, diffirent field parsers and space-delimited parser drops leading/trailing spaces. Changes in the field parser are nessesary. First it need to stop on both space and tab. And unescaped field must be at least one character. It's important otherwise we have ambigoius grammar. For example line "a " could be parsed as:

record 0WS (non-escaped "a") [] 1WS
record 0WS (non-escaped "a") [1WS (non-escaped "")] 0WS

I hope it's clear.

I've looked at both python and go libraries. It look like none could be used to parse data using grammar above. It look like only way to push grammar selection into option is to either enumerate grammars or by setting field and header parsers.

@tibbe
Copy link
Collaborator

tibbe commented Jan 15, 2013

I thought about this a bit today. I didn't get much further than breaking down the changes into a diff:

-header      = name  *(COMMA name)
-record      = field *(COMMA field)
+header      = *WS name  *(+WS name)  *WS
+record      = *WS field *(+WS field) *WS
-non-escaped = *TEXTDATA
+non-escaped = +TEXTDATA
-TEXTDATA = ^[ DQUOTE COMMA CR LF ]
+TEXTDATA = ^[ DQUOTE WS CR LF ]

I made some initial attempts in adding more field to DecodeOptions, but it didn't work out.

@Shimuuar
Copy link
Contributor Author

I think that grammars are different enough that it's difficult to unify them
using options. Best I can think up:

data DecOpt = CSV Word8 | SpaceDelim

But then we lose record update syntax. It is possible to put everything to
record:

data DecOpt = DecOpt
  { isSpaceDelim :: Bool
  , csvSeparator :: Word8
  }

In this case fields have different meaning depending on values of other field.
csvSeparator doesn't mean anything if isSpaceDelim is True.

There is also type class approach. It could be extended to handle any CSV
like format. But is there any other?

class DecOpt a where
  -- Return parsers for header and ordinary record
  toDecOpt :: a -> (Parser,Parser)

data CsvOpt = CsvOpt Word8
instance DecOpt where
  toDecOpt (Csv d) = (header d, record d)


data SpaceDelim = SpaceDelim
instance DecOpt SpaceDelim where
  toDecOpt _ = (headerTable, recordTable)

@tibbe
Copy link
Collaborator

tibbe commented Jan 16, 2013

Let me start with the constraints I'm working with:

  • The more general we make the library (e.g. provide more top-level combinators) the more difficult it becomes for users to grasp). For example, if we double the number of top-level decode and encode functions, there's more stuff there for the users to understand.
  • Certain kinds of additions lead to a doubling of the number of functions in the API. For example, supporting files with and without headers (decode and decodeByName) led to such a doubling. I decided to try to only add new top-level functions if the return value differs (as in the case of decode and decodeByName) and handle any format differences using the options records.
  • At some point we'll just be writing another generic parser library and we'll end up in the Turing tarpit, where everything can be done using cassava, but nothing can be done well and/or easily. I'm trying to to constrain the problem domain we're working in to prevent that from happening. We don't want to end up with decodeOptions :: CsvOpts ... | SpaceDelimOpts ... | XmlOpts .... ;)

At the extreme, the user could just provide a Parser (Vector (Vector ByteString)) and we could offer a decodeUsing :: FromRecord a => Parser (Vector (Vector ByteString)) -> ByteString -> Vector a. That would be the most general (although not the most efficient, as we always need to construct the vector of vector of byte strings in memory).

I'd like to avoid this if possible, especially if it's not needed. I went and looked at the output formats of Matlab, Octave, and Excel and they all use a single separator (space or tab), with the exception of the Excel Formatted Text output, which uses several spaces.

I went ahead and checked the current field parser and it actually uses this grammar:

TEXTDATA = ^[ DQUOTE <escape-char> CR LF ]

so it almost does what your grammer specifies. If we made decDelimiter a predicate function it should do exactly what your grammar does (i.e. disallow any kind of whitespace).

At first I thought we could change the parsing code for separators from word8 (decDelimiter opts) to many (word8 (decDelimiter opts)), but that doesn't work as this CSV data would parse correctly:

a,b,,c

Hence we'd have to change decDelimiter to be of type Parser () as I mentioned before.

@Shimuuar
Copy link
Contributor Author

Well I dn't like type-classes idea either. There isn't enough CSV-like formats to justify such generality.

I don't think that switching decDelimiter to Parser () is good idea. Parser for unescaped fields require predicate on delimiter character. Since we need predicate anyway we can just add flag to choose between one character and one or more characters as delimiter. In the same way it's possible tyo add flag for dropping initial/trailing whitespaces.

data DecOpt = DecOpt
  { decDelim         :: Word8 → Bool
  , oneOrMoreDelim   :: Bool
  , stripDelimOnEdge :: Bool
  } 

Then default options are:

csvOpts = DecOpt (== ',') False False
spaceDelimOpts = DecOpt (\c -> == ' ' || == '\t') True True

@Shimuuar
Copy link
Contributor Author

I almost forgot about this

So does following design for decode/encode options seems reasonable to you? It does manage to unify CSV and space delimited data .

data DecOpt = DecOpt
  { decDelim         :: Word8 → Bool 
    -- Predicate for the separator character
  , oneOrMoreDelim   :: Bool
    -- Whether consecutive separator characters should be treated as single separator
  , stripDelimOnEdge :: Bool
    -- Thether separators on start/end of line should be dropped
  }

data EncOpt = EncOpt
  { encDelim :: Word8
  , extraCharsToEscape :: Word8 -> Bool
    -- Any other characters which sould be escaped.
  }

@tibbe
Copy link
Collaborator

tibbe commented Feb 22, 2013

Have you tried to implement it to make sure it actually works? :)

@Shimuuar
Copy link
Contributor Author

Not yet. Being naturally lazy I like to avoid obviosly wrong ideas as early as possible. But it's simple enumeration of differences in grammar so there shouldn't be any problems. Most of the pieces are already implemented and need only reshuffling.

@tibbe
Copy link
Collaborator

tibbe commented Feb 22, 2013

I'll look into it.

@tibbe
Copy link
Collaborator

tibbe commented Feb 23, 2013

I'm working on this on the https://github.com/tibbe/cassava/tree/space-delim branch.

@Shimuuar
Copy link
Contributor Author

I've implemented correct trimming of spaces. It turned out to be tricky and I had to rewrite record parser.

Correctly leading/trailing spaces

It's hard to strip whitespaces correctly because
 a) It's valid part of field for CSV so
    "a,b,c " -> ["a","b","c "]
 b) If we're using spaces as delimiter we get spurious empty field
    at the end fo the line
    "a b c " -> ["a","b","c",""]

Only reliable way to strip them is to read whole line, strip spaces
and parse stripped line.

I'm working on space2 branch in my repo.

@Shimuuar
Copy link
Contributor Author

Implementation is mostly complete. There are no tests for encoding yet and few performance regressions too.

@tibbe
Copy link
Collaborator

tibbe commented Mar 7, 2013

Only reliable way to strip them is to read whole line, strip spaces and parse stripped line.

I tried as well and came to the same conclusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants