Skip to content

Latest commit

 

History

History
505 lines (276 loc) · 15.7 KB

ecsv.md

File metadata and controls

505 lines (276 loc) · 15.7 KB

Module ecsv

Erlang NIF CSV parser and writer.

Data Types


callback_fun(CallBackStateType) = fun((Message::callback_message(), CallBackState0::CallBackStateType) -> CallBackState::CallBackStateType)

Callback function used for processing parsed data in parse_stream/4 and parse_stream/5.

Note callback_message() contains Rows in reverse order.


callback_message() = {eof | rows, RevRows::rows()}

callback_state() = any()

input() = eof | binary()

line() = [atom() | number() | iolist()]

option() = strict | null | all_lines | strict_finish | {delimiter, byte()} | {quote, byte()}

options() = [option()]

See parser_init/1 for details about option().


reader_fun(ReaderStateType) = fun((ReaderState0::ReaderStateType) -> {Input::input(), ReaderState::ReaderStateType})

Reader function which feeds data to parse_stream/4 and parse_stream/5.

Note function has to return eof as the last Input value or parse_stream/4 and parse_stream/5 will never finish otherwise.


reader_state() = any()

row() = tuple()

rows() = [row()]

state() = any()

Internal parser state which is an NIF resource type. Note it's value is not immutable so doesn't have to be passed from call to call. Anyway, the whole API is designed and internally used as the state is immutable which allows writing pure Erlang implementation with exactly same API. So please do not abuse this feature because it could be incompatible in future. On another side, the current implementation doesn't allow restart parsing from last correct state after error.

All functions after finished parsing return state in a condition which allows starting another parsing with the same settings. It means parse_step/2 and parse_raw/3 after Input = eof and parse_stream/4 and parse_stream/5 always.

Function Index

accumulator/0Return simple accumulator callback function.
block_chopper/1Return simple binary reader function.
default_block_size/0Default block size used by parser API functions.
file_reader/0Return file reader function.
parse/1Equivalent to parse(Bin, []).
parse/2Parse CSV data in binary and return rows in order.
parse_raw/3Parse Input and accumulate result.
parse_step/2Parse Input and return rows in order.
parse_stream/4Equivalent to parse_stream(Reader, ReaderState0, CallBack, CallbackState0, []).
parse_stream/5Parse stream produced by Reader and process by CallBack.
parser_init/1Initialise parser state.
write/1Equivalent to write_lines([Line]).
write_lines/1Experimental CSV writer.

Function Details

accumulator/0


accumulator() -> CallBack

Return simple accumulator callback function.

The callback function (see callback_fun()) reverses rows as reaction to{eof, _} callback message so returned final state is in order when used with parse_stream/4 and parse_stream/5.

Returned callback is equivalent to

  Accumulator = fun({eof, Rs}, Acc) -> lists:reverse(Rs ++ Acc);
                   ({rows, Rs}, Acc) -> Rs ++ Acc
                end

If efficiency is a concern (even only new rows are appended to accumulator), consider direct using of parse_raw/3.

See also: parse_raw/3, parse_stream/4, parse_stream/5.

block_chopper/1


block_chopper(BlockSize) -> Reader
  • BlockSize = pos_integer()
  • Reader = reader_fun(State)
  • State = binary()

Return simple binary reader function.

Function comes handy when you have already whole CSV data but would like use parse_stream/4 or parse_stream/5 with custom callback function working on amount of data defined by BlockSize.

See also: parse_stream/4, parse_stream/5.

default_block_size/0


default_block_size() -> 20480

Default block size used by parser API functions

See also: file_reader/0, parse_raw/3.

file_reader/0


file_reader() -> Reader

Return file reader function.

Returned reader function reads file:io_device() using file:read/2 calls with default block size.

See also: default_block_size/0, parse_stream/4, parse_stream/5.

parse/1


parse(Bin) -> Rows

Equivalent to parse(Bin, []).

parse/2


parse(Bin, Opts) -> Rows

Parse CSV data in binary and return rows in order.

parse_raw/3


parse_raw(Input, State0, Acc) -> Result

Parse Input and accumulate result

It is low-level parsing function which allows writing your own iterative parsing functions like parse_stream/5. Note it returns newly parsed rows in reversed order with Acc0 content appended. All other parser functions use this function internally.

  1> {ok, R1, S1} = ecsv:parse_raw(<<"foo\nbar">>, ecsv:parser_init([]), []).
  {ok,[{<<"foo">>}],<<>>}
  2> {ok, R2, S2} = ecsv:parse_raw(<<"\nbaz\nquux">>, S1, R1).
  {ok,[{<<"baz">>},{<<"bar">>},{<<"foo">>}],<<>>}
  3> ecsv:parse_raw(eof, S2, R2).
  {ok,[{<<"quux">>},{<<"baz">>},{<<"bar">>},{<<"foo">>}],<<>>}

Function chops Input binary by default_block_size/0 which should take 10-15% of timeslice on decent 2.6GHz CPU and keeps VM responsive. You should not call NIF function directly.

parse_step/2


parse_step(Input, State0) -> Result

Parse Input and return rows in order.

This function allows writing simple parsing loop over chunked data. It requires initialised parser state and returns rows in order. The call with eof is necessary if there is missing line terminator after the last row. Use parser_raw/3 if an order of rows is not important for you or you want accumulate rows and use lists:reverse/0 afterward.

See also: parser_init/1, parser_raw/3.

parse_stream/4


parse_stream(Reader, ReaderState0, CallBack, CallbackState0) -> Result
  • Reader = reader_fun(ReaderStateType)
  • ReaderState0 = ReaderStateType
  • CallBack = callback_fun(CallBackStateType)
  • CallbackState0 = CallBackStateType
  • Result = {ReaderState, CallbackState, State}
  • ReaderState = ReaderStateType
  • CallbackState = CallBackStateType
  • State = state()

Equivalent to parse_stream(Reader, ReaderState0, CallBack,CallbackState0, []).

parse_stream/5


parse_stream(Reader, ReaderState0, CallBack, CallbackState0, StateOrOpts) -> Result
  • Reader = reader_fun(ReaderStateType)
  • ReaderState0 = ReaderStateType
  • CallBack = callback_fun(CallBackStateType)
  • CallbackState0 = CallBackStateType
  • StateOrOpts = state() | options()
  • Result = {ReaderState, CallbackState, State}
  • ReaderState = ReaderStateType
  • CallbackState = CallBackStateType
  • State = state()

Parse stream produced by Reader and process by CallBack.

Function parses Input form Reader (See reader_fun()) and result feeds into CallBack (See callback_fun()).

Code

  {ok, Bin} = file:read_file("test/FL_insurance_sample.csv"),
  Rows = ecsv:parse(Bin).

leads in same result as

  {ok, FH} = file:open("test/FL_insurance_sample.csv", [read, raw, binary]),
  try ecsv:parse_stream(ecsv:file_reader(), FH, ecsv:accumulator(), []) of
      {_, Rows, _} -> Rows
  after file:close(FH)
  end.

or

  {ok, Bin} = file:read_file("test/FL_insurance_sample.csv"),
  BC = ecsv:block_chopper(ecsv:default_block_size()),
  {_, Rows, _} = ecsv:parse_stream(BC, Bin, ecsv:accumulator(), []).

But using parse_stream/4,5 allows stream processing. For example

  Counter = fun({_, Rs}, {Fs, Ls}) ->
                {Fs + lists:sum([tuple_size(X) || X <- Rs]),
                 Ls + length(Rs)}
            end,
  {ok, FH2} = file:open("test/FL_insurance_sample.csv", [read, raw, binary]),
  try ecsv:parse_stream(ecsv:file_reader(), FH2, Counter, {0, 0}) of
      {_, {NumberOfFields, NumberOfRows} = Result, _} -> Result
  after file:close(FH2)
  end.

will be way more efficient than reading all rows into memory for big data files.

parser_init/1


parser_init(Opts) -> State

Initialise parser state

Return State for parsing CSV using given 'Opts' options(). See state() for more details about State behaviour.

strict
Force strict quoting rules.
null
Unquoted empty field is returned as atom null. Compare
  1> ecsv:parse(<<"a,,b,\"\",c">>, []).
  [{<<"a">>,<<>>,<<"b">>,<<>>,<<"c">>}]
  2> ecsv:parse(<<"a,,b,\"\",c">>, [null]).
  [{<<"a">>,null,<<"b">>,<<>>,<<"c">>}]
all_lines
Return all even empty rows. Compare
  1> ecsv:parse(<<"a\n\nb\n\n">>,[]).
  [{<<"a">>},{<<"b">>}]
  2> ecsv:parse(<<"a\n\nb\n\n">>,[all_lines]).
  [{<<"a">>},{},{<<"b">>},{}]
strict_finish
Force strict quoting rules only for last field of last row with missing line terminator.
{delimiter, D}
Define alternative delimiter character. See {quote, Q} for example.
{quote, Q}
Define alternative quotation character. Compare
  1> ecsv:parse(<<"'a,\",';'b,\"\",''c';\"">>, [strict]).
  [{<<"'a">>,<<",';'b,\",''c';">>}]
  2> ecsv:parse(<<"'a,\",';'b,\"\",''c';\"">>, [strict, {delimiter, $;}, {quote, $'}]).
  [{<<"a,\",">>,<<"b,\"\",'c">>,<<"\"">>}]

write/1


write(Line) -> Result
  • Line = line()
  • Result = iolist()

Equivalent to write_lines([Line]).

write_lines/1


write_lines(Lines) -> Result
  • Lines = [line()]
  • Result = iolist()

Experimental CSV writer

Function writes binaries, iolists, atoms, integers and floats. Fields are quoted and quotes escaped as needed.