Erlang NIF CSV parser and writer.
callback_fun(CallBackStateType) = fun((Message::callback_message(), CallBackState0::CallBackStateType) -> CallBackState::CallBackStateType)
Callback function used for processing parsed data in parse_stream/4
and parse_stream/5
.
Note callback_message()
contains Rows
in reverse order.
callback_message() = {eof | rows, RevRows::rows()}
callback_state() = any()
input() = eof | binary()
line() = [atom() | number() | iolist()]
option() = strict | null | all_lines | strict_finish | {delimiter, byte()} | {quote, byte()}
options() = [option()]
See parser_init/1
for details about option()
.
reader_fun(ReaderStateType) = fun((ReaderState0::ReaderStateType) -> {Input::input(), ReaderState::ReaderStateType})
Reader function which feeds data to parse_stream/4
and parse_stream/5
.
Note function has to return eof
as the last Input
value or parse_stream/4
and parse_stream/5
will never finish otherwise.
reader_state() = any()
row() = tuple()
rows() = [row()]
state() = any()
Internal parser state which is an NIF resource type. Note it's value is not immutable so doesn't have to be passed from call to call. Anyway, the whole API is designed and internally used as the state is immutable which allows writing pure Erlang implementation with exactly same API. So please do not abuse this feature because it could be incompatible in future. On another side, the current implementation doesn't allow restart parsing from last correct state after error.
All functions after finished parsing return state in a condition which
allows starting another parsing with the same settings. It means parse_step/2
and parse_raw/3
after Input = eof
and parse_stream/4
and parse_stream/5
always.
accumulator/0 | Return simple accumulator callback function. |
block_chopper/1 | Return simple binary reader function. |
default_block_size/0 | Default block size used by parser API functions. |
file_reader/0 | Return file reader function. |
parse/1 | Equivalent to parse(Bin, []). |
parse/2 | Parse CSV data in binary and return rows in order. |
parse_raw/3 | Parse Input and accumulate result. |
parse_step/2 | Parse Input and return rows in order. |
parse_stream/4 | Equivalent to parse_stream(Reader, ReaderState0, CallBack, CallbackState0, []). |
parse_stream/5 | Parse stream produced by Reader and process by CallBack . |
parser_init/1 | Initialise parser state. |
write/1 | Equivalent to write_lines([Line]). |
write_lines/1 | Experimental CSV writer. |
accumulator() -> CallBack
CallBack = fun((Input, RevRows) -> Rows)
Input = callback_message()
RevRows = rows()
Rows = rows()
Return simple accumulator callback function.
The callback function (see callback_fun()
) reverses rows as
reaction to{eof, _}
callback message so returned final state is in
order when used with parse_stream/4
and parse_stream/5
.
Returned callback is equivalent to
Accumulator = fun({eof, Rs}, Acc) -> lists:reverse(Rs ++ Acc);
({rows, Rs}, Acc) -> Rs ++ Acc
end
If efficiency is a concern (even only new rows are appended to
accumulator), consider direct using of parse_raw/3
.
See also: parse_raw/3, parse_stream/4, parse_stream/5.
block_chopper(BlockSize) -> Reader
BlockSize = pos_integer()
Reader = reader_fun(State)
State = binary()
Return simple binary reader function.
Function comes handy when you have already whole CSV data but would like
use parse_stream/4
or parse_stream/5
with custom callback
function working on amount of data defined by BlockSize
.
See also: parse_stream/4, parse_stream/5.
default_block_size() -> 20480
Default block size used by parser API functions
See also: file_reader/0, parse_raw/3.
file_reader() -> Reader
Reader = reader_fun(FH)
FH = file:io_device()
Return file reader function.
Returned reader function reads file:io_device()
using file:read/2
calls with default block size.
See also: default_block_size/0, parse_stream/4, parse_stream/5.
parse(Bin) -> Rows
Bin = binary()
Rows = rows()
Equivalent to parse(Bin, [])
.
parse(Bin, Opts) -> Rows
Parse CSV data in binary and return rows in order.
parse_raw(Input, State0, Acc) -> Result
Input = input()
State0 = state()
Acc = rows()
Result = {ok, Acc, State} | {error, Acc, Reason}
Acc = rows()
State = state()
Reason = any()
Parse Input and accumulate result
It is low-level parsing function which allows writing your own iterative
parsing functions like parse_stream/5
. Note it returns newly
parsed rows in reversed order with Acc0
content appended. All other
parser functions use this function internally.
1> {ok, R1, S1} = ecsv:parse_raw(<<"foo\nbar">>, ecsv:parser_init([]), []).
{ok,[{<<"foo">>}],<<>>}
2> {ok, R2, S2} = ecsv:parse_raw(<<"\nbaz\nquux">>, S1, R1).
{ok,[{<<"baz">>},{<<"bar">>},{<<"foo">>}],<<>>}
3> ecsv:parse_raw(eof, S2, R2).
{ok,[{<<"quux">>},{<<"baz">>},{<<"bar">>},{<<"foo">>}],<<>>}
Function chops Input
binary by default_block_size/0
which should
take 10-15% of timeslice on decent 2.6GHz CPU and keeps VM responsive. You
should not call NIF function directly.
parse_step(Input, State0) -> Result
Parse Input
and return rows in order.
This function allows writing simple parsing loop over chunked data. It
requires initialised parser state and returns rows in order. The call with
eof
is necessary if there is missing line terminator after the last row.
Use parser_raw/3
if an order of rows is not important for you or
you want accumulate rows and use lists:reverse/0
afterward.
See also: parser_init/1, parser_raw/3.
parse_stream(Reader, ReaderState0, CallBack, CallbackState0) -> Result
Reader = reader_fun(ReaderStateType)
ReaderState0 = ReaderStateType
CallBack = callback_fun(CallBackStateType)
CallbackState0 = CallBackStateType
Result = {ReaderState, CallbackState, State}
ReaderState = ReaderStateType
CallbackState = CallBackStateType
State = state()
Equivalent to parse_stream(Reader, ReaderState0, CallBack,CallbackState0, [])
.
parse_stream(Reader, ReaderState0, CallBack, CallbackState0, StateOrOpts) -> Result
Reader = reader_fun(ReaderStateType)
ReaderState0 = ReaderStateType
CallBack = callback_fun(CallBackStateType)
CallbackState0 = CallBackStateType
StateOrOpts = state() | options()
Result = {ReaderState, CallbackState, State}
ReaderState = ReaderStateType
CallbackState = CallBackStateType
State = state()
Parse stream produced by Reader
and process by CallBack
.
Function parses Input
form Reader
(See reader_fun()
) and
result feeds into CallBack
(See callback_fun()
).
Code
{ok, Bin} = file:read_file("test/FL_insurance_sample.csv"),
Rows = ecsv:parse(Bin).
leads in same result as
{ok, FH} = file:open("test/FL_insurance_sample.csv", [read, raw, binary]),
try ecsv:parse_stream(ecsv:file_reader(), FH, ecsv:accumulator(), []) of
{_, Rows, _} -> Rows
after file:close(FH)
end.
or
{ok, Bin} = file:read_file("test/FL_insurance_sample.csv"),
BC = ecsv:block_chopper(ecsv:default_block_size()),
{_, Rows, _} = ecsv:parse_stream(BC, Bin, ecsv:accumulator(), []).
But using parse_stream/4,5
allows stream processing. For example
Counter = fun({_, Rs}, {Fs, Ls}) ->
{Fs + lists:sum([tuple_size(X) || X <- Rs]),
Ls + length(Rs)}
end,
{ok, FH2} = file:open("test/FL_insurance_sample.csv", [read, raw, binary]),
try ecsv:parse_stream(ecsv:file_reader(), FH2, Counter, {0, 0}) of
{_, {NumberOfFields, NumberOfRows} = Result, _} -> Result
after file:close(FH2)
end.
will be way more efficient than reading all rows into memory for big data files.
parser_init(Opts) -> State
Initialise parser state
Return State
for parsing CSV using given 'Opts' options()
. See
state()
for more details about State
behaviour.
strict
null
null
. Compare
1> ecsv:parse(<<"a,,b,\"\",c">>, []). [{<<"a">>,<<>>,<<"b">>,<<>>,<<"c">>}] 2> ecsv:parse(<<"a,,b,\"\",c">>, [null]). [{<<"a">>,null,<<"b">>,<<>>,<<"c">>}]
all_lines
1> ecsv:parse(<<"a\n\nb\n\n">>,[]). [{<<"a">>},{<<"b">>}] 2> ecsv:parse(<<"a\n\nb\n\n">>,[all_lines]). [{<<"a">>},{},{<<"b">>},{}]
strict_finish
{delimiter, D}
{quote, Q}
for example.{quote, Q}
1> ecsv:parse(<<"'a,\",';'b,\"\",''c';\"">>, [strict]). [{<<"'a">>,<<",';'b,\",''c';">>}] 2> ecsv:parse(<<"'a,\",';'b,\"\",''c';\"">>, [strict, {delimiter, $;}, {quote, $'}]). [{<<"a,\",">>,<<"b,\"\",'c">>,<<"\"">>}]
write(Line) -> Result
Line = line()
Result = iolist()
Equivalent to write_lines([Line])
.
write_lines(Lines) -> Result
Lines = [line()]
Result = iolist()
Experimental CSV writer
Function writes binaries, iolists, atoms, integers and floats. Fields are quoted and quotes escaped as needed.