Allow specifying options in load/write functions #108

jornfranke · 2024-02-06T21:19:53Z

Currently, one can load/write data only with specific default options without being able to customize them.

For instancce, in loadCsv or writeCsv I cannot specify a different separator from ";". There are many other options relevant for reading/writing CSV files.

loadParquet/writeParquet does not allow to specify options, such as compression (cf. here).

I propose to have additional functions, e.g. loadCsvWithParameters that takes as input the path and a Map<String,String> which allows to specify any options. I am not exactly sure how one can pass a Map<String,String> in VTL initialized with data. Alternatively one can provide simply a "config" Dataset.class which contains in a dataset with two columns (key,value).

NicoLaval · 2024-02-08T08:07:33Z

Relevant point.
For now, it's hardcoded with a simple configuration, and sure it's not satisfactory.

Defining the Map in the VTL script seems to be impossible.

I imagine two ways to enable custom configuration:

pass it through env vars (but it will restrict the configuration for the whole jupyter instance, not so good)
defining a lot of load / write functions with different combinations (seems to be boring)

I suggest we discuss this during our future call.

NicoLaval · 2024-02-29T19:08:33Z

Hi @jornfranke,

We discussed the possibilities with @hadrienk:

using UDO (not supported for now in Trevas engine) to wrap the loadCSV Java function defined and provided in the Kernel.

define operator read(url string, sep string default ";", delimiter string default "'", header boolean default true)
returns dataset is
readCSV(sep, delimiter, header)
end operator;

And instantiate with read("path", _, _, false); for instance.

At this point, parameters are only positional in VTL, which would give an ugly syntax if we want to expose many options as parameters (see this issue posted on the VTL TF repo).

using url parameters loadCSV("path?header=false"); and handle them directly in the loadCSV Java function provided in the Jupyter Kernel.

What do you think?

jornfranke · 2024-02-29T19:11:24Z

Thanks for the feedback. We did internally some workaround and will look into the second option you propose. The first option would probably makes sense once Trevas supports it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow specifying options in load/write functions #108

Allow specifying options in load/write functions #108

jornfranke commented Feb 6, 2024

NicoLaval commented Feb 8, 2024

NicoLaval commented Feb 29, 2024

jornfranke commented Feb 29, 2024

Allow specifying options in load/write functions #108

Allow specifying options in load/write functions #108

Comments

jornfranke commented Feb 6, 2024

NicoLaval commented Feb 8, 2024

NicoLaval commented Feb 29, 2024

jornfranke commented Feb 29, 2024