API discussion #1187

st-pasha · 2018-07-20T22:56:44Z

This issue is for the general discussion of the datatable's API. It should only be closed when the discussion has stabilized, and the majority of the suggested syntax either implemented or delegated into separate issues.

First, as a general principle, datatable is a sibling of R's data.table, and aims to mimic its API / algorithms whenever possible and reasonable. At the same time, many of the design choices that went into data.table stem from the functionality of base R; such functionality may be awkward when transferred into Python directly. So some kind of balanced approach is needed. Finally, it must be acknowledged that R gives much more freedom in syntactic expression to the user, which means many of the constructs used in data.table are simply not possible in Python.

Main syntax

The cornerstone of data.table's API is the following syntactic form:

                                 DT[i, j, by, ...]

where ... denotes extra options. Here i and j are positional arguments, denoting the rows and columns selectors respectively (alternatively, j is often called the "what to do" argument, as it can specify arbitrary calculations over the columns). The by argument may also be positional, but more commonly it is used in named form (i.e. by=...), especially considering that it is frequently replaced with keyby=... which is another mode of grouping.

This syntax is good, and we want to generally retain it, however, there is a big caveat: Python does not support named parameters in square-brackets selectors. There is PEP-472 to add such support. The PEP dates back to 2014 and was on "standards track" for Py3.6, however, today Py3.7 is ~~almost~~ already out, and the proposal was not implemented yet. So don't get your hopes too high...

Given all this a considerable amount of thought, I come up with the following suggested primary syntax for datatable:

                                 DT[i, j]
                                 DT[i, j, by(...)]
                                 DT[i, j, join(...)]
                                 etc.

Thus, the simplest form uses DT[i, j], which is perfectly natural for indexing a 2-dimensional table of data. However, the grouping argument, if present, must be "named" via function by(). The function by() may accept multiple columns or column expressions, and also have its own parameters. For example, such parameters could be method = "fast"|"sorted"|"keep_order"|"rle" to choose the algorithm for grouping, add_cols = True|False whether to automatically add key columns to the resulting frame, skip_na = False|True whether an NA-valued group is dropped, filter=<expr> to remove certain groups based on a custom logic, and so on.

Likewise, the generic syntax to perform a join is the join() verb: DT[i, j, join(X, on=..., nomatch=..., mode=...)]. We can support the data.table's syntax DT[X] too, but I suspect it won't be very useful without the support of extra arguments such as on=, mult=, etc. Another point of distinction is that unlike DT[X], the expression DT[:, :, join(X)] will perform a left-outer-join with default params.

This takes care of most of the arguments to [.data.table. The arguments that do not fall into either by() or join() family are: nomatch, which, with and verbose. Out of these, with is not needed since in Python the mode with=TRUE does not work anyway, so we have to use f.* expressions. The verbose and nomatch parameters can be handled as global options. The which parameter is very awkward: a much cleaner approach is to have a special .WHICH symbol to be used in j.

f.* symbols

As mentioned above, the data.table's syntax DT[, A] to refer to column "A" cannot work in Python: A will be interpreted as variable from the outer scope, not as column "A" in DT. Of course, DT[:, "A"] is ok in Python, but then you cannot do expressions such as DT[:, "A" / "B"]. Presumably, you could put the entire expression into a string DT[:, "A / B"], but even this has its limitations.

Instead, we opted out for the f.* syntax: the f refers to the "frame currently being operated upon", and then f.A or f["A"] is the column "A" in that frame. The constant repetition of f. is somewhat tedious, but it has its own advantages too:

it is easy to refer to a column whose name is in a variable: f[var];
similarly, you can refer to a column whose name is not a valid identifier: f["Purchase price"];
it is possible to distinguish between the columns of the current frame and the columns of the joined frame, the latter will use prefix g;
data.table occasionally uses a similar approach by saying x.col or i.col;
the columnar expression(s) can be saved in a variable and then reused later.

In-place frame updates

In data.table the syntax for this is DT[i, col1:=expr]. This is nice, but there is no ":=" operator in Python (at least until PEP-572, but even that would not be overloadable). Instead, we currently implement the following syntax for updates: DT[i, col] = expr. This works fine in small use cases but quickly becomes unreadable in larger ones. Consider: DT[:, [colA, colB2, colC]] = [expr1, expr2, expr3] -- which column name gets assigned which expression? Or DT[:, col, join(X, ...), by(z)] = expr -- the column name and the expression are so far from each other that it becomes unclear what kind of assignment takes place.
One way to deal with this problem is to introduce a special syntax for updates:

DT.update[i, {colA: expr1, colB: expr2}, ...]

or alternatively

DT[i, update(colA=expr1, colB=expr2), ...]

Arbitrary group expressions

One of the most powerful features of data.table is the ability to perform arbitrary calculations with subsets of the target frame corresponding to each group. This is done via .SD special symbol: the j part of the DT[i, j, by] form can be an arbitrary function of .SD -- as long it creates a list (or a data.table) as a result.

A similar functionality can be achieved in Python datatable via a special function apply() (or do()) which can be used in place of j expression. This function may take either one or two arguments, and produce either a list, or list-of-lists, or a Frame, or None.

the function will be called once for each group in the source frame, or once for each row if no by() clause was given;
a single-argument function will be given the subframe of the data corresponding to each group (i.e. .SD);
a two-arguments function will be given the key value as the first arg, and the subset of data as the second arg;
if the function returns None, that value is ignored; if it returns a list/tuple, that value is converted into a 1-row frame; if it returns a list-of-lists, it is converted into a frame (each list element becomes a column);
at the end, all produced frames are rbind-ed together (or combined according to the combine option).

The apply() function may have options to control its behavior: sdcols - same as .SDcols in R, combine="rbind"|"cbind"|"list", etc.

Please share your thoughts / comments / suggestions.

The text was updated successfully, but these errors were encountered:

arnocandel · 2018-07-20T23:31:48Z

Thank you! Maybe we could assemble a representative side-by-side comparison between R and Python API to give us a better feel of what it would look like?

st-pasha · 2018-07-27T23:58:53Z

Pivoting

Pivot functionality comes from early spreadsheet applications, Lotus Improv and Excel. In a nutshell, the functionality involves designating a "rows" column, a "columns" column, and a "data" column, plus an aggregator function. Then for each distinct (rows, columns) combination, we compute aggregator(data). Thus, pivoting involves:

selecting columns R, C and V from the target frame;
grouping the target frame by (R, C);
for each group computing agg(V);
converting the produced frame into "wide" format with values of C becoming the columns.

Thus, I can think of 2 ways how to express this in datatable syntax:

Emphasizing the way how people usually think about pivot tables
```
DT.pivot[f.R, f.C, agg(f.V)]
```
This draws upon the traditional form DT[i, j] where i corresponds to rows, and j to columns.
Emphasizing what pivoting actually is:
```
DT.pivot[:, agg(f.V), by(f.R, f.C)]
```
This requires some more cognitive processing, but is more flexible: one can add a rows filter, and use extra parameters of the by function to fine-tune grouping.

Note: same function is called dcast() in data.table, we may choose to use that verb instead.

jangorecki · 2018-08-07T13:48:17Z

for update api second one looks much better

DT[i, update(colA=expr1, colB=expr2), ...]

it even corresponds to R's data.table:

DT[i, `:=`(colA=expr1, colB=expr2), ...]

so only LHS := RHS is not possible in python.

calling apply(.SD, something) for every row if by is not specified will differ from R's dt. We have to explicitly use by=1:nrow(DT) to achieve that, and there are good reasons for that. But R is language of vectors so fun(vec1, vec2, ..., vecN) might be more natural here than in python, where for(i in 0:(nrow(DT)-1)) fun(vec1[i], vec2[i], ..., vecN[i]) might be desired.

- Method `Frame.__call__()` is marked deprecated, and re-implemented to use the `DT[i, j, ...]` interface; - Tests that were using `DT(...)` construct were rewritten in terms of `DT[i, j, ...]`; - A lot of code that existed to support old `DT(...)` function has been removed; - The extra arguments in `DT[i, j, ...]` can now be `None`, in which case they will be ignored. This can be used to include/exclude nodes in `DT[i, j, ...]` based on certain condition. This PR also implements the main principles of design-doc #1187, so I will be closing that as well.

Viktor-Demin · 2019-03-06T20:39:48Z

DT[i, update(colA=expr1, colB=expr2), ...]
This looks like very good update option

DT.pivot[:, agg(f.V), by(f.R, f.C)]
and this is good option for pivoting.

But would be great to see how similar code looks in data.table and in pandas.

jangorecki · 2019-03-07T05:26:30Z

@Cheburusska in R data.table pivot is made by dcast, related issue to use it inside [ Rdatatable/data.table#1714
we use R formula feature in dcast API, not sure if python has something similar.

st-pasha added the design-doc Generic discussion / roadmap how some major new functionality can be implemented label Jul 20, 2018

st-pasha self-assigned this Jul 20, 2018

st-pasha mentioned this issue Jul 30, 2018

Implement inner/outer joins for non-keyed frames #1080

Open

st-pasha mentioned this issue Sep 18, 2018

Implement by() function, as described in API design-doc #1300

Closed

st-pasha mentioned this issue Jan 22, 2019

Deprecate DT(rows=, select=, ...) #1579

Merged

st-pasha closed this as completed in #1579 Jan 22, 2019

st-pasha mentioned this issue Mar 6, 2019

Assigning of new or reassigning existing columns in j part, similar to ':=' operator in R data.table #1700

Closed

st-pasha mentioned this issue Sep 3, 2019

Pivot table support. #1981

Open

st-pasha added this to the Release 0.8.0 milestone Jan 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API discussion #1187

API discussion #1187

st-pasha commented Jul 20, 2018 •

edited

Loading

arnocandel commented Jul 20, 2018

st-pasha commented Jul 27, 2018

jangorecki commented Aug 7, 2018 •

edited

Loading

Viktor-Demin commented Mar 6, 2019

jangorecki commented Mar 7, 2019 •

edited

Loading

API discussion #1187

API discussion #1187

Comments

st-pasha commented Jul 20, 2018 • edited Loading

Main syntax

f.* symbols

In-place frame updates

Arbitrary group expressions

arnocandel commented Jul 20, 2018

st-pasha commented Jul 27, 2018

Pivoting

jangorecki commented Aug 7, 2018 • edited Loading

Viktor-Demin commented Mar 6, 2019

jangorecki commented Mar 7, 2019 • edited Loading

st-pasha commented Jul 20, 2018 •

edited

Loading

jangorecki commented Aug 7, 2018 •

edited

Loading

jangorecki commented Mar 7, 2019 •

edited

Loading