Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API discussion #1187

Closed
st-pasha opened this issue Jul 20, 2018 · 5 comments · Fixed by #1579
Closed

API discussion #1187

st-pasha opened this issue Jul 20, 2018 · 5 comments · Fixed by #1579
Assignees
Labels
design-doc Generic discussion / roadmap how some major new functionality can be implemented
Milestone

Comments

@st-pasha
Copy link
Contributor

st-pasha commented Jul 20, 2018

This issue is for the general discussion of the datatable's API. It should only be closed when the discussion has stabilized, and the majority of the suggested syntax either implemented or delegated into separate issues.

First, as a general principle, datatable is a sibling of R's data.table, and aims to mimic its API / algorithms whenever possible and reasonable. At the same time, many of the design choices that went into data.table stem from the functionality of base R; such functionality may be awkward when transferred into Python directly. So some kind of balanced approach is needed. Finally, it must be acknowledged that R gives much more freedom in syntactic expression to the user, which means many of the constructs used in data.table are simply not possible in Python.

Main syntax

The cornerstone of data.table's API is the following syntactic form:

                                 DT[i, j, by, ...]

where ... denotes extra options. Here i and j are positional arguments, denoting the rows and columns selectors respectively (alternatively, j is often called the "what to do" argument, as it can specify arbitrary calculations over the columns). The by argument may also be positional, but more commonly it is used in named form (i.e. by=...), especially considering that it is frequently replaced with keyby=... which is another mode of grouping.

This syntax is good, and we want to generally retain it, however, there is a big caveat: Python does not support named parameters in square-brackets selectors. There is PEP-472 to add such support. The PEP dates back to 2014 and was on "standards track" for Py3.6, however, today Py3.7 is almost already out, and the proposal was not implemented yet. So don't get your hopes too high...

Given all this a considerable amount of thought, I come up with the following suggested primary syntax for datatable:

                                 DT[i, j]
                                 DT[i, j, by(...)]
                                 DT[i, j, join(...)]
                                 etc.

Thus, the simplest form uses DT[i, j], which is perfectly natural for indexing a 2-dimensional table of data. However, the grouping argument, if present, must be "named" via function by(). The function by() may accept multiple columns or column expressions, and also have its own parameters. For example, such parameters could be method = "fast"|"sorted"|"keep_order"|"rle" to choose the algorithm for grouping, add_cols = True|False whether to automatically add key columns to the resulting frame, skip_na = False|True whether an NA-valued group is dropped, filter=<expr> to remove certain groups based on a custom logic, and so on.

Likewise, the generic syntax to perform a join is the join() verb: DT[i, j, join(X, on=..., nomatch=..., mode=...)]. We can support the data.table's syntax DT[X] too, but I suspect it won't be very useful without the support of extra arguments such as on=, mult=, etc. Another point of distinction is that unlike DT[X], the expression DT[:, :, join(X)] will perform a left-outer-join with default params.

This takes care of most of the arguments to [.data.table. The arguments that do not fall into either by() or join() family are: nomatch, which, with and verbose. Out of these, with is not needed since in Python the mode with=TRUE does not work anyway, so we have to use f.* expressions. The verbose and nomatch parameters can be handled as global options. The which parameter is very awkward: a much cleaner approach is to have a special .WHICH symbol to be used in j.

f.* symbols

As mentioned above, the data.table's syntax DT[, A] to refer to column "A" cannot work in Python: A will be interpreted as variable from the outer scope, not as column "A" in DT. Of course, DT[:, "A"] is ok in Python, but then you cannot do expressions such as DT[:, "A" / "B"]. Presumably, you could put the entire expression into a string DT[:, "A / B"], but even this has its limitations.

Instead, we opted out for the f.* syntax: the f refers to the "frame currently being operated upon", and then f.A or f["A"] is the column "A" in that frame. The constant repetition of f. is somewhat tedious, but it has its own advantages too:

  • it is easy to refer to a column whose name is in a variable: f[var];
  • similarly, you can refer to a column whose name is not a valid identifier: f["Purchase price"];
  • it is possible to distinguish between the columns of the current frame and the columns of the joined frame, the latter will use prefix g;
  • data.table occasionally uses a similar approach by saying x.col or i.col;
  • the columnar expression(s) can be saved in a variable and then reused later.

In-place frame updates

In data.table the syntax for this is DT[i, col1:=expr]. This is nice, but there is no ":=" operator in Python (at least until PEP-572, but even that would not be overloadable). Instead, we currently implement the following syntax for updates: DT[i, col] = expr. This works fine in small use cases but quickly becomes unreadable in larger ones. Consider: DT[:, [colA, colB2, colC]] = [expr1, expr2, expr3] -- which column name gets assigned which expression? Or DT[:, col, join(X, ...), by(z)] = expr -- the column name and the expression are so far from each other that it becomes unclear what kind of assignment takes place.
One way to deal with this problem is to introduce a special syntax for updates:

DT.update[i, {colA: expr1, colB: expr2}, ...]

or alternatively

DT[i, update(colA=expr1, colB=expr2), ...]

Arbitrary group expressions

One of the most powerful features of data.table is the ability to perform arbitrary calculations with subsets of the target frame corresponding to each group. This is done via .SD special symbol: the j part of the DT[i, j, by] form can be an arbitrary function of .SD -- as long it creates a list (or a data.table) as a result.

A similar functionality can be achieved in Python datatable via a special function apply() (or do()) which can be used in place of j expression. This function may take either one or two arguments, and produce either a list, or list-of-lists, or a Frame, or None.

  • the function will be called once for each group in the source frame, or once for each row if no by() clause was given;
  • a single-argument function will be given the subframe of the data corresponding to each group (i.e. .SD);
  • a two-arguments function will be given the key value as the first arg, and the subset of data as the second arg;
  • if the function returns None, that value is ignored; if it returns a list/tuple, that value is converted into a 1-row frame; if it returns a list-of-lists, it is converted into a frame (each list element becomes a column);
  • at the end, all produced frames are rbind-ed together (or combined according to the combine option).

The apply() function may have options to control its behavior: sdcols - same as .SDcols in R, combine="rbind"|"cbind"|"list", etc.


Please share your thoughts / comments / suggestions.

@st-pasha st-pasha added the design-doc Generic discussion / roadmap how some major new functionality can be implemented label Jul 20, 2018
@st-pasha st-pasha self-assigned this Jul 20, 2018
@arnocandel
Copy link
Member

Thank you! Maybe we could assemble a representative side-by-side comparison between R and Python API to give us a better feel of what it would look like?

@st-pasha
Copy link
Contributor Author

Pivoting

Pivot functionality comes from early spreadsheet applications, Lotus Improv and Excel. In a nutshell, the functionality involves designating a "rows" column, a "columns" column, and a "data" column, plus an aggregator function. Then for each distinct (rows, columns) combination, we compute aggregator(data). Thus, pivoting involves:

  • selecting columns R, C and V from the target frame;
  • grouping the target frame by (R, C);
  • for each group computing agg(V);
  • converting the produced frame into "wide" format with values of C becoming the columns.

Thus, I can think of 2 ways how to express this in datatable syntax:

  1. Emphasizing the way how people usually think about pivot tables

    DT.pivot[f.R, f.C, agg(f.V)]
    

    This draws upon the traditional form DT[i, j] where i corresponds to rows, and j to columns.

  2. Emphasizing what pivoting actually is:

    DT.pivot[:, agg(f.V), by(f.R, f.C)]
    

    This requires some more cognitive processing, but is more flexible: one can add a rows filter, and use extra parameters of the by function to fine-tune grouping.

Note: same function is called dcast() in data.table, we may choose to use that verb instead.

@jangorecki
Copy link
Contributor

jangorecki commented Aug 7, 2018

  • for update api second one looks much better
DT[i, update(colA=expr1, colB=expr2), ...]

it even corresponds to R's data.table:

DT[i, `:=`(colA=expr1, colB=expr2), ...]

so only LHS := RHS is not possible in python.

  • calling apply(.SD, something) for every row if by is not specified will differ from R's dt. We have to explicitly use by=1:nrow(DT) to achieve that, and there are good reasons for that. But R is language of vectors so fun(vec1, vec2, ..., vecN) might be more natural here than in python, where for(i in 0:(nrow(DT)-1)) fun(vec1[i], vec2[i], ..., vecN[i]) might be desired.

st-pasha added a commit that referenced this issue Jan 22, 2019
- Method `Frame.__call__()` is marked deprecated, and re-implemented to use the `DT[i, j, ...]` interface;
- Tests that were using `DT(...)` construct were rewritten in terms of `DT[i, j, ...]`;
- A lot of code that existed to support old `DT(...)` function has been removed;
- The extra arguments in `DT[i, j, ...]` can now be `None`, in which case they will be ignored. This can be used to include/exclude nodes in `DT[i, j, ...]` based on certain condition.

This PR also implements the main principles of design-doc #1187, so I will be closing that as well.
@Viktor-Demin
Copy link

DT[i, update(colA=expr1, colB=expr2), ...]
This looks like very good update option

DT.pivot[:, agg(f.V), by(f.R, f.C)]
and this is good option for pivoting.

But would be great to see how similar code looks in data.table and in pandas.

@jangorecki
Copy link
Contributor

jangorecki commented Mar 7, 2019

@Cheburusska in R data.table pivot is made by dcast, related issue to use it inside [ Rdatatable/data.table#1714
we use R formula feature in dcast API, not sure if python has something similar.

@st-pasha st-pasha added this to the Release 0.8.0 milestone Jan 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design-doc Generic discussion / roadmap how some major new functionality can be implemented
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants