-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API discussion #1187
Comments
Thank you! Maybe we could assemble a representative side-by-side comparison between R and Python API to give us a better feel of what it would look like? |
PivotingPivot functionality comes from early spreadsheet applications, Lotus Improv and Excel. In a nutshell, the functionality involves designating a "rows" column, a "columns" column, and a "data" column, plus an aggregator function. Then for each distinct (rows, columns) combination, we compute aggregator(data). Thus, pivoting involves:
Thus, I can think of 2 ways how to express this in
Note: same function is called |
it even corresponds to R's data.table:
so only
|
- Method `Frame.__call__()` is marked deprecated, and re-implemented to use the `DT[i, j, ...]` interface; - Tests that were using `DT(...)` construct were rewritten in terms of `DT[i, j, ...]`; - A lot of code that existed to support old `DT(...)` function has been removed; - The extra arguments in `DT[i, j, ...]` can now be `None`, in which case they will be ignored. This can be used to include/exclude nodes in `DT[i, j, ...]` based on certain condition. This PR also implements the main principles of design-doc #1187, so I will be closing that as well.
But would be great to see how similar code looks in data.table and in pandas. |
@Cheburusska in R data.table pivot is made by |
This issue is for the general discussion of the
datatable
's API. It should only be closed when the discussion has stabilized, and the majority of the suggested syntax either implemented or delegated into separate issues.First, as a general principle,
datatable
is a sibling of R'sdata.table
, and aims to mimic its API / algorithms whenever possible and reasonable. At the same time, many of the design choices that went intodata.table
stem from the functionality of base R; such functionality may be awkward when transferred into Python directly. So some kind of balanced approach is needed. Finally, it must be acknowledged that R gives much more freedom in syntactic expression to the user, which means many of the constructs used indata.table
are simply not possible in Python.Main syntax
The cornerstone of data.table's API is the following syntactic form:
where
...
denotes extra options. Herei
andj
are positional arguments, denoting the rows and columns selectors respectively (alternatively,j
is often called the "what to do" argument, as it can specify arbitrary calculations over the columns). Theby
argument may also be positional, but more commonly it is used in named form (i.e.by=...
), especially considering that it is frequently replaced withkeyby=...
which is another mode of grouping.This syntax is good, and we want to generally retain it, however, there is a big caveat: Python does not support named parameters in square-brackets selectors. There is PEP-472 to add such support. The PEP dates back to 2014 and was on "standards track" for Py3.6, however, today Py3.7 is
almostalready out, and the proposal was not implemented yet. So don't get your hopes too high...Given all this a considerable amount of thought, I come up with the following suggested primary syntax for
datatable
:Thus, the simplest form uses
DT[i, j]
, which is perfectly natural for indexing a 2-dimensional table of data. However, the grouping argument, if present, must be "named" via functionby()
. The functionby()
may accept multiple columns or column expressions, and also have its own parameters. For example, such parameters could bemethod = "fast"|"sorted"|"keep_order"|"rle"
to choose the algorithm for grouping,add_cols = True|False
whether to automatically add key columns to the resulting frame,skip_na = False|True
whether an NA-valued group is dropped,filter=<expr>
to remove certain groups based on a custom logic, and so on.Likewise, the generic syntax to perform a join is the
join()
verb:DT[i, j, join(X, on=..., nomatch=..., mode=...)]
. We can support the data.table's syntaxDT[X]
too, but I suspect it won't be very useful without the support of extra arguments such ason=
,mult=
, etc. Another point of distinction is that unlikeDT[X]
, the expressionDT[:, :, join(X)]
will perform a left-outer-join with default params.This takes care of most of the arguments to
[.data.table
. The arguments that do not fall into eitherby()
orjoin()
family are:nomatch
,which
,with
andverbose
. Out of these,with
is not needed since in Python the modewith=TRUE
does not work anyway, so we have to usef.*
expressions. Theverbose
andnomatch
parameters can be handled as global options. Thewhich
parameter is very awkward: a much cleaner approach is to have a special.WHICH
symbol to be used inj
.f.* symbols
As mentioned above, the data.table's syntax
DT[, A]
to refer to column "A" cannot work in Python:A
will be interpreted as variable from the outer scope, not as column "A" in DT. Of course,DT[:, "A"]
is ok in Python, but then you cannot do expressions such asDT[:, "A" / "B"]
. Presumably, you could put the entire expression into a stringDT[:, "A / B"]
, but even this has its limitations.Instead, we opted out for the
f.*
syntax: thef
refers to the "frame currently being operated upon", and thenf.A
orf["A"]
is the column "A" in that frame. The constant repetition off.
is somewhat tedious, but it has its own advantages too:f[var]
;f["Purchase price"]
;g
;data.table
occasionally uses a similar approach by sayingx.col
ori.col
;In-place frame updates
In data.table the syntax for this is
DT[i, col1:=expr]
. This is nice, but there is no ":=" operator in Python (at least until PEP-572, but even that would not be overloadable). Instead, we currently implement the following syntax for updates:DT[i, col] = expr
. This works fine in small use cases but quickly becomes unreadable in larger ones. Consider:DT[:, [colA, colB2, colC]] = [expr1, expr2, expr3]
-- which column name gets assigned which expression? OrDT[:, col, join(X, ...), by(z)] = expr
-- the column name and the expression are so far from each other that it becomes unclear what kind of assignment takes place.One way to deal with this problem is to introduce a special syntax for updates:
or alternatively
Arbitrary group expressions
One of the most powerful features of data.table is the ability to perform arbitrary calculations with subsets of the target frame corresponding to each group. This is done via
.SD
special symbol: thej
part of theDT[i, j, by]
form can be an arbitrary function of.SD
-- as long it creates a list (or a data.table) as a result.A similar functionality can be achieved in Python
datatable
via a special functionapply()
(ordo()
) which can be used in place ofj
expression. This function may take either one or two arguments, and produce either a list, or list-of-lists, or a Frame, or None.by()
clause was given;.SD
);None
, that value is ignored; if it returns a list/tuple, that value is converted into a 1-row frame; if it returns a list-of-lists, it is converted into a frame (each list element becomes a column);combine
option).The
apply()
function may have options to control its behavior:sdcols
- same as.SDcols
in R,combine="rbind"|"cbind"|"list"
, etc.Please share your thoughts / comments / suggestions.
The text was updated successfully, but these errors were encountered: