Combination of range() and count() inside update() leads to TypeError #2892

ghost · 2021-03-12T11:13:05Z

Hi there I want to enumerate the rows in each group.

from datatable import dt, f, by, update, count

df = dt.Frame(x=["b"] * 3 + ["a"] * 3 + ["c"] * 3)

df['a'] = range(df.nrows) # works fine
df[:, update(b=count()), by('x')] # works fine
df[:, update(c=range(count())), by('x')] # TypeError

The last line of code leads to the following error:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
TypeError: 'Expr' object cannot be interpreted as an integer

Is there a way fix or circumvent this error to get the desired output (only column 'c' is of importance)?:

   | x    a   b   c
-- + --  --  --  --
 0 | b    0   3   0
 1 | b    1   3   1
 2 | b    2   3   2
 3 | a    3   3   0
 4 | a    4   3   1
 5 | a    5   3   2
 6 | c    6   3   0
 7 | c    7   3   1
 8 | c    8   3   2
[9 rows x 4 columns]

(I asked the same on SO here: https://stackoverflow.com/questions/66585631/python-datatable-assign-new-column-by-group-to-enumerate-elements)

Thank you for your help :)

The text was updated successfully, but these errors were encountered:

st-pasha · 2021-03-12T20:54:53Z

Thanks Peter for the question.
The reason range(count()) doesn't work is because range is a pure-python function that expects its argument to be an integer. In this case, count() is not an integer - it is an expression that eventually evaluates to an integer within each group of a DT[i,j,by] call. However, as range() expects a plain integer, it throws an error here.

Unfortunately, I can't think of a way to implement what you're asking -- it may require creating a new function for this task.

st-pasha · 2021-03-12T21:30:13Z

I've checked some other packages, and here's how they do it:

pandas: cumcount();
data.table: 1:.N or seq_len(.N), both of which are R equivalents of python's range(). However, it works in R because they have lazy evaluation semantics built in;
tidyr: cumsum()
SQL Server: ROW_NUMBER() OVER(PARTITION BY)
Stata: _n variable

ghost · 2021-03-12T21:33:36Z

Thank you @st-pasha for your explanation very much. I have a way to get the desired output, however the solution isn't elegant at all. Anyway here it is:

from datatable import dt, f, by, update, count

df = dt.Frame(x=["b"] * 3 + ["a"] * 3 + ["c"] * 3)

values = dt.unique(df["x"]).to_list()[0]
for value in values:
    df[f.x == value, 'c'] = range(df[f.x == value, :].nrows)

Result:

   | x    c
-- + --  --
 0 | b    0
 1 | b    1
 2 | b    2
 3 | a    0
 4 | a    1
 5 | a    2
 6 | c    0
 7 | c    1
 8 | c    2

samukweku · 2021-03-13T00:22:31Z

Yeah, there isn't any clean way without implementing a function. There was a question about this on SO, with a hack

samukweku · 2021-03-15T06:10:46Z

@st-pasha in terms of API design for something like this, I would suggest maybe one function that can be appended sort of to existing functions, similar to what is available in numpy with ufunc.accumulate

Something like dt.accumulate(dt.sum(f[:])). a similar functionality exists in numpy as reduce.

Not sure if it makes sense or is doable; it just might help to not have so many functions, while still providing the functionality.

Another option would be the user defined function suggested in #1960. That would allow users make use of functions in numpy (which are quite fast and written in C) or other libraries.

st-pasha · 2021-03-15T20:11:53Z

In the simplest case when there are no groups, dt.sum(f[:]) produces a single row. Thus, dt.accumulate(<one row>) is not going to work. What you want here instead is (accumulate(sum))(f[:]), i.e. a function that augments the existing reduce operator. The other problem is that specific cum* operators can be written more efficiently than what a generic function accumulate can achieve.

samukweku · 2021-11-04T09:40:55Z

I want to take a stab at this, I am thinking of first implementing a cumsum (which is needed to do the cumcount).

samukweku · 2021-11-11T07:03:13Z

my knowledge of C++ shows through this; I'll unassign myself; hopefully someone else can pick this up

samukweku · 2022-07-10T16:52:00Z

this should be resolved by #3279

WIP for #3279 Closes #2892

samukweku self-assigned this Nov 4, 2021

samukweku removed their assignment Nov 11, 2021

samukweku mentioned this issue Nov 11, 2021

[WIP] Cumsum #3201

Closed

samukweku mentioned this issue Apr 7, 2022

[ENH] Implement cumsum() function #3257

Merged

samukweku mentioned this issue Jun 30, 2022

Implement cumulative functions #3279

Closed

8 tasks

samukweku mentioned this issue Jul 11, 2022

[ENH] Add cumcount() and ngroup() functions #3310

Merged

oleksiyskononenko closed this as completed in #3310 Jul 26, 2022

oleksiyskononenko pushed a commit that referenced this issue Jul 26, 2022

[ENH] Add cumcount() and ngroup() functions (#3310)

db26c27

WIP for #3279 Closes #2892

samukweku added a commit that referenced this issue Aug 4, 2022

[ENH] Add cumcount() and ngroup() functions (#3310)

7deebb9

WIP for #3279 Closes #2892

samukweku added a commit that referenced this issue Aug 10, 2022

[ENH] Add cumcount() and ngroup() functions (#3310)

201c2a6

WIP for #3279 Closes #2892

st-pasha added this to the Release 1.1.0 milestone Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combination of range() and count() inside update() leads to TypeError #2892

Combination of range() and count() inside update() leads to TypeError #2892

ghost commented Mar 12, 2021 •

edited by ghost

Loading

st-pasha commented Mar 12, 2021

st-pasha commented Mar 12, 2021

ghost commented Mar 12, 2021

samukweku commented Mar 13, 2021

samukweku commented Mar 15, 2021 •

edited

Loading

st-pasha commented Mar 15, 2021

samukweku commented Nov 4, 2021

samukweku commented Nov 11, 2021

samukweku commented Jul 10, 2022

Combination of range() and count() inside update() leads to TypeError #2892

Combination of range() and count() inside update() leads to TypeError #2892

Comments

ghost commented Mar 12, 2021 • edited by ghost Loading

st-pasha commented Mar 12, 2021

st-pasha commented Mar 12, 2021

ghost commented Mar 12, 2021

samukweku commented Mar 13, 2021

samukweku commented Mar 15, 2021 • edited Loading

st-pasha commented Mar 15, 2021

samukweku commented Nov 4, 2021

samukweku commented Nov 11, 2021

samukweku commented Jul 10, 2022

ghost commented Mar 12, 2021 •

edited by ghost

Loading

samukweku commented Mar 15, 2021 •

edited

Loading