Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combination of range() and count() inside update() leads to TypeError #2892

Closed
ghost opened this issue Mar 12, 2021 · 9 comments · Fixed by #3310
Closed

Combination of range() and count() inside update() leads to TypeError #2892

ghost opened this issue Mar 12, 2021 · 9 comments · Fixed by #3310
Milestone

Comments

@ghost
Copy link

ghost commented Mar 12, 2021

Hi there I want to enumerate the rows in each group.

from datatable import dt, f, by, update, count

df = dt.Frame(x=["b"] * 3 + ["a"] * 3 + ["c"] * 3)

df['a'] = range(df.nrows) # works fine
df[:, update(b=count()), by('x')] # works fine
df[:, update(c=range(count())), by('x')] # TypeError

The last line of code leads to the following error:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
TypeError: 'Expr' object cannot be interpreted as an integer

Is there a way fix or circumvent this error to get the desired output (only column 'c' is of importance)?:

   | x    a   b   c
-- + --  --  --  --
 0 | b    0   3   0
 1 | b    1   3   1
 2 | b    2   3   2
 3 | a    3   3   0
 4 | a    4   3   1
 5 | a    5   3   2
 6 | c    6   3   0
 7 | c    7   3   1
 8 | c    8   3   2
[9 rows x 4 columns]

(I asked the same on SO here: https://stackoverflow.com/questions/66585631/python-datatable-assign-new-column-by-group-to-enumerate-elements)

Thank you for your help :)

@st-pasha
Copy link
Contributor

Thanks Peter for the question.
The reason range(count()) doesn't work is because range is a pure-python function that expects its argument to be an integer. In this case, count() is not an integer - it is an expression that eventually evaluates to an integer within each group of a DT[i,j,by] call. However, as range() expects a plain integer, it throws an error here.

Unfortunately, I can't think of a way to implement what you're asking -- it may require creating a new function for this task.

@st-pasha
Copy link
Contributor

I've checked some other packages, and here's how they do it:

@ghost
Copy link
Author

ghost commented Mar 12, 2021

Thank you @st-pasha for your explanation very much. I have a way to get the desired output, however the solution isn't elegant at all. Anyway here it is:

from datatable import dt, f, by, update, count

df = dt.Frame(x=["b"] * 3 + ["a"] * 3 + ["c"] * 3)

values = dt.unique(df["x"]).to_list()[0]
for value in values:
    df[f.x == value, 'c'] = range(df[f.x == value, :].nrows)

Result:

   | x    c
-- + --  --
 0 | b    0
 1 | b    1
 2 | b    2
 3 | a    0
 4 | a    1
 5 | a    2
 6 | c    0
 7 | c    1
 8 | c    2

@samukweku
Copy link
Contributor

Yeah, there isn't any clean way without implementing a function. There was a question about this on SO, with a hack

@samukweku
Copy link
Contributor

samukweku commented Mar 15, 2021

@st-pasha in terms of API design for something like this, I would suggest maybe one function that can be appended sort of to existing functions, similar to what is available in numpy with ufunc.accumulate

Something like dt.accumulate(dt.sum(f[:])). a similar functionality exists in numpy as reduce.

Not sure if it makes sense or is doable; it just might help to not have so many functions, while still providing the functionality.

Another option would be the user defined function suggested in #1960. That would allow users make use of functions in numpy (which are quite fast and written in C) or other libraries.

@st-pasha
Copy link
Contributor

In the simplest case when there are no groups, dt.sum(f[:]) produces a single row. Thus, dt.accumulate(<one row>) is not going to work. What you want here instead is (accumulate(sum))(f[:]), i.e. a function that augments the existing reduce operator. The other problem is that specific cum* operators can be written more efficiently than what a generic function accumulate can achieve.

@samukweku
Copy link
Contributor

I want to take a stab at this, I am thinking of first implementing a cumsum (which is needed to do the cumcount).

@samukweku samukweku self-assigned this Nov 4, 2021
@samukweku
Copy link
Contributor

my knowledge of C++ shows through this; I'll unassign myself; hopefully someone else can pick this up

@samukweku
Copy link
Contributor

this should be resolved by #3279

oleksiyskononenko pushed a commit that referenced this issue Jul 26, 2022
samukweku added a commit that referenced this issue Aug 4, 2022
samukweku added a commit that referenced this issue Aug 10, 2022
@st-pasha st-pasha added this to the Release 1.1.0 milestone Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants