Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decile calculation with "ntile"... #161

Closed
coforfe opened this issue Dec 11, 2022 · 7 comments
Closed

Decile calculation with "ntile"... #161

coforfe opened this issue Dec 11, 2022 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@coforfe
Copy link

coforfe commented Dec 11, 2022

Hi,

Thanks for your excellent package to port R (dplyr) flow of processing to Python. I have been using another alternatives, and yours is the one that offers the most extensive and equivalent to what is possible now with dplyr.

I have an issue with how ntile() calculates the different groups for a vector of probabilities ("p2").

This is the output of that calculation.

kk = ( trainnonFun >> select( f.p2) >> mutate( decil = ntile(f.p2, n=10)))
kk
            p2      decil
      <float64> <category>
6535   0.971462         10
7523   0.971462         10
48441  0.970154         10
48417  0.970154         10
...         ...        ...
13971  0.970154         10
38140  0.409739          1
13400  0.409739          1
45999  0.405575          1
26150  0.372226          1
29939  0.357850          1

But when you calculates how many values are in each bucket, it shows something strange:

pp = ( kk >> count(f.decil))
pp
      decil       n
  <category> <int64>
0          1       7
1          2     542
2          3    1361
3          4     924
4          5    1240
5          6    1655
6          7    3080
7          8    2647
8          9    1571
9         10    1345

The groups are very dissimilar.

For the sake of reproducibility, In this file you can find that dataframe with the probabilities and the calculated decile.

Now, I am calculating the right decile with pandas qcut() method, which offers the right output, with a much mofre balanced number of elements in each bucket.

Thanks again,
Carlos.

@pwwang
Copy link
Owner

pwwang commented Dec 12, 2022

Hi, could you show the versions of datar:

from datar import get_versions
get_versions()

@coforfe
Copy link
Author

coforfe commented Dec 12, 2022

Hi,
Yes, this is what I get.

>>> from datar import get_versions
>>> get_versions()
python      : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]
datar       : 0.10.2
simplug     : 0.2.1
executing   : 1.2.0
pipda       : 0.10.0
datar-numpy : 0.0.0
numpy       : 1.23.4
datar-pandas: 0.0.0
pandas      : 1.5.2

Thanks!
Carlos.

pwwang added a commit to pwwang/datar-pandas that referenced this issue Dec 12, 2022
@pwwang
Copy link
Owner

pwwang commented Dec 12, 2022

This is a nice catch!
The ntile implementation should use pd.qcut instead of pd.cut.
It shall be fixed by datar-pandas v0.1.1

Try updating datar by:

pip install -U datar[pandas]

and also try get_versions to ensure datar-pandas v0.1.1 is installed.

@pwwang
Copy link
Owner

pwwang commented Dec 12, 2022

By the way, thanks for the compliments:

Thanks for your excellent package to port R (dplyr) flow of processing to Python. I have been using other alternatives, and yours is the one that offers the most extensive and equivalent to what is possible now with dplyr.

Do you mind if I put it as a testimonial in the README file?

@pwwang pwwang added the bug Something isn't working label Dec 12, 2022
@pwwang pwwang self-assigned this Dec 12, 2022
@coforfe
Copy link
Author

coforfe commented Dec 12, 2022

Thanks a lot for your quick fix!.

No, I do not mind at all.
Thanks to you.
Carlos.

@pwwang
Copy link
Owner

pwwang commented Dec 12, 2022

Thanks!

Please confirm if this is fixed and feel free to close it if so.
Feel free to open new issues if you have other questions.

@coforfe
Copy link
Author

coforfe commented Dec 12, 2022

Thanks,

Yes, I have just updated datar with your indications and now the problem is fixed.

       decil       n
  <category> <int64>
0          1    1438
1          2    1437
2          3    1437
3          4    1437
4          5    1437
5          6    1439
6          7    1435
7          8    1585
8          9    1293
9         10    1434

Thanks again,
Carlos.

@coforfe coforfe closed this as completed Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants