Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group function #815

Closed
wants to merge 16 commits into from
Closed

group function #815

wants to merge 16 commits into from

Conversation

matthieugomez
Copy link
Contributor

The group function allows to combine multiple columns of a dataframe into one PooledDataArray column. A user case is to create groups based on multiple variables before fitting a model.
The function group basically rewraps code from groupby.I'm not sure whether this functionality already exists or not.

@tshort
Copy link
Contributor

tshort commented Jun 16, 2015

Needs tests and docs...

@matthieugomez matthieugomez changed the title Group function poolall function Jun 17, 2015
@matthieugomez
Copy link
Contributor Author

I have added tests & a doc. I have added a skipna option : should observations with some column equal to NA grouped into a NA group? The option defaults to true. This returns NA for any observation with a original column equals to NA, which is generally what the user wants (especially when using poolall before fitting models). I don't know the usual name for NA options in Julia, so tell me if skipna does not sound right.

Group using Uint32
remove poolall
change group in docs
Change name in test + add refs type test
@matthieugomez matthieugomez changed the title poolall function group function Jun 17, 2015
add == for test
@matthieugomez
Copy link
Contributor Author

I have updated my commit to solve a bug I spotted in groupby, which gives incorrect groups in the following case:

   df = DataFrame(v1 = pool(1:1000), v2 = pool(fill(1, 1000))) 
   groupby(df, [:v1, :v2])

Grouping by v1 and v2 should give 1000 groups, but only 255 groups are created. This happens because the type of refs element for the last column of df is Uint8, and this type is not promoted when grouping by v1 and v2, even though the number of groups becomes > 255. I now promote this vector to Uint32 in all cases for intermediary computations & I have added a test to catch this case.

@matthieugomez
Copy link
Contributor Author

I think it's good on my end now.

@matthieugomez matthieugomez mentioned this pull request Aug 22, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants