group function #815

matthieugomez · 2015-06-16T16:09:56Z

The group function allows to combine multiple columns of a dataframe into one PooledDataArray column. A user case is to create groups based on multiple variables before fitting a model.
The function group basically rewraps code from groupby.I'm not sure whether this functionality already exists or not.

tshort · 2015-06-16T23:37:40Z

Needs tests and docs...

matthieugomez · 2015-06-17T01:31:03Z

I have added tests & a doc. I have added a skipna option : should observations with some column equal to NA grouped into a NA group? The option defaults to true. This returns NA for any observation with a original column equals to NA, which is generally what the user wants (especially when using poolall before fitting models). I don't know the usual name for NA options in Julia, so tell me if skipna does not sound right.

Group using Uint32

remove poolall

change group in docs

Change name in test + add refs type test

add == for test

matthieugomez · 2015-06-17T23:12:49Z

I have updated my commit to solve a bug I spotted in groupby, which gives incorrect groups in the following case:

   df = DataFrame(v1 = pool(1:1000), v2 = pool(fill(1, 1000))) 
   groupby(df, [:v1, :v2])

Grouping by v1 and v2 should give 1000 groups, but only 255 groups are created. This happens because the type of refs element for the last column of df is Uint8, and this type is not promoted when grouping by v1 and v2, even though the number of groups becomes > 255. I now promote this vector to Uint32 in all cases for intermediary computations & I have added a test to catch this case.

matthieugomez · 2015-06-19T22:20:04Z

I think it's good on my end now.

matthieugomez and others added 4 commits June 16, 2015 12:04

Create a pooled data array from multiple columns

3eabe79

export

42fc0ee

don’t use dropunused

bc54c4b

add ; for julia 0.4

2058036

test + doc + rename + skipna option

b863e1f

matthieugomez changed the title ~~Group function~~ poolall function Jun 17, 2015

change default option

152782c

matthieugomez added 4 commits June 17, 2015 18:51

Update grouping.jl

0fde93d

Group using Uint32

Update DataFrames.jl

9cfc221

remove poolall

Update pooling.md

6eef51e

change group in docs

Update grouping.jl

c2234f1

Change name in test + add refs type test

matthieugomez changed the title ~~poolall function~~ group function Jun 17, 2015

Update grouping.jl

e96d340

add == for test

garborg mentioned this pull request Jun 18, 2015

Anti join not working as intended for joins on multiple keys? #821

Closed

100 -> 1000

a038c51

matthieugomez added 4 commits June 22, 2015 09:12

use Uint64 rather than Uint32

3cb5ae5

simplify group

873f959

factorize wo/ missing

80b2c20

RefArray -> DataArrays.RefArray

3f60b64

matthieugomez closed this Aug 22, 2015

matthieugomez mentioned this pull request Aug 22, 2015

Avoid overflow #862

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

group function #815

group function #815

matthieugomez commented Jun 16, 2015

tshort commented Jun 16, 2015

matthieugomez commented Jun 17, 2015

matthieugomez commented Jun 17, 2015

matthieugomez commented Jun 19, 2015

group function #815

group function #815

Conversation

matthieugomez commented Jun 16, 2015

tshort commented Jun 16, 2015

matthieugomez commented Jun 17, 2015

matthieugomez commented Jun 17, 2015

matthieugomez commented Jun 19, 2015