-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A better summary function #1514
Comments
I’m not sure |
Of course this has been done before. I used a modified version of The function is only a couple of lines, giving a smaller output and eliminating the need to load another package. By the same logic the |
I think the main challenge is to think through what makes sense for logical, character, factor, and date/time variables, and how many different columns you end up needing. @rpruim any thoughts? Does mosaic have something like this? |
Here is my opinion on this. (My code is not perfect. :-))
So I would suggest the same fields as in the current function, perhaps with the addition of a NA count column if any exist. |
We do provide dfapply(KidsFeet, favstats)
## $birthmonth
## min Q1 median Q3 max mean sd n missing
## 1 3 6 9 12 6.102564 3.36229 39 0
##
## $birthyear
## min Q1 median Q3 max mean sd n missing
## 87 88 88 88 88 87.82051 0.3887764 39 0
##
## $length
## min Q1 median Q3 max mean sd n missing
## 21.6 24 24.5 25.6 27.5 24.72308 1.317586 39 0
##
## $width
## min Q1 median Q3 max mean sd n missing
## 7.9 8.65 9 9.35 9.8 8.992308 0.5095843 39 0 This makes it possible to do this: do.call(rbind, dfapply(KidsFeet, favstats))
## min Q1 median Q3 max mean sd n missing
## birthmonth 1.0 3.00 6.0 9.00 12.0 6.102564 3.3622899 39 0
## birthyear 87.0 88.00 88.0 88.00 88.0 87.820513 0.3887764 39 0
## length 21.6 24.00 24.5 25.60 27.5 24.723077 1.3175858 39 0
## width 7.9 8.65 9.0 9.35 9.8 8.992308 0.5095843 39 0 (@hadley: using If dfapply( data, inspect, select = TRUE) to get a list of summaries, which could perhaps be wrapped up into something for a better display. Tabular displays are tricky, however, since the various data types should have different summaries. (I do not like applying standard numerical summaries to factors. I'd rather know things like how many levels, proportions of most common levels, etc.) So other than One option would be to have separate tables for each variable type. |
@hadley, is there a fundamental reason why something like data %>%
summarise( favstats(variable)) can't be made to work? I'm imagining a use case where the function ( |
@rpruim see #154 - I'm not fundamentally opposed to it, we just haven't worked out a nice interface for it (and whether it's different enough from summarise to be it's own verb). The decision to drop row names in |
Here's a proof of concept inspect(Births78)
##
## categorical variables:
## name class levels missing n distribution
## 1 wday ordered 7 0 365 Sun (14.5%), Mon (14.2%), Tues (14.2%) ...
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd n missing
## 1 births integer 7135 8554 9218 9705 10711 9132 818 365 0
## 2 dayofyear integer 1 92 183 274 365 183 106 365 0
##
## time variables:
## name class first last min_diff max_diff missing n
## 1 date POSIXct 1978-01-01 1978-12-31 1 1 0 365 See ProjectMOSAIC/mosaic#544 for further developments. |
Here is a modified version of the
|
I would also suggest adding a column to display variable labels, or even another column to display whether a variable has value labels (from the # devtools::install_github("larmarange/labelled")
# devtools::install_github("jjchern/meda")
> library(meda)
> nlsw88 = haven::read_dta("http://www.stata-press.com/data/r13/nlsw88.dta")
>
> cb(nlsw88) # `cb` is short for `codebook`, and it shows summary statistics of a data frame
Source: local data frame [17 x 8]
var obs unique mean std.dev min max var_label
(chr) (int) (int) (dbl) (dbl) (dbl) (dbl) (chr)
1 idcode 2246 2246 2612.65 1480.86 1.00 5159.00 NLS id
2 age 2246 13 39.15 3.06 34.00 46.00 age in current year
3 race 2246 3 1.28 0.48 1.00 3.00 race
4 married 2246 2 0.64 0.48 0.00 1.00 married
5 never_married 2246 2 0.10 0.31 0.00 1.00 never married
6 grade 2244 16 13.10 2.52 0.00 18.00 current grade completed
7 collgrad 2246 2 0.24 0.43 0.00 1.00 college graduate
8 south 2246 2 0.42 0.49 0.00 1.00 lives in south
9 smsa 2246 2 0.70 0.46 0.00 1.00 lives in SMSA
10 c_city 2246 2 0.29 0.45 0.00 1.00 lives in central city
11 industry 2232 12 8.19 3.01 1.00 12.00 industry
12 occupation 2237 13 4.64 3.41 1.00 13.00 occupation
13 union 1878 2 0.25 0.43 0.00 1.00 union worker
14 wage 2246 967 7.77 5.76 1.00 40.75 hourly wage
15 hours 2242 62 37.22 10.51 1.00 80.00 usual hours worked
16 ttl_exp 2246 1546 12.53 4.61 0.12 28.88 total work experience
17 tenure 2231 259 5.98 5.51 0.00 25.92 job tenure (years)
>
> d(nlsw88) # `d` is shor for `describe` and it shows variable labels, and whether value label exists for certain variables
Source: local data frame [17 x 6]
var type class val_label label head
(chr) (chr) (chr) (lgl) (chr) (chr)
1 idcode int int FALSE NLS id 1 2 3 4 6...
2 age int int FALSE age in current year 37 37 42 43 42...
3 race int lbl TRUE race 2 2 2 1 1...
4 married int lbl TRUE married 0 0 0 1 1...
5 never_married int int FALSE never married 0 0 1 0 0...
6 grade int int FALSE current grade complete... 12 12 12 17 12...
7 collgrad int lbl TRUE college graduate 0 0 0 1 0...
8 south int int FALSE lives in south 0 0 0 0 0...
9 smsa int lbl TRUE lives in SMSA 1 1 1 1 1...
10 c_city int int FALSE lives in central city... 0 1 1 0 0...
11 industry int lbl TRUE industry 5 4 4 11 4...
12 occupation int lbl TRUE occupation 6 5 3 13 6...
13 union int lbl TRUE union worker 1 1 NA 1 0...
14 wage dbl nmr FALSE hourly wage 11.73912525177 6.40096...
15 hours int int FALSE usual hours worked 48 40 40 42 48...
16 ttl_exp dbl nmr FALSE total work experience... 10.3333339691162 13.62...
17 tenure dbl nmr FALSE job tenure (years) 5.33333349227905 5.25 ...
>
> # Note that there's a value label for the variable "race", thus we can checkout the values
> labelled::val_labels(nlsw88$race)
white black other
1 2 3 (Sorry for the command prompts and not using |
Not a fan of cryptic 1- and 2-letter function names. If everyone did that, we would have even more name collisions than we already have. Regarding the display of labels, I'm not sure they belong here as they tend to be long and so steal space from the other things we want to see. But I'm still deciding just what output to include.
Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, :
‘min’ not meaningful for factors |
The default output should be as simple as possible, I think. Things like labels, quartiles and medians are good to have at times, but to me they are optional. Standard deviation is also something that I think should be optional. It is necessary in printed output (as are labels), but in working with data things like strange means or a many NA. I don't think std is something I look at. But as I said, this is only my opinion... |
I think there's enough difference of opinion here to suggest that it's best such a function live in another package. |
I agree that short function names are really bad. Coming from Stata, I've gotten used to short commands and seeing quick results. I'll fix the error for factors. I also agree that the default output should be as simple as possible. Ultimately, I would really like to see something in the |
"As simple as possible, but no simpler." (Einstein) Sacrificing reasonableness or usefulness for simplicity is probably not a good trade. Computing means of factors is mostly silly. Treating everything as numeric is not a good direction. So I think you are left with separate treatment for (groups of) types of variables, or a summary that includes only things that always make sense (type, class, n, missing, etc) much like As @hadley suggests, there may be multiple takes on what such a summary should included and how it should be formatted. Indeed there are already a number of of these floating around -- including one more now that I've created one in One advantage of having in |
Thank's for discussing the issue. I hope that the proof of concept of @rpruim is included in the mosaic package. Even though his suggestion is not as good as mine :-), it is already much better than the alternatives. |
If anyone arrives here in 2017, there is also the skimr package: |
I think that dplyr would benefit from having a function summarizing the data frame variables. It is surprising that the R base package has nothing better than the
summary
function to provide an overview of a data frame. In dplyr one can look at the data with for exampleglimpse
orhead
, but a concise display of key summary statistics would make data management easier. Summary statistics can provide more information than the raw data. For example one way to see that a join does not work is to look at the number of NA values.I have small function
describe()
available here that shows what I mean. It takes the same arguments asdplyr::select()
, and produces summary statistics as a data frame.The output is not only useful for looking at the data in R. It is also close to what is often shown as Table 1 in journal articles (at least in applied econometrics).
The text was updated successfully, but these errors were encountered: