New naming scheme for the missing diagnostics / summary functions #38

njtierney · 2017-01-05T00:23:38Z

Currently I'm finding it a bit hard to remember which functions I want to do what summary of the missing data.

I am moving towards the format miss_type_value/fun, because it makes more sense to me when tabbing through functions.

miss_* = I want to explore missing values

miss_case_* = I want to explore missing cases

miss_case_pct = I want to find the percentage of cases containing a missing value
miss_case_summary = I want to find the number / percentage of missings in each case
miss_case_table = I want a tabulation of the number / percentage of cases missing

This is more consistent and easier to reason with. I will not be providing .Deprecated for these functions, naniar is still early days, and these functions shouldn't break much analysis code, and are easy to fix.

percent_missing_case()  --> miss_case_pct
percent_missing_var()   --> miss_var_pct
percent_missing_df()    --> miss_df_pct

summary_missing_case()  --> miss_case_summary
summary_missing_var()   --> miss_var_summary

table_missing_case()   --> miss_case_table
table_missing_var()    --> miss_var_table

The text was updated successfully, but these errors were encountered:

seasmith · 2017-01-05T19:11:08Z

I'd like to add two points to the discussion of names in general (as far as things stand in v 0.0.4.9000).

I feel that the table_missing_* functions feel more like aggregated summaries and the summary_missing_* functions feel more like mutated tables. My instinctual feeling on those particular function names is that they should be reversed; as though the summary_* functions are aggregations of the data and the table_* functions are aggregations of the initial summaries.
Perhaps one way to clear-up confusion would be to reorder the column names by subject (variable; missing; percent) in the table_missing_* and the summary_missing_* functions.

# Current respective column name ordering for:
  # [1] `table_missing_var()`
  # [2] `summary_missing_var()`:

#> [1] "n_missing_in_var"  "n_vars"        "percent"         
#> [2] "variable"          "n_missing"     "percent"

# Reordering by subject:

#> [1] "n_vars"    "n_missing_in_var"  "percent"         
#> [2] "variable"  "n_missing"         "percent"

njtierney · 2017-01-05T23:57:50Z

Thanks for your comments, @seasmith, good to get another opinion on this.

The table_missing_* functions were originally named with the prefix table because I needed to distinguish between the two functions, and thought that they behave like more like the table command in R, which performs counts and cross tabulations.

I'm not entirely convinced that reordering the columns makes things easier to understand, although I do like the consistency.

library(naniar)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

table_missing_var_2 <- function(x){
    table_missing_var(x) %>%
        select_("n_vars",
               "n_missing_in_var",
               "percent")
}

table_missing_var(airquality)
#> # A tibble: 3 × 3
#>   n_missing_in_var n_vars  percent
#>              <int>  <int>    <dbl>
#> 1                0      4 66.66667
#> 2                7      1 16.66667
#> 3               37      1 16.66667

summary_missing_var(airquality)
#> # A tibble: 6 × 3
#>   variable n_missing   percent
#>      <chr>     <int>     <dbl>
#> 1    Ozone        37 24.183007
#> 2  Solar.R         7  4.575163
#> 3     Wind         0  0.000000
#> 4     Temp         0  0.000000
#> 5    Month         0  0.000000
#> 6      Day         0  0.000000


table_missing_var_2(airquality)
#> # A tibble: 3 × 3
#>   n_vars n_missing_in_var  percent
#>    <int>            <int>    <dbl>
#> 1      4                0 66.66667
#> 2      1                7 16.66667
#> 3      1               37 16.66667

summary_missing_var(airquality)
#> # A tibble: 6 × 3
#>   variable n_missing   percent
#>      <chr>     <int>     <dbl>
#> 1    Ozone        37 24.183007
#> 2  Solar.R         7  4.575163
#> 3     Wind         0  0.000000
#> 4     Temp         0  0.000000
#> 5    Month         0  0.000000
#> 6      Day         0  0.000000

Do you have any strong opinion about the renaming I proposed?

percent_missing_case()  --> miss_case_pct
percent_missing_var()   --> miss_var_pct
percent_missing_df()    --> miss_df_pct

summary_missing_case()  --> miss_case_summary
summary_missing_var()   --> miss_var_summary

table_missing_case()   --> miss_case_table
table_missing_var()    --> miss_var_table

Thanks again for your input, much appreciated!

seasmith · 2017-01-07T03:29:57Z

I like the new naming scheme. Ought to help make tab completion a lot easier.

njtierney · 2017-01-08T09:38:25Z

OK great, just prepping the release for this now. I'm going to go ahead and use .Deprecate, because it is best practice and I want to get used to doing something like this.

njtierney · 2017-06-15T05:42:18Z

Thank you again @seasmith for your input! :)

njtierney closed this as completed Jun 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New naming scheme for the missing diagnostics / summary functions #38

New naming scheme for the missing diagnostics / summary functions #38

njtierney commented Jan 5, 2017

seasmith commented Jan 5, 2017

njtierney commented Jan 5, 2017

seasmith commented Jan 7, 2017 •

edited

Loading

njtierney commented Jan 8, 2017

njtierney commented Jun 15, 2017

New naming scheme for the missing diagnostics / summary functions #38

New naming scheme for the missing diagnostics / summary functions #38

Comments

njtierney commented Jan 5, 2017

seasmith commented Jan 5, 2017

njtierney commented Jan 5, 2017

seasmith commented Jan 7, 2017 • edited Loading

njtierney commented Jan 8, 2017

njtierney commented Jun 15, 2017

seasmith commented Jan 7, 2017 •

edited

Loading