Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New naming scheme for the missing diagnostics / summary functions #38

Closed
njtierney opened this issue Jan 5, 2017 · 5 comments
Closed

Comments

@njtierney
Copy link
Owner

Currently I'm finding it a bit hard to remember which functions I want to do what summary of the missing data.

I am moving towards the format miss_type_value/fun, because it makes more sense to me when tabbing through functions.

miss_* = I want to explore missing values

miss_case_* = I want to explore missing cases

miss_case_pct = I want to find the percentage of cases containing a missing value
miss_case_summary = I want to find the number / percentage of missings in each case
miss_case_table = I want a tabulation of the number / percentage of cases missing

This is more consistent and easier to reason with. I will not be providing .Deprecated for these functions, naniar is still early days, and these functions shouldn't break much analysis code, and are easy to fix.

percent_missing_case()  --> miss_case_pct
percent_missing_var()   --> miss_var_pct
percent_missing_df()    --> miss_df_pct

summary_missing_case()  --> miss_case_summary
summary_missing_var()   --> miss_var_summary

table_missing_case()   --> miss_case_table
table_missing_var()    --> miss_var_table
@seasmith
Copy link
Contributor

seasmith commented Jan 5, 2017

I'd like to add two points to the discussion of names in general (as far as things stand in v 0.0.4.9000).

  1. I feel that the table_missing_* functions feel more like aggregated summaries and the summary_missing_* functions feel more like mutated tables. My instinctual feeling on those particular function names is that they should be reversed; as though the summary_* functions are aggregations of the data and the table_* functions are aggregations of the initial summaries.

  2. Perhaps one way to clear-up confusion would be to reorder the column names by subject (variable; missing; percent) in the table_missing_* and the summary_missing_* functions.

# Current respective column name ordering for:
  # [1] `table_missing_var()`
  # [2] `summary_missing_var()`:

#> [1] "n_missing_in_var"  "n_vars"        "percent"         
#> [2] "variable"          "n_missing"     "percent"

# Reordering by subject:

#> [1] "n_vars"    "n_missing_in_var"  "percent"         
#> [2] "variable"  "n_missing"         "percent"

@njtierney
Copy link
Owner Author

Thanks for your comments, @seasmith, good to get another opinion on this.

The table_missing_* functions were originally named with the prefix table because I needed to distinguish between the two functions, and thought that they behave like more like the table command in R, which performs counts and cross tabulations.

I'm not entirely convinced that reordering the columns makes things easier to understand, although I do like the consistency.

library(naniar)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

table_missing_var_2 <- function(x){
    table_missing_var(x) %>%
        select_("n_vars",
               "n_missing_in_var",
               "percent")
}

table_missing_var(airquality)
#> # A tibble: 3 × 3
#>   n_missing_in_var n_vars  percent
#>              <int>  <int>    <dbl>
#> 1                0      4 66.66667
#> 2                7      1 16.66667
#> 3               37      1 16.66667

summary_missing_var(airquality)
#> # A tibble: 6 × 3
#>   variable n_missing   percent
#>      <chr>     <int>     <dbl>
#> 1    Ozone        37 24.183007
#> 2  Solar.R         7  4.575163
#> 3     Wind         0  0.000000
#> 4     Temp         0  0.000000
#> 5    Month         0  0.000000
#> 6      Day         0  0.000000


table_missing_var_2(airquality)
#> # A tibble: 3 × 3
#>   n_vars n_missing_in_var  percent
#>    <int>            <int>    <dbl>
#> 1      4                0 66.66667
#> 2      1                7 16.66667
#> 3      1               37 16.66667

summary_missing_var(airquality)
#> # A tibble: 6 × 3
#>   variable n_missing   percent
#>      <chr>     <int>     <dbl>
#> 1    Ozone        37 24.183007
#> 2  Solar.R         7  4.575163
#> 3     Wind         0  0.000000
#> 4     Temp         0  0.000000
#> 5    Month         0  0.000000
#> 6      Day         0  0.000000

Do you have any strong opinion about the renaming I proposed?

percent_missing_case()  --> miss_case_pct
percent_missing_var()   --> miss_var_pct
percent_missing_df()    --> miss_df_pct

summary_missing_case()  --> miss_case_summary
summary_missing_var()   --> miss_var_summary

table_missing_case()   --> miss_case_table
table_missing_var()    --> miss_var_table

Thanks again for your input, much appreciated!

@seasmith
Copy link
Contributor

seasmith commented Jan 7, 2017

I like the new naming scheme. Ought to help make tab completion a lot easier.

@njtierney
Copy link
Owner Author

OK great, just prepping the release for this now. I'm going to go ahead and use .Deprecate, because it is best practice and I want to get used to doing something like this.

@njtierney
Copy link
Owner Author

Thank you again @seasmith for your input! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants