Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix build against dev dplyr #353

Closed
kriemo opened this issue Mar 20, 2020 · 7 comments
Closed

fix build against dev dplyr #353

kriemo opened this issue Mar 20, 2020 · 7 comments
Assignees

Comments

@kriemo
Copy link
Member

kriemo commented Mar 20, 2020

> 
> ### ** Examples
> 
> genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))
> 
> x <- bed_random(genome, seed = 1010486)
> y <- bed_random(genome, seed = 9203911)
> 
> bed_absdist(x, y, genome)
Error: Join columns must be uniqueProblem at position 2
Backtrace:1. └─valr::bed_absdist(x, y, genome)
 2.   ├─dplyr::inner_join(genome, ref_points, by = c("chrom", groups_xy))
 3.   └─dplyr:::inner_join.data.frame(...)
 4.     └─dplyr:::join_mutate(...)
 5.       └─dplyr:::join_cols(...)
 6.         └─dplyr:::standardise_join_by(by, x_names = x_names, y_names = y_names)
 7.           └─dplyr:::check_join_vars(by$x, x_names)
Execution halted
checking tests ...

 ERROR
Running the tests intests/testthat.Rfailed.
Last 13 lines of output:
  ══ testthat results  ═══════════════════════════════════════════════════════════
  [ OK: 424 | SKIPPED: 3 | WARNINGS: 2 | FAILED: 15 ]
  1. Error: absdist calculation is correct (@test_absdist.r#22) 
  2. Error: self absdist is 0 (@test_absdist.r#35) 
  3. Error: x ivls without matching y-ivls chroms are reported with absdist = NA (@test_absdist.r#56) 
  4. Error: ensure that absdist is calculated with respect to input tbls issue#108 (@test_absdist.r#87) 
  5. Error: old dataframe groupings (dplyr v. < 0.7.9.900) are tolerated (@test_groups.r#87) 
  6. Failure: unmatched groups are included when invert = TRUE (@test_intersect.r#329) 
  7. Failure: book-ended intervals are not reported (@test_map.r#104) 
  8. Failure: basic partition works (bedops partition1 test) (@test_partition.r#36) 
  9. Failure: extended partition works (bedops partition2 test) (@test_partition.r#119) 
  1. ...
  
  Error: testthat unit tests failed
  Execution halted
@kriemo
Copy link
Member Author

kriemo commented Mar 20, 2020

Most of the errors are related to testthat:test_equal no longer considering a trbl_interval() object equal to a tibble::tribble(), due to differing class attributes. Substituting testthat::expect_equivalent() gets around this issue. The alternative fix is mentioned in #274, but that would be a big breaking change that I am not in favor of at this point.

library(testthat)
library(valr)
x <- tibble::tribble(
  ~ chrom, ~ start, ~ end,
  "chr1", 1000, 2000,
  "chr1", 1000, 400
)

pred <- tibble::tribble(
  ~ chrom, ~ start, ~ end,
  "chr1", 1000, 400,
  "chr1", 1000, 2000
)

res <- bed_sort(x)
expect_equal(res, pred)
#> Error: `res` not equal to `pred`.
#> Attributes: < Length mismatch: comparison on first 2 components >
#> Attributes: < Component "class": Lengths (4, 3) differ (string compare on first 3) >
#> Attributes: < Component "class": 3 string mismatches >
class(res)
#> [1] "tbl_ivl"    "tbl_df"     "tbl"        "data.frame"
class(pred)
#> [1] "tbl_df"     "tbl"        "data.frame"
pred == res
#>      chrom start  end
#> [1,]  TRUE  TRUE TRUE
#> [2,]  TRUE  TRUE TRUE
expect_equivalent(res, pred)

Created on 2020-03-20 by the reprex package (v0.3.0)

@kriemo
Copy link
Member Author

kriemo commented Mar 21, 2020

arrange has a performance regression (tidyverse/dplyr#4962) in the dev version of dplyr which will slow down many of valr's functions. The regression will likely be fixed before being released as v1.0.0. However testing on my end suggests that just using base R order with the radix sorting method is faster than CRAN dplyr, (or dev dplyr) by ~25-40%. I suggest we modify bed_sort to use base R unless arrange is fundamentally rewritten in v1.0.0.

Dev dplyr performance:

library(valr)
library(dplyr, warn.conflicts = FALSE)
genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))

# number of intervals
n <- 1e7
seed_x <- 1010486
x <- bed_random(genome, n = n, seed = seed_x)

packageVersion("dplyr")
#> [1] '0.8.99.9002'
microbenchmark::microbenchmark(
  arrange(x, chrom, start, end),
  x[order(x$chrom, x$start, x$end, method = "radix"), ],
  times = 2,
  unit = "s"
)
#> Unit: seconds
#>                                                   expr       min        lq
#>                          arrange(x, chrom, start, end) 6.6252349 6.6252349
#>  x[order(x$chrom, x$start, x$end, method = "radix"), ] 0.2678227 0.2678227
#>       mean    median       uq      max neval cld
#>  6.6921549 6.6921549 6.759075 6.759075     2   b
#>  0.3317684 0.3317684 0.395714 0.395714     2  a

Created on 2020-03-21 by the reprex package (v0.3.0)

v 0.8.5 performance:

library(valr)
library(dplyr, warn.conflicts = FALSE)
genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))

# number of intervals
n <- 1e7
seed_x <- 1010486
x <- bed_random(genome, n = n, seed = seed_x)

packageVersion("dplyr")
#> [1] '0.8.5'
microbenchmark::microbenchmark(
  arrange(x, chrom, start, end),
  x[order(x$chrom, x$start, x$end, method = "radix"), ],
  times = 2,
  unit = "s"
)
#> Unit: seconds
#>                                                   expr       min        lq
#>                          arrange(x, chrom, start, end) 0.4096381 0.4096381
#>  x[order(x$chrom, x$start, x$end, method = "radix"), ] 0.2582281 0.2582281
#>       mean    median        uq       max neval cld
#>  0.4190501 0.4190501 0.4284621 0.4284621     2   b
#>  0.2600742 0.2600742 0.2619203 0.2619203     2  a

Created on 2020-03-21 by the reprex package (v0.3.0)

@kriemo
Copy link
Member Author

kriemo commented Mar 22, 2020

There is also a performance hit with summarize that will not be addressed in v1.0.0. This is really noticeable when using n() within summary functions (e.g. bed_map or bed_merge). We can substitute length() to regain some of the performance losses. The benchmarks vignette should be updated to use bed_map(x, y, .n = length(end)) instead of calling n().

tidyverse/dplyr#5017

library(valr)
library(bench)
library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '0.8.99.9002'
genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))

# number of intervals
n <- 1e6
seed_x <- 1010486
x <- bed_random(genome, n = n, seed = seed_x)
seed_y <- 1010487
y <- bed_random(genome, n = n, seed = seed_y)

mark(bed_map(x, y, .n = n()),
     bed_map(x, y, .n = length(end)),
     iterations = 2)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 bed_map(x, y, .n = n())             6.6s    7.06s     0.142     449MB     6.30
#> 2 bed_map(x, y, .n = length(end))    1.79s    1.85s     0.539     461MB     4.85

Created on 2020-03-22 by the reprex package (v0.3.0)

@jayhesselberth jayhesselberth mentioned this issue Mar 24, 2020
21 tasks
@kriemo
Copy link
Member Author

kriemo commented Apr 2, 2020

New upstream changes are causing multiple test errors due to our custom tbl_ivl and tbl_gnm classes not playing well with bind_rows() . The changes don't seem to be stable yet, so will hold off on trying to fix.

library(tibble)
library(dplyr, warn.conflicts = F)
library(valr)

x <- tribble(
  ~ chrom, ~ start, ~ end,
  "chr1", 100, 200
)

y <- as.tbl_interval(x)

bind_rows(x, y)
#> Error: No common type for `..1` <tbl_df<
#>   chrom: character
#>   start: double
#>   end  : double
#> >> and `..2` <tbl_ivl<
#>   chrom: character
#>   start: double
#>   end  : double
#> >>.

class(x)
#> [1] "tbl_df"     "tbl"        "data.frame"
class(y)
#> [1] "tbl_ivl"    "tbl_df"     "tbl"        "data.frame"

Created on 2020-04-02 by the reprex package (v0.3.0)

@kriemo
Copy link
Member Author

kriemo commented Apr 21, 2020

Output from newest vctrs. We will need to define some custom functions for vctrs (see r-lib/vctrs#982) in order to keep our tbl_ivl and tbl_gnm tibble subclasses compatible with dplyr.

library(tibble)
library(dplyr, warn.conflicts = F)
library(valr)

x <- tribble(
  ~ chrom, ~ start, ~ end,
  "chr1", 100, 200
)

y <- as.tbl_interval(x)

bind_rows(x, y)
#> Warning: Can't combine <tbl_df> and <tbl_ivl>.
#> ℹ Convert all inputs to the same class to avoid this warning.
#> ℹ See <https://vctrs.r-lib.org/reference/faq-warning-convert-inputs.html>.
#> ℹ Falling back to <data.frame>.
#> Error: Can't convert <tbl_ivl> to <data.frame>.

class(x)
#> [1] "tbl_df"     "tbl"        "data.frame"
class(y)
#> [1] "tbl_ivl"    "tbl_df"     "tbl"        "data.frame"

Created on 2020-04-21 by the reprex package (v0.3.0)

@jayhesselberth
Copy link
Member

We could consider dropping tbl_ivl and tbl_gnm entirely, just using tbl_df.

I think the main utility of the tbl_ivl and tbl_gnm classes is to provide a handy way to ensure some basic checks have been done. But we could just run these checks at the beginning of each function instead of checking for the classes.

@kriemo
Copy link
Member Author

kriemo commented Apr 22, 2020

That's a good idea and will likely make maintenance easier.

@kriemo kriemo self-assigned this Apr 22, 2020
This was referenced May 4, 2020
@jayhesselberth jayhesselberth mentioned this issue May 6, 2020
21 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants