fix build against dev dplyr #353

kriemo · 2020-03-20T17:49:17Z

> 
> ### ** Examples
> 
> genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))
> 
> x <- bed_random(genome, seed = 1010486)
> y <- bed_random(genome, seed = 9203911)
> 
> bed_absdist(x, y, genome)
Error: Join columns must be unique
✖ Problem at position 2
Backtrace:
    █
 1. └─valr::bed_absdist(x, y, genome)
 2.   ├─dplyr::inner_join(genome, ref_points, by = c("chrom", groups_xy))
 3.   └─dplyr:::inner_join.data.frame(...)
 4.     └─dplyr:::join_mutate(...)
 5.       └─dplyr:::join_cols(...)
 6.         └─dplyr:::standardise_join_by(by, x_names = x_names, y_names = y_names)
 7.           └─dplyr:::check_join_vars(by$x, x_names)
Execution halted

checking tests ...

 ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
  ══ testthat results  ═══════════════════════════════════════════════════════════
  [ OK: 424 | SKIPPED: 3 | WARNINGS: 2 | FAILED: 15 ]
  1. Error: absdist calculation is correct (@test_absdist.r#22) 
  2. Error: self absdist is 0 (@test_absdist.r#35) 
  3. Error: x ivls without matching y-ivls chroms are reported with absdist = NA (@test_absdist.r#56) 
  4. Error: ensure that absdist is calculated with respect to input tbls issue#108 (@test_absdist.r#87) 
  5. Error: old dataframe groupings (dplyr v. < 0.7.9.900) are tolerated (@test_groups.r#87) 
  6. Failure: unmatched groups are included when invert = TRUE (@test_intersect.r#329) 
  7. Failure: book-ended intervals are not reported (@test_map.r#104) 
  8. Failure: basic partition works (bedops partition1 test) (@test_partition.r#36) 
  9. Failure: extended partition works (bedops partition2 test) (@test_partition.r#119) 
  1. ...
  
  Error: testthat unit tests failed
  Execution halted

kriemo · 2020-03-20T20:46:51Z

Most of the errors are related to testthat:test_equal no longer considering a trbl_interval() object equal to a tibble::tribble(), due to differing class attributes. Substituting testthat::expect_equivalent() gets around this issue. The alternative fix is mentioned in #274, but that would be a big breaking change that I am not in favor of at this point.

library(testthat)
library(valr)
x <- tibble::tribble(
  ~ chrom, ~ start, ~ end,
  "chr1", 1000, 2000,
  "chr1", 1000, 400
)

pred <- tibble::tribble(
  ~ chrom, ~ start, ~ end,
  "chr1", 1000, 400,
  "chr1", 1000, 2000
)

res <- bed_sort(x)
expect_equal(res, pred)
#> Error: `res` not equal to `pred`.
#> Attributes: < Length mismatch: comparison on first 2 components >
#> Attributes: < Component "class": Lengths (4, 3) differ (string compare on first 3) >
#> Attributes: < Component "class": 3 string mismatches >
class(res)
#> [1] "tbl_ivl"    "tbl_df"     "tbl"        "data.frame"
class(pred)
#> [1] "tbl_df"     "tbl"        "data.frame"
pred == res
#>      chrom start  end
#> [1,]  TRUE  TRUE TRUE
#> [2,]  TRUE  TRUE TRUE
expect_equivalent(res, pred)

^{Created on 2020-03-20 by the reprex package (v0.3.0)}

kriemo · 2020-03-21T19:10:03Z

arrange has a performance regression (tidyverse/dplyr#4962) in the dev version of dplyr which will slow down many of valr's functions. The regression will likely be fixed before being released as v1.0.0. However testing on my end suggests that just using base R order with the radix sorting method is faster than CRAN dplyr, (or dev dplyr) by ~25-40%. I suggest we modify bed_sort to use base R unless arrange is fundamentally rewritten in v1.0.0.

Dev dplyr performance:

library(valr)
library(dplyr, warn.conflicts = FALSE)
genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))

# number of intervals
n <- 1e7
seed_x <- 1010486
x <- bed_random(genome, n = n, seed = seed_x)

packageVersion("dplyr")
#> [1] '0.8.99.9002'
microbenchmark::microbenchmark(
  arrange(x, chrom, start, end),
  x[order(x$chrom, x$start, x$end, method = "radix"), ],
  times = 2,
  unit = "s"
)
#> Unit: seconds
#>                                                   expr       min        lq
#>                          arrange(x, chrom, start, end) 6.6252349 6.6252349
#>  x[order(x$chrom, x$start, x$end, method = "radix"), ] 0.2678227 0.2678227
#>       mean    median       uq      max neval cld
#>  6.6921549 6.6921549 6.759075 6.759075     2   b
#>  0.3317684 0.3317684 0.395714 0.395714     2  a

^{Created on 2020-03-21 by the reprex package (v0.3.0)}

v 0.8.5 performance:

library(valr)
library(dplyr, warn.conflicts = FALSE)
genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))

# number of intervals
n <- 1e7
seed_x <- 1010486
x <- bed_random(genome, n = n, seed = seed_x)

packageVersion("dplyr")
#> [1] '0.8.5'
microbenchmark::microbenchmark(
  arrange(x, chrom, start, end),
  x[order(x$chrom, x$start, x$end, method = "radix"), ],
  times = 2,
  unit = "s"
)
#> Unit: seconds
#>                                                   expr       min        lq
#>                          arrange(x, chrom, start, end) 0.4096381 0.4096381
#>  x[order(x$chrom, x$start, x$end, method = "radix"), ] 0.2582281 0.2582281
#>       mean    median        uq       max neval cld
#>  0.4190501 0.4190501 0.4284621 0.4284621     2   b
#>  0.2600742 0.2600742 0.2619203 0.2619203     2  a

^{Created on 2020-03-21 by the reprex package (v0.3.0)}

kriemo · 2020-03-22T17:09:55Z

There is also a performance hit with summarize that will not be addressed in v1.0.0. This is really noticeable when using n() within summary functions (e.g. bed_map or bed_merge). We can substitute length() to regain some of the performance losses. The benchmarks vignette should be updated to use bed_map(x, y, .n = length(end)) instead of calling n().

tidyverse/dplyr#5017

library(valr)
library(bench)
library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '0.8.99.9002'
genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))

# number of intervals
n <- 1e6
seed_x <- 1010486
x <- bed_random(genome, n = n, seed = seed_x)
seed_y <- 1010487
y <- bed_random(genome, n = n, seed = seed_y)

mark(bed_map(x, y, .n = n()),
     bed_map(x, y, .n = length(end)),
     iterations = 2)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 bed_map(x, y, .n = n())             6.6s    7.06s     0.142     449MB     6.30
#> 2 bed_map(x, y, .n = length(end))    1.79s    1.85s     0.539     461MB     4.85

^{Created on 2020-03-22 by the reprex package (v0.3.0)}

kriemo · 2020-04-02T14:36:33Z

New upstream changes are causing multiple test errors due to our custom tbl_ivl and tbl_gnm classes not playing well with bind_rows() . The changes don't seem to be stable yet, so will hold off on trying to fix.

library(tibble)
library(dplyr, warn.conflicts = F)
library(valr)

x <- tribble(
  ~ chrom, ~ start, ~ end,
  "chr1", 100, 200
)

y <- as.tbl_interval(x)

bind_rows(x, y)
#> Error: No common type for `..1` <tbl_df<
#>   chrom: character
#>   start: double
#>   end  : double
#> >> and `..2` <tbl_ivl<
#>   chrom: character
#>   start: double
#>   end  : double
#> >>.

class(x)
#> [1] "tbl_df"     "tbl"        "data.frame"
class(y)
#> [1] "tbl_ivl"    "tbl_df"     "tbl"        "data.frame"

^{Created on 2020-04-02 by the reprex package (v0.3.0)}

kriemo · 2020-04-21T23:46:13Z

Output from newest vctrs. We will need to define some custom functions for vctrs (see r-lib/vctrs#982) in order to keep our tbl_ivl and tbl_gnm tibble subclasses compatible with dplyr.

library(tibble)
library(dplyr, warn.conflicts = F)
library(valr)

x <- tribble(
  ~ chrom, ~ start, ~ end,
  "chr1", 100, 200
)

y <- as.tbl_interval(x)

bind_rows(x, y)
#> Warning: Can't combine <tbl_df> and <tbl_ivl>.
#> ℹ Convert all inputs to the same class to avoid this warning.
#> ℹ See <https://vctrs.r-lib.org/reference/faq-warning-convert-inputs.html>.
#> ℹ Falling back to <data.frame>.
#> Error: Can't convert <tbl_ivl> to <data.frame>.

class(x)
#> [1] "tbl_df"     "tbl"        "data.frame"
class(y)
#> [1] "tbl_ivl"    "tbl_df"     "tbl"        "data.frame"

^{Created on 2020-04-21 by the reprex package (v0.3.0)}

jayhesselberth · 2020-04-22T12:04:25Z

We could consider dropping tbl_ivl and tbl_gnm entirely, just using tbl_df.

I think the main utility of the tbl_ivl and tbl_gnm classes is to provide a handy way to ensure some basic checks have been done. But we could just run these checks at the beginning of each function instead of checking for the classes.

kriemo · 2020-04-22T12:54:00Z

That's a good idea and will likely make maintenance easier.

jayhesselberth mentioned this issue Mar 24, 2020

Release valr 0.6.0 #359

Closed

21 tasks

kriemo self-assigned this Apr 22, 2020

This was referenced May 4, 2020

dplyr v. 1.0.0 compatibility #354

Merged

Changes for dplyr 1.0.0 #363

Closed

jayhesselberth closed this as completed in 978f3fb May 4, 2020

jayhesselberth mentioned this issue May 6, 2020

Release valr 0.6.1 #367

Closed

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix build against dev dplyr #353

fix build against dev dplyr #353

kriemo commented Mar 20, 2020

kriemo commented Mar 20, 2020

kriemo commented Mar 21, 2020

kriemo commented Mar 22, 2020

kriemo commented Apr 2, 2020

kriemo commented Apr 21, 2020

jayhesselberth commented Apr 22, 2020

kriemo commented Apr 22, 2020 •

edited

Loading

fix build against dev dplyr #353

fix build against dev dplyr #353

Comments

kriemo commented Mar 20, 2020

kriemo commented Mar 20, 2020

kriemo commented Mar 21, 2020

kriemo commented Mar 22, 2020

kriemo commented Apr 2, 2020

kriemo commented Apr 21, 2020

jayhesselberth commented Apr 22, 2020

kriemo commented Apr 22, 2020 • edited Loading

kriemo commented Apr 22, 2020 •

edited

Loading