You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, this is done in a kludgy way (and I find it hard to remember what is done) and at the time I wrote didn't really know much about NSE/environments. This an attempt of expressing all of the accumulated technical debt and propose some fixes.
There are a few kludges:
The main engine behind doing NSE is the overscope_ranges() generic, which essentially wraps S4Vectors::as.env() then calls a rlang::new_data_mask() for creating a "tidy-evaluation" environment with the bottom environment being the input ranges and top the environment being input metadata. On the other end for some, when a verb is evaluated each quosure has their environment updated so all the generics from BiocGenerics, IRanges and S4Vectors are bound to the top level of the quosures environment. This is super wasteful and the wrong place to put them.
Grouping in the mask. Currently, we apply a transform where we create Lists from our groups and then evaluate our quosures with respect to the List. This works fairly well since most generics dispatch on the lists and provides major speedups for basic ops like mean etc. However, in the cases where there is not a generic, an operation will throw an error (in the case of summarise) or will be wasteful by creating multiple masks (like mutate). There should be checks in place that attempts to dispatch on the list, if that fails then convert the quosure to a function(?) then use an appropriate *apply function to call it. This seems to be way tricky to implement, so maybe there's a better approach? cc @lawremi
Lazy binding: Currently, functions like mutate can assign symbols into the mask, so computed columns can be computed on. This should be done lazily via delayedAssign.
The n() function. This is confusing for people as there's a name clash with dplyr::n() which will fail and causes certain functions like dplyr::count() etc not work. The n() function is also wasteful on grouping, we can really quickly compute the group counts directly from the indices. The n() function doesn't need to be exported but could be bound to the mask.
Proposed Fixes:
Overscope ranges should be renamed to reflect what it actually does, creates a mask for the ranges.
The generics should be bound to the top environment of the mask not each quosure
A .data pronoun should be installed in the mask, to enable programming NSE interfaces on top of plyranges
Refactor n to be bound to the mask, it should look outside it's scope to quickly compute the result rather than via environment look ups.
Better fail-safe mechanisms for grouping transformations that don't work.
Other ideas:
There's a tension between thinking of a Ranges as columns consisting of start, end, width and as a single Vector like object. Sometimes, you want both and this can make the API clunky see #47 and also the count_overlaps functions. Could there be a lazy binding to the Ranges in the mask? This could facilitate doing range-wise arithmetic all within mutate (without needing to use anchor first).
The text was updated successfully, but these errors were encountered:
Currently, this is done in a kludgy way (and I find it hard to remember what is done) and at the time I wrote didn't really know much about NSE/environments. This an attempt of expressing all of the accumulated technical debt and propose some fixes.
There are a few kludges:
The main engine behind doing NSE is the
overscope_ranges()
generic, which essentially wrapsS4Vectors::as.env()
then calls arlang::new_data_mask()
for creating a "tidy-evaluation" environment with the bottom environment being the input ranges and top the environment being input metadata. On the other end for some, when a verb is evaluated each quosure has their environment updated so all the generics fromBiocGenerics
,IRanges
andS4Vectors
are bound to the top level of the quosures environment. This is super wasteful and the wrong place to put them.Grouping in the mask. Currently, we apply a transform where we create Lists from our groups and then evaluate our quosures with respect to the List. This works fairly well since most generics dispatch on the lists and provides major speedups for basic ops like mean etc. However, in the cases where there is not a generic, an operation will throw an error (in the case of summarise) or will be wasteful by creating multiple masks (like mutate). There should be checks in place that attempts to dispatch on the list, if that fails then convert the quosure to a function(?) then use an appropriate *apply function to call it. This seems to be way tricky to implement, so maybe there's a better approach? cc @lawremi
Lazy binding: Currently, functions like mutate can assign symbols into the mask, so computed columns can be computed on. This should be done lazily via delayedAssign.
The
n()
function. This is confusing for people as there's a name clash withdplyr::n()
which will fail and causes certain functions likedplyr::count()
etc not work. Then()
function is also wasteful on grouping, we can really quickly compute the group counts directly from the indices. Then()
function doesn't need to be exported but could be bound to the mask.Proposed Fixes:
.data
pronoun should be installed in the mask, to enable programming NSE interfaces on top of plyrangesOther ideas:
There's a tension between thinking of a Ranges as columns consisting of start, end, width and as a single Vector like object. Sometimes, you want both and this can make the API clunky see #47 and also the count_overlaps functions. Could there be a lazy binding to the Ranges in the mask? This could facilitate doing range-wise arithmetic all within mutate (without needing to use anchor first).
The text was updated successfully, but these errors were encountered: