Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching top level expressions #578

Merged
merged 37 commits into from
Feb 5, 2020

Conversation

lorenzwalthert
Copy link
Collaborator

@lorenzwalthert lorenzwalthert commented Jan 19, 2020

The cache is currently invalidated if any expression is modified. If we manage to cache top level expressions separately, we can still use the cache for all non-modified expressions, which would should improve speed very often. Closes #570.

  • Do only add variables block and is_cached (if at all) after than
    get_parse_data (but before parse_transform_serialize_r_block because when we
    do block processing this might be too late) to not drag along
    transformers everywhere totally unrelated. When processing blocks, make sure
    to cache individual values where they are created.

  • process in block, not every expression.

  • expressions on the same line must be processed together, i.e. when separated
    by semi-colon or comment, they belong to the same block.

  • think about the interaction with create_tokens and that each column should
    potentially exist on every level, not just top level.

  • the text attibute of a top level nest is not used to create a hash for
    caching, but is used to return text in case no styling is needed because the
    value is cached.

  • resolve line break problem.

  • Add tests. Both low-level and high-level.

  • compute_parse_data_nested() should not compute the parse data nested for expressions that are cached already because for them, we just return text of the top level expression.

  • Put all consecutive non-cached expressions into one block to speed up things. Maybe this does not help as much as expected. Need more comprehensive benchmarks.

  • nice to have: Cache after styling each expression separately, so styling a
    file with ten identical expressions will use cache for 9 of them.
    Does not seem to work because we must have blocks first.

Note: The diff is so large because all trees are rewritten as we use includeText = TRUE in getParseData(). Otherwise it would be 5k lower.

@lorenzwalthert lorenzwalthert force-pushed the caching-top-level-expr branch 2 times, most recently from 0fef796 to 4a45b62 Compare January 19, 2020 14:36
@codecov-io
Copy link

codecov-io commented Jan 19, 2020

Codecov Report

Merging #578 into master will increase coverage by 0.29%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #578      +/-   ##
==========================================
+ Coverage   90.65%   90.95%   +0.29%     
==========================================
  Files          46       47       +1     
  Lines        2044     2111      +67     
==========================================
+ Hits         1853     1920      +67     
  Misses        191      191
Impacted Files Coverage Δ
R/nested-to-tree.R 100% <ø> (ø) ⬆️
R/relevel.R 96.92% <ø> (ø) ⬆️
R/initialize.R 97.05% <ø> (ø) ⬆️
R/transform-files.R 100% <100%> (ø) ⬆️
R/utils-cache.R 100% <100%> (ø) ⬆️
R/token-create.R 95.74% <100%> (+0.18%) ⬆️
R/serialize.R 100% <100%> (ø) ⬆️
R/utils.R 82.75% <100%> (+1.98%) ⬆️
R/nest.R 100% <100%> (ø) ⬆️
R/parse.R 85.55% <100%> (ø) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b7c4d9b...4f6a890. Read the comment docs.

@lorenzwalthert lorenzwalthert force-pushed the caching-top-level-expr branch 2 times, most recently from b6745e5 to 3efe926 Compare February 4, 2020 20:48
Not really active on expression level, just some infrastructure and code running. Must now process each block separately, i.e. parse_transform_serialize and parse_transform_serialize_block from branch caching@48a9a6
has not resolved following problems:

- Cache each expression separately, processing all
  at once. Currently, process each expression separaetly,
  which takes very long. Probable problem: need to
  generate cache per expression before it is merged with
  other expressions or parse generated code at the very
  end to get expressions again (probably cheap but
  cognitive overhead)

Infrastructure:

- Can't remember why get_parse_data should also add
  the variable is_cached (and drag along transformers.
… instead of get_parse_data()

more coherent conceptually and less dragging of arguments
consistently formatted and all formatting comes from initializer (that is in style guide), nothing hardcoded in styler infra.
Because the line break that is applied to the first token with apply_stylerignore may contain a line break, but since all blocks are separated by a line break there would be an extra line break added.
…look different. This is responsible for

153 files changed, 5204 insertions(+), 4856 deletions(-).

We include the text so we can, if we use the cache, return text. Previously, this was not needed because we did not process by expressions, so the input text was available at the time we checked the cache (function returned from make_transformer).
…ens, so there is no error when parser version < 2 in relocate_eq_sign when a newly created parse table is bound together with the previous one.
…ize_r_block to enable processing by blocks of expressions but caching by expression (but not implemented in practive with cache_find_block()
probably not faster, but much more complicated

includes quite slow creation of text of individual expressions with getParseData. parse(deparse()) was not used because it also styles the code at the same time (indention of function declaration for example). Could not find quick way to turn that off and make it return as is.
@lorenzwalthert lorenzwalthert mentioned this pull request Feb 4, 2020
R/initialize.R Outdated




Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the semantic difference between 1 empty line, 3 empty lines (as in line 22) and 5 empty lines?

R/nest.R Outdated
#' Drop all children of a top level expression that are cached
#'
#' Note that we do cache top-level comments. Because package code has a lot of
#' roxygen comments and each of them is a top level expresion, so checking is
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "so" is wrong since the sentence starts with "Because".

R/nest.R Outdated
#' [parse_transform_serialize_r_block()], we simply return `text` for the top
#' level token. For that
#' reason, the nested parse table can, at the rows where these expressions are
#' located, be shallow, i.e. it does not have to contain a children, because it
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "a children" -> "children"

R/nest.R Outdated
#' level token. For that
#' reason, the nested parse table can, at the rows where these expressions are
#' located, be shallow, i.e. it does not have to contain a children, because it
#' will neighter be transformerd nor serialized anytime. This function drop all
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "neighter" -> "neither"
typo: "transformerd" -> "transformed"
typo: "drop" -> "drops"

#' Every expression is an expression itself, Expressions on same line are in
#' same block.
#' Multiple expressions can sit on one row, e.g. in line comment and commands
#' seperated with ";". This creates a problem when processing each expression
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "seperated" -> "separated"

#' @param pd A top level nest.
find_blank_lines_to_next_block <- function(pd) {
block_boundary <- pd$block != lag(pd$block, default = 0)
# TODO everywhere: block is not ambiguous. use cache block since we also have
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "not ambiguous" -> "ambiguous"

R/utils-cache.R Outdated
@@ -13,11 +13,32 @@ hash_standardize <- function(text) {
list()
}

#' Check if text is cached
#'
#' This boilds down to check if the hash exists at the caching dir as a file.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "boilds" -> "boils"

@lorenzwalthert
Copy link
Collaborator Author

Thanks @rillig, I will have a look.

@lorenzwalthert lorenzwalthert marked this pull request as ready for review February 5, 2020 19:09
@lorenzwalthert lorenzwalthert merged commit a87198f into r-lib:master Feb 5, 2020
@lorenzwalthert lorenzwalthert deleted the caching-top-level-expr branch February 11, 2020 12:35
@lorenzwalthert lorenzwalthert restored the caching-top-level-expr branch November 28, 2020 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache on expression level
3 participants