Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word count with books #1

Open
topepo opened this issue Apr 17, 2023 · 4 comments
Open

word count with books #1

topepo opened this issue Apr 17, 2023 · 4 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@topepo
Copy link

topepo commented Apr 17, 2023

Thanks for making this.

If I want to use it with a book, how would that happen? It looks like the yml that you have goes into the individual documents (I think).

Is there a way to the options into _quarto.yml and get either a per-chapter count or one for the entire book?

@topepo
Copy link
Author

topepo commented Apr 18, 2023

I wrote some code to compute and aggregate the counts per markdown file:

# Computes the word count of each md file and appends results to tibble
# Assumes that the lua filters from https://github.com/andrewheiss/quarto-wordcount
# are in a path that is accessible.

get_word_count <- function() {
  require(purrr)
  md_files <- list.files(pattern = "\\.md$")
  md_stubs <- gsub("\\.md$", "", md_files)

  res <-
    map(md_stubs, file_word_count) %>%
    map2_dfr(md_stubs, parse_output)
  res
}

# ------------------------------------------------------------------------------
# helpers

# Runs a system command to get the outputs
#' @param x the file name stub
#' @param path The location of the lua filters
file_word_count <- function(x, path = "word_counts") {
  require(glue)
  file_path <- tempfile()
  file_path <- paste0(file_path, ".html")
  cmd_text <- glue::glue("pandoc {x}.md --output {file_path} --lua-filter {path}/wordcount.lua --citeproc")
  system(cmd_text, intern = TRUE)
}

# operates on all results
#' @param x the results of the filter
#' @param file_name the file name to add to the results
parse_output <- function(x, file_name) {
  require(purrr)
  has_numbers <- grepl("[[:digit:]]", x)
  x <- x[has_numbers]
  res <- purrr::map_dfr(x, extract_count)
  res$file <- file_name
  res
}

# converts filter results to tibble
#' @param x the file name
extract_count <- function(x) {
  require(tibble)
  split_up <- strsplit(x, " ")[[1]]
  count <- as.integer(split_up[1])
  desc <- paste(split_up[-1], collapse = " ")
  tibble::tibble(count = count, type = desc)
}

@andrewheiss
Copy link
Owner

Sorry for the delay here! I ended up making a bunch of other major changes to the filter and I think I can figure out a solution here now

The filter works on individual documents, since it converts each document to a pandoc AST and finds the word count from that. With things like Quarto books and websites, Quarto renders each of the files separately (and would get a separate word count for each) and then does whatever magic it uses to stitch them all together into a single document. In the case of HTML output, I'm 99% sure that the separate documents are never combined into a single AST, since both books and websites are, um, websites

With PDF and Word output, though, there might be one unified AST prior to converting to the final output—I'll need to check that

In any case, even if PDF/Word have a single combined AST to work with, it still might be easier to to something like the purrr::map(file_word_count) approach you did, since HTML doesn't use one single file. Perhaps some function that captures the output from each file, or that builds a tidy CSV as it renders and then can read from that CSV, or something along those lines?

@andrewheiss
Copy link
Owner

k cool, so in exploring this more, it looks like neither HTML nor PDF output uses a full single combined AST. They all render everything to individual markdown files and then (1) for HTML books, each file gets converted to individual HTML files and (2) for PDF books, Quarto/pandoc somehow merges them into one .tex file and then passes that through LaTeX. I'm assuming Word, typst, markdown, and others do something similar.

When the word counting filter is included in _quarto.yml (like here from the default book template you get from RStudio's new project dialog):

project:
  type: book

book:
  title: "blah-book"
  author: "Norah Jones"
  date: "6/3/2024"
  chapters:
    - index.qmd
    - intro.qmd
    - summary.qmd
    - references.qmd

bibliography: references.bib

format:
  html:
    theme: cosmo
    citeproc: false
    filters:
      - at: pre-quarto
        path: _extensions/andrewheiss/wordcount/citeproc.lua
      - at: pre-quarto
        path: _extensions/andrewheiss/wordcount/wordcount.lua
  pdf:
    documentclass: scrreprt

…it counts each of the individual .qmd files separately

image

I think it should be fairly straightforward to make Lua output those counts to a (temporary?) file and then aggregate them at the end

@andrewheiss
Copy link
Owner

I think it should be fairly straightforward to make Lua output those counts…

lol nope.

With this content in index.qmd:

# Preface {.unnumbered}

This is a Quarto book.

To learn more about Quarto books visit <https://quarto.org/docs/books>.

…if the word count filter is run on its own just on index.qmd, there are 14 words:

Overall totals:
-----------------------------
- 14 total words
- 14 words in body and notes

Section totals:
-----------------------------
- 14 words in text body

When it's run as a whole book, though, index.qmd suddenly has 32 words.

Overall totals:
-----------------------------
- 32 total words
- 32 words in body and notes

Section totals:
-----------------------------
- 32 words in text body

I don't know where they're coming from either.

If I keep the intermediate md files:

project:
  type: book

book:
  title: "blah-book"
  author: "Norah Jones"
  date: "6/3/2024"
  chapters:
    - index.qmd
    - intro.qmd
    - summary.qmd
    - references.qmd

bibliography: references.bib

format:
  html:
    theme: cosmo
    keep-md: true
    citeproc: false
    count-code-blocks: false
    filters:
      - at: pre-quarto
        path: _extensions/andrewheiss/wordcount/citeproc.lua
      - at: pre-quarto
        path: _extensions/andrewheiss/wordcount/wordcount.lua

…the resulting intermediate index.html.md still only has 14 words in it:

# Preface {.unnumbered}

This is a Quarto book.

To learn more about Quarto books visit <https://quarto.org/docs/books>.

Quarto's collection of book-related Lua filters are doing something extra behind the scenes that I can't track down

@andrewheiss andrewheiss added enhancement New feature or request help wanted Extra attention is needed labels Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants