word count with books #1

topepo · 2023-04-17T16:42:41Z

Thanks for making this.

If I want to use it with a book, how would that happen? It looks like the yml that you have goes into the individual documents (I think).

Is there a way to the options into _quarto.yml and get either a per-chapter count or one for the entire book?

The text was updated successfully, but these errors were encountered:

topepo · 2023-04-18T15:07:11Z

I wrote some code to compute and aggregate the counts per markdown file:

# Computes the word count of each md file and appends results to tibble
# Assumes that the lua filters from https://github.com/andrewheiss/quarto-wordcount
# are in a path that is accessible.

get_word_count <- function() {
  require(purrr)
  md_files <- list.files(pattern = "\\.md$")
  md_stubs <- gsub("\\.md$", "", md_files)

  res <-
    map(md_stubs, file_word_count) %>%
    map2_dfr(md_stubs, parse_output)
  res
}

# ------------------------------------------------------------------------------
# helpers

# Runs a system command to get the outputs
#' @param x the file name stub
#' @param path The location of the lua filters
file_word_count <- function(x, path = "word_counts") {
  require(glue)
  file_path <- tempfile()
  file_path <- paste0(file_path, ".html")
  cmd_text <- glue::glue("pandoc {x}.md --output {file_path} --lua-filter {path}/wordcount.lua --citeproc")
  system(cmd_text, intern = TRUE)
}

# operates on all results
#' @param x the results of the filter
#' @param file_name the file name to add to the results
parse_output <- function(x, file_name) {
  require(purrr)
  has_numbers <- grepl("[[:digit:]]", x)
  x <- x[has_numbers]
  res <- purrr::map_dfr(x, extract_count)
  res$file <- file_name
  res
}

# converts filter results to tibble
#' @param x the file name
extract_count <- function(x) {
  require(tibble)
  split_up <- strsplit(x, " ")[[1]]
  count <- as.integer(split_up[1])
  desc <- paste(split_up[-1], collapse = " ")
  tibble::tibble(count = count, type = desc)
}

andrewheiss · 2024-06-03T20:58:18Z

Sorry for the delay here! I ended up making a bunch of other major changes to the filter and I think I can figure out a solution here now

The filter works on individual documents, since it converts each document to a pandoc AST and finds the word count from that. With things like Quarto books and websites, Quarto renders each of the files separately (and would get a separate word count for each) and then does whatever magic it uses to stitch them all together into a single document. In the case of HTML output, I'm 99% sure that the separate documents are never combined into a single AST, since both books and websites are, um, websites

With PDF and Word output, though, there might be one unified AST prior to converting to the final output—I'll need to check that

In any case, even if PDF/Word have a single combined AST to work with, it still might be easier to to something like the purrr::map(file_word_count) approach you did, since HTML doesn't use one single file. Perhaps some function that captures the output from each file, or that builds a tidy CSV as it renders and then can read from that CSV, or something along those lines?

andrewheiss · 2024-06-04T01:15:35Z

k cool, so in exploring this more, it looks like neither HTML nor PDF output uses a full single combined AST. They all render everything to individual markdown files and then (1) for HTML books, each file gets converted to individual HTML files and (2) for PDF books, Quarto/pandoc somehow merges them into one .tex file and then passes that through LaTeX. I'm assuming Word, typst, markdown, and others do something similar.

When the word counting filter is included in _quarto.yml (like here from the default book template you get from RStudio's new project dialog):

project:
  type: book

book:
  title: "blah-book"
  author: "Norah Jones"
  date: "6/3/2024"
  chapters:
    - index.qmd
    - intro.qmd
    - summary.qmd
    - references.qmd

bibliography: references.bib

format:
  html:
    theme: cosmo
    citeproc: false
    filters:
      - at: pre-quarto
        path: _extensions/andrewheiss/wordcount/citeproc.lua
      - at: pre-quarto
        path: _extensions/andrewheiss/wordcount/wordcount.lua
  pdf:
    documentclass: scrreprt

…it counts each of the individual .qmd files separately

I think it should be fairly straightforward to make Lua output those counts to a (temporary?) file and then aggregate them at the end

andrewheiss · 2024-06-04T01:50:27Z

I think it should be fairly straightforward to make Lua output those counts…

lol nope.

With this content in index.qmd:

# Preface {.unnumbered}

This is a Quarto book.

To learn more about Quarto books visit <https://quarto.org/docs/books>.

…if the word count filter is run on its own just on index.qmd, there are 14 words:

Overall totals:
-----------------------------
- 14 total words
- 14 words in body and notes

Section totals:
-----------------------------
- 14 words in text body

When it's run as a whole book, though, index.qmd suddenly has 32 words.

Overall totals:
-----------------------------
- 32 total words
- 32 words in body and notes

Section totals:
-----------------------------
- 32 words in text body

I don't know where they're coming from either.

If I keep the intermediate md files:

project:
  type: book

book:
  title: "blah-book"
  author: "Norah Jones"
  date: "6/3/2024"
  chapters:
    - index.qmd
    - intro.qmd
    - summary.qmd
    - references.qmd

bibliography: references.bib

format:
  html:
    theme: cosmo
    keep-md: true
    citeproc: false
    count-code-blocks: false
    filters:
      - at: pre-quarto
        path: _extensions/andrewheiss/wordcount/citeproc.lua
      - at: pre-quarto
        path: _extensions/andrewheiss/wordcount/wordcount.lua

…the resulting intermediate index.html.md still only has 14 words in it:

# Preface {.unnumbered}

This is a Quarto book.

To learn more about Quarto books visit <https://quarto.org/docs/books>.

Quarto's collection of book-related Lua filters are doing something extra behind the scenes that I can't track down

andrewheiss mentioned this issue Apr 20, 2023

Separate word counts for main text, references, appendix #2

Closed

andrewheiss added enhancement New feature or request help wanted Extra attention is needed labels Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word count with books #1

word count with books #1

topepo commented Apr 17, 2023

topepo commented Apr 18, 2023

andrewheiss commented Jun 3, 2024

andrewheiss commented Jun 4, 2024

andrewheiss commented Jun 4, 2024

word count with books #1

word count with books #1

Comments

topepo commented Apr 17, 2023

topepo commented Apr 18, 2023

andrewheiss commented Jun 3, 2024

andrewheiss commented Jun 4, 2024

andrewheiss commented Jun 4, 2024