Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large projects might cause "Argument list too long" error on pre-render scripts #10828

Closed
cscheid opened this issue Sep 17, 2024 Discussed in #10823 · 4 comments · Fixed by #11098
Closed

Large projects might cause "Argument list too long" error on pre-render scripts #10828

cscheid opened this issue Sep 17, 2024 Discussed in #10823 · 4 comments · Fixed by #11098
Assignees
Labels
bug Something isn't working project-scripts
Milestone

Comments

@cscheid
Copy link
Collaborator

cscheid commented Sep 17, 2024

Discussed in #10823

Originally posted by Analect September 17, 2024

Description

I'm hitting this problem with a private gitlab-hosted repo, containing circa 1500 documents that get rendered with quarto. I'm not able to share this set-up, however the approach I'm taking is similar to what is shown publically here, where I'm scraping some document meta-data and pushing this to files in a data folder which are then published as resources in the rendered docs. I'm experimenting with alternative ways to work with the document meta-data and wanted to leverage the pre-render capability within quarto.

project:
  type: website
  resources:
    - "data/**/*"
    - "package/**/*"
    - "coi-serviceworker.min.js"
  pre-render:
    - scripts/metadata-scrape.py
    - scripts/load_data_kuzu.py
  render:
    - "*.qmd"
    ...

The gitlab-runner that is doing the rendering is based on this docker image, a debian 12 OS, with quarto 1.5.57 on-board. If I comment-out the pre-render scripts, then things run fine. However, when I enable the scripts/metadata-scrape.py on the gitlab repo (similar in pattern to this scripts/metadata-scrape.py, only longer to handle custom meta-data), I'm getting this Argument list too long error. Can you shed any light on why this might be happening when quarto is handling pre-render scripts.

image

Per this, I tried to set a longer command-line buffer with ulimit -s 65536 on the VM running this dockerized gitlab-runner, but also included in the .gitlab-ci.yml so that it gets applied within the runner itself (see image above), but none of these have helped.

@cscheid cscheid added bug Something isn't working project-scripts labels Sep 17, 2024
@cscheid cscheid added this to the v1.6 milestone Sep 17, 2024
@cscheid cscheid self-assigned this Sep 17, 2024
@cscheid
Copy link
Collaborator Author

cscheid commented Sep 17, 2024

The only way I can see this happening is that we're passing the list of input files as the QUARTO_PROJECT_INPUT_FILES env variable, and that is triggering the error (even though the error talks about the argument list being too long instead of an env variable being too long).

I'm not sure how to fix this in a backwards-compatible way. What we need to do for large files is to pass the path of a temporary file that contains the list of input files; but if we do that, we'll break the very many existing pre-render scripts that work just fine.

@Analect
Copy link

Analect commented Sep 18, 2024

Found these on Gitlab. Perhaps it's relevant.

I'm not sure how to fix this in a backwards-compatible way.

One suggested fix in the second link above is to use a file-type variable. Maybe creating both a variable-type variable and file-type variable would allow users facing the Argument list too long problem to revert to using the file-type variable, somehow.

@Analect
Copy link

Analect commented Sep 18, 2024

Also, for avoidance of doubt, if I disable the pre-render scripts section of _quarto.yml, then the render proceeds, per below (file names and folders are fictitious), for now, but I do feel like I'm at the upper-end of file count handled, since I'm sometimes bumping up against this problem which I know you are addressing separately.

$ quarto render --output-dir public
WARN: The file /xxx/xxx/xxx.qmd contains a theme property which is being ignored. Website projects do not support per document themes since all pages within a website share the website's theme.
WARN: The file /xxx/xxx/xxx.qmd contains a theme property which is being ignored. Website projects do not support per document themes since all pages within a website share the website's theme.
....
[   1/1516] docs/folder1/folder2/folder3/folder4/01_file.ipynb
[   2/1516] docs/folder1/folder2/folder3/folder4/02_file.ipynb
[   3/1516] docs/folder1/folder2/folder3/folder4/03_file.ipynb
[   4/1516] docs/folder1/folder2/folder3/folder4/04_file.ipynb
...

I'm not sure what form QUARTO_PROJECT_INPUT_FILES takes and how this is influenced by whether pre-render scripts are enabled or not. Just for further colour, one of my scripts generates an array of file-paths to process meta-data for documents .. that looks something like what's depicted below. Not sure if QUARTO_PROJECT_INPUT_FILES contains a richer set of data beyond file paths or not.

[docs/folder1/folder2/folder3/folder4/01_file.ipynb, docs/folder1/folder2/folder3/folder4/02_file.ipynb, docs/folder1/folder2/folder3/folder4/03_file.ipynb, docs/folder1/folder2/folder3/folder4/04_file.ipynb ..., docs/folder1/folder2/folder3/folder4/1516_file.ipynb]

That array size, in my case, is 12728 (see getsizeof below) with length of 1564.

print(sys.getsizeof(file_paths)) 
print(len(file_paths))

@cscheid
Copy link
Collaborator Author

cscheid commented Oct 17, 2024

The following PR will resolve this problem. Unfortunately, I couldn't come up with a way in which this would be automatically addressed. Instead, Quarto will look for the following environment variables:

  • QUARTO_USE_FILE_FOR_PROJECT_INPUT_FILES
  • QUARTO_USE_FILE_FOR_PROJECT_OUTPUT_FILES

Either of these environment variables should be set to a filename in a directory that Quarto has write permissions to. When they are set, Quarto will write the list of files to the file pointed to in the environment variable, rather than setting QUARTO_PROJECT_INPUT_FILES and QUARTO_PROJECT_OUTPUT_FILES, respectively.

cscheid added a commit to quarto-dev/quarto-web that referenced this issue Oct 17, 2024
* add docs for the fix described in quarto-dev/quarto-cli#11098

* put docs in the right place
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working project-scripts
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants