s3fs IO errors (was: Error received during crawl) #772

Neo-Oli · 2025-02-18T10:31:54Z

Hi, I have an issue that I am hoping you can help me with.

I am trying to archive a rather large site. Because the resulting archives quickly filled up my VPS' storage, I mounted an S3 space with s3fs. But when I run the scraper I get this error.

{"timestamp":"2025-02-12T01:19:35.728Z","logLevel":"info","context":"behavior","message":"Run Script Finished","details":{"frameUrl":"[https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est","page":"https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est","workerid":0](https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est%22,%22page%22:%22https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est%22,%22workerid%22:0)}}
{"timestamp":"2025-02-12T01:19:35.729Z","logLevel":"info","context":"behavior","message":"Behaviors finished","details":{"finished":1,"page":"[https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est","workerid":0](https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est%22,%22workerid%22:0)}}
{"timestamp":"2025-02-12T01:19:36.732Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":4,"page":"[https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est","workerid":0](https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est%22,%22workerid%22:0)}}
{"timestamp":"2025-02-12T01:19:40.834Z","logLevel":"info","context":"worker","message":"Worker done, all tasks complete","details":{"workerid":0}}
{"timestamp":"2025-02-12T01:19:41.333Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1252,"total":1252,"pending":0,"failed":1,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"timestamp":"2025-02-12T01:19:41.340Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2025-02-12T01:19:41.341Z","logLevel":"info","context":"general","message":"Generating WACZ","details":{}}
{"timestamp":"2025-02-12T01:19:41.432Z","logLevel":"info","context":"general","message":"Num WARC Files: 39","details":{}}
node:events:496
throw er; // Unhandled 'error' event
^

Error: EIO: i/o error, close
Emitted 'error' event on WriteStream instance at:
[90m at emitErrorNT (node:internal/streams/destroy:169:8)[39m
[90m at emitErrorCloseNT (node:internal/streams/destroy:128:3)[39m
[90m at process.processTicksAndRejections (node:internal/process/task_queues:82:21)[39m {
errno: [33m-5[39m,
code: [32m'EIO'[39m,
syscall: [32m'close'[39m
}

Node.js v20.11.1
1

I tried it many times and it always causes the same error. The s3fs mount is stable during that time, so it isn't just a network disconnect. Do you have any ideas what could cause this issue?

The text was updated successfully, but these errors were encountered:

ikreymer · 2025-02-20T18:57:38Z

We haven't had a chance to test with s3fs, so can't really help there specifically. However, Browsertrix Crawler actually has native support for uploading to S3-compatible storage. For security, the S3 settings are only provided via environment variables.

docker run
-e STORE_ENDPOINT_URL=https://s3-endpoint.example.com/bucket/ \
-e STORE_ACCESS_KEY=<access key> \
-e STORE_SECRET_KEY=<secret key> \
-e STORE_PATH=<optional prefix>/ \
... crawl --generateWACZ

This will upload the WACZ to https://s3-endpoint.example.com/bucket/<optional prefix>/
The prefix is not required.

We have some docs on this, but they should be extended to include this example:
https://crawler.docs.browsertrix.com/user-guide/common-options/#uploading-crawl-outputs-to-s3-compatible-storage

A working example can be found in the tests also:
https://github.com/webrecorder/browsertrix-crawler/blob/main/tests/upload-wacz.test.js

You can set the --sizeLimit on the crawl, where it will upload to S3 and exit, and you can run it in a script that restarts the crawler in this way. (We use it this way in Browsertrix app with Kubernetes).

At this time, only upload WACZs is supported, and the WACZ upload should stream directly to S3, without requiring any additional local disk space. Hope this helps!

Neo-Oli · 2025-02-26T16:55:35Z

At this time, only upload WACZs is supported, and the WACZ upload should stream directly to S3, without requiring any additional local disk space. Hope this helps!

@ikreymer The Problem is, that only the S3 bucket contains the previous crawls. The crawler server has no previous data. As far as I am aware only the latest crawl would be included in the WACZ file if we did it this way. Is that correct.

I did find a solution for us. With rclone (with --vfs-cache-mode writes and adding --diskUtilization 0 to browsertrix-crawler) instead of s3fs I am able to generate a WACZ directly on the S3 bucket.

github-project-automation bot added this to Webrecorder Projects Feb 18, 2025

github-project-automation bot moved this to Triage in Webrecorder Projects Feb 18, 2025

ikreymer changed the title ~~Error received during crawl~~ s3fs IO errors (was: Error received during crawl) Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s3fs IO errors (was: Error received during crawl) #772

s3fs IO errors (was: Error received during crawl) #772

Neo-Oli commented Feb 18, 2025

ikreymer commented Feb 20, 2025

Neo-Oli commented Feb 26, 2025

s3fs IO errors (was: Error received during crawl) #772

s3fs IO errors (was: Error received during crawl) #772

Comments

Neo-Oli commented Feb 18, 2025

ikreymer commented Feb 20, 2025

Neo-Oli commented Feb 26, 2025