Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3fs IO errors (was: Error received during crawl) #772

Open
Neo-Oli opened this issue Feb 18, 2025 · 2 comments
Open

s3fs IO errors (was: Error received during crawl) #772

Neo-Oli opened this issue Feb 18, 2025 · 2 comments

Comments

@Neo-Oli
Copy link

Neo-Oli commented Feb 18, 2025

Hi, I have an issue that I am hoping you can help me with.

I am trying to archive a rather large site. Because the resulting archives quickly filled up my VPS' storage, I mounted an S3 space with s3fs. But when I run the scraper I get this error.

{"timestamp":"2025-02-12T01:19:35.728Z","logLevel":"info","context":"behavior","message":"Run Script Finished","details":{"frameUrl":"[https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est","page":"https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est","workerid":0](https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est%22,%22page%22:%22https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est%22,%22workerid%22:0)}}
{"timestamp":"2025-02-12T01:19:35.729Z","logLevel":"info","context":"behavior","message":"Behaviors finished","details":{"finished":1,"page":"[https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est","workerid":0](https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est%22,%22workerid%22:0)}}
{"timestamp":"2025-02-12T01:19:36.732Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":4,"page":"[https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est","workerid":0](https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est%22,%22workerid%22:0)}}
{"timestamp":"2025-02-12T01:19:40.834Z","logLevel":"info","context":"worker","message":"Worker done, all tasks complete","details":{"workerid":0}}
{"timestamp":"2025-02-12T01:19:41.333Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1252,"total":1252,"pending":0,"failed":1,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"timestamp":"2025-02-12T01:19:41.340Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2025-02-12T01:19:41.341Z","logLevel":"info","context":"general","message":"Generating WACZ","details":{}}
{"timestamp":"2025-02-12T01:19:41.432Z","logLevel":"info","context":"general","message":"Num WARC Files: 39","details":{}}
node:events:496
throw er; // Unhandled 'error' event
^

Error: EIO: i/o error, close
Emitted 'error' event on WriteStream instance at:
[90m at emitErrorNT (node:internal/streams/destroy:169:8)[39m
[90m at emitErrorCloseNT (node:internal/streams/destroy:128:3)[39m
[90m at process.processTicksAndRejections (node:internal/process/task_queues:82:21)[39m {
errno: [33m-5[39m,
code: [32m'EIO'[39m,
syscall: [32m'close'[39m
}

Node.js v20.11.1
1

I tried it many times and it always causes the same error. The s3fs mount is stable during that time, so it isn't just a network disconnect. Do you have any ideas what could cause this issue?

@ikreymer
Copy link
Member

We haven't had a chance to test with s3fs, so can't really help there specifically. However, Browsertrix Crawler actually has native support for uploading to S3-compatible storage. For security, the S3 settings are only provided via environment variables.

docker run
-e STORE_ENDPOINT_URL=https://s3-endpoint.example.com/bucket/ \
-e STORE_ACCESS_KEY=<access key> \
-e STORE_SECRET_KEY=<secret key> \
-e STORE_PATH=<optional prefix>/ \
... crawl --generateWACZ

This will upload the WACZ to https://s3-endpoint.example.com/bucket/<optional prefix>/
The prefix is not required.

We have some docs on this, but they should be extended to include this example:
https://crawler.docs.browsertrix.com/user-guide/common-options/#uploading-crawl-outputs-to-s3-compatible-storage

A working example can be found in the tests also:
https://github.com/webrecorder/browsertrix-crawler/blob/main/tests/upload-wacz.test.js

You can set the --sizeLimit on the crawl, where it will upload to S3 and exit, and you can run it in a script that restarts the crawler in this way. (We use it this way in Browsertrix app with Kubernetes).

At this time, only upload WACZs is supported, and the WACZ upload should stream directly to S3, without requiring any additional local disk space. Hope this helps!

@ikreymer ikreymer changed the title Error received during crawl s3fs IO errors (was: Error received during crawl) Feb 20, 2025
@Neo-Oli
Copy link
Author

Neo-Oli commented Feb 26, 2025

At this time, only upload WACZs is supported, and the WACZ upload should stream directly to S3, without requiring any additional local disk space. Hope this helps!

@ikreymer The Problem is, that only the S3 bucket contains the previous crawls. The crawler server has no previous data. As far as I am aware only the latest crawl would be included in the WACZ file if we did it this way. Is that correct.

I did find a solution for us. With rclone (with --vfs-cache-mode writes and adding --diskUtilization 0 to browsertrix-crawler) instead of s3fs I am able to generate a WACZ directly on the S3 bucket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

2 participants