Tips for working with the Cloudflare protections #98

Kyther · 2023-07-17T15:35:18Z

Kyther
Jul 17, 2023

I've been running ao3downloader the past week, starting from a day after the DDoS attacks. I figured I'd share a little of what I learned so it's all in one place. Note that I use Linux, and some of the tricks I'm using may not function on Windows.

When the protections are high, the script doesn't run. Fortunately, most of the time they don't seem to be quite as high, and it does work. However, it may end abruptly if they're bumped up, and files may occasionally fail to download - and you will not receive an error about it. Since I have modified the script to change the default filename - and added disambiguation for when filenames are identical - attempting to redownload from the list would cause tons of duplicate files to be downloaded. Therefore I want to avoid any unnecessary calls to AO3's servers whenever possible.

To make sure I don't lose files by them silently failing to download, this is what I do:

First download the entire URL to a text file. I usually do links with metadata but links only are all that's really needed, and l need to have a single txt file with all the links in it at some point.
If the fandom is over a thousand fics, I break the text file up into blocks of a thousand at a time using the split command in the terminal. (This reduces the chance of having to manually remove a bunch of links from the file to resume downloads.)
I download several file formats, including html. Having html as one of the formats is the key here.
When a batch of 1000 is finished, I take note of the number of files in the folder. If it's exactly a multiple of 1000 (plus the extra "images" folder) then I know everything downloaded, yay! If not, I make a note that I'll need to check the entire lot, which I can do either immediately or wait (I generally wait until the entire fandom has been run through so I only have to compare once). If a fic has partially downloaded (one format but not all), I might manually grab the last format or two to make sure every fic grabbed has all of the formats I want for it.
If I'm done downloading the fandom, but know I have missing files, then I use the terminal to run the following command while in the folder of ALL of the files for that fandom to pull the links to each fic that was downloaded (a friend helped me with this command, lol): find -name '*.html' | while IFS= read -r fn; do grep -Po -m 1 '<a href="\K[http://archiveofourown\.org/works/\d+(?=")'](http://archiveofourown/.org/works//d+(?=%22)%27) "$fn"; done > ../output.txt
Next, I open the output file and do a search-and-replace to swap every instance of "http://" with "https://". This is because AO3's downloads have an http link embedded in them but the site itself uses https, so the files won't match.
Now I run the comm command (terminal again!) comparing the downloaded list (output.txt) with the full list of everything that should have downloaded (this is the original txt file of all the links), suppressing any lines except what are unique to the full list. This gets me the list of fics that didn't download, which I can now run through ao3downloader and combine with the rest to get the full batch.

AO3's servers currently request 300-second breaks every 9-15 downloads, so you won't get anything too quickly, but it can at least be left unattended while one accomplishes other tasks, which makes it worth it. Here's to hoping the stricter rate limits will not be necessary soon!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tips for working with the Cloudflare protections #98

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Tips for working with the Cloudflare protections #98

Kyther Jul 17, 2023

Replies: 0 comments

Kyther
Jul 17, 2023