Skip to content

Commit

Permalink
Kitchen-sink bugfixes
Browse files Browse the repository at this point in the history
- fixes Distinguish between actual series bookmarks and bookmarked works that happen to be in a series #117
- fixes Properly identify links to individual chapters of a work as "work" links #102
- fixes Fix exception when work text contains the string "This work could have adult content" #23
- fixes a rare issue when retry-after is non-positive
- ✨ add tests ✨
  • Loading branch information
nianeyna committed Jan 21, 2024
1 parent e826fc6 commit a910039
Show file tree
Hide file tree
Showing 19 changed files with 6,059 additions and 60 deletions.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
venv/
.vscode/
__pycache__/
*.pyc
downloads/
Expand Down
12 changes: 2 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@
## Update about the current DDoS attack on AO3 (July 16, 2023)

The script seems to be working again based on my testing. It's slower than usual due to the rate limit adjustments (you'll see the "ao3 has requested a break" message more often, and the breaks will be longer). Other than that though, things look pretty stable for now. Happy downloading!

## What is this?

This is a program intended to help you download fanfiction from the [Archive of Our Own](https://archiveofourown.org/) in bulk. This program is primarily intended to work with links to the Archive of Our Own itself, but has a secondary function of downloading any [Pinboard](https://pinboard.in/) bookmarks that link to the Archive of Our Own. You can ignore the Pinboard functionality if you don't know what Pinboard is or don't use Pinboard.
Expand Down Expand Up @@ -32,7 +28,7 @@ As of January 17, 2023 I have changed how file names are generated (again). All

## Instructions

1. install [python](https://www.python.org/downloads/). make sure to install version 3.9.0 or later. see [announcements](#announcements) for the most recent version of python that is confirmed to work with the script - when in doubt, install that version.
1. install python [from this link](https://www.python.org/downloads/release/python-3114/). **do not install the latest version of python**, or a version of python lower than 3.9.0.
2. download the repository as a zip file. the "repository" means the folder containing the code.
- if you are reading this on [github](https://github.com/nianeyna/ao3downloader), you can download the repository by clicking on the "Code" button in github and selecting "Download ZIP"
- if you are reading this on [my website](https://nianeyna.dev/ao3downloader/), you can download the repository by clicking the button at the top of the page that says "Click to Download"
Expand Down Expand Up @@ -76,7 +72,7 @@ As of January 17, 2023 I have changed how file names are generated (again). All
- **IMPORTANT**: some of your input choices are saved in a file called <!--CHECK-->settings.json<!--SETTINGS_FILE_NAME--> (in the same folder as ao3downloader.py). In some cases you will not be able to change these choices unless you clear your settings by deleting <!--CHECK-->settings.json<!--SETTINGS_FILE_NAME--> (or editing it, if you are comfortable with json). In addition, please note that saved settings include passwords and keys and are saved in plain text. **Use appropriate caution with this file.**
- **The purpose of entering your ao3 login information** is to download archive-locked works or anything else that is not visible when you are not logged in. If you don't care about that, there is no need to enter your login information.
- **Ao3 limits the number of requests** a single user can make to the site in a given time period. When this limit is reached, the script will pause for the amount of time (usually a few minutes) that Ao3 requests. When this happens, the start time, end time, and length of the pause in seconds will be printed to the console. If you try to access Ao3 from your browser during this period, you will see a "Retry later" message. Don't be alarmed by this - it's normal, and you aren't in trouble. Simply wait for the specified amount of time and then refresh the page. Other than during these required pauses, you can use Ao3 as normal while the script is running.
- **If you choose to '<!--CHECK-->get works from series links<!--AO3_PROMPT_SERIES-->'** then if the script encounters a work that is part of a series, it will also download the entire series that the work is a part of. This can _dramatically_ extend the amount of time the script takes to run. If you don't want this, choose 'n' when you get this prompt. (Note that this will cause the program to ignore _all_ series links, including e.g. series that you have bookmarked.)
- **If you choose to '<!--CHECK-->get works from all encountered series links<!--AO3_PROMPT_SERIES-->'** then if the script encounters a work that is part of a series, it will also download the entire series that the work is a part of. This can _dramatically_ extend the amount of time the script takes to run. If you don't want this, choose 'n' when you get this prompt. (Series that you have bookmarked directly will always be fully downloaded, regardless of what you choose here.)
- **If you choose to '<!--CHECK-->download embedded images<!--AO3_PROMPT_IMAGES-->'** the script will look for image links on all works it downloads and attempt to save those images to an '<!--CHECK-->images<!--IMAGE_FOLDER_NAME-->' subfolder. Images will be titled with the name of the fic + 'imgxxx' to distinguish them.
- Note that this feature does not encode any association between the downloaded images and the fic file aside from the file name.
- Most file formats will include embedded image files anyway, regardless of whether you choose this option. I have confirmed this for PDF, EPUB, MOBI, and AZW3 file formats. (If you saw me contradict this in an earlier version of this readme... no you didn't)
Expand All @@ -99,10 +95,6 @@ As of January 17, 2023 I have changed how file names are generated (again). All
- With the exception of series links, if you enter a link to an ao3 page that contains links to works or series, but does not support multiple pages of results, the script will loop infinitely. Most notably, this applies to user dashboard pages. If this happens, you can close the window to get out of the loop.
- When downloading missing fics from series, if you are logged in, and the downloader finds a link to a series that is inaccessible because you do not have permission to access the series page, the downloader will download all of the works linked on your user dashboard page, instead. Yes... really.
- Works that contain certain archive messages in either the work text or the tags may cause unexpected behavior. These problem phrases are:
- <!--CHECK-->Error 404<!--AO3_DELETED-->
- <!--CHECK-->This work could have adult content.<!--AO3_EXPLICIT-->
- <!--CHECK-->This work is only available to registered users of the Archive<!--AO3_LOCKED-->
## Troubleshooting
Expand Down
19 changes: 9 additions & 10 deletions ao3downloader/ao3.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,15 +68,15 @@ def get_work_links(self, link: str, metadata: bool) -> dict[str, dict]:

def get_work_links_recursive(self, links_list: dict[str, dict], link: str, visited_series: list[str], metadata: bool, soup: BeautifulSoup=None) -> None:

if parse_text.is_work(link, internal=False):
if parse_text.is_work(link):
if link not in links_list:
if metadata:
metatdata = parse_soup.get_work_metadata(soup, link)
links_list[link] = metatdata
else:
links_list[link] = None
elif parse_text.is_series(link, internal=False):
if self.series and link not in visited_series:
elif parse_text.is_series(link):
if link not in visited_series:
visited_series.append(link)
series_soup = self.repo.get_soup(link)
series_soup = self.proceed(series_soup)
Expand All @@ -87,7 +87,7 @@ def get_work_links_recursive(self, links_list: dict[str, dict], link: str, visit
while True:
self.fileops.write_log({'starting': link})
thesoup = self.repo.get_soup(link)
urls = parse_soup.get_work_and_series_urls(thesoup)
urls = parse_soup.get_work_and_series_urls(thesoup, self.series)
if len(urls) == 0: break
for url in urls:
self.get_work_links_recursive(links_list, url, visited_series, metadata, thesoup)
Expand All @@ -104,18 +104,17 @@ def download_recursive(self, link: str, log: dict, visited: list[str]) -> None:
if link in visited: return
visited.append(link)

if parse_text.is_series(link, internal=False):
if self.series:
log = {}
self.download_series(link, log, visited)
elif parse_text.is_work(link, internal=False):
if parse_text.is_work(link):
log = {}
self.download_work(link, log, None)
elif parse_text.is_series(link):
log = {}
self.download_series(link, log, visited)
elif strings.AO3_BASE_URL in link:
while True:
self.fileops.write_log({'starting': link})
thesoup = self.repo.get_soup(link)
urls = parse_soup.get_work_and_series_urls(thesoup)
urls = parse_soup.get_work_and_series_urls(thesoup, self.series)
if len(urls) == 0: break
for url in urls:
self.download_recursive(url, log, visited)
Expand Down
73 changes: 49 additions & 24 deletions ao3downloader/parse_soup.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
import re
import traceback
from typing import Any

from bs4 import BeautifulSoup
from bs4 import BeautifulSoup, ResultSet

from ao3downloader import parse_text, strings
from ao3downloader.exceptions import DownloadException, ProceedException
Expand Down Expand Up @@ -96,33 +97,57 @@ def get_series_info(soup: BeautifulSoup) -> dict:
def get_work_urls(soup: BeautifulSoup) -> list[str]:
"""Get all links to ao3 works on a page"""

work_urls = []
return list(dict.fromkeys(list(
map(lambda w: get_full_work_url(w.get('href')),
filter(lambda a : a.get('href') and parse_text.is_work(a.get('href')),
soup.find_all('a'))))))

# get links to all works on the page
all_links = soup.find_all('a')
for link in all_links:
href = link.get('href')
if href and parse_text.is_work(href):
url = strings.AO3_BASE_URL + href
work_urls.append(url)

return work_urls
def get_full_work_url(url: str) -> str:
"""Get full ao3 work url from partial url"""

work_number = parse_text.get_work_number(url)
return strings.AO3_BASE_URL + url.split(work_number)[0] + work_number

def get_work_and_series_urls(soup: BeautifulSoup) -> list[str]:
"""Get all links to ao3 works or series on a page"""

urls = []
def get_series_urls(soup: BeautifulSoup, get_all: bool) -> list[str]:
"""Get all links to ao3 series on a page"""

# get links to all works on the page
all_links = soup.find_all('a')
for link in all_links:
href = link.get('href')
if href and (parse_text.is_work(href) or parse_text.is_series(href)):
url = strings.AO3_BASE_URL + href
urls.append(url)
bookmarks = None if get_all else soup.find_all('li', class_='bookmark')

return list(dict.fromkeys(list(
map(lambda w: get_full_series_url(w.get('href')),
filter(lambda a : is_series(a, get_all, bookmarks),
soup.find_all('a'))))))


def is_series(element: Any, get_all: bool, bookmarks: ResultSet[Any]) -> bool:

series_number = parse_text.get_series_number(element.get('href'))

# it's not a series at all, so return false
if not series_number: return False

# it is a series and we want all of them, so return true
if get_all: return True

return urls
# check the bookmarks list to see if this is a series, and return true if it is
return len(list(filter(lambda x: f'series-{series_number}' in x.get('class'), bookmarks))) > 0


def get_full_series_url(url: str) -> str:
"""Get full ao3 series url from partial url"""

series_number = parse_text.get_series_number(url)
return strings.AO3_BASE_URL + url.split(series_number)[0] + series_number


def get_work_and_series_urls(soup: BeautifulSoup, get_all: bool=False) -> list[str]:
"""Get all links to ao3 works or series on a page"""

work_urls = get_work_urls(soup)
series_urls = get_series_urls(soup, get_all)
return work_urls + series_urls


def get_proceed_link(soup: BeautifulSoup) -> str:
Expand Down Expand Up @@ -225,15 +250,15 @@ def get_current_chapters(soup: BeautifulSoup) -> str:


def is_locked(soup: BeautifulSoup) -> bool:
return string_exists(soup, strings.AO3_LOCKED)
return soup.find('div', id='main', class_='sessions-new') is not None


def is_deleted(soup: BeautifulSoup) -> bool:
return string_exists(soup, strings.AO3_DELETED)
return soup.find('div', id='main', class_='error-404') is not None


def is_explicit(soup: BeautifulSoup) -> bool:
return string_exists(soup, strings.AO3_EXPLICIT)
return soup.find('p', class_='caution') is not None


def is_failed_login(soup: BeautifulSoup) -> bool:
Expand Down
25 changes: 18 additions & 7 deletions ao3downloader/parse_text.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
import datetime
import re

from ao3downloader import strings

Expand All @@ -25,15 +24,27 @@ def get_file_type(filetype: str) -> str:


def get_work_number(link: str) -> str:
return link[link.find('/works/'):][7:]
return get_digits_after('/works/', link)


def is_work(link: str, internal: bool=True) -> bool:
return (link.startswith('/') or not internal) and re.compile(strings.AO3_WORK).match(link)
def get_series_number(link: str) -> str:
return get_digits_after('/series/', link)


def is_series(link: str, internal: bool=True) -> bool:
return (link.startswith('/') or not internal) and re.compile(strings.AO3_SERIES).match(link)
def is_work(link: str) -> bool:
return get_work_number(link) != None


def is_series(link: str) -> bool:
return get_series_number(link) != None


def get_digits_after(test: str, url: str) -> str:
index = str.find(url, test)
if index == -1: return None
digits = get_num_from_link(url, index + len(test))
if not digits or len(digits) == 0: return None
return digits


def get_next_page(link: str) -> str:
Expand Down Expand Up @@ -62,7 +73,7 @@ def get_page_number(link: str) -> int:


def get_num_from_link(link: str, start: int) -> str:
end = start + 1
end = start
while end < len(link) and str.isdigit(link[start:end+1]):
end = end + 1
return link[start:end]
Expand Down
2 changes: 1 addition & 1 deletion ao3downloader/parse_xml.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ def get_bookmark_list(bookmark_xml: ET.Element, exclude_toread: bool) -> list[di
attributes = child.attrib
# only include valid ao3 links
link = attributes['href']
if 'archiveofourown.org' in link and (parse_text.is_work(link, internal=False) or parse_text.is_series(link, internal=False)):
if 'archiveofourown.org' in link and (parse_text.is_work(link) or parse_text.is_series(link)):
# if exclude_toread is true, only include read bookmarks
if exclude_toread:
if not 'toread' in attributes:
Expand Down
1 change: 1 addition & 0 deletions ao3downloader/repo.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ def my_get(self, url: str) -> requests.Response:
pause_time = int(response.headers['retry-after'])
except:
pause_time = 300 # default to 5 minutes in case there was a problem getting retry-after
if pause_time <= 0: pause_time = 300 # default to 5 minutes if retry-after is an invalid value
now = datetime.datetime.now()
later = now + datetime.timedelta(0, pause_time)
print(strings.MESSAGE_TOO_MANY_REQUESTS.format(pause_time, now.strftime('%H:%M:%S'), later.strftime('%H:%M:%S')))
Expand Down
8 changes: 1 addition & 7 deletions ao3downloader/strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@
AO3_PROMPT_LAST_PAGE = 'do you want to start downloading from the page you stopped on last time? ({}/{})'.format(PROMPT_YES, PROMPT_NO)
AO3_PROMPT_PAGES = 'please enter page number to stop on. enter 0 to download all pages.'
AO3_PROMPT_IMAGES = 'do you want to download embedded images? (will be saved separately) ({}/{})'.format(PROMPT_YES, PROMPT_NO)
AO3_PROMPT_SERIES = 'do you want to get works from series links? ({}/{})'.format(PROMPT_YES, PROMPT_NO)
AO3_PROMPT_SERIES = 'do you want to get works from all encountered series links? (bookmarked series will always be downloaded, regardles of this option) ({}/{})'.format(PROMPT_YES, PROMPT_NO)
AO3_PROMPT_METADATA = 'do you want to include work metadata? ({}/{})'.format(PROMPT_YES, PROMPT_NO)
AO3_PROMPT_FILE_INPUT = 'please enter complete file path (including file extension) to file containing links to download (must be a text file with one link on each line)'
AO3_INFO_LOGIN = 'logging in'
Expand Down Expand Up @@ -117,12 +117,6 @@
AO3_BASE_URL = 'https://archiveofourown.org'
AO3_LOGIN_URL = 'https://archiveofourown.org/users/login'

AO3_WORK = r'.*\/works\/\d+$'
AO3_SERIES = r'.*\/series\/\d+$'

AO3_LOCKED = 'This work is only available to registered users of the Archive'
AO3_DELETED = 'Error 404'
AO3_EXPLICIT = 'This work could have adult content.'
AO3_FAILED_LOGIN = 'The password or user name you entered doesn\'t match our records.'
AO3_PROCEED = 'Yes, Continue'
AO3_MARK_READ = 'Mark as Read'
Expand Down
Binary file modified requirements.txt
Binary file not shown.
Empty file added test/__init__.py
Empty file.
41 changes: 41 additions & 0 deletions test/__snapshots__/test_parse_soup.ambr
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# serializer version: 1
# name: test_get_series_urls_all
list([
'https://archiveofourown.org/series/2065602',
'https://archiveofourown.org/series/3738976',
'https://archiveofourown.org/series/2627935',
'https://archiveofourown.org/series/3108957',
'https://archiveofourown.org/series/3078150',
'https://archiveofourown.org/series/3078153',
'https://archiveofourown.org/series/3196530',
'https://archiveofourown.org/series/2643097',
])
# ---
# name: test_get_series_urls_bookmarks
list([
'https://archiveofourown.org/series/2065602',
'https://archiveofourown.org/series/3738976',
])
# ---
# name: test_get_work_urls
list([
'https://archiveofourown.org/works/34816549',
'https://archiveofourown.org/works/35778589',
'https://archiveofourown.org/works/41655369',
'https://archiveofourown.org/works/34763164',
'https://archiveofourown.org/works/41214669',
'https://archiveofourown.org/works/41822007',
'https://archiveofourown.org/works/342122',
'https://archiveofourown.org/works/26958667',
'https://archiveofourown.org/works/33658237',
'https://archiveofourown.org/works/36398359',
'https://archiveofourown.org/works/34702543',
'https://archiveofourown.org/works/35369560',
'https://archiveofourown.org/works/18623245',
'https://archiveofourown.org/works/34348333',
'https://archiveofourown.org/works/28032981',
'https://archiveofourown.org/works/24412372',
'https://archiveofourown.org/works/557020',
'https://archiveofourown.org/works/28968675',
])
# ---
2,798 changes: 2,798 additions & 0 deletions test/fixtures/bookmarks.html

Large diffs are not rendered by default.

Loading

0 comments on commit a910039

Please sign in to comment.