Kitchen-sink bugfixes

- fixes Distinguish between actual series bookmarks and bookmarked works that happen to be in a series #117 - fixes Properly identify links to individual chapters of a work as "work" links #102 - fixes Fix exception when work text contains the string "This work could have adult content" #23 - fixes a rare issue when retry-after is non-positive - ✨ add tests ✨
nianeyna · Jan 21, 2024 · a910039 · a910039
1 parent e826fc6
commit a910039
Show file tree

Hide file tree

Showing 19 changed files with 6,059 additions and 60 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,4 @@
 venv/
-.vscode/
 __pycache__/
 *.pyc
 downloads/

diff --git a/README.md b/README.md
@@ -1,7 +1,3 @@
-## Update about the current DDoS attack on AO3 (July 16, 2023)
-
-The script seems to be working again based on my testing. It's slower than usual due to the rate limit adjustments (you'll see the "ao3 has requested a break" message more often, and the breaks will be longer). Other than that though, things look pretty stable for now. Happy downloading!
-
 ## What is this?
 
 This is a program intended to help you download fanfiction from the [Archive of Our Own](https://archiveofourown.org/) in bulk. This program is primarily intended to work with links to the Archive of Our Own itself, but has a secondary function of downloading any [Pinboard](https://pinboard.in/) bookmarks that link to the Archive of Our Own. You can ignore the Pinboard functionality if you don't know what Pinboard is or don't use Pinboard.
@@ -32,7 +28,7 @@ As of January 17, 2023 I have changed how file names are generated (again). All
 
 ## Instructions
 
-1. install [python](https://www.python.org/downloads/). make sure to install version 3.9.0 or later. see [announcements](#announcements) for the most recent version of python that is confirmed to work with the script - when in doubt, install that version.
+1. install python [from this link](https://www.python.org/downloads/release/python-3114/). **do not install the latest version of python**, or a version of python lower than 3.9.0.
 2. download the repository as a zip file. the "repository" means the folder containing the code.
    - if you are reading this on [github](https://github.com/nianeyna/ao3downloader), you can download the repository by clicking on the "Code" button in github and selecting "Download ZIP"
    - if you are reading this on [my website](https://nianeyna.dev/ao3downloader/), you can download the repository by clicking the button at the top of the page that says "Click to Download"
@@ -76,7 +72,7 @@ As of January 17, 2023 I have changed how file names are generated (again). All
 - **IMPORTANT**: some of your input choices are saved in a file called <!--CHECK-->settings.json<!--SETTINGS_FILE_NAME--> (in the same folder as ao3downloader.py). In some cases you will not be able to change these choices unless you clear your settings by deleting <!--CHECK-->settings.json<!--SETTINGS_FILE_NAME--> (or editing it, if you are comfortable with json). In addition, please note that saved settings include passwords and keys and are saved in plain text. **Use appropriate caution with this file.**
 - **The purpose of entering your ao3 login information** is to download archive-locked works or anything else that is not visible when you are not logged in. If you don't care about that, there is no need to enter your login information.
 - **Ao3 limits the number of requests** a single user can make to the site in a given time period. When this limit is reached, the script will pause for the amount of time (usually a few minutes) that Ao3 requests. When this happens, the start time, end time, and length of the pause in seconds will be printed to the console. If you try to access Ao3 from your browser during this period, you will see a "Retry later" message. Don't be alarmed by this - it's normal, and you aren't in trouble. Simply wait for the specified amount of time and then refresh the page. Other than during these required pauses, you can use Ao3 as normal while the script is running.
-- **If you choose to '<!--CHECK-->get works from series links<!--AO3_PROMPT_SERIES-->'** then if the script encounters a work that is part of a series, it will also download the entire series that the work is a part of. This can _dramatically_ extend the amount of time the script takes to run. If you don't want this, choose 'n' when you get this prompt. (Note that this will cause the program to ignore _all_ series links, including e.g. series that you have bookmarked.)
+- **If you choose to '<!--CHECK-->get works from all encountered series links<!--AO3_PROMPT_SERIES-->'** then if the script encounters a work that is part of a series, it will also download the entire series that the work is a part of. This can _dramatically_ extend the amount of time the script takes to run. If you don't want this, choose 'n' when you get this prompt. (Series that you have bookmarked directly will always be fully downloaded, regardless of what you choose here.)
 - **If you choose to '<!--CHECK-->download embedded images<!--AO3_PROMPT_IMAGES-->'** the script will look for image links on all works it downloads and attempt to save those images to an '<!--CHECK-->images<!--IMAGE_FOLDER_NAME-->' subfolder. Images will be titled with the name of the fic + 'imgxxx' to distinguish them.
   - Note that this feature does not encode any association between the downloaded images and the fic file aside from the file name.
   - Most file formats will include embedded image files anyway, regardless of whether you choose this option. I have confirmed this for PDF, EPUB, MOBI, and AZW3 file formats. (If you saw me contradict this in an earlier version of this readme... no you didn't)
@@ -99,10 +95,6 @@ As of January 17, 2023 I have changed how file names are generated (again). All
 
 - With the exception of series links, if you enter a link to an ao3 page that contains links to works or series, but does not support multiple pages of results, the script will loop infinitely. Most notably, this applies to user dashboard pages. If this happens, you can close the window to get out of the loop.
 - When downloading missing fics from series, if you are logged in, and the downloader finds a link to a series that is inaccessible because you do not have permission to access the series page, the downloader will download all of the works linked on your user dashboard page, instead. Yes... really.
-- Works that contain certain archive messages in either the work text or the tags may cause unexpected behavior. These problem phrases are:
-  - <!--CHECK-->Error 404<!--AO3_DELETED-->
-  - <!--CHECK-->This work could have adult content.<!--AO3_EXPLICIT-->
-  - <!--CHECK-->This work is only available to registered users of the Archive<!--AO3_LOCKED-->
 
 ## Troubleshooting
 

diff --git a/ao3downloader/ao3.py b/ao3downloader/ao3.py
@@ -68,15 +68,15 @@ def get_work_links(self, link: str, metadata: bool) -> dict[str, dict]:
 
     def get_work_links_recursive(self, links_list: dict[str, dict], link: str, visited_series: list[str], metadata: bool, soup: BeautifulSoup=None) -> None:
 
-        if parse_text.is_work(link, internal=False):
+        if parse_text.is_work(link):
             if link not in links_list:
                 if metadata:
                     metatdata = parse_soup.get_work_metadata(soup, link)
                     links_list[link] = metatdata
                 else:
                     links_list[link] = None
-        elif parse_text.is_series(link, internal=False):
-            if self.series and link not in visited_series:
+        elif parse_text.is_series(link):
+            if link not in visited_series:
                 visited_series.append(link)
                 series_soup = self.repo.get_soup(link)
                 series_soup = self.proceed(series_soup)
@@ -87,7 +87,7 @@ def get_work_links_recursive(self, links_list: dict[str, dict], link: str, visit
             while True:
                 self.fileops.write_log({'starting': link})
                 thesoup = self.repo.get_soup(link)
-                urls = parse_soup.get_work_and_series_urls(thesoup)
+                urls = parse_soup.get_work_and_series_urls(thesoup, self.series)
                 if len(urls) == 0: break
                 for url in urls:
                     self.get_work_links_recursive(links_list, url, visited_series, metadata, thesoup)
@@ -104,18 +104,17 @@ def download_recursive(self, link: str, log: dict, visited: list[str]) -> None:
         if link in visited: return
         visited.append(link)
 
-        if parse_text.is_series(link, internal=False):
-            if self.series:
-                log = {}
-                self.download_series(link, log, visited)
-        elif parse_text.is_work(link, internal=False):
+        if parse_text.is_work(link):
             log = {}
             self.download_work(link, log, None)
+        elif parse_text.is_series(link):
+            log = {}
+            self.download_series(link, log, visited)        
         elif strings.AO3_BASE_URL in link:
             while True:
                 self.fileops.write_log({'starting': link})
                 thesoup = self.repo.get_soup(link)
-                urls = parse_soup.get_work_and_series_urls(thesoup)
+                urls = parse_soup.get_work_and_series_urls(thesoup, self.series)
                 if len(urls) == 0: break
                 for url in urls:
                     self.download_recursive(url, log, visited)

diff --git a/ao3downloader/parse_soup.py b/ao3downloader/parse_soup.py
@@ -1,7 +1,8 @@
 import re
 import traceback
+from typing import Any
 
-from bs4 import BeautifulSoup
+from bs4 import BeautifulSoup, ResultSet
 
 from ao3downloader import parse_text, strings
 from ao3downloader.exceptions import DownloadException, ProceedException
@@ -96,33 +97,57 @@ def get_series_info(soup: BeautifulSoup) -> dict:
 def get_work_urls(soup: BeautifulSoup) -> list[str]:
     """Get all links to ao3 works on a page"""
 
-    work_urls = []
+    return list(dict.fromkeys(list(
+        map(lambda w: get_full_work_url(w.get('href')), 
+            filter(lambda a : a.get('href') and parse_text.is_work(a.get('href')), 
+                   soup.find_all('a'))))))
 
-    # get links to all works on the page
-    all_links = soup.find_all('a')
-    for link in all_links:
-        href = link.get('href')
-        if href and parse_text.is_work(href):
-            url = strings.AO3_BASE_URL + href
-            work_urls.append(url)
 
-    return work_urls
+def get_full_work_url(url: str) -> str:
+    """Get full ao3 work url from partial url"""
 
+    work_number = parse_text.get_work_number(url)
+    return strings.AO3_BASE_URL + url.split(work_number)[0] + work_number
 
-def get_work_and_series_urls(soup: BeautifulSoup) -> list[str]:
-    """Get all links to ao3 works or series on a page"""
 
-    urls = []
+def get_series_urls(soup: BeautifulSoup, get_all: bool) -> list[str]:
+    """Get all links to ao3 series on a page"""
 
-    # get links to all works on the page
-    all_links = soup.find_all('a')
-    for link in all_links:
-        href = link.get('href')
-        if href and (parse_text.is_work(href) or parse_text.is_series(href)):
-            url = strings.AO3_BASE_URL + href
-            urls.append(url)
+    bookmarks = None if get_all else soup.find_all('li', class_='bookmark')
+
+    return list(dict.fromkeys(list(
+        map(lambda w: get_full_series_url(w.get('href')), 
+            filter(lambda a : is_series(a, get_all, bookmarks),
+                   soup.find_all('a'))))))
+
+
+def is_series(element: Any, get_all: bool, bookmarks: ResultSet[Any]) -> bool:
+
+    series_number = parse_text.get_series_number(element.get('href'))
+
+    # it's not a series at all, so return false
+    if not series_number: return False
+
+    # it is a series and we want all of them, so return true
+    if get_all: return True
 
-    return urls
+    # check the bookmarks list to see if this is a series, and return true if it is
+    return len(list(filter(lambda x: f'series-{series_number}' in x.get('class'), bookmarks))) > 0
+
+
+def get_full_series_url(url: str) -> str:
+    """Get full ao3 series url from partial url"""
+
+    series_number = parse_text.get_series_number(url)
+    return strings.AO3_BASE_URL + url.split(series_number)[0] + series_number
+
+
+def get_work_and_series_urls(soup: BeautifulSoup, get_all: bool=False) -> list[str]:
+    """Get all links to ao3 works or series on a page"""
+
+    work_urls = get_work_urls(soup)
+    series_urls = get_series_urls(soup, get_all)
+    return work_urls + series_urls
 
 
 def get_proceed_link(soup: BeautifulSoup) -> str:
@@ -225,15 +250,15 @@ def get_current_chapters(soup: BeautifulSoup) -> str:
 
 
 def is_locked(soup: BeautifulSoup) -> bool:
-    return string_exists(soup, strings.AO3_LOCKED)
+    return soup.find('div', id='main', class_='sessions-new') is not None
 
 
 def is_deleted(soup: BeautifulSoup) -> bool:
-    return string_exists(soup, strings.AO3_DELETED)
+    return soup.find('div', id='main', class_='error-404') is not None
 
 
 def is_explicit(soup: BeautifulSoup) -> bool:
-    return string_exists(soup, strings.AO3_EXPLICIT)
+    return soup.find('p', class_='caution') is not None
 
 
 def is_failed_login(soup: BeautifulSoup) -> bool:

diff --git a/ao3downloader/parse_text.py b/ao3downloader/parse_text.py
@@ -1,5 +1,4 @@
 import datetime
-import re
 
 from ao3downloader import strings
 
@@ -25,15 +24,27 @@ def get_file_type(filetype: str) -> str:
 
 
 def get_work_number(link: str) -> str:
-    return link[link.find('/works/'):][7:]
+    return get_digits_after('/works/', link)
 
 
-def is_work(link: str, internal: bool=True) -> bool:
-    return (link.startswith('/') or not internal) and re.compile(strings.AO3_WORK).match(link)
+def get_series_number(link: str) -> str:
+    return get_digits_after('/series/', link)
 
 
-def is_series(link: str, internal: bool=True) -> bool:
-    return (link.startswith('/') or not internal) and re.compile(strings.AO3_SERIES).match(link)
+def is_work(link: str) -> bool:
+    return get_work_number(link) != None
+
+
+def is_series(link: str) -> bool:
+    return get_series_number(link) != None
+
+
+def get_digits_after(test: str, url: str) -> str:
+    index = str.find(url, test)
+    if index == -1: return None
+    digits = get_num_from_link(url, index + len(test))
+    if not digits or len(digits) == 0: return None
+    return digits
 
 
 def get_next_page(link: str) -> str:
@@ -62,7 +73,7 @@ def get_page_number(link: str) -> int:
 
 
 def get_num_from_link(link: str, start: int) -> str:
-    end = start + 1
+    end = start
     while end < len(link) and str.isdigit(link[start:end+1]):
         end = end + 1
     return link[start:end]

diff --git a/ao3downloader/parse_xml.py b/ao3downloader/parse_xml.py
@@ -9,7 +9,7 @@ def get_bookmark_list(bookmark_xml: ET.Element, exclude_toread: bool) -> list[di
         attributes = child.attrib
         # only include valid ao3 links
         link = attributes['href']
-        if 'archiveofourown.org' in link and (parse_text.is_work(link, internal=False) or parse_text.is_series(link, internal=False)):
+        if 'archiveofourown.org' in link and (parse_text.is_work(link) or parse_text.is_series(link)):
             # if exclude_toread is true, only include read bookmarks
             if exclude_toread:
                 if not 'toread' in attributes:

diff --git a/ao3downloader/repo.py b/ao3downloader/repo.py
@@ -63,6 +63,7 @@ def my_get(self, url: str) -> requests.Response:
                 pause_time = int(response.headers['retry-after'])
             except:
                 pause_time = 300 # default to 5 minutes in case there was a problem getting retry-after
+            if pause_time <= 0: pause_time = 300 # default to 5 minutes if retry-after is an invalid value
             now = datetime.datetime.now()
             later = now + datetime.timedelta(0, pause_time)
             print(strings.MESSAGE_TOO_MANY_REQUESTS.format(pause_time, now.strftime('%H:%M:%S'), later.strftime('%H:%M:%S')))

diff --git a/ao3downloader/strings.py b/ao3downloader/strings.py
@@ -62,7 +62,7 @@
 AO3_PROMPT_LAST_PAGE = 'do you want to start downloading from the page you stopped on last time? ({}/{})'.format(PROMPT_YES, PROMPT_NO)
 AO3_PROMPT_PAGES = 'please enter page number to stop on. enter 0 to download all pages.'
 AO3_PROMPT_IMAGES = 'do you want to download embedded images? (will be saved separately) ({}/{})'.format(PROMPT_YES, PROMPT_NO)
-AO3_PROMPT_SERIES = 'do you want to get works from series links? ({}/{})'.format(PROMPT_YES, PROMPT_NO)
+AO3_PROMPT_SERIES = 'do you want to get works from all encountered series links? (bookmarked series will always be downloaded, regardles of this option) ({}/{})'.format(PROMPT_YES, PROMPT_NO)
 AO3_PROMPT_METADATA = 'do you want to include work metadata? ({}/{})'.format(PROMPT_YES, PROMPT_NO)
 AO3_PROMPT_FILE_INPUT = 'please enter complete file path (including file extension) to file containing links to download (must be a text file with one link on each line)'
 AO3_INFO_LOGIN = 'logging in'
@@ -117,12 +117,6 @@
 AO3_BASE_URL = 'https://archiveofourown.org'
 AO3_LOGIN_URL = 'https://archiveofourown.org/users/login'
 
-AO3_WORK = r'.*\/works\/\d+$'
-AO3_SERIES = r'.*\/series\/\d+$'
-
-AO3_LOCKED = 'This work is only available to registered users of the Archive'
-AO3_DELETED = 'Error 404'
-AO3_EXPLICIT = 'This work could have adult content.'
 AO3_FAILED_LOGIN = 'The password or user name you entered doesn\'t match our records.'
 AO3_PROCEED = 'Yes, Continue'
 AO3_MARK_READ = 'Mark as Read'

diff --git a/requirements.txt b/requirements.txt
diff --git a/test/__init__.py b/test/__init__.py
diff --git a/test/__snapshots__/test_parse_soup.ambr b/test/__snapshots__/test_parse_soup.ambr
@@ -0,0 +1,41 @@
+# serializer version: 1
+# name: test_get_series_urls_all
+  list([
+    'https://archiveofourown.org/series/2065602',
+    'https://archiveofourown.org/series/3738976',
+    'https://archiveofourown.org/series/2627935',
+    'https://archiveofourown.org/series/3108957',
+    'https://archiveofourown.org/series/3078150',
+    'https://archiveofourown.org/series/3078153',
+    'https://archiveofourown.org/series/3196530',
+    'https://archiveofourown.org/series/2643097',
+  ])
+# ---
+# name: test_get_series_urls_bookmarks
+  list([
+    'https://archiveofourown.org/series/2065602',
+    'https://archiveofourown.org/series/3738976',
+  ])
+# ---
+# name: test_get_work_urls
+  list([
+    'https://archiveofourown.org/works/34816549',
+    'https://archiveofourown.org/works/35778589',
+    'https://archiveofourown.org/works/41655369',
+    'https://archiveofourown.org/works/34763164',
+    'https://archiveofourown.org/works/41214669',
+    'https://archiveofourown.org/works/41822007',
+    'https://archiveofourown.org/works/342122',
+    'https://archiveofourown.org/works/26958667',
+    'https://archiveofourown.org/works/33658237',
+    'https://archiveofourown.org/works/36398359',
+    'https://archiveofourown.org/works/34702543',
+    'https://archiveofourown.org/works/35369560',
+    'https://archiveofourown.org/works/18623245',
+    'https://archiveofourown.org/works/34348333',
+    'https://archiveofourown.org/works/28032981',
+    'https://archiveofourown.org/works/24412372',
+    'https://archiveofourown.org/works/557020',
+    'https://archiveofourown.org/works/28968675',
+  ])
+# ---
diff --git a/test/fixtures/bookmarks.html b/test/fixtures/bookmarks.html