Fix EvilAngel performerByURL -> Refactor Algolia scraping #2177

nrg101 · 2025-01-24T02:41:25Z

Overview

I started this as an attempt to fix the performerByURL scraping for EvilAngel, but it turned into the long overdue effort to overhaul the Algolia script.

Scraper type(s)

Outstanding tasks

implement search match scoring/comparison
implement studio name determination logic
handle multiple sites searching (is this ever needed? I'm going to say no at this point)
match by file info (e.g. duration, resolution, whatever)

Examples to test

performerByURL

performerByName

Create Performer > Scrape with... > EvilAngel > Performer Name = Ariel

select Ariel Demure from the results

performerByFragment

do the Create Performer search action above

see the scraped performer

go to an existing performer that has scenes at Evil Angel > Edit > Scrape with... > EvilAngel

select the performer from the results
see the additional/new/different scraped data

sceneByURL

movieByURL

galleryByURL

Short description

Problem

Recently, many Algolia-based sites have closed the free access to pages like:

/en/videos
/en/pornstars
/en/movies
/en/video/evilangel/TS-SOPHIA-MONTESINO-Spunky-Anal-Date/256714
/en/movie/Transgressive-25/126353
/en/pornstar/view/Brittney-Kade/92399

This means you can no longer browse videos, performers and movies on sites like evilangel.com, genderxfilms.com, and a whole load of other sites. This is especially annoying for performer scraping as that is not implemented in the current Algolia.py

Solution

There is actually a full Python client for the Algolia API, and all that's needed is fetching the appId and apiKey, and setting the host and referer headers. By refering to that client's docs, the current Algolia.py, and the Aylo API script, I've cobbled together a working:

performerByURL -> lookup performer by URL (the ID at the end)
performerByName -> searches for up to 20 performers matching a text string
performerByFragment -> looks up performer from one of the search results from performerByName

The current Algolia.py is a whole load of jank taped together and is long overdue an overhaul. Rather than trying to refactor it in-place, I've decided to make a new script called AlgoliaAPI.py, so that each site scraper can be migrated over individually.

The good parts of the existing Algolia.py should be included now.

nrg101 · 2025-01-24T18:14:40Z

I've done a first pass of implementing all the scrapers for EvilAngel.yml with the new AlgoliaAPI.py.

There are some TODOs for handling multiple sites, and doing some form of results score matching (e.g. galleryByFragment) when the operation finds multiple API results, but the operation only returns a single scraper result.

nrg101 · 2025-01-24T18:25:58Z

@Maista6969 I think a long time ago, there was a discussion about refactoring the existing Algolia.py, and you were in that discussion? Sorry if I'm mistaken, it was quite some time ago...

Anyway, what I have here in this PR is working, albeit with some functionality yet to port over, e..g

galleryByFragment multiple results match-scoring to return best match as single result... I think this is all the match ratio jank in the existing Algolia.py
anything that scrapes a studio could do with a set of logic to determine the studio name from the studio_name, network, serie, channel, sitename, etc. I think this would be really nice if it could be one of the "extra" array items in the respective YAML, so I will see if that's feasible in a nice way that everyone can get along with
handling multiple sites... I'm not sure how important this is, but there may be scenarios where the user would like to search more than one site

I may have overlooked some stuff, but I was wondering if you (or anyone else) has any input, suggestions, requests, etc. at this point?

ltgorman · 2025-01-24T22:48:59Z

@Maista6969 I think a long time ago, there was a discussion about refactoring the existing Algolia.py, and you were in that discussion? Sorry if I'm mistaken, it was quite some time ago...

Anyway, what I have here in this PR is working, albeit with some functionality yet to port over, e..g

galleryByFragment multiple results match-scoring to return best match as single result... I think this is all the match ratio jank in the existing Algolia.py

anything that scrapes a studio could do with a set of logic to determine the studio name from the studio_name, network, serie, channel, sitename, etc. I think this would be really nice if it could be one of the "extra" array items in the respective YAML, so I will see if that's feasible in a nice way that everyone can get along with

handling multiple sites... I'm not sure how important this is, but there may be scenarios where the user would like to search more than one site

I may have overlooked some stuff, but I was wondering if you (or anyone else) has any input, suggestions, requests, etc. at this point?

Implementing markers would be nice.

Maista6969

I've been wanting to rewrite Algolia for a long time, thank you for contributing this! I certainly agree that the old Algolia has become pretty crufty and I think this is an excellent start on the Road to Refactor 😁

The next step will be creating an EvilAngel.py that uses this API module so we can have an extra layer of indirection where we can handle the special cases for this site like studio remappings that are currently such a mess in the old Algolia.py

scrapers/Algolia/AlgoliaAPI.py

nrg101 · 2025-01-25T01:18:26Z

The next step will be creating an EvilAngel.py that uses this API module so we can have an extra layer of indirection where we can handle the special cases for this site like studio remappings that are currently such a mess in the old Algolia.py

I did wonder what else would be needed, like apart from the studio remapping stuff...

I thought that subclassing or importing functions from a "base" script (similar to the Aylo implementation) might be more complex than it's worth (to give flexibility that ultimately isn't needed).

With that in mind, I had a think about how the scraper configuration YAML could be used for something like the studio name mapping, and came up with a possible solution like this:

import ast

# the API hit dictionary
api_hit = {
    'studio_name': 'Enid Blyton',
    'serie_name': 'Groovy Gang',
    'channel_name': 'Happy Joy',
    'sitename': 'thisthing',
    'segment': 'something',
}

# these could come in via the `args["extra"] list of strings
conditions_and_values_to_assign = [
    "api_hit['studio_name'] == 'Enid Blyton' => api_hit['channel_name']",
    "api_hit['segment'] == 'something' => 'a fixed value'",
    "api_hit['studio_name'] == 'Not A Match' => api_hit['serie_name']",
]

for condition_and_value_to_assign in conditions_and_values_to_assign:
    condition, value_to_assign = condition_and_value_to_assign.split(' => ')
    # Parsing and evaluating the condition
    parsed_condition = ast.parse(condition, mode='eval')
    if eval(compile(parsed_condition, filename="", mode="eval")):
        new_variable = eval(value_to_assign)
    else:
        new_variable = 'default_value'

    print(condition_and_value_to_assign)
    print(new_variable)
    print()

when run. this outputs:

api_hit['studio_name'] == 'Enid Blyton' => api_hit['channel_name']
Happy Joy

api_hit['segment'] == 'something' => 'a fixed value'
a fixed value

api_hit['studio_name'] == 'Not A Match' => api_hit['serie_name']
default_value

I'm not super excited about the use of eval, but it could be a solution for the studio mapping logic

Maista6969 · 2025-01-25T01:25:25Z

I see what you mean here, but I feel like dynamically evaluating code from a YAML file is even more complex than just having one Python script that calls another Python script 😅

For most sites we might not even need any special handling, see for example the True Amateurs scraper which can just use the general API results and so doesn't have a separate Python script 🙂

edit: whoops originally linked to the wrong scraper here, not Trans Angels but True Amateurs

nrg101 · 2025-01-25T01:32:38Z

Implementing markers would be nice.

You'll have to enlighten me, as in:

what is a marker?
what in the Algolia API provides data for markers?
how do markers get saved/persisted?

I can't see how any of the scraped models provide any marker feature.

Maista6969 · 2025-01-25T01:42:08Z

Stash does not currently support scraping markers, but several scrapers have hacked it in because of user demand: it breaks the model of scrapers because instead of just returning results to Stash (where users can decide whether or not they'd like to keep the results) it makes the scraper call the GraphQL API to mutate the scene as it's being scraped

It's currently a feature in the Vixen Network scraper as well as the Aylo API, but I'd much prefer to lobby for native support before hacking it into any more scrapers

I think it's a moot point here though, as far as I can tell these sites don't have marker data in their APIs

ltgorman · 2025-01-25T02:07:50Z

Stash does not currently support scraping markers, but several scrapers have hacked it in because of user demand: it breaks the model of scrapers because instead of just returning results to Stash (where users can decide whether or not they'd like to keep the results) it makes the scraper call the GraphQL API to mutate the scene as it's being scraped

It's currently a feature in the Vixen Network scraper as well as the Aylo API, but I'd much prefer to lobby for native support before hacking it into any more scrapers

I think it's a moot point here though, as far as I can tell these sites don't have marker data in their APIs

If you look at the Adult Time json, markers are there under "action_tags". It might be not all the studios have them, same thing happens with Aylo studios. I get your desire to make the support more native, I was just throwing the suggestion out there.

Maista6969 · 2025-01-25T12:13:50Z

If you look at the Adult Time json, markers are there under "action_tags". It might be not all the studios have them, same thing happens with Aylo studios. I get your desire to make the support more native, I was just throwing the suggestion out there.

Thanks, I wasn't aware that they provided these :) I'll make a note of it for when we expand the use of this to AdultTime and the other sites that can use this API 👍

stg-annon · 2025-01-25T17:17:33Z

yeah to grab markers with a scrape you really want to be sure you matched correctly when you pull them, ideally they are integrated as something we can pass to stash that's like any other scraped metadata but if you want to use a scraper with a confirmation dialog the next best option would probably be a on update hook that looks for a custom flag added by the scraper to remove the flag and add the markers this would happen after the user confirms the scrape and the scene is updated even better would be the ability for a hook for post scrape update

Maista6969 · 2025-01-25T17:30:00Z

yeah to grab markers with a scrape you really want to be sure you matched correctly when you pull them, ideally they are integrated as something we can pass to stash that's like any other scraped metadata but if you want to use a scraper with a confirmation dialog the next best option would probably be a on update hook that looks for a custom flag added by the scraper to remove the flag and add the markers this would happen after the user confirms the scrape and the scene is updated even better would be the ability for a hook for post scrape update

Scrapers can't register hooks, but I see your point in that we could maintain a separate plugin for this 👍

nrg101 · 2025-01-27T00:02:48Z

Implemented studio name determination for:

studios listed in EvilAngel.yml
TransPlaytime, as those scenes have evilangel.com URLs

nrg101 · 2025-01-30T01:48:16Z

file metadata (duration, file size) is now used in the match scoring

nrg101 · 2025-01-30T16:18:06Z

Putting back to draft while I move some of the Adult Time studios from their own scraper (e.g. All Girl Massage, Fantasy Massage, etc.) to the new AdultTime scraper

nrg101 added 2 commits January 23, 2025 18:17

fix: start refactor to standard Algolia Python client

f4416d5

add performerByName and performerByFragment

fd42f14

nrg101 marked this pull request as draft January 24, 2025 02:41

nrg101 added 8 commits January 24, 2025 03:00

use guess_nationality; remove unused imports; tidy typings

f50820d

make the homepage url and headers better

8de32ba

refactor sceneByURL

e621444

refactor sceneByURL

9e06b45

refactor sceneByFragment and sceneByQueryFragment

2767b57

refactor galleryByURL

d0ab149

refactor galleryByFragment

1423b03

refactor movieByURL

b6bcec0

Maista6969 reviewed Jan 25, 2025

View reviewed changes

nrg101 added 2 commits January 25, 2025 01:43

move AlgoliaAPI.py to its own folder

0d0fa04

ensure requests package is installed

e434681

nrg101 added 2 commits January 25, 2025 03:16

make suggested changes

6298dec

make EvilAngel.py

d2112e0

implement postprocess; tidy comments/typings

8884d25

reference correct package

1a2c9be

nrg101 marked this pull request as ready for review January 30, 2025 01:48

nrg101 requested a review from Maista6969 January 30, 2025 01:48

nrg101 added 18 commits January 30, 2025 02:08

map Transfixed Muses to Transfixed

7e91418

fix gallery scraping

ed6b3ba

change GenderXFilms to use AlgoliaAPI

05ad8a4

add studio logic for Devil's Film

ff33ad0

fix argument variable name

fdc2d5c

add video URL for devilsfilm gallery

b3a4456

add studio name logic for ASMR Fantasy

4ebfe05

make Transfixed gallery work with photoset URL

65bab27

add galleryByURL for oopsie.com

ff0bb04

fix galleryByURL

58bce66

add galleryByFragment to GenderXFilms

df3d0fa

use db gallery folder file count in match ratio evaluation

4afecd4

fix bug for studio override when no channels prop

d83860b

add zip support for galleryByFragment

b87c964

fix function call missing argument

a5f6d04

fix function call missing argument

70de3cf

support Devil's Tgirls better

6469343

move sites to FantasyMassage (Network)

a27380f

nrg101 marked this pull request as draft January 30, 2025 16:16

nrg101 added 5 commits January 30, 2025 17:19

move sites

8905958

migrate TabooHeat

c8fac9a

remove straggler

d797fea

migrate Gangbang Creampie

b616bd5

migrate sites

f47d03b

nrg101 marked this pull request as ready for review January 30, 2025 18:17

add extra logic for TransgressiveXXX studio name

36d84ec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix EvilAngel performerByURL -> Refactor Algolia scraping #2177

Fix EvilAngel performerByURL -> Refactor Algolia scraping #2177

nrg101 commented Jan 24, 2025 •

edited

Loading

nrg101 commented Jan 24, 2025

nrg101 commented Jan 24, 2025

ltgorman commented Jan 24, 2025

Maista6969 left a comment

nrg101 commented Jan 25, 2025

Maista6969 commented Jan 25, 2025 •

edited

Loading

nrg101 commented Jan 25, 2025

Maista6969 commented Jan 25, 2025

ltgorman commented Jan 25, 2025

Maista6969 commented Jan 25, 2025

stg-annon commented Jan 25, 2025

Maista6969 commented Jan 25, 2025

nrg101 commented Jan 27, 2025

nrg101 commented Jan 30, 2025

nrg101 commented Jan 30, 2025

Fix EvilAngel performerByURL -> Refactor Algolia scraping #2177

Are you sure you want to change the base?

Fix EvilAngel performerByURL -> Refactor Algolia scraping #2177

Conversation

nrg101 commented Jan 24, 2025 • edited Loading

Overview

Scraper type(s)

Outstanding tasks

Examples to test

performerByURL

performerByName

performerByFragment

sceneByURL

movieByURL

galleryByURL

Short description

Problem

Solution

nrg101 commented Jan 24, 2025

nrg101 commented Jan 24, 2025

ltgorman commented Jan 24, 2025

Maista6969 left a comment

Choose a reason for hiding this comment

nrg101 commented Jan 25, 2025

Maista6969 commented Jan 25, 2025 • edited Loading

nrg101 commented Jan 25, 2025

Maista6969 commented Jan 25, 2025

ltgorman commented Jan 25, 2025

Maista6969 commented Jan 25, 2025

stg-annon commented Jan 25, 2025

Maista6969 commented Jan 25, 2025

nrg101 commented Jan 27, 2025

nrg101 commented Jan 30, 2025

nrg101 commented Jan 30, 2025

nrg101 commented Jan 24, 2025 •

edited

Loading

Maista6969 commented Jan 25, 2025 •

edited

Loading