Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix EvilAngel performerByURL -> Refactor Algolia scraping #2177

Open
wants to merge 70 commits into
base: master
Choose a base branch
from

Conversation

nrg101
Copy link
Contributor

@nrg101 nrg101 commented Jan 24, 2025

Overview

I started this as an attempt to fix the performerByURL scraping for EvilAngel, but it turned into the long overdue effort to overhaul the Algolia script.

Scraper type(s)

  • performerByName
  • performerByFragment
  • performerByURL
  • sceneByName
  • sceneByQueryFragment
  • sceneByFragment
  • sceneByURL
  • movieByURL
  • galleryByFragment
  • galleryByURL

Outstanding tasks

  • implement search match scoring/comparison
  • implement studio name determination logic
  • handle multiple sites searching (is this ever needed? I'm going to say no at this point)
  • match by file info (e.g. duration, resolution, whatever)

Examples to test

performerByURL

performerByName

Create Performer > Scrape with... > EvilAngel > Performer Name = Ariel

  • select Ariel Demure from the results

performerByFragment

do the Create Performer search action above

  • see the scraped performer

go to an existing performer that has scenes at Evil Angel > Edit > Scrape with... > EvilAngel

  • select the performer from the results
  • see the additional/new/different scraped data

sceneByURL

movieByURL

galleryByURL

Short description

Problem

Recently, many Algolia-based sites have closed the free access to pages like:

  • /en/videos
  • /en/pornstars
  • /en/movies
  • /en/video/evilangel/TS-SOPHIA-MONTESINO-Spunky-Anal-Date/256714
  • /en/movie/Transgressive-25/126353
  • /en/pornstar/view/Brittney-Kade/92399

This means you can no longer browse videos, performers and movies on sites like evilangel.com, genderxfilms.com, and a whole load of other sites. This is especially annoying for performer scraping as that is not implemented in the current Algolia.py

Solution

There is actually a full Python client for the Algolia API, and all that's needed is fetching the appId and apiKey, and setting the host and referer headers. By refering to that client's docs, the current Algolia.py, and the Aylo API script, I've cobbled together a working:

  • performerByURL -> lookup performer by URL (the ID at the end)
  • performerByName -> searches for up to 20 performers matching a text string
  • performerByFragment -> looks up performer from one of the search results from performerByName

The current Algolia.py is a whole load of jank taped together and is long overdue an overhaul. Rather than trying to refactor it in-place, I've decided to make a new script called AlgoliaAPI.py, so that each site scraper can be migrated over individually.

The good parts of the existing Algolia.py should be included now.

@nrg101 nrg101 marked this pull request as draft January 24, 2025 02:41
@nrg101
Copy link
Contributor Author

nrg101 commented Jan 24, 2025

I've done a first pass of implementing all the scrapers for EvilAngel.yml with the new AlgoliaAPI.py.

There are some TODOs for handling multiple sites, and doing some form of results score matching (e.g. galleryByFragment) when the operation finds multiple API results, but the operation only returns a single scraper result.

@nrg101
Copy link
Contributor Author

nrg101 commented Jan 24, 2025

@Maista6969 I think a long time ago, there was a discussion about refactoring the existing Algolia.py, and you were in that discussion? Sorry if I'm mistaken, it was quite some time ago...

Anyway, what I have here in this PR is working, albeit with some functionality yet to port over, e..g

  • galleryByFragment multiple results match-scoring to return best match as single result... I think this is all the match ratio jank in the existing Algolia.py
  • anything that scrapes a studio could do with a set of logic to determine the studio name from the studio_name, network, serie, channel, sitename, etc. I think this would be really nice if it could be one of the "extra" array items in the respective YAML, so I will see if that's feasible in a nice way that everyone can get along with
  • handling multiple sites... I'm not sure how important this is, but there may be scenarios where the user would like to search more than one site

I may have overlooked some stuff, but I was wondering if you (or anyone else) has any input, suggestions, requests, etc. at this point?

@ltgorman
Copy link
Contributor

@Maista6969 I think a long time ago, there was a discussion about refactoring the existing Algolia.py, and you were in that discussion? Sorry if I'm mistaken, it was quite some time ago...

Anyway, what I have here in this PR is working, albeit with some functionality yet to port over, e..g

  • galleryByFragment multiple results match-scoring to return best match as single result... I think this is all the match ratio jank in the existing Algolia.py
  • anything that scrapes a studio could do with a set of logic to determine the studio name from the studio_name, network, serie, channel, sitename, etc. I think this would be really nice if it could be one of the "extra" array items in the respective YAML, so I will see if that's feasible in a nice way that everyone can get along with
  • handling multiple sites... I'm not sure how important this is, but there may be scenarios where the user would like to search more than one site

I may have overlooked some stuff, but I was wondering if you (or anyone else) has any input, suggestions, requests, etc. at this point?

Implementing markers would be nice.

Copy link
Collaborator

@Maista6969 Maista6969 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been wanting to rewrite Algolia for a long time, thank you for contributing this! I certainly agree that the old Algolia has become pretty crufty and I think this is an excellent start on the Road to Refactor 😁

The next step will be creating an EvilAngel.py that uses this API module so we can have an extra layer of indirection where we can handle the special cases for this site like studio remappings that are currently such a mess in the old Algolia.py

scrapers/Algolia/AlgoliaAPI.py Outdated Show resolved Hide resolved
scrapers/Algolia/AlgoliaAPI.py Outdated Show resolved Hide resolved
scrapers/Algolia/AlgoliaAPI.py Outdated Show resolved Hide resolved
scrapers/Algolia/AlgoliaAPI.py Outdated Show resolved Hide resolved
scrapers/Algolia/AlgoliaAPI.py Outdated Show resolved Hide resolved
scrapers/Algolia/AlgoliaAPI.py Outdated Show resolved Hide resolved
scrapers/Algolia/AlgoliaAPI.py Outdated Show resolved Hide resolved
@nrg101
Copy link
Contributor Author

nrg101 commented Jan 25, 2025

The next step will be creating an EvilAngel.py that uses this API module so we can have an extra layer of indirection where we can handle the special cases for this site like studio remappings that are currently such a mess in the old Algolia.py

I did wonder what else would be needed, like apart from the studio remapping stuff...

I thought that subclassing or importing functions from a "base" script (similar to the Aylo implementation) might be more complex than it's worth (to give flexibility that ultimately isn't needed).

With that in mind, I had a think about how the scraper configuration YAML could be used for something like the studio name mapping, and came up with a possible solution like this:

import ast

# the API hit dictionary
api_hit = {
    'studio_name': 'Enid Blyton',
    'serie_name': 'Groovy Gang',
    'channel_name': 'Happy Joy',
    'sitename': 'thisthing',
    'segment': 'something',
}

# these could come in via the `args["extra"] list of strings
conditions_and_values_to_assign = [
    "api_hit['studio_name'] == 'Enid Blyton' => api_hit['channel_name']",
    "api_hit['segment'] == 'something' => 'a fixed value'",
    "api_hit['studio_name'] == 'Not A Match' => api_hit['serie_name']",
]

for condition_and_value_to_assign in conditions_and_values_to_assign:
    condition, value_to_assign = condition_and_value_to_assign.split(' => ')
    # Parsing and evaluating the condition
    parsed_condition = ast.parse(condition, mode='eval')
    if eval(compile(parsed_condition, filename="", mode="eval")):
        new_variable = eval(value_to_assign)
    else:
        new_variable = 'default_value'

    print(condition_and_value_to_assign)
    print(new_variable)
    print()

when run. this outputs:

api_hit['studio_name'] == 'Enid Blyton' => api_hit['channel_name']
Happy Joy

api_hit['segment'] == 'something' => 'a fixed value'
a fixed value

api_hit['studio_name'] == 'Not A Match' => api_hit['serie_name']
default_value

I'm not super excited about the use of eval, but it could be a solution for the studio mapping logic

@Maista6969
Copy link
Collaborator

Maista6969 commented Jan 25, 2025

I see what you mean here, but I feel like dynamically evaluating code from a YAML file is even more complex than just having one Python script that calls another Python script 😅

For most sites we might not even need any special handling, see for example the True Amateurs scraper which can just use the general API results and so doesn't have a separate Python script 🙂

edit: whoops originally linked to the wrong scraper here, not Trans Angels but True Amateurs

@nrg101
Copy link
Contributor Author

nrg101 commented Jan 25, 2025

Implementing markers would be nice.

You'll have to enlighten me, as in:

  • what is a marker?
  • what in the Algolia API provides data for markers?
  • how do markers get saved/persisted?

I can't see how any of the scraped models provide any marker feature.

@Maista6969
Copy link
Collaborator

Stash does not currently support scraping markers, but several scrapers have hacked it in because of user demand: it breaks the model of scrapers because instead of just returning results to Stash (where users can decide whether or not they'd like to keep the results) it makes the scraper call the GraphQL API to mutate the scene as it's being scraped

It's currently a feature in the Vixen Network scraper as well as the Aylo API, but I'd much prefer to lobby for native support before hacking it into any more scrapers

I think it's a moot point here though, as far as I can tell these sites don't have marker data in their APIs

@ltgorman
Copy link
Contributor

Stash does not currently support scraping markers, but several scrapers have hacked it in because of user demand: it breaks the model of scrapers because instead of just returning results to Stash (where users can decide whether or not they'd like to keep the results) it makes the scraper call the GraphQL API to mutate the scene as it's being scraped

It's currently a feature in the Vixen Network scraper as well as the Aylo API, but I'd much prefer to lobby for native support before hacking it into any more scrapers

I think it's a moot point here though, as far as I can tell these sites don't have marker data in their APIs

If you look at the Adult Time json, markers are there under "action_tags". It might be not all the studios have them, same thing happens with Aylo studios. I get your desire to make the support more native, I was just throwing the suggestion out there.

@Maista6969
Copy link
Collaborator

If you look at the Adult Time json, markers are there under "action_tags". It might be not all the studios have them, same thing happens with Aylo studios. I get your desire to make the support more native, I was just throwing the suggestion out there.

Thanks, I wasn't aware that they provided these :) I'll make a note of it for when we expand the use of this to AdultTime and the other sites that can use this API 👍

@stg-annon
Copy link
Contributor

yeah to grab markers with a scrape you really want to be sure you matched correctly when you pull them, ideally they are integrated as something we can pass to stash that's like any other scraped metadata but if you want to use a scraper with a confirmation dialog the next best option would probably be a on update hook that looks for a custom flag added by the scraper to remove the flag and add the markers this would happen after the user confirms the scrape and the scene is updated even better would be the ability for a hook for post scrape update

@Maista6969
Copy link
Collaborator

yeah to grab markers with a scrape you really want to be sure you matched correctly when you pull them, ideally they are integrated as something we can pass to stash that's like any other scraped metadata but if you want to use a scraper with a confirmation dialog the next best option would probably be a on update hook that looks for a custom flag added by the scraper to remove the flag and add the markers this would happen after the user confirms the scrape and the scene is updated even better would be the ability for a hook for post scrape update

Scrapers can't register hooks, but I see your point in that we could maintain a separate plugin for this 👍

@nrg101
Copy link
Contributor Author

nrg101 commented Jan 27, 2025

Implemented studio name determination for:

  • studios listed in EvilAngel.yml
  • TransPlaytime, as those scenes have evilangel.com URLs

@nrg101
Copy link
Contributor Author

nrg101 commented Jan 30, 2025

file metadata (duration, file size) is now used in the match scoring

@nrg101 nrg101 marked this pull request as ready for review January 30, 2025 01:48
@nrg101 nrg101 requested a review from Maista6969 January 30, 2025 01:48
@nrg101 nrg101 marked this pull request as draft January 30, 2025 16:16
@nrg101
Copy link
Contributor Author

nrg101 commented Jan 30, 2025

Putting back to draft while I move some of the Adult Time studios from their own scraper (e.g. All Girl Massage, Fantasy Massage, etc.) to the new AdultTime scraper

@nrg101 nrg101 marked this pull request as ready for review January 30, 2025 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants