Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing non-alphanumeric characters from all searches doesn't work for some indexers #1225

Closed
xelra opened this issue Apr 6, 2016 · 41 comments

Comments

@xelra
Copy link

xelra commented Apr 6, 2016

When searching, especially for anime, cleanTitle is not what is needed.

It should (maybe additionally to not break other search APIs) search for the exact scene title. This would be especially helpful for anime.

Nyaa fixed their search API and now properly returns results for '. It needs to be substituted with %27 though.
Here is the now successful search for JoJo's Bizarre Adventure:
http://www.nyaa.se/?page=search&cats=1_37&filter=1&term=JoJo%27s+Bizarre+Adventure

Here is the hastebin that Taloth made about how Sonarr currently searches:
http://hastebin.com/qotubuxeme.vhdl

@thezoggy
Copy link

goes with: #542 ?

@markus101 markus101 changed the title Manual search uses cleanTitle Removing single quotes from all searches doesn't work for some indexers Apr 11, 2016
@markus101
Copy link
Member

I think this one just needed a better name, we're using a modified version of the scene names, which works for a lot of indexers, but some are special 😄 We just need a way to modify them later in the process to allow for customization for certain indexers.

@Taloth
Copy link
Member

Taloth commented Apr 11, 2016

We should move the cleanup logic out of the searchcriteria and into the RequestGenerator.

@markus101
Copy link
Member

Similar issue with Knight's & Magic, usenet indexers have it as Knights & Magic, but Nyaa.si has Knight's & Magic, but we replace & with and when searching.

@kvloover
Copy link

Is it perhaps possible to allow users to change the name that's being searched ?
Just a simple override that they can tailor to their index setup and the show that is not working ?

@arebokert
Copy link

I think it would be great if it was possible to manually modify how the search is done. I know for some torrent trackers you sometimes need to include double quotes around the search term to get a proper result if the show name contains spaces for example.

@FabioCastilho
Copy link

Knight's & Magic
Alguém conseguiu uma solução para esse problema ? Infelizmente a versao que eu utilizo aqui esta como & e e só é encontrado AND

Knight's & Magic
Has anyone managed a solution to this problem? Unfortunately the version I use here is as & e is only found AND

@AzzieDev
Copy link

AzzieDev commented Apr 25, 2018

markus101 can there be an exception for stuff, like so shows with titles with a ' can also search without, and also, titles with an amperstand (&) be searched with AND without, as well as search with it replaced as "and"?

Such as, The Handmaid's Tale, Will & Grace, >> The Handmaids Tale, Will Grace, Will and Grace?

RSS caught The Handmaid's Tale, but manual search doesn't.

@Sportsmaniac13
Copy link

Running into the same issue as kat with The Handmaid's Tale -- Jackett catches both, but manual search only catches versions without the apostrophe. Any way we could have it search for both, or perhaps specify which on a per-show basis?

@ghost
Copy link

ghost commented May 20, 2018

This happens for me using Torznab through Jackett. Debug log:

18-5-20 03:25:33.0|Info|NzbSearchService|Searching 1 indexers for [The Handmaid's Tale : S02E05]
18-5-20 03:25:33.0|Debug|Torznab|Downloading Feed [MY_SERVER]/jackett/api/v2.0/indexers/privatehd/results/torznab/api?t=tvsearch&cat=2000,2010,2030,2040,3000,5000,5030,5040&extended=1&apikey=(removed)&offset=0&limit=100&q=Handmaids%20Tale&season=2&ep=5
18-5-20 03:25:33.5|Debug|NzbSearchService|Total of 0 reports were found for [The Handmaid's Tale : S02E05] from 1 indexers

The Handmaid's Tale gets converted to &q=Handmaids%20Tale

Manual search on Jackett for "Handmaid's Tale" works but not "Handmaids Tale", so the above causes the indexer to return no results.

@AzzieDev
Copy link

roman-22 So Jackett is part of the problem... Didn't even think of that part.

@ghost
Copy link

ghost commented May 20, 2018

I think each indexer's search handles it differently. TL recognises both "Handmaids Tale" and "Handmaid's Tale" whereas PHD (used in above log) needs the apostrophe.

Because Jackett is not given the apostrophe and is only passed "Handmaids Tale" I don't think there's a way for Jackett to solve the problem.

Either Sonarr needs to pass the apostrophe to Jackett, or the indexer needs to adapt their search engine to allow looser matches to be found.

@ivanbeldad
Copy link

I opened other issue (#2644) with a similar problem. Because mine was closed inmediatly and this one has more attention I'll add my opinion here.

In my case the problem isn't only the single quote auto-removal of Sonarr, it's removing "the" from any series like The handmaid's tale, leaving it like "handmaids tale", which is far from correct, and it's complicating the way indexers works.

@markus101 said that they cannot let indexers sanitize because they don't always do it, but I don't think that is a reason to do things wrong.

If indexers are not sanitizing is not problem of Sonarr, is problem of the indexer. I don't understand that one application should do things that it shouldn't because external applications don't work otherwise.

I wrote one indexer on Jackett, and fix another one to make it Sonarr compliant, and in my case I face the problem that Sonarr is "making up" titles that doesn't match the reality, so I cannot really know the real one.

Remove apostrophes or remove "the" from titles before send it to indexer is out of Sonarr scope.

@AzzieDev
Copy link

AzzieDev commented Jul 5, 2018

If indexers are not sanitizing is not problem of Sonarr, is problem of the indexer. I don't understand that one application should do things that it shouldn't because external applications don't work otherwise.

This is NOT the fault or problem of indexers. Sonarr picks up results IN RSS but not in searches. The implementation could be added with tweaks to the search algorithm for titles by also searching for results with special characters stripped in Sonarr. In fact, releases are actually meant to be untouched (including their filenames) due to standards of release groups and the Scene which make sure files do not contain special characters for the sake of compatibility and consistency.

As per removing "the" from titles and adding them to search results, this occurs but is seemingly harmless in and of itself. It does not appear to be injuring RSS snatches.

@ivanbeldad
Copy link

ivanbeldad commented Jul 5, 2018

@kat953162 I think that you didn't understand my point. I'm saying that is problem of the indexer to sanitize the title, not Sonarr. Obviously RSS works, because Sonarr doesn't send any query.

The origin of the problem is Sonarr, Sonarr is removing characters from titles that it shouldn't. But they say that they do it because indexers don't sanitize, so they have to. Wrong. If indexers are bad implemented is problem of the indexer, not Sonarr, Sonarr should do things right, because if it doesn't is way more complicate to implement an indexer that needs the removed parts.

I think there would be two harmless solutions to this problem without affect any implemented indexer:

  • In sanitized titles realize two queries, one with the original title and other with the sanitized (Disadvantage: lower performance).
  • Add an indexer toggle to enable/disable Sonarr sanitization (Disadvantage: harder setup and interface overload).

@AzzieDev
Copy link

AzzieDev commented Jul 5, 2018

The indexers have protocols to follow. Higher-level indexers are not going to rename releases if a group uses a certain title. Its not "sanitization" if the indexer adds extra special characters to a title. The Scene will not suddenly start allowing characters other than A-Z, a-z, 0-9, periods, and dashes as per the rules, and other release groups generally follow these standards as well but with flexibility.

Sonarr needs to perform queries with specials characters removed in order to capture a full set of results. It will then notice items that would have an apostrophe or other special symbols (ampersands) removed from the title. It will only have a slightly lower speed (not performance) for titles with symbols, which is not too common, and it is far better than not having the results at all from the query in the first place. Instead of making one API request, it would be making two for titles with symbols, which isn't a big deal.

@Taloth
Copy link
Member

Taloth commented Jul 5, 2018

The majority of indexers with newznab use sphinx indexing and are usually configured to strip special characters like that during the indexing process. It would be nice if they did the same for the api query, but they often don't.
It isn't realistic to demand those indexer to fix it when we're talking about 90%+ of our supported indexers, the world simply doesn't work that way. Otherwise the same could be said for 99% of the indexers in Jackett since they do not have an api nor proper indexing capabilities.
Furthermore most indexers are actually true indexers in the sense that they accept tvdbid/tvmazeid query parameters, The q= parameter is a keyword search, not a title search and is purely intended as fallback in case the id-based search fails.
Regarding 'Expanse', we can leave out 'the' because we'd be getting both results on any decent indexer. (in fact, most indexer will happily return all the relevant results based on tvdbid)

It doesn't make sense to demand Sonarr queries by unmodified titles simply because you desire it so and break it for all the sphinx indexers in the process.
To be frank, I much prefer to support indexers that have a proper api instead of breaking them in order to support a site that does not have an api and actually has a cryptominer on their site. I mean, wtf...

Currently there is no way for newznab/torznab indexers to convey their keyword format in the t=caps capabilities, otherwise we could use that, so it's not possible to implement different behavior depending on the site. At least, not at this time.
For Mejor, an alternative would be to index all the series titles in jackett in a cache (~26 site queries). That would arguably make the whole thing faster and more reliable in the process.

@ivanbeldad
Copy link

I didn't say break anything. None of my proposals will break even one indexer currently working.

@curiositycasualty
Copy link

Maybe since Jackett is a "known good" indexer, we could create a new indexer "type" (vs. "Torznab") for it that would support passing escaped queries?

@markus101 markus101 changed the title Removing single quotes from all searches doesn't work for some indexers Removing non-alphanumeric characters from all searches doesn't work for some indexers Feb 3, 2019
@lps-rocks
Copy link

An option (advanced or otherwise) inevitably means that the user is able to misconfigure it. And in this case it's easily overlooked because it only becomes apparent when the user notices releases missing from the manual results. So it's something to be avoided. With per-indexer control I meant in the code, so that we can override the query title logic for specific indexers as required.

A hard coded list of sites / indexers in the code will require unnecessary administrative overhead on the code base.

Trying to idiot proof the program by removing user choice is infuriating to me and if I do write a patch it will include a toggle option and possibly an option to specify what characters can remain. Some trackers allow a few but not all special characters in a search.

I've been on the opposing end no less than a few dozen times and I hate when a developer tries to single handedly be smarter than the user. If it's an advanced option it's on the user to mess with it. If the user misconfigures it, that's their problem. With adequate documentation and UI design the user should be able to figure out exactly what it is they're tinkering with.

Anyway, doing a query for both variations of such names should not increase the load on the indexer in any significant manner. And has the advantage of 'simply working' regardless of user configuration.

I disagree. It's avoidable and for the same reasoning above, it should be up to the user since it's their account on the indexer the additional load will show up under.

@xelra
Copy link
Author

xelra commented May 20, 2019

@lps-rocks I think you did not understand what @Taloth tried to explain.

There's just no place where a user-editable setting would make sense. The indexers are already hardcoded in Sonarr. The only option you have to add other indexers is via custom newznab or custom torznab.

If someone has a newznab or torznab api, they can provide the capabilities of their api to Sonarr. Then Sonarr can adjust its queries accordingly.
So where exactly is the user supposed to meddle with this?

Once such capabilities are in Sonarr, it simply needs to be added to the few indexers that are integrated. And since Jackett is just another torznab indexer, if the user sets up custom indexers through Jackett, they can just set the api capabilities from there.

@lps-rocks
Copy link

lps-rocks commented May 20, 2019 via email

@Taloth
Copy link
Member

Taloth commented May 21, 2019

Newznab indexers mostly use sphinx as search engine, and thus the query titles were formatted for that.

Back in the day we proposed and succeeded in getting the supportedParams attribute added to the newznab-specification caps response, allowing indexers to specify which query parameters they supported. This was first introduced in torznab specifically for Jackett. And after tvrage disappeared, it was successfully proposed to the actual newznab specification and their codebase. It was an improvement because it allowed the indexers to specify what they supported and clients to act accordingly. All without requiring the user to fiddle with configuration.
This is no different, I can imagine that a 'searchFormat' parameter could be added, defaulting to 'sphinx'. Which Jackett for example can set to 'raw', to prevent Sonarr from doing any cleanup whatsoever.
For non newznab/torznab indexers we'll need to find out what cleanup is required.

So the decision logic that determines which QueryTitle cleanups are needed has to be moved to RequestGenerator so that the indexer capabilities can be taken into account.
Then likely two new QueryTitle cleanups formats need to be added. A 'raw' format that does no cleanup, and a 'unknown' format, which queries for multiple formats.
Then the Jackett devs need to be contacted to see if they're interested in collaborating for a change in the torznab capabilities to include the appropriate value, so Sonarr can adjust accordingly. Given our history together I don't expect that to be an issue at all. However, if they use the 'raw' format, then they will need to do the entire cleanup themselves, which is likely the best approach given the variation of indexers they support.
For the nyaa.se indexer in Sonarr, we will need to try a few formats and see what format is required for their full text search. Useful here would be to come up with a test set of titles that usually go wrong.
It's also possible that sphinx already needs multiple titles to be queried, but I guess we'll only find that out by trying those titles.
AnimeTosho and NyaaPantsu are probably the interesting ones because they are not proxies like Jackett, but not use the newznab codebase. So we need to find out what formats apply to them. If we can come up with a sensible caps attribute then I again would expect them to be amicable to add it to their site.

I discussed this with markus and he also does not want to add a user setting for this behavior. The correct format should be automatically determined, but if that is not possible or inconclusive then both titles should be queried instead.

@Webreaper
Copy link

Webreaper commented Dec 15, 2019

Feels like a short-term fix is to query both the sanitised and non-sanitized titles, and that makes a lot of sense. I also agree that it would be worth asking the Jackett devs to support a "sanitized" field - they could have that stored in the Jackett DB and then it would solve the problem for all feeds everywhere, but reduce the need to double-query on Sonarr.

This change would solve/fix a lot of manual matches/searches that I have to do.

@Bazzu85
Copy link

Bazzu85 commented Mar 31, 2021

Hi guys,
I have a similar problem.
A multiepisode release like "Show.S02E09-10" is converted to "Show S02E09 10". so sonarr can't recognize that's a multiepisode..
can something be done?

@ian-g-holm-intel
Copy link

How has this not been fixed yet? Is there really no workaroudn for sites like nyaa.si that require the apostraphe in shows with titles containing apostraphes?

@kekal
Copy link

kekal commented May 23, 2021

As far as I understand, there is no tangible movement in the Jackett on this issue.
Will put a reference here Jackett/Jackett#8246

@kekal
Copy link

kekal commented Aug 17, 2021

I will add one more case to the piggy bank.
https://www.thetvdb.com/series/brooklyn-nine-nine cannot be found through the Jacett
image

markus101 added a commit that referenced this issue Oct 5, 2021
@bakerboy448
Copy link
Contributor

Ref - Radarr/Radarr#4502

1cbcad6 helped lay the ground work for some of this and once trackers indicate they support/need RawSearch that should signficantly alleviate this issue

YGG Torrents in Prowlarr has marked the tracker as supporting RawSearch - Once Jackett supports the parameter they can update their definitions as well

Other Indexer definitions will need similar updates

Doesn't fully resolve it, but should help for some.

@aniro
Copy link

aniro commented Feb 20, 2022

Same issue for Rutracker and NoNaMe Club(L) trackers
Prowlarr/Sonarr strip quotes from titles like "Man Who Wasn't There 2001" which leads to 0 results on these trackers:

2022-02-20 22:55:09.0|Info|ReleaseSearchService|Searching indexer(s): [RuTracker, NoNaMe ClubL] for Term: [Man Who Wasnt There 2001], Offset: 0, Limit: 0, Categories: []
2022-02-20 22:55:09.1|Debug|RuTracker|Downloading Feed https://rutracker.org/forum/tracker.php?nm=Man+Who+Wasnt+There+2001
2022-02-20 22:55:09.1|Info|Cardigann|Adding request: https://nnmclub.to/forum/tracker.php
2022-02-20 22:55:09.1|Debug|Cardigann|Downloading Feed https://nnmclub.to/forum/tracker.php: f[]=-1&o=1&s=2&tm=-1&shf=1&sha=1&ta=-1&sns=-1&sds=-1&nm=Man%20Who%20Wasnt%20There%202001&pn=&submit=%D0%9F%D0%BE%D0%B8%D1%81%D0%BA
2022-02-20 22:55:09.4|Debug|Cardigann|Parsing
2022-02-20 22:55:09.4|Debug|Cardigann|Got 0 releases
2022-02-20 22:55:09.5|Debug|ReleaseSearchService|Total of 0 reports were found for Term: [Man Who Wasnt There 2001], Offset: 0, Limit: 0, Categories: [] from 2 indexer(s)
2022-02-20 22:55:09.5|Debug|Api|[GET] /api/v1/search?query=Man%20Who%20Wasnt%20There%202001&indexerIds=2&indexerIds=1&type=search: 200.OK (521 ms)

Manual search using correct title from prowlarr works fine:

2022-02-20 22:55:14.5|Info|ReleaseSearchService|Searching indexer(s): [RuTracker, NoNaMe ClubL] for Term: [Man Who Wasn't There 2001], Offset: 0, Limit: 0, Categories: []
2022-02-20 22:55:14.6|Debug|RuTracker|Downloading Feed https://rutracker.org/forum/tracker.php?nm=Man+Who+Wasn%27t+There+2001
2022-02-20 22:55:14.6|Info|Cardigann|Adding request: https://nnmclub.to/forum/tracker.php
2022-02-20 22:55:14.6|Debug|Cardigann|Downloading Feed https://nnmclub.to/forum/tracker.php: f[]=-1&o=1&s=2&tm=-1&shf=1&sha=1&ta=-1&sns=-1&sds=-1&nm=Man%20Who%20Wasn%27t%20There%202001&pn=&submit=%D0%9F%D0%BE%D0%B8%D1%81%D0%BA
2022-02-20 22:55:14.9|Debug|Cardigann|Parsing
2022-02-20 22:55:14.9|Debug|Cardigann|Got 7 releases
2022-02-20 22:55:15.0|Debug|ReleaseSearchService|Total of 29 reports were found for Term: [Man Who Wasn't There 2001], Offset: 0, Limit: 0, Categories: [] from 2 indexer(s)
2022-02-20 22:55:15.0|Debug|Api|[GET] /api/v1/search?query=Man%20Who%20Wasn%27t%20There%202001&indexerIds=2&indexerIds=1&type=search: 200.OK (503 ms)

@bakerboy448
Copy link
Contributor

Jackett will now have RawSearch support shortly
Jackett/Jackett#13409

  • also includes rawsearch for RuTracker

Indexers that require RawSearch simply need to be reported to Jackett and then will be pulled to Prowlarr when updated or can be reported directly to Prowlarr

Believe this should effectively resolve this issue then.

Prowlarr RuTracker commit - Prowlarr/Prowlarr@bc50fd9

@markus101
Copy link
Member

Indexers through Jackett and Prowlarr that report raw search capabilities are handled correctly which does solve this issue in the majority or cases, other newznab/torznab indexers can do the same if required.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests