Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(starr): expand dual audio regex #1979

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

adapowers
Copy link

@adapowers adapowers commented Jun 17, 2024

Pull Request

Purpose

Two common conventions for indicating dual audio releases are not currently captured by the regex:

  • The release group VARYG appends it directly to their name: DUAL-VARYG
  • Other release groups will use DUAL {Resolution} or {Resolution} DUAL

I attempted to add a pattern which was flexible, while having a low risk of false positives.

Approach

The following patterns have been added into the regex:

(?-i)DUAL-VARYG(?i)

  • (?-i) disables case-insensitivity for the rest of the pattern
  • DUAL-VARYG matches literally
  • (?i) turns case-insensitivity back on for the rest of the pattern

dual[ ._-]?(\d{3,4}p|ultrahd|4k)

  • dual[ ._-] matches dual (case insensitive) plus common separator characters
  • \d{3,4}p|ultrahd|4k matches any 3-4 digits followed by p (1080p, 720p, etc.) or 4K or UltraHD

(\d{3,4}p|ultrahd|4k)[ ._-]?dual

  • Same match as above, but in reverse order: 1080p.DUAL, 4K UltraHD DUAL, etc.

Notes:

  • I did not choose to mess with case sensitivity lightly; as dual is a dictionary word, we want to prevent matching on any uppercase, hyphen-delimited filename that contains it. This is the same reason I chose to encode the release group directly. Without both, there could be too many false positives.
  • In general, this approach attempts to leverage the fact that DUAL (as a single word indicating dual audio) is often placed directly before or after the video resolution.

Regex

https://regex101.com/r/p1Rt67/6

Open Questions and Pre-Merge TODOs

Requirements

@github-actions github-actions bot added Area: Sonarr Sonarr Related Area: Radarr Radarr Related Area: Backend Backend Changes, not related to a specific section Area: Starr Custom Formats Issue is related to custom formats labels Jun 17, 2024
@adapowers adapowers changed the title fix(starr anime) add regex for DUAL-VARYG pattern fix(starr anime): add regex for DUAL-VARYG pattern Jun 17, 2024
@adapowers adapowers changed the title fix(starr anime): add regex for DUAL-VARYG pattern fix(starr anime): add dual audio regex for DUAL-VARYG pattern Jun 17, 2024
@adapowers adapowers changed the title fix(starr anime): add dual audio regex for DUAL-VARYG pattern fix(starr): add dual audio regex for DUAL-VARYG pattern Jun 17, 2024
@adapowers adapowers changed the title fix(starr): add dual audio regex for DUAL-VARYG pattern feat(starr): expand dual audio regex Jun 17, 2024
@adapowers
Copy link
Author

adapowers commented Jun 17, 2024

Another thought: it would be nice if this could also be captured in the {custom formats} for renaming, to better track which files have dual audio and which don't. But that would require it having a shorter name like DA or Dual, which I imagine we wouldn't want. Any thoughts, besides simply changing the name on our own instances?

@adapowers
Copy link
Author

Another thought: it would be nice if this could also be captured in the {custom formats} for renaming, to better track which files have dual audio and which don't. But that would require it having a shorter name like DA or Dual, which I imagine we wouldn't want. Any thoughts, besides simply changing the name on our own instances?

Ah, nevermind—I realized that's already accounted for in the {MediaInfo AudioLanguages} piece of the recommended anime naming scheme.

@bakerboy448 bakerboy448 requested a review from a team June 17, 2024 10:39
@FonduemangVI
Copy link
Contributor

@rg9400 could I get you to take a look at this given the regex changes

@nuxencs nuxencs force-pushed the fix/add-varyg-anime-dual-audio branch from 5783172 to 97a0780 Compare June 28, 2024 14:22
Copy link
Contributor

@zakkarry zakkarry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could probably be improved further but this seems to do the same job.

not sure why the varyg is case sensitive, that seems like its unnecessary - when would this be lowercase and not be a match?

docs/json/radarr/cf/anime-dual-audio.json Outdated Show resolved Hide resolved
docs/json/sonarr/cf/anime-dual-audio.json Outdated Show resolved Hide resolved
@adapowers
Copy link
Author

could probably be improved further but this seems to do the same job.

not sure why the varyg is case sensitive, that seems like its unnecessary - when would this be lowercase and not be a match?

Fair enough—I think I was just trying to be super safe, but you make a good point. And I appreciate the additional refactor! Changes accepted.

@zakkarry zakkarry dismissed their stale review June 29, 2024 05:40

still needs review from anime team - my changes have been committed

@nuxencs nuxencs force-pushed the fix/add-varyg-anime-dual-audio branch from 370a931 to 41b551e Compare July 9, 2024 14:18
@rg9400
Copy link
Contributor

rg9400 commented Jul 17, 2024

Sorry, was on vacation for the last few weeks. Can you share some test cases against this regex? The VARYG format has Dual-Audio in their naming near the end which matches even if DUAL-VARYG does not. The dual regex you shared matches scene naming, but anime is rarely consistent, so it's not always the case that the resolution follows DUAL. For example, it's not matching NanDesuKa's format: Helck.S01E16.1080p.HIDI.WEB-DL.DUAL.AAC2.0.H.264-NanDesuKa.mkv. Another random example: [OhDeer] Shikanoko Nokonoko Koshitantan - 01 (WEB 1080p Multi Audio) | (Dual) (My Deer Friend Nokotan). I get that if we are too lenient, it can match episode and anime titles that use the word dual, but I am not sure how many additional releases it is capturing.

The regex itself looks good, but sometimes it's hard to identify potential issues, so just trying to do a bit of due diligence to validate the changes are solving a problem.

@github-actions github-actions bot added the Status: Conflicted Pull Request is Conflicted label Sep 6, 2024
@FonduemangVI
Copy link
Contributor

@rg9400 given your recent changes is this still valid/needed?

@github-actions github-actions bot added Status: Conflicted Pull Request is Conflicted and removed Status: Conflicted Pull Request is Conflicted labels Sep 6, 2024
@adapowers
Copy link
Author

adapowers commented Nov 25, 2024

Hi! My apologies for leaving and then resurrecting this PR after so long.

Everyone in this thread gave a lot of great feedback, and I've been giving it a ton of consideration. I'm trying to approach this matter very thoughtfully, as my goal (perhaps lofty) is to "solve" dual audio anime for the foreseeable future. To that end, I'd love feedback and recommendations on the different approaches I'm about to lay out.

TLDR:

  • Medium coverage: https://regex101.com/r/8mXQiS/1 (~2.4x slower than current, catches most cases with very low false positive risk)
  • High coverage (2025-proof): https://regex101.com/r/eAyxRy/1 (~5x slower than current, catches all cases with low false positive risk, has zone for easy exception adding, accounts for upcoming 2025 anime that would cause trouble)
  • High coverage (risky): https://regex101.com/r/36M8BD/1 (~4.3x slower than current, does not account for upcoming titles)

If that kind of performance hit is not a concern (given we're still talking a couple millisecs here), I advocate for high coverage. If it is, or there are other concerns, medium will still be a huge improvement.

Improvements over current regex

(All)

  • Matches Multi-Audio convention
  • Matches DUAL-VARYG convention
  • Matches language codes (EN|JA|ZH|KO) in any order and with others inbetween, with only slight false positive risk
  • Adds support for full language names (English + Japanese, etc.) with separator variations (any order, but must be adjacent)

(High coverage only)

  • Matches increasingly common release convention of .DUAL. (and similar) as only indication of dual audio

Responding to previous comments

  • "These test cases aren't very thorough" - Yes, and I can't believe I let that fly. I've collected a ton of real-world examples this time.
  • "Dual doesn't always show up in the same place in the filename" - Yes, that was naive of me. It's not the right approach.
  • "VARYG releases also include some other mention of dual audio" - Not necessarily. The filenames I've seen are somewhat inconsistent, despite their group being one of the most consistent in dual audio releases generally, and there are even some dual VARYG releases that just say -VARYG (and which we can't do anything about).
  • @zakkarry's replacement of (literal space) with \s (all whitespace) - Accepted with thanks, but I'm also assuming that *arr only ever tests releases as single strings, since using \s on a multi-line list (with newlines at the end) could give unwanted matches.
  • "Capturing dual as a single word is probably okay" - I hope so, maybe not, but we'll get to that.

The problem as I understand it currently

  • Dual audio titles have to be matched on the release and *arr naming side.

    • Otherwise, scores might change a lot between search and import, which can lead to very unpredictable and inconsistent upgrading behavior.
    • Certain trackers (like AnimeBytes) enforce their own conventions, as well—though thankfully they're covered by the variance in release names themselves.
  • Dual audio can be indicated by any of the following conventions (assume all typical variations in separator chars):

    • (Dual)
    • Dual Audio
    • Multi-Audio
    • DUAL-VARYG
    • EN+JA (and similar, sometimes even mixed in with other languages: EN+DE+FR+JA)
    • Full languages: [English + Japanese] (can be + or & or neither, with or without brackets/parens)
    • And (the bane of my existence): simply DUAL (as a single delimited word at any point in the release title).
  • Release naming is very inconsistent, both within and between scene/P2P groups.

    • (Ironically, one of the conventions many have standardized on is the aforementioned .DUAL.).

Approaches and tradeoffs

It's possible to maximize coverage for all cases, with some tradeoffs: namely, possibility of false positives, and performance (relative to current regex) if we then try to control for them. Here are the options.

MEDIUM COVERAGE: Match as many cases EXCEPT for the word DUAL

https://regex101.com/r/8mXQiS/1

This leaves a LOT of current releases on the table, but minimizes false positives.
Catches:

  • (Dual)
  • Dual Audio
  • Multi-Audio
  • DUAL-VARYG
  • All language cases

HIGH COVERAGE: Match on DUAL

What if we just bite the bullet, and catch all language cases and multi-audio, then match on DUAL except for at the start of a title?

What's the worst that could happen?

I checked AniDB and AnimeBytes for every existing anime with the word "dual" in the title itself, and did some additional research on srrDB and PreDB (where I also found some of the other conventions incorporated in this version).

The good news: Except for cases where "dual" is in the middle like "Synduality Noir" (easily controlled for), it's actually just:

  • Dual! Parallel Trouble Adventure (1999)
  • Armitage III Dual Matrix (movie) (2001)

We can take care of the first by making sure we don't match at the start of a word. The second just requires a negative lookbehind to make sure dual isn't preceded by III.

https://regex101.com/r/36M8BD/1

The bad news: The popular video game franchise Guilty Gear is getting an anime release in 2025 called GUILTY GEAR STRIVE: DUAL RULERS. Assuming likely release titles, this could create a ton of false positives. We now have to add strive and strive: to our negative lookbehind.

https://regex101.com/r/eAyxRy/1

I personally think this tradeoff (and the occasional need to update it when a new, very popular title comes out) is worth catching effectively all releases that use dual or multi to denote dual-audio. But, others may not agree.

Technicals

Medium coverage

(dual|multi)[ ._-](audio|varyg)|[([]dual[)\]]
  • dual or multi, common separators, and audio or varyg. Or, dual in parens/brackets.

High coverage (full)

(?<!strive|strive:|iii)([ ._\[(-]dual|multi[ ._-]audio)
  • A common separator (including paren/bracket) and dual (except when preceded by strive, strive:, or iii), OR multi (common separator) audio.
  • This ensures that dual is never matched at the start of a word.
  • This matches on multi-audio, but not multi-sub.

Language matching

(EN|JA|ZH|KO)(?= ?\+ ?.*?(EN|JA|ZH|KO)
  • This uses a lookahead that starts at a matched language code and checks to see if it's followed by a +, anything else, and then another language code.
  • False positive risk: One language and a matching substring elsewhere, e.g. [EN+FR]-JaKo (fake release group).
  • False positive risk: Dual audio, but not English, e.g. [JA+KO].
  • The assumption is that this would only be running on anime-specific profiles, which would lower the risk much further than general cases.
(English|Japanese|Chinese|Korean) ?[ ._\+&-] ?(English|Japanese|Chinese|Korean)
  • No lookahead here, because it's too risky. Just a full language name, plus common separators, plus another full language name.
  • False positive risk: Dual audio, but not English, e.g. [Japanese + Korean].
  • The assumption is that this would only be running on anime-specific profiles, which would lower the risk much further than general cases.

Conclusion and request

No, it's not perfect. But, whichever way we go, it's a huge improvement. I'd like to defer to you, dear maintainers, on your desired coverage/performance/false positive tradeoff of these options, and after incorporating any feedback, hopefully we can move forward on one of them. :)

@zakkarry
Copy link
Contributor

In regards to the \s replacement, just to follow-up...given that we are talking about other languages, I think it's probably best to proceed with caution regarding literals in the case of whitespace. There are a variety of character-sets that COULD be used and if you truly want this to be "universal", there is a chance that (particularly, as always, with anime and entirely different alphabets) it could yield a title with whitespace other than that used in traditional character-sets we would be used to seeing for english. Is it likely? Probably not THAT likely...but given that for normal ASCII-32 character whitespace it makes no real difference, and that it's all single lines, I think it would be best with \s in this case (and potentially any others subject to different character-sets.

It's also much less of a concern for trailing spaces to be used, in my opinion, than for different languages to have different whitespace usage.

Just my opinion, though.

@adapowers
Copy link
Author

In regards to the \s replacement, just to follow-up...given that we are talking about other languages, I think it's probably best to proceed with caution regarding literals in the case of whitespace. There are a variety of character-sets that COULD be used and if you truly want this to be "universal", there is a chance that (particularly, as always, with anime and entirely different alphabets) it could yield a title with whitespace other than that used in traditional character-sets we would be used to seeing for english. Is it likely? Probably not THAT likely...but given that for normal ASCII-32 character whitespace it makes no real difference, and that it's all single lines, I think it would be best with \s in this case (and potentially any others subject to different character-sets.

It's also much less of a concern for trailing spaces to be used, in my opinion, than for different languages to have different whitespace usage.

Just my opinion, though.

Heard and appreciated! Whatever the ultimate PR winds up being, I'll make sure to replace spaces with \s.

@rg9400
Copy link
Contributor

rg9400 commented Nov 26, 2024

@adapowers I have this flagged to look at, but I'll need to see when I find time due to the holidays. Prima facie, I think we can certainly expand it to handle some of the cases you've highlighted. I've jotted down some quick thoughts from reading your post, but I haven't gone over it in detail or spent time with the regex, I will do that later.

Do you see lots of results with [English + Japanese]? I only included the [EN+JA] cases because that's what the Arr's rename the file to include if multi-languages are detected.

The other sort of thing I noticed was your reasoning for the Dual tradeoff. I think it makes sense but is something I am wary of in terms of maintenance. The other potential concern could be episode titles since it could match on those, though I think most anime groups do not include them (though P2P do sometimes).

Also, I do not think we should match on [JA+KO] because there are certain "dual" releases such as for Tower of God that would then match. This can be easily controlled by creating two versions, one with EN first and one with it at the end. The caveat here is that a group might mark those releases as dual audio regardless based on differing conventions. But I do not think it hurts to control here.

@zakkarry
Copy link
Contributor

Given the number of regex steps and substantially increased time processing for every individual case of multiple languages takes, I'm not entirely sure the benefits outweigh the cost to cover every single language variation in one, and the precedent that every language is covered is only tech-debt.

I would recommend taking a rather lax approach to what you include and maybe write an additional CF or 2 for users that need them. Or a brief "this is how to add your own language"

Bundling everything into one only introduces complexity for maintaining, cost for processing, and precedent to cover every language in the future. None of that sounds particularly great to me, but I will defer to you all as anime isn't my thing to begin with.

@adapowers
Copy link
Author

adapowers commented Nov 26, 2024

I appreciate both your notes, and will respond meaningfully ASAP. To ask a Q now though, @zakkarry, in response to something you said: is there precedent for a (TRaSH-synced) "advanced" version of a CF with qualified tradeoffs? My hopeful goal is to improve everyone's *arr experience, else I'd just keep my CF for myself and unsync from the guide's. :) Overall, I know we have to minimize debt+maximize compatibility to keep the UX bar high, just knowing what other options exist could be helpful for deciding how to split this out.

(And to your other point: anime naming is a monster, and we could use all the CPU cycles in the world trying to account for it; it's likely always going to be a little more expensive compared to other CFs for even decent coverage, but obviously cost:benefit is always a relevant factor.)

@zakkarry
Copy link
Contributor

I don't think that DUAL would ever be considered a default CF, but I don't use the sync'ing software and use mostly custom CF's and scores I wrote and manage to begin with, so I wouldn't be the person to ask about the feasibility or anything of that.

It was just what I would suggest if all things were equal.

@rg9400
Copy link
Contributor

rg9400 commented Nov 27, 2024

@adapowers been going through your medium regex for now (I know that we have to make a decision on the riskier ones, will get to that in a bit). Some notes on it below, but otherwise, I think that is easy enough to adopt without much concern.

  1. I am okay with the first and second conditions in this regex
  2. I think we should force EN to be on one or the other side of the language code groupings. My lazy way to do that is write to conditions, one with it at the front, the other at the end like in the current version. The current version is also very closely tied to our existing file rename (versus actual release titles), so I think we can control this a bit more if we want to expand it to capture cases where other languages come in between. I am a bit wary of the lack of boundaries (our file rename always has the languages between [ and ]) or the white spaces. We really only need to capture [EN+DE+IT+JA] as well (with any permutation of number of languages) but not much else.
  3. Like the above, for the full language condition, we should keep English in the first or last group.

For the more aggressive regex, I have verified it works, but we just have to make a call here. Mainly P2P groups use this format, and the guide isn't super oriented around them (though they do get scored). On the other hand, your regex is controlling for the edge cases we can predict so far, though it seems the processing time and maintenance will be the tradeoff (as well as those edge cases we cannot predict such as episode titles which might be included in some p2p releases). I am still thinking through this, but we can fix the above since it is consistent across all the formats in the meantime.

@bakerboy448 bakerboy448 added the Status: Waiting for User Waiting for OP or Contributor to address feedback or provide information label Dec 3, 2024
@bakerboy448 bakerboy448 marked this pull request as draft December 3, 2024 22:28
@bakerboy448 bakerboy448 added the Do Not Merge Do Not Merge label Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Backend Backend Changes, not related to a specific section Area: Radarr Radarr Related Area: Sonarr Sonarr Related Area: Starr Custom Formats Issue is related to custom formats Do Not Merge Do Not Merge Status: Conflicted Pull Request is Conflicted Status: Waiting for User Waiting for OP or Contributor to address feedback or provide information
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants