Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(synced-lyrics): romanization #2790

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

ArjixWasTaken
Copy link
Contributor

@ArjixWasTaken ArjixWasTaken commented Dec 28, 2024

Started working on romaji support!

TODO:

  • Languages:
    • Japanese
    • Mandarin
    • Korean
  • UI
    • synced lyrics
    • plain lyrics (e.g. lyrics genius)

Preview:
image


Related discussion #2382
Blocked by #2788

@ArjixWasTaken ArjixWasTaken marked this pull request as draft December 28, 2024 00:43
@ArjixWasTaken ArjixWasTaken changed the title feat(synced-lyrics): init romanization! feat(synced-lyrics): romanization Dec 28, 2024
@ArjixWasTaken

This comment was marked as resolved.

@ArjixWasTaken
Copy link
Contributor Author

ArjixWasTaken commented Dec 29, 2024

This brings an issue to the surface, should I run only one romanizer and not all of them?
Like, Japanese has many chinese characters, that are often pronounced in different ways, like the example above, where de became teki

If I run the pinyin romanization first, it will affect japanese lyrics, but if I run the japanese romanization first, it will affect chinese lyrics.....

I'll decide what to do about this later, for now I'll ignore this issue

@ArjixWasTaken
Copy link
Contributor Author

Ok, for now I ended up with the following solution to the problem described earlier:

  • if a line contains hiragana or katakana, it is japanese, if it only contains kanji and doesn't use hiragana/katakana at all, it is chinese

That fixed the issues I encountered in chinese and japanese lyrics, unfortunately, if the lyrics are entirely in kanji, I can't do much about it 😔

More creative solutions are always welcome!

@JellyBrick
Copy link
Collaborator

JellyBrick commented Dec 29, 2024

Ok, for now I ended up with the following solution to the problem described earlier:

  • if a line contains hiragana or katakana, it is japanese, if it only contains kanji and doesn't use hiragana/katakana at all, it is chinese

That fixed the issues I encountered in chinese and japanese lyrics, unfortunately, if the lyrics are entirely in kanji, I can't do much about it 😔

More creative solutions are always welcome!

You can use lingua for language detection

https://github.com/pemistahl/lingua-js

or
https://github.com/dachev/node-cld
https://github.com/nitotm/efficient-language-detector-js

@ArjixWasTaken
Copy link
Contributor Author

That sounds a little bit overkill, and honestly, I doubt it would work unless it knows the difference between Chinese and Japanese grammar, because a lot of kanji characters are 1:1 the same as Chinese characters

@JellyBrick
Copy link
Collaborator

If a sentence contains shinjitai (新字体), it can be considered Japanese (although there are cases where it uses kanji (旧字体), but in those cases I think it's mostly distinguishable by the presence of hiragana and katakana)

@JellyBrick
Copy link
Collaborator

JellyBrick commented Dec 29, 2024

Also, if a sentence contains Korean hanja (한자), it can be considered Korean.

@ArjixWasTaken
Copy link
Contributor Author

ArjixWasTaken commented Dec 29, 2024

Hmm, I think it should be safe to assume that if at least one line contains katakana/hiragana, then we are dealing with japanese for all lines.

If no japanese/korean is detected anywhere, that's when I should use the pinyin library, I think that should work fine for most cases?

I really don't want to add more dependencies than needed

@ArjixWasTaken
Copy link
Contributor Author

ArjixWasTaken commented Dec 29, 2024

It should be much more reliable now, I doubt there are many japanese songs that are exclusively in kanji, if at least one line contains one character that is hiragana/katagana, then all lines are treated as japanese

otherwise, it is treated as chinese

@ArjixWasTaken
Copy link
Contributor Author

ArjixWasTaken commented Dec 29, 2024

@JellyBrick es-hangul doesn't have a method to test if a given text contains hangul, so I found some regex to do the job
the regex doesn't have to be perfect, don't overthink it, we just want to know if at least one line contains a hangul character or not, to determine what romanizer to use, I know korean unicode is complex, but this really should be enough


image

@kimjammer
Copy link
Contributor

Got an edge case for you, some songs are in 3 languages. Currently it looks like only one romanizer can be active for a song. In this particular song, Japanese and Korean don't appear on lines together so you could just pick the romanizer line by line, but I'm not sure if there are songs out there that do.
image
We love CJK 😃😃😃.
Anyhow, looking forward to this feature.
(I'm also going to assume you've noticed that the fancy font size isn't being applied correctly)

@ArjixWasTaken
Copy link
Contributor Author

ArjixWasTaken commented Dec 29, 2024

Got an edge case for you, some songs are in 3 languages. Currently it looks like only one romanizer can be active for a song. In this particular song, Japanese and Korean don't appear on lines together so you could just pick the romanizer line by line, but I'm not sure if there are songs out there that do. image We love CJK 😃😃😃. Anyhow, looking forward to this feature. (I'm also going to assume you've noticed that the fancy font size isn't being applied correctly)

detecting line-by-line for korean+japanese should be fine, but add chinese to the mix and (╯°□°)╯︵ ┻━┻

@ArjixWasTaken
Copy link
Contributor Author

@kimjammer could you send a list of 2-3 songs that have lyrics and contain chinese+korean+japanese?

@kimjammer
Copy link
Contributor

kimjammer commented Dec 29, 2024

By 3 languages I meant Japanese + English + Korean, but I think it would be good to have a list of mixed language songs for testing so:
JE, KE = Japanese and English on same line, Korean and English on same line, Japanese and Korean NOT in same line

J, K, JE, KE: LIT - STAYC
J, K, JE, KE, JK, JKE: Bad Girl for You - EXID
K, C, KE: HIGHLIGHT - SEVENTEEN
C, CK, CE: HWAA (Chinese Ver.) - (G)I-DLE
C, J, K, CE, KE: From Home - NCT U

I don't think any song has Chinese and Japanese on the same line
Good luck!

I totally understand if you just draw the line somewhere, there's probably not too many of these super multilingual songs.

@ArjixWasTaken
Copy link
Contributor Author

@kimjammer can you confirm if this is correct?

https://music.youtube.com/watch?v=WRWg4EA_dwQ&t=161
image

@kimjammer
Copy link
Contributor

I pulled the pr to test it and I can confirm that the top 4 songs look right to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants