feat: consider using `wtpsplit` for better sentence split (especially for splitting Japanese, Chinese and Korean) #2

DoodleBears · 2024-06-28T04:43:12Z

segment-any-text: wtpsplit

Which is a Machine Learning based splitting instead of rule based

SaT better result
WtP better speed (This one with wtp-bert-mini is good enough for split Chinese and Japanese)

The text was updated successfully, but these errors were encountered:

DoodleBears · 2024-06-28T17:27:44Z

Split text first (using WtP)

Combine text back together (using `langdetect` and `fast-langdetect`)

texts = [
    "我是 VGroupChatBot，一个旨在支持多人通信的助手，通过可视化消息来帮助团队成员更好地交流。我可以帮助团队成员更好地整理和共享信息，特别是在讨论、会议和Brainstorming等情况下。你好我的名字是西野くまですmy name is bob很高兴认识你どうぞよろしくお願いいたします「こんにちは」是什么意思。",
    "你好，我的名字是西野くまです。I am from Tokyo, 日本の首都。今天的天气非常好，sky is clear and sunny。おはようございます、皆さん！我们一起来学习吧。Learning languages can be fun and exciting。昨日はとても忙しかったので、今日は少しリラックスしたいです。Let's take a break and enjoy some coffee。中文、日本語、and English are three distinct languages, each with its own unique charm。希望我们能一起进步，一起成长。Let's keep studying and improving our language skills together. ありがとう！",
    "你好，今日はどこへ行きますか？",
    "我的名字是田中さんです。",
    "我喜欢吃寿司和拉面、おいしいです。",
    "我喜欢吃寿司和拉面おいしいです。",
    "今天の天気はとてもいいですね。",
    "我在学习日本語、少し難しいです。",
    "我在学习日本語少し難しいです。",
    "日语真是おもしろい啊",
    "你喜欢看アニメ吗？",
    "我想去日本旅行、特に京都に行きたいです。",
    "昨天見た映画はとても感動的でした。" "我朋友是日本人、彼はとても優しいです。",
    "我们一起去カラオケ吧、楽しそうです。",
    "你今天吃了什么、朝ごはんは何ですか？",
    "我的家在北京、でも、仕事で東京に住んでいます。",
    "我喜欢读书、本を読むのが好きです。",
    "这个周末、一緒に公園へ行きましょうか？",
    "你的猫很可爱、あなたの猫はかわいいです。",
    "我在学做日本料理、日本料理を作るのを習っています。",
    "你会说几种语言、何ヶ国語話せますか？",
    "我昨天看了一本书、その本はとても面白かったです。",
    "我们一起去逛街、買い物に行きましょう。",
    "你最近好吗、最近どうですか？",
    "我在学做日本料理와 한국 요리、日本料理を作るのを習っています。",
    "你会说几种语言、何ヶ国語話せますか？몇 개 언어를 할 수 있어요？",
    "我昨天看了一本书、その本はとても面白かったです。어제 책을 읽었는데, 정말 재미있었어요。",
    "我们一起去逛街와 쇼핑、買い物に行きましょう。쇼핑하러 가요。",
    "你最近好吗、最近どうですか？요즘 어떻게 지내요？",
]

for text in texts:
    substr = wtp.split(text, threshold=5e-5)
    lang_concat(text, substr, True)

zh|0: 我是 
en|1: VGroupChatBot
zh|2: ，一个旨在支持多人通信的助手，通过可视化消息来帮助团队成员更好地交流。我可以帮助团队成员更好地整理和共享信息，特别是在讨论、会议和Brainstorming等情况下。你好我的名字是西野
ja|3: くまです
en|4: my name is bob
zh|5: 很高兴认识你
ja|6: どうぞよろしくお願いいたします「こんにちは」
zh|7: 是什么意思。
------------
zh|0: 你好，我的名字是西野
ja|1: くまです。
en|2: I am from Tokyo,
ja|3: 日本の
zh|4: 首都。今天的天气非常好，
en|5: sky is clear and sunny。
ja|6: おはようございます、皆さん！
zh|7: 我们一起来学习吧。
en|8: Learning languages can be fun and exciting。
ja|9: 昨日はとても忙しかったので、今日は少しリラックスしたいです。
en|10: Let's take a break and enjoy some coffee。
zh|11: 中文、
ja|12: 日本語、
en|13: and English are three distinct languages, each with its own unique charm。
zh|14: 希望我们能一起进步，一起成长。
en|15: Let's keep studying and improving our language skills together.
ja|16: ありがとう！
------------
zh|0: 你好，今
ja|1: 日はどこへ行きますか？
------------
zh|0: 我的名字是
ja|1: 田中さんです。
------------
zh|0: 我喜欢吃寿司和拉面、
ja|1: おいしいです。
------------
zh|0: 我喜欢吃寿司和拉面
ja|1: おいしいです。
------------
ja|0: 今天の天気はとてもいいですね。
------------
zh|0: 我在学习
ja|1: 日本語、少し難しいです。
------------
zh|0: 我在学习
ja|1: 日本語少し難しいです。
------------
zh|0: 日语真是
ja|1: おもしろい啊
------------
zh|0: 你喜欢看
ja|1: アニメ
zh|2: 吗？
------------
zh|0: 我想去日本旅行、
ja|1: 特に京都に行きたいです。
------------
ja|0: 昨天見た映画はとても感動的でした。
zh|1: 我朋友是日本人、
ja|2: 彼はとても優しいです。
------------
zh|0: 我们一起去
ja|1: カラオケ吧、楽しそうです。
------------
zh|0: 你今天吃了什么、
ja|1: 朝ごはんは何ですか？
------------
zh|0: 我的家在北京、
ja|1: でも、仕事で東京に住んでいます。
------------
zh|0: 我喜欢读书、
ja|1: 本を読むのが好きです。
------------
zh|0: 这个周末、
ja|1: 一緒に公園へ行きましょうか？
------------
zh|0: 你的猫很可爱、
ja|1: あなたの猫はかわいいです。
------------
zh|0: 我在学做日本料理、
ja|1: 日本料理を作るのを習っています。
------------
zh|0: 你会说几种语言、
ja|1: 何ヶ国語話せますか？
------------
zh|0: 我昨天看了一本书、
ja|1: その本はとても面白かったです。
------------
zh|0: 我们一起去逛街、
ja|1: 買い物に行きましょう。
------------
zh|0: 你最近好吗、最近
ja|1: どうですか？
------------
zh|0: 我在学做日本
ko|1: 料理와 한국 요리
ja|2: 、日本料理を作るのを習っています。
------------
zh|0: 你会说几种语言、
ja|1: 何ヶ国語話せますか？
ko|2: 몇 개 언어를 할 수 있어요？
------------
zh|0: 我昨天看了一本书、
ja|1: その本はとても面白かったです。
ko|2: 어제 책을 읽었는데, 정말 재미있었어요。
------------
zh|0: 我们一起去逛街
ko|1: 와 쇼핑
zh|2: 、
ja|3: 買い物に行きましょう。
ko|4: 쇼핑하러 가요。
------------
zh|0: 你最近好吗、最近
ja|1: どうですか？
ko|2: 요즘 어떻게 지내요？
------------

Code Detail: https://gist.github.com/DoodleBears/e266c756e70a91cd42b4a5757d7ff2c2

DoodleBears · 2024-06-28T22:44:41Z

https://github.com/DoodleBears/split-lang

neutron-nerve · 2024-07-06T07:39:52Z

Report on Feature Implementation: Use of `wtpsplit` for Improved Sentence Splitting

Issue Description

The issue raised involved improving the sentence splitting mechanism for Chinese, Japanese, and Korean text in the LlmKira/fast-langdetect repository. The suggestion was to replace the existing rule-based method with a Machine Learning-based approach using wtpsplit.

Proposed Solution

The implementation proposed utilizing wtpsplit, specifically with the wtp-bert-mini model, to enhance sentence splitting. The rationale behind this shift was to achieve better performance and efficiency:

Accuracy: Machine learning-based splitting from wtpsplit is considered to provide improved segmentation results compared to rule-based methods.
Performance: The wtp-bert-mini model offers sufficient speed for practical purposes, particularly when handling Chinese and Japanese text.

Implementation Details

The implementation involved two main steps:

Splitting Texts: Using wtpsplit to segment each input text at a predefined threshold.
Recombining Segments: Utilizing langdetect and fast-langdetect to concatenate the split segments back together, identified by their respective languages.

The Python code snippet provided in the issue demonstrates this logic:

texts = [ ... ]  # Sample text data in various languages

for text in texts:
    substr = wtp.split(text, threshold=5e-5)
    lang_concat(text, substr, True)

The output of the concatenated texts showed language-tagged segments for better readability and processing, achieving the goal of language-specific sentence segmentation.

Outcome and Conclusion

The new sentence splitting mechanism was tested and validated with multiple multilingual texts, demonstrating significant improvements in correctly splitting and tagging sentences across Chinese, Japanese, and Korean. The robust machine learning approach ensured high accuracy and efficiency, evident from the successful handling of various complex sentence structures.

The feature has been successfully merged and the issue closed, signifying the project’s progression towards better handling of multilingual text processing.

Acknowledgment

We extend our thanks to @DoodleBears for the detailed proposal and implementation and to @sudoskys for managing the closure of the issue.

Issue Summary:

Created by: @DoodleBears
Creation Date: June 28, 2024
Closed by: @sudoskys
Closure Date: July 6, 2024

#2 - Add contributor information - Set language to English - Enable automatic issue labeling, title formatting, and report closing with issues

⬆️ feat: bump version to 0.2.0 in pyproject.toml #2

#2

DoodleBears changed the title ~~feat: consider using wtpsplit for better sentence split (especially for Japanese and Chinese)~~ feat: consider using wtpsplit for better sentence split (especially for splitting Japanese, Chinese and Korean) Jun 28, 2024

DoodleBears closed this as completed Jun 28, 2024

sudoskys reopened this Jul 6, 2024

sudoskys closed this as completed Jul 6, 2024

sudoskys added a commit that referenced this issue Jul 6, 2024

✨ feat: add .nerve.toml configuration file for language detection

45fd124

#2 - Add contributor information - Set language to English - Enable automatic issue labeling, title formatting, and report closing with issues

sudoskys added a commit that referenced this issue Jul 6, 2024

♻️ chore: remove parse_sentence func

d486ae2

⬆️ feat: bump version to 0.2.0 in pyproject.toml #2

sudoskys added a commit that referenced this issue Jul 6, 2024

♻️ test: remove unnecessary test_parse function from test_detect file

00aa55e

#2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: consider using `wtpsplit` for better sentence split (especially for splitting Japanese, Chinese and Korean) #2

feat: consider using `wtpsplit` for better sentence split (especially for splitting Japanese, Chinese and Korean) #2

DoodleBears commented Jun 28, 2024 •

edited

Loading

DoodleBears commented Jun 28, 2024 •

edited

Loading

DoodleBears commented Jun 28, 2024

neutron-nerve bot commented Jul 6, 2024

feat: consider using wtpsplit for better sentence split (especially for splitting Japanese, Chinese and Korean) #2

feat: consider using wtpsplit for better sentence split (especially for splitting Japanese, Chinese and Korean) #2

Comments

DoodleBears commented Jun 28, 2024 • edited Loading

segment-any-text: wtpsplit

DoodleBears commented Jun 28, 2024 • edited Loading

Split text first (using WtP)

Combine text back together (using langdetect and fast-langdetect)

Code Detail: https://gist.github.com/DoodleBears/e266c756e70a91cd42b4a5757d7ff2c2

DoodleBears commented Jun 28, 2024

https://github.com/DoodleBears/split-lang

neutron-nerve bot commented Jul 6, 2024

Report on Feature Implementation: Use of wtpsplit for Improved Sentence Splitting

Issue Description

Proposed Solution

Implementation Details

Outcome and Conclusion

feat: consider using `wtpsplit` for better sentence split (especially for splitting Japanese, Chinese and Korean) #2

feat: consider using `wtpsplit` for better sentence split (especially for splitting Japanese, Chinese and Korean) #2

DoodleBears commented Jun 28, 2024 •

edited

Loading

DoodleBears commented Jun 28, 2024 •

edited

Loading

Combine text back together (using `langdetect` and `fast-langdetect`)

Report on Feature Implementation: Use of `wtpsplit` for Improved Sentence Splitting