Text-Processing-and-Lang-Detection-for-Vietnamese

This repository provides tools for text processing and language detection specifically tailored for Vietnamese. The project leverages the underthesea library for comprehensive text processing tasks and fastText for efficient language detection.

Introduction about Underthesea

Underthesea is a powerful NLP toolkit for Vietnamese language processing. It offers a wide range of functionalities but We just use:

Sentence Segmentation: Breaking down text into individual sentences.
Word Tokenization: Splitting sentences into words.
POS Tagging: Assigning parts of speech to each word. DOCUMENT
Named Entity Recognition (NER): Identifying and classifying entities in text.

Note: Underthesea has compatibility issues with Python version 3.12 and above. Before using it, please set up Python version 3.11 or an older version.

    pip install underthesea

Introduction about Fasttext

FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It is particularly known for its efficiency and accuracy in language detection tasks.

We use Language Identification of library.

Use it in Window:

    curl -O https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

Use it in Linux:

    wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
README.md		README.md
SSMLTagger.py		SSMLTagger.py
function_util.py		function_util.py
requirements.txt		requirements.txt
test_ssml.py		test_ssml.py
text_processing_lang_detection.py		text_processing_lang_detection.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-Processing-and-Lang-Detection-for-Vietnamese

Introduction about Underthesea

Introduction about Fasttext

About

Releases

Packages

Contributors 2

Languages

hangtantai/Text-Processing-and-Lang-Detection-for-Vietnamese

Folders and files

Latest commit

History

Repository files navigation

Text-Processing-and-Lang-Detection-for-Vietnamese

Introduction about Underthesea

Introduction about Fasttext

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages