Skip to content

This project focuses on developing advanced text processing and language detection capabilities specifically for the Vietnamese language. Our primary goal is to create an AI system capable of real-time interaction with users in the context of real estate.

Notifications You must be signed in to change notification settings

hangtantai/Text-Processing-and-Lang-Detection-for-Vietnamese

Repository files navigation

Text-Processing-and-Lang-Detection-for-Vietnamese

This repository provides tools for text processing and language detection specifically tailored for Vietnamese. The project leverages the underthesea library for comprehensive text processing tasks and fastText for efficient language detection.

Introduction about Underthesea

Underthesea is a powerful NLP toolkit for Vietnamese language processing. It offers a wide range of functionalities but We just use:

  • Sentence Segmentation: Breaking down text into individual sentences.
  • Word Tokenization: Splitting sentences into words.
  • POS Tagging: Assigning parts of speech to each word. DOCUMENT
  • Named Entity Recognition (NER): Identifying and classifying entities in text.

Note: Underthesea has compatibility issues with Python version 3.12 and above. Before using it, please set up Python version 3.11 or an older version.

    pip install underthesea

Introduction about Fasttext

FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It is particularly known for its efficiency and accuracy in language detection tasks.

We use Language Identification of library.

Use it in Window:

    curl -O https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

Use it in Linux:

    wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

About

This project focuses on developing advanced text processing and language detection capabilities specifically for the Vietnamese language. Our primary goal is to create an AI system capable of real-time interaction with users in the context of real estate.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages