NLP: Clean raw and messy text with regular expressions

Background

This code is part of a larger NLP Machine Learning project. To train/finetune a Machine Learning model or to make predictions with it, the input text first needs to be cleaned before it can be tokenized and used.

General Info

The code can be used to clean and preprocess raw, unprocessed and messy text and mostly uses regular expressions (Python re package, Python 3.11) to do that. In "main.py", there is a messy sample text to be cleaned. The methods in the class CleanText in "funcs/clean.py" transform those parts of the text that are found by the compiled re objects/patterns defined in "funcs/re_patterns.py". These patterns and functions can be adjusted to specific needs. In addition to the Python re package, some other Python string functions (such as "maketrans", etc) are used.

Setup

The main.py script contains sample text in the variable "messy_text".

Go to main.py and paste the text you want to be cleaned into the variable "messy_text". Then run it. The cleaned text will be printed.
Adjust the class methods in "funcs/clean.py" and the regular expressions in "funcs/re_patterns.py" according to your needs.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
funcs_new		funcs_new
funcs_old		funcs_old
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP: Clean raw and messy text with regular expressions

Background

General Info

Setup

About

Releases

Packages

Languages

rainergo/NLP-CleanText

Folders and files

Latest commit

History

Repository files navigation

NLP: Clean raw and messy text with regular expressions

Background

General Info

Setup

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages