This GitHub repository contains a versatile text processing toolkit for natural language processing (NLP) tasks in Python. It provides a set of functions to preprocess and analyze textual data, including removing stopwords, lemmatization, calculating word frequencies, performing part-of-speech tagging, and more. The toolkit is designed to help streamline the text data preparation process for various NLP projects, making it easier to clean, analyze, and extract valuable insights from text data.
Stopword Removal: Remove common stopwords from text data to focus on meaningful content. Lemmatization: Reduce words to their base forms, particularly useful for reducing word variations. Word Frequency Calculation: Count the frequency of each word in the text. Part-of-Speech Tagging: Assign parts of speech to words in the text. Punctuation Removal: Eliminate punctuation marks from the text for cleaner analysis. Word Frequency Sorting: Sort words by frequency in descending order. Data Export to Excel: Save the processed data to an Excel file for further analysis and visualization. Sample Usage: The repository includes a sample text and a Python script demonstrating how to use the toolkit. Users can simply input their text data and leverage the provided functions to perform various text processing operations. The processed data can be exported to an Excel file for further analysis.
- Clone or download the repository to your local environment.
- Modify the Excel file path line (71).
- Add the sample text within the Python script.
- Run the script to see the text processing functions in action.
The code in this repository builds upon various open-source Python libraries, including NLTK, spaCy, and pandas.
the used text "The quick brown fox jumped over the lazy dog. A computer program is a collection of code that performs a specific task. Paris, often referred to as the 'City of Love' and the 'City of Lights,' is a captivating metropolis that needs no introduction. The sun sets behind the mountains, casting a warm golden hue across the tranquil lake."
output in Excel file: