Topic Modeling of Hebrew Tweets

This repository documents the process of topic modeling on Hebrew tweets, including data cleaning, topic extraction, classification, and validation. Below is a detailed overview of each step and the associated files.

Overview

EDA: The first thing I did was to gather insights from the data with different kinds of simple processing and graphs.
Data Preparation: Cleaned the data by removing frequently occurring words and performed lemmatization to normalize words to their base forms.
Topic Modeling: Applied language models to generate a list of topics from the cleaned data.
Classification: Classified several hundred tweets into the generated topics, creating a dataset of tweets and their associated topics.
Validation: Used various validation methods to assess the accuracy of topic representation.
Improvement Suggestions
- I plan to use a language model specifically trained for Hebrew to improve topic representation.
- I would apply embeddings directly to tweets to identify similar ones without relying solely on classification.
- I was considering fine-tuning BERT or similar models for more effective topic extraction and event identification.
- And much much more :-)

Files and Notebooks

1. Data Analysis and Modeling Notebook

Filename: twitter-topic-modeling.ipynb
Description: This notebook contains the analysis of the data. Also, in this notebook I generated the topics.

2. Topic Modeling Script

Filename: main.py
Description: This Python script creates topics.

3. Topic Validation Notebook

Filename: Topic_Validation.ipynb
Description: This notebook performs validation of the topic modeling process and evaluates the accuracy of topic representation.

Note: I ran the notebooks on with Google Colab with GPU A100.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
.DS_Store		.DS_Store
README.md		README.md
lemmatized_tweets.csv		lemmatized_tweets.csv
main.py		main.py
openai_gpt_client.py		openai_gpt_client.py
tweet_topic_generator.py		tweet_topic_generator.py
twitter_topic_analysis.ipynb		twitter_topic_analysis.ipynb
twitter_topic_modeling.ipynb		twitter_topic_modeling.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topic Modeling of Hebrew Tweets

Overview

Files and Notebooks

1. Data Analysis and Modeling Notebook

2. Topic Modeling Script

3. Topic Validation Notebook

About

Releases

Packages

Languages

ofirsteinherz/topic-modeling-of-hebrew-Tweets

Folders and files

Latest commit

History

Repository files navigation

Topic Modeling of Hebrew Tweets

Overview

Files and Notebooks

1. Data Analysis and Modeling Notebook

2. Topic Modeling Script

3. Topic Validation Notebook

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages