🎯 Real Time Social Media Content Retrieval System

Project Description

The Real Time Social Media Content Retrieval System is a platform designed to retrieve real-time posts from LinkedIn based on user queries. Users can input their query, and the system will fetch relevant posts from LinkedIn in real-time. It allows users to fetch multiple posts and select how many similar results they want to retrieve from the database. While currently limited to LinkedIn posts, the system can be extended to include other social media platforms, enabling users to find similar posts across various social media channels.

Limitations

Currently, the system only supports live LinkedIn posts retrieval. However, users can extend the functionality to other social media platforms by fetching data and storing it in a specific format in the backend. The fetched data should be stored in JSON files within the data folder with the following format:

{
    "Name": "<account_name>",
    "Posts": {
        "<Post_ID>": {
            "text": "<fetched_data>",
            "post_owner": "<account_name>",
            "source": "<social media handle name like Linkedin>"
        }
    }
}

Demo

Demo Video Link

Technologies Used

The project utilizes several technologies to create a streamlined pipeline:

Bytewax: Used to create a fully streamlined pipeline.
Qdrant: Serves as the vectordatabase, internally built using Rust for faster data processing.
Pydantic: Used for data validation and models.
Streamlit: Provides a simple user interface for the system, developed in Python.
Selenium: To create automation workflow using Python.
BeautifulSoup: Scrape the data from the HTML pages.

Installation

To run this project on your machine, follow these steps:

Create a virtual environment:

python3 -m venv venv

Activate the environment:

Windows:

venv\Scripts\activate

macOS and Linux:

source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Ensure Docker is installed and run the Qdrant container:

sudo docker run -d -p 6333:6333 -v qdrant_storage:/qdrant/storage qdrant/qdrant

Run the Streamlit app:

streamlit run app.py

Access the UI:

Open your web browser and navigate to localhost:8501 to start using the Real-Time Social Media Content Retrieval System.

How to Use

To use this app, follow the steps below:

Step 1: Provide LinkedIn Credentials

Add your LinkedIn username, password, and the account username from which you want to fetch the posts.

Step 2: Fetch LinkedIn Posts

Click on the "Fetch Details" button.
Wait for some time as the app will automatically open LinkedIn and fetch the posts.

Step 3: Migrate Data to Vectordb

If you have already fetched data and stored it in the Data folder in JSON format, you can directly migrate the data.
Ensure that your custom data follows the mentioned JSON format.
Use the provided migration tool or script to migrate the data to Vectordb. Follow any instructions or guidelines provided with the migration tool to complete the process.

Step 4: Search in the Database

Once the migration is completed successfully, you can start searching in the database.
Access the user interface of the application.
From the left-side panel, select the number of results you want to fetch from the database.
Enter your query in the search bar and initiate the search.
The application will retrieve and display relevant posts from the database based on your query.

Contribution

Developers are welcome to contribute to this project. Here's how:

Fork the repository.
Create a new branch for your feature or bug fix.
Make your changes and ensure clean code.
Write tests for your changes (if applicable).
Commit your changes with a descriptive message.
Push your branch to your forked repository.
Create a pull request with a detailed description of your changes.

Contact

For any suggestions, comments, or inquiries, please contact bhikadiyamanthan@gmail.com or reach out via LinkedIn: https://www.linkedin.com/in/manthanbhikadiya/. Your inputs are highly appreciated and will contribute to making this project more beneficial for users.

Special Mentioned

Many thanks to Paul Lusztin for generously providing the code and an efficient pipeline for the Retrieval Data System. This project wouldn't have been possible without your contribution. I strongly encourage everyone to subscribe to the newsletter.
Github Repo: https://github.com/decodingml/articles-code/tree/main/articles/large_language_models/real_time_retrieval_system_for_social_media_data
Newsletter: https://decodingml.substack.com/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly