This project is an AI-powered Web Scraper that leverages Langchain, GroqCloud, and Llama3.1 to scrape, parse, and analyze content from web pages. The project is made using Streamlit and containerized using Docker and deployed on Vercel for scalability and ease of access.
- Uses Langchain for conversational and generative AI capabilities.
- Integrates with GroqCloud and the Llama3.1 model for powerful language understanding.
- Scrapes and parses web content based on user input on Streamlit.
- Deployed using Docker for containerized execution on Vercel.
.
├── Dockerfile # Docker setup for the app
├── main.py # The main entry point of the Streamlit app
├── parse.py # Contains the logic for handling user input and parsing web content
├── requirements.txt # Python dependencies
├── vercel.json # Configuration for deployment on Vercel
└── .env # Environment variables (API keys, etc.)
To run this project, you need the following dependencies listed in the requirements.txt
:
groq
streamlit
langchain
langchain_core
langchain_groq
selenium
beautifulsoup4
lxml
html5lib
python-dotenv
The project uses environment variables stored in a .env
file. Ensure you add your keys for GroqCloud and other services as needed.
Example .env
file:
GROQ_API_KEY=your_groq_api_key
or
HUGGINGFACEHUB_API_TOKEN = your_HuggingFace_API_TOKEN
-
Clone the repository:
git clone https://github.com/your-repo/ai-web-scraper.git cd ai-web-scraper
-
Set up the environment:
- Create a
.env
file with your API keys as shown above.
- Create a
-
Install the dependencies:
pip install -r requirements.txt
-
Run the Streamlit app:
streamlit run main.py
To containerize the app, a Dockerfile is provided. This Dockerfile installs all necessary dependencies, sets up ChromeDriver for scraping, and runs the Streamlit app.
docker build -t ai-web-scraper .
docker run -p 8501:8501 ai-web-scraper
The app will be accessible at http://localhost:8501
inside the container.
This project is deployed on Vercel using Streamlit and Docker.