Data Extraction and Pipeline Project

This repository contains the code and documentation for a data extraction and pipeline project. The project involves extracting data from various resources, transforming it, and loading it into different databases. Below is an overview of the project:

Overview

Extracted data from different resources such as API’s, CSVs, JSON.
Saved the loaded data into different databases:
- Structured data was stored in SQL databases including SQL Express Server and MySQL.
- Semi-structured data was stored in MongoDB.
Built a pipeline to retrieve the data from these sources, perform transformations, such as sentiment analysis on news data using a pretrained model, and load it into a local staging database.
Utilized PostgreSQL for storing transformed data in the local staging database.
Used Pyspark to design and implement the pipeline for data processing.
Shifted the data from the local data warehouse to a cloud-based service, specifically Azure SQL.
Utilized Power BI for creating visualizations and dashboards to analyze the data.

Data Flow Diagram

Project Structure

The project is structured as follows:

models/: Contains pretrained models used for sentiment analysis.
docs/: Contains project documentation.
visualizations/: Contains visualizations and dashboards created using Power BI.

Usage

To run the data extraction and pipeline:

Install the required dependencies specified in requirements.txt.
Change the configuration based on your env in conf.yaml file.
Run the main script to execute different components of the pipeline.
Use Power BI to open and explore the visualizations and dashboards in the visualizations/ directory.

Feel free to contribute by submitting bug fixes, enhancements, or additional features.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
Visualizations		Visualizations
connectors		connectors
docs		docs
src		src
.gitignore		.gitignore
DataFlow.png		DataFlow.png
ERD-Pic.png		ERD-Pic.png
README.md		README.md
conf.yaml		conf.yaml
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Extraction and Pipeline Project

Overview

Data Flow Diagram

Project Structure

Usage

License

About

Releases

Packages

Contributors 2

Languages

mehassanhmood/BigData-Analytics

Folders and files

Latest commit

History

Repository files navigation

Data Extraction and Pipeline Project

Overview

Data Flow Diagram

Project Structure

Usage

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages