This is a portfolio project for the Codecademy Data Science Career Path. The project aims to analyze a dataset of US medical insurance costs.
The project can be viewed on Kaggle here.
This project was completed using Python and the following libraries:
- Pandas
- Matplotlib
The first section of the project focuses on importing the dataset, exploring the data types, and finding the basic statistics for each column in the dataset. The code uses a try-except block to handle the different working directories in Kaggle and GitHub.
The second section of the project analyzes the dataset using various visualizations to explore the relationships between the variables. This includes histograms, scatterplots, and bar charts.
The third section of the project studies the correlation between variables using the Pearson correlation coefficient. This section aims to find out which variables have the strongest correlation with each other.
The final section of the project looks at potential areas of bias in the dataset. This includes looking at the distribution of individuals in different regions and the impact of smoking on insurance costs.
Feel free to check out the project on Kaggle and provide any feedback or suggestions!