This project demonstrates the application of common data cleaning and transformation techniques on a student performance dataset. The goal is to prepare the data for further analysis or machine learning modeling.
The dataset was obtained from Kaggle: https://www.kaggle.com/datasets/haseebindata/student-performance-predictions and contains information about students, including their attendance rate, study hours, previous grades, extracurricular activities, parental support, and final grades.
-
Handling Missing Values
- Although the original dataset contained no missing values, the script includes a step to handle them using
df.dropna()
. This ensures robustness in case future datasets have missing data.
- Although the original dataset contained no missing values, the script includes a step to handle them using
-
Removing Duplicates
- Duplicate rows, if any, are removed based on the unique identifier
StudentID
. This prevents any bias or inaccuracies in the analysis due to redundant data.
- Duplicate rows, if any, are removed based on the unique identifier
-
Normalization
- The
AttendanceRate
,StudyHoursPerWeek
, andPreviousGrade
columns are normalized usingMinMaxScaler
from scikit-learn. This scales the features to a range of 0-1, making them comparable and potentially improving the performance of machine learning algorithms.
- The
-
Encoding Categorical Variables
- Categorical variables
Gender
andParentalSupport
are transformed into numerical representations using one-hot encoding. This enables their inclusion in mathematical computations required for modeling.
- Categorical variables
-
Feature Engineering
- A new feature,
GradeImprovement
, is calculated by subtractingPreviousGrade
fromFinalGrade
. This derived feature provides direct insight into the academic progress of each student.
- A new feature,
The Python script utilizes the Pandas library for data manipulation and scikit-learn for preprocessing.