This machine learning project aims to do 2 main things separately:
- Predicting the airline ticket price (regression problem).
- Classifying the ticket price range into 4 categories: cheap, moderate, expensive, very expensive.
These two parts rely on 10 features: date, airline, ch code (airline code), num code, time taken, stop, arrival time, type, route.
Data format: comma separated values file.
- Data Analysis
- Preprocessing
- Modeling
- Testing
- Models Analysis (not done yet)
- Deployment on heroku
Project can be found at: https://github.com/NourKamaly/AirlineTicketPricePrediction
Programming Languages: Python 3.9, JavaScript
Markup Languages: HTML, CSS
Tools used: PowerBI
Frameworks: Bootstrap
Libraries used: NumPy, pandas, dataprep, matplotlib, scipy, seabron, TensorFlow, xgboost, sklearn, joblib, flask.
Analysis was done using Power BI, answering 20 questions about the data
- Due to the presence of the date feature, the data was handled as a time series forecasting problem
- Data was sorted (mergesort) according to month, day, flight departure hour, and flight departure minute to prevent data leakage when splittling the data into the training and validation set.
- Features extracted: weekday of flight, flight day, flight month, and distance between the source and destination cities.
- Feature balance applied to airline as some categories had relatively low frequency
- Outlier detection using the interquartile range on the label (price)
- Feature engineering applied to the other features
- Feature selection using the p value
- Data transformation using the discrete cosine transform as this is a time series data (we suspected that the data may have been periodic) Multiple encoders were used and this resulted in 3 different dataset and training was done on each one of them separatly
10 models were tried in Regression:
- eXtreme Gradient Boosting Regressor
- Poisson Regressor
- Histogram Gradient Boosting Regressor
- Linear Regression
- Light Gradient Boosting Machine Regressor
- Gradient Boosting Regressor
- Extra Tree Regressor
- Bagging Regressor
- Decision Tree Regressor
- Random Forest
- A bagging ensemble learning model (simple averaging) made with: HistGradientBoostingRegressor,LGBMRegressor, ExtraTreesRegressor, BaggingRegressor, RandomForestRegressor
The ensemble model and random forest got the 2 best r2 score in the regression testing set
Ensemble model r2 score: 0.982
Random Forest r2 score: 0.980
9 models were tried in classification:
- Ada Boost
- Gradient Boosting Classifier
- Bagging Classifier
- Random Forest
- eXtreme Gradient Boosting Classifier
- Decision Tree Classifier
- Histogram Gradient Boosting Classifier
- Extra Tree Classifier
- Ensemble Stacking model that consists of RF, bagging classifier , extra tree classifier (the best performing models)
Deployment was done using HTML, CSS, Javascript, bootstrap for the interface and the backened was made by Flask.
The Website Link: https://airline-ticket-prediction-app.herokuapp.com/