Online course completion prediction
-
Introduction Online learning platforms have revolutionized education by making a wide range of courses accessible to people worldwide. However, a significant challenge these platforms face is the high dropout rate of students. Understanding and predicting which students are likely to complete a course can help educators and administrators design better intervention strategies, personalize learning experiences, and ultimately improve student retention and success rates.
-
Objective The objective of this project is to
train
anddeploy
a machine learning model that predicts whether a student will complete an online course. By accurately predicting course completion, the model can assist online learning platforms in identifying at-risk students early and implement measures to support them throughout their learning journey. -
Data Source The data for this project was sourced from kaggle.com, and includes the following:
a) UserID: Unique identifier of a student.
b) CourseCategory: Category of the course enrolled in by a student. This includes Arts
, Science
, Programmming
and Health
.
c) TimeSpentOnCourse: This is the total amount of time spent, in hours, on a course by a student.
d) NumberOfVideosWatched: This is the number of video content relating to a course that has been consumed by a student.
e) NumberOfQuizzesTaken: This is the number of quizzes relating to a course that a student has taken.
f) QuizScores: Average of the scores obtained by a student in all the quizzes taken.
g) CompletionRate: How much of a course is completed by a student.
h) DeviceType: The type of device used in taking the course. Two types were considerd in this dataset - mobile(0) or laptop(1)
i) CourseCompletion: A boolean indicating if a student completed a course (1) or not (0)
- Methodology
a) Data Preprocessing: Cleaning and transforming the raw data, handling missing values, and encoding categorical variables.
b) Exploratory Data Analysis (EDA): Analyzing the data to identify patterns, correlations, and insights that will inform feature engineering and model selection.
c) Feature Engineering: Creating new features that capture important aspects of student behavior and engagement.
d) Model Selection: Evaluating different machine learning algorithms (e.g., logistic regression, Linear regression, Lasso) to find the best fit for the problem.
e) Model Training and Evaluation: Training the selected model on historical data and evaluating its performance using appropriate metrics (e.g., mean squared error).
f) Hyperparameter Tuning: Optimizing the model’s hyperparameters to improve its performance.
g) Deployment: Deploying the model to a web service and setting up an almost real-time prediction system.
h) Monitoring: Continuously monitoring the model’s performance to ensure its accuracy and relevance over time.
- Expected Outcomes
** Improved Retention Rates: By predicting and addressing factors leading to dropouts, an online platform can retain more students.
** Personalized Learning: Tailoring the learning experience based on the predicted needs and behaviors of students.
** Informed Interventions: Enabling educators to proactively engage with at-risk students and provide targeted support.
** Data-Driven Insights: Gaining a deeper understanding of the factors that contribute to course completion, which can inform future course design and teaching strategies.
There are 5 subdirectories in the root folder of the project.
** mlpipeline
- contains directories and files needed to pull the dataset used, save the data and train a model using a pipeline and work orchestration tool
** model-deployment
- this contains all the logic to deploy the model to a web service, apply linting and formating, and test it.
** infrastructure
- this contains the Terraform directories and files to automate provisioning of resources needed to host the web service on the AWS Cloud platform.
** monitoring
- this contains the script for daily monitoring of the model performance.
** .github
- this is for GitHub CI/CD pipeline to automate the cloud deployment of of the web service and the needed infrastructures.
mlpipeline
, model-deployment
and monitoring
have seperate pipenv environments.
NOTE: Everything is running locally at the moment. However, provision has been made for a transfer to the AWS cloud.
The following tools were used
...
** Prefect
for workflow orchestration
** MLFlow
for experiment tracking and model registry
** Evidently
for monitoring
** Flask
as a web service for model deployment
** Docker
for packaging our deployment as a conatiner
** Kubernnetes
for orchestrating the container deployment
** Postgres
for storing monitoring data
** Grafana
for visualization, observability and alerting
** Terraform
as Infrastructure as Code tool
** GitHub Actions
for CI/CD pipeline
To run things locally
...
MLPipeline Phase
- Ensure python and pipenv are installed
- Change directory to
mlpipeline
- Create the following directories -
data
,monitoring_data
to save appropriate data for model training and monitoring - Run the command
pipenv install --python <your-python-version>
to install all the dependencies - Start MLFlow server using the command
mlflow server --backend-store-uri sqlite:///backend.db
. This will start mlflow using sqlite database as the backend store, with the database namedbackend.db
in this case. The default-artifact-root would be the local filesystem in this case. - Access mlflow ui at
http://localhost:5000
- Prefect is the workflow orchestration tool of choice in this project.
- To start prefect, and view flows when we train the model, run the command
prefect server start
- Access prefect ui at
http://localhost:4200
- To now deploy the pipeline to run on a scheduled basis (weekly in this case), run
python mlpipeline-deploy.py
. This file contains configurations for the deployment of the mlpipeline found intrain_model.py
to prefect - Then run
prefect deployment run <name-of-the-flow>/<name-of-the-deployment>
e.gprefect deployment run run-pipeline/mlpipeline
- Start the prefect agent to run the pipeline flow using
prefect agent start --pool "default-agent-pool" --work-queue "mlops"
command - For testing, you can trigger a run manually on the prefect ui.
- Go to mlflow ui to see that a model has been trained, and registered in the model registry, and metrics, including those for monitoring using
evidently
, have been saved.
Model Deployment Phase
-
Change directory to
model-deployment
-
Run the command
pipenv install --python <your-python-version>
to install all the dependencies -
The main code is found in
predict.py
-
Run the web service using
python predict.py
command. -
Using Thunderclient or Postman, you can send request to the service at
http:localhost:9696
and see expected response. -
Sample data that can be sent is... { "UserID": 6001, "CourseCategory": "Science", "DeviceType": 1, "TimeSpentOnCourse": 42.238989, "NumberOfVideosWatched": 8, "NumberOfQuizzesTaken": 7 }
-
The reponse will contain the prediction as
CourseCompletion
and theRUN_ID
of the model we used. -
Further unit test, integration test, linting and formating can be done, all in a go using the
make
command e.gmake -d
. This will run all the commands present in the Makefile found in the index directory. -
Note that the integration testing involves starting the docker containers as specified in the docker compose file present in the index directory. Do update the docker compose file with the values of the model
RUN_ID
and other details. Ensure the MLflow server is still running so the we service can connect to it. -
You can access the Postgres Database via Adminer -
http://localhost:8080
to see the data that has been written to it, and visualize and perform some analytics with creation of dashboards using Grafana -http://localhost:3000
-
Next, create a local Kubernetes cluster using Docker Desktop, Minikube or any other tool (I used Docker Desktop)
-
Ensure the Docker daemon has been started.
-
Update the manifest files found in
k8s-local
folder to contain real values e.g image name and tag for web service deployment, pstgres password(base64 encoded) -
Run the following in the k8s cluster -
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.3.0/deploy/static/provider/cloud/deploy.yaml
to install ingress controller to the cluster for traffic routing. -
In ingress.yaml file, use a FQDN of your choice (mine is model.pred.com)
-
Add this domain name to the list of domain names in the hosts file and map it to local host e.g
127.0.0.1 model.pred.com
-
Apply the manifest files as
kubectl apply -f k8s-local
-
Test that web service is up and running by sending requests again to
http://model.pred.com/web
using Postman or ThunderClient -
Access the other services using their subdomain names
Monitoring Phase
- Change directory to
monitoring
- Run the command
pipenv install --python <your-python-version>
to install all the dependencies - This also requires
prefect
for workflow orchestration. - The main code is in
daily_monitoring.py
file - The workflow is scheduled to run daily
- It involves pulling the previous day's data saved in PostgreSQL database, and using
evidently
, derive some data quality metrics and save this into a different tablemetrics
in the database - Running this workflow involves the same steps as in the
MLPipeline
phase above...
To run things in the cloud - AWS
...
MLPipeline Phase
- Provision and
EC2 instance
on AWS and attach a role to it to write to S3Bucket
- Allow SSH access on port 22 from your IP Address
- SSH into the instance and copy the MLPipeline directory to it.
- Install pipenv, and run
pipenv install --python <your-python-version>
- Start MLFlow server with - mlflow server --backend-store-uri postgres://:@:/<DATABSE_NAME> --default-artifact-root s3://<BUCKET_NAME>
- Create an S3 Bucket and Postgres RDS instance before step 5. Allow ingress on port 5432 from the instance security group to the databse.
- Start prefect server as in the MLPipeline phase above
Model Deployment Phase
- The model deployment phase under this section involves using
GitHub Actions
as the CI/CD tool of choice - The configuration files for the CI/CD is located in .github directory, which is present at the root directory
- The flow involves ...
** Create a new git branch
** Make changes to the content of
infrastructures
ormodel-deplyment
directories in the new branch ** Trigger thepre-commit hook
when a commit is made (The pre-commit configuration file should ideally be in the root directory containing the .github directory. However, in this case, it's inmodel-deployment
, since that's the directory with a pipenv environment and containing the pre-commit package. Hence, this stage of quality check is missed in this project) ** Push changes to GitHub and make apull request
to the main branch ** This trigger the CI pipeline to run tests and quality check on the repository as well asterraform plan
command ** Once this pull request is merged with the main branch, CD would be trigger to runterraform plan
again, and if successful, to apply any changes in the infrastructure ** The infrestructures that have been configured in the terraform files include - EKS cluster, ECR, PostgreSQL database ** The PostgreSQL database is to save the data that would be used for monitoring ** Visualization, analytics and dashboard creation will be done usingGrafana Cloud
** Simple create an account onGrafana Cloud
and connect the PostgreSQL as a data source ** After provisioning the infrastructures, the CD pipeline will build a docker image of the web service and push it to ECR ** The Kubernetes manifest files will be automatically updated with new values and the files re-applied to the Kubernetes cluster
Monitoring Phase
- This involves the same steps as in the local environment
- The EC2 instance created earlier for experiment tracking and prefect deployment will also be used to run this workflow.