-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Machine Learning Experiment for ETA service #356
Comments
Discussion on 07182020 more resources: https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/navigation/eta/team_discussion_for_06182020.md Briefly go-though: #356 New York Texi duration
Flight ETA Via SparkML
Flight ETA Via GCloud AI Platform
|
Draft Machine Learning diagram, related with #357 (comment) Context Diagram(click for large image) Notes: the Container DiagramTo do Component DiagramTraining data/Test data sample format
ETA service query format
Notes:
|
Subtask of #355
We plan to build a machine learning model based on user's gps trace data. Here record some experiments and proof of concept for understanding the problem set.
There are several experiment I have done to get familiar with ML, here I record 3 of them which I feel is highly related:
Data and characteristics determine the upper limit of machine learning, and models and algorithms just approach this upper limit.
New York City Taxi Trip Duration from Kaggle
applied machine learning engineer
.python notebook
provided by website help to generate live statistic, and we could also download the docker image and deploy on other cloud(Kaggle Python docker image, hub, instruction)Data source
https://www.kaggle.com/c/nyc-taxi-trip-duration/data
OSRM features
route
gps traces
, we could do spatial index mapping to generate a list of spatial index box to represent the routemore info, Google S2
Weather feature
I think weather feature is crawling from open data website, you could find related data for this Kaggle competition here. More information you could go to here -> 6.1 Weather reports
Feature extracting
PCA
to transform longitude and latitude, help for decision tree splitsDistance
Datetime
Training
XGBoosting
Parameter Tune
Most of parameters in XGBoost are about bias variance tradeoff. When we allow the model to get more complicated (e.g. more depth), the model has better ability to fit the training data, resulting in a less biased model. However, such complicated model requires more data to fit. XGBoost Parameters
Try with different parameters
It takes extremely large amount of resources and time.
Cross Validation
http://blog.mrtz.org/2015/03/09/competition.html
Flight Delay Estimation(gcloud)
Cloud Dataproc
is easy to do development and easy to scale. It lunches pre-build container image which contains tensorflow, python3, etc.pub/sub
system could simulate live streaming with batch dataDataflow
,Cloud Bigtable
,Data Studio
helps a lot with building streaming system, which will discuss more in Streaming experiment for ETA service #357apache beam
to aggregate data frompub/sub
-> record result ascsv
-> load data intocloud bigtable
-> trigger training with checkpoint, more info in Streaming experiment for ETA service #357Input Data
fixed
orig and destinationy = 0 if arrival delay >= 15 minutes
y = 1 if arrival delay < 15 minutes
// marching learning algorithm predict the probability that the flight is on time
Logic Regression via Spark
more info
After recording all data into
csv
, could load data intodataframe
orrdd
(difference), then generatedataframe
contains result afterfeatures engineering
, then callingtrain
prediction
evaluation
Tensorflow via Cloud Dataproc
More info
Why wide & Deep helps
source
How to implement wide and deep
How tensor flow scales
more info
How to scale gcloud ai platform
Flight Delay Estimation(open source stack)
Classification
more info
How to improve prediction model and how to evaluate
more info
The text was updated successfully, but these errors were encountered: