Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine Learning Experiment for ETA service #356

Open
CodeBear801 opened this issue Jun 17, 2020 · 2 comments
Open

Machine Learning Experiment for ETA service #356

CodeBear801 opened this issue Jun 17, 2020 · 2 comments
Labels
Ideas Ideas for long-term discussion Prototype Proof of concept

Comments

@CodeBear801
Copy link

CodeBear801 commented Jun 17, 2020

Subtask of #355

We plan to build a machine learning model based on user's gps trace data. Here record some experiments and proof of concept for understanding the problem set.

There are several experiment I have done to get familiar with ML, here I record 3 of them which I feel is highly related:

  • New York City Taxi Trip Duration from Kaggle
  • Flight Delay Estimation(gcloud)
  • Flight Delay Estimation(open source stack)

Data and characteristics determine the upper limit of machine learning, and models and algorithms just approach this upper limit.

New York City Taxi Trip Duration from Kaggle

Data source

https://www.kaggle.com/c/nyc-taxi-trip-duration/data

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	trip_duration
0	id2875421	2	2016-03-14 17:24:55	2016-03-14 17:32:30	1	-73.982155	40.767937	-73.964630	40.765602	N	455
1	id2377394	1	2016-06-12 00:43:35	2016-06-12 00:54:38	1	-73.980415	40.738564	-73.999481	40.731152	N	663
  • No GPS traces
  • The scenario has been set to NY, means training data and test data all exists in NY
  • 1458644 trip records in train.csv, and 625134 trip records in test.csv
    • If we consider to cluster orig point and destination point, each cluster pair(orig-destination location pair) has multiple(lots) of data coverage
      image

OSRM features

id total_distance total_travel_time number_of_steps
id2875421 2009.1 164.9 5
id2377394 2513.2 332.0 6
id3504673 1779.4 235.8 4
  • OSRM route is calculated based on orig/dest point, which will generate distance, duration, number of steps to represent the route
    • When we have gps traces, we could do spatial index mapping to generate a list of spatial index box to represent the route
      image
      more info, Google S2
    • Or, we could do map matching, try to snap points to a list of navigable edges in the graph, then extract more features more info

Weather feature

I think weather feature is crawling from open data website, you could find related data for this Kaggle competition here. More information you could go to here -> 6.1 Weather reports

Feature extracting

  • PCA to transform longitude and latitude, help for decision tree splits
  • Distance
  • Normalize Datetime
  • Speed
  • Clustering orig and dest
  • Temporal and geospatial aggregation

Training

XGBoosting

xgb_pars = {'min_child_weight': 50, 'eta': 0.3, 'colsample_bytree': 0.3, 'max_depth': 10,
            'subsample': 0.8, 'lambda': 1., 'nthread': 4, 'booster' : 'gbtree', 'silent': 1,
            'eval_metric': 'rmse', 'objective': 'reg:linear'}

model = xgb.train(xgb_pars, dtrain, 60, watchlist, early_stopping_rounds=50,
                  maximize=False, verbose_eval=10)

Parameter Tune

Most of parameters in XGBoost are about bias variance tradeoff. When we allow the model to get more complicated (e.g. more depth), the model has better ability to fit the training data, resulting in a less biased model. However, such complicated model requires more data to fit. XGBoost Parameters

Try with different parameters

# random search on different parameter combination
xgb_pars = []
for MCW in [10, 20, 50, 75, 100]:
    for ETA in [0.05, 0.1, 0.15]:
        for CS in [0.3, 0.4, 0.5]:
            for MD in [6, 8, 10, 12, 15]:
                for SS in [0.5, 0.6, 0.7, 0.8, 0.9]:
                    for LAMBDA in [0.5, 1., 1.5,  2., 3.]:
                        xgb_pars.append({'min_child_weight': MCW, 'eta': ETA, 
                                         'colsample_bytree': CS, 'max_depth': MD,
                                         'subsample': SS, 'lambda': LAMBDA, 
                                         'nthread': -1, 'booster' : 'gbtree', 'eval_metric': 'rmse',
                                         'silent': 1, 'objective': 'reg:linear'})

It takes extremely large amount of resources and time.

Cross Validation

http://blog.mrtz.org/2015/03/09/competition.html

Flight Delay Estimation(gcloud)

  • Keyword: SparkML, Logistic Regression, Tensorflow, Wide-and-Deep, Cloud Dataproc
  • My experiment: notes
  • Summary:
    • Cloud Dataproc is easy to do development and easy to scale. It lunches pre-build container image which contains tensorflow, python3, etc.
    • Use Google's pub/sub system could simulate live streaming with batch data
    • Dataflow, Cloud Bigtable, Data Studio helps a lot with building streaming system, which will discuss more in Streaming experiment for ETA service #357
    • During test, we use batch data(like one month's flight data) as input into machine learning pipeline
    • In live streaming system, using apache beam to aggregate data from pub/sub -> record result as csv -> load data into cloud bigtable -> trigger training with checkpoint, more info in Streaming experiment for ETA service #357

Input Data

|summary|   FL_DATE|UNIQUE_CARRIER|        AIRLINE_ID|CARRIER|            FL_NUM| ORIGIN_AIRPORT_ID|ORIGIN_AIRPORT_SEQ_ID|ORIGIN_CITY_MARKET_ID|ORIGIN|   DEST_AIRPORT_ID|DEST_AIRPORT_SEQ_ID|DEST_CITY_MARKET_ID|DEST|       CRS_DEP_TIME|           DEP_TIME|         DEP_DELAY|          TAXI_OUT|         WHEELS_OFF|          WHEELS_ON|          TAXI_IN|       CRS_ARR_TIME|           ARR_TIME|         ARR_DELAY|           CANCELLED|CANCELLATION_CODE|            DIVERTED|         DISTANCE|   DEP_AIRPORT_LAT|   DEP_AIRPORT_LON|DEP_AIRPORT_TZOFFSET|   ARR_AIRPORT_LAT|   ARR_AIRPORT_LON|ARR_AIRPORT_TZOFFSET|EVENT|NOTIFY_TIME|

y = 0 if arrival delay >= 15 minutes
y = 1 if arrival delay < 15 minutes
// marching learning algorithm predict the probability that the flight is on time

Logic Regression via Spark

more info

After recording all data into csv, could load data into dataframe or rdd(difference), then generate dataframe contains result after features engineering, then calling train

examples = traindata.rdd.map(udf)
lrmodel = LogisticRegressionWithLBFGS.train(examples, intercept=True)

prediction

lrmodel = LogisticRegressionModel.load(sc, MODEL_FILE)
lrmodel.setThreshold(xxx)
lrmodel.predict(Independent features)

evaluation

def eval(labelpred):
    cancel = labelpred.filter(lambda (label, pred): pred < 0.7)
    nocancel = labelpred.filter(lambda (label, pred): pred >= 0.7)
    corr_cancel = cancel.filter(lambda (label, pred): label == int(pred >= 0.7)).count()
    corr_nocancel = nocancel.filter(lambda (label, pred): label == int(pred >= 0.7)).count()
    
    cancel_denom = cancel.count()
    nocancel_denom = nocancel.count()
    if cancel_denom == 0:
        cancel_denom = 1
    if nocancel_denom == 0:
        nocancel_denom = 1
    return {'total_cancel': cancel.count(), \
            'correct_cancel': float(corr_cancel)/cancel_denom, \
            'total_noncancel': nocancel.count(), \
            'correct_noncancel': float(corr_nocancel)/nocancel_denom \
           }

Tensorflow via Cloud Dataproc

More info

  • Why wide & Deep helps
    image
    source

  • How to implement wide and deep
    image

  • How tensor flow scales
    more info

  • How to scale gcloud ai platform

Flight Delay Estimation(open source stack)

  • Keyword: SparkML, Scikit-Learn, MongoDB, Kafka
  • My experiment: note
  • Summary
    • During development, I build all dependencies and python-connector into docker
    • For the development stage, need to config each docker image and make sure dependencies could work
    • During scale, need docker orchestration tools such as K8S
    • If the environment is well set, development in local is similar as develop on public cloud like gcloud and aws, but harder to manage

Classification

more info

How to improve prediction model and how to evaluate

more info

@CodeBear801 CodeBear801 added Ideas Ideas for long-term discussion Prototype Proof of concept labels Jun 17, 2020
@CodeBear801
Copy link
Author

CodeBear801 commented Jun 19, 2020

Discussion on 07182020

more resources: https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/navigation/eta/team_discussion_for_06182020.md

Briefly go-though: #356

New York Texi duration

Flight ETA Via SparkML

https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/navigation/eta/making_predictions_sample.ipynb

  • Topics
    • Data Input, target
    • Feature abstraction with Spark
    • Classification with Spark(how to decide bucket)
    • Evaluation
    • Why parquet
    • Why Dataframe not RDD
    • Optimization(optional)

Flight ETA Via GCloud AI Platform

  • Topics
    • ENV setup
    • Feature abstraction via Tensorflow
    • Training and evaluation
    • Wide & Deep Learning(optional)
    • Scale(How to scale with GCloud, How Tensorflow Scale)

@CodeBear801
Copy link
Author

CodeBear801 commented Jun 22, 2020

Draft Machine Learning diagram, related with #357 (comment)

Context Diagram

image

(click for large image)

Notes: the input here is the output component of flow in #357 (comment)

Container Diagram

To do

Component Diagram

level 1
image

level 2
image

Training data/Test data sample format

trace_id, userid, start_position, end_position, duration, distance, osrm_legs, avg_speed, osrm_distance, osrm_duration, osrm_edge_list, spatial_index_cell_list...

ETA service query format

userid, start_position, end_position, distance, osrm_legs, avg_speed, osrm_distance, osrm_duration, osrm_edge_list, spatial_index_cell_list...

Notes:

  • Training Model accepts bounded data(eg. past 3 month) and unbounded data(eg. live data)

  • For testing, mainly will based on bounded data, which could be a csv contains gps trace data, each line follows Training data/Test data sample format

  • Features which needs heavy calculation will be moved to steps prior to Model Training, such as

    • OSRM route calculation
    • Map Matching
    • Spatial index mapping for trace points
    • Live traffic/historical speed injection for spatial cells
    • Weather abstraction
  • Model Training component suppose to generate following features(just for example, whether they are needed or not need evaluation)

    • PCA for orig and destination
    • clustering
    • format time
    • great circle distance between orig and destination
  • For how to generate unbounded data based on user's trace, please go to Streaming experiment for ETA service #357 (comment)

  • Different model for different purpose

    • If we want to estimate ETA for a specific user's usual route(let's say estimate ETA for a daily commute, this is detected by other service and passed to ETA service with a flag), it should be per user per pattern
    • If we want to estimate ETA for a generic user's random query, we need single model for all these kind of requests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ideas Ideas for long-term discussion Prototype Proof of concept
Projects
None yet
Development

No branches or pull requests

1 participant