Solution of team TSE to NeurIPS2022-Traffic4cast Challenge
Necessary packages needed for running the scripts are included in requirements.txt
In addition, the official t4c package have to be installed in advance.
pip install -r requirements.txt
The scripts used for data imputation, data preparation, feature extraction and model training & prediction are included in
. Before running the scripts, please configure the paths in config.json
The model checkpoints are included in the folder processed/checkpoints
Checkpoints | Description |
lgb_1+nr2_model_london.pkl |
London model with Mahattan and normed Euclidean distance |
lgb_1+nr2_model_madrid.pkl |
Madrid model with Mahattan and normed Euclidean distance |
lgb_1+nr2_model_melbourne.pkl |
Melbourne model with Mahattan and normed Euclidean distance |
lgb_1+p2_model_london.pkl |
London model with Mahattan and Euclidean distance |
lgb_1+p2_model_madrid.pkl |
Madrid model with Mahattan and Euclidean distance |
lgb_1+p2_model_melbourne.pkl |
Melbourne model with Mahattan and Euclidean distance |
lgb_full_missing_model_london.pkl |
London model for samples with high missing rate |
The codes of feature engineering are included in the folder src/feature_extraction
Please note that, before running the codes within this folder to extract features, the scripts within the src/preparation
folder should be run first to prepare all required inputs. Those scripts should be run as follows.
: restructure the raw loop dataset, the imputed loop dataset, and the y labels in the eta
: calculate the missing rate of loop data for each observation (time step)
: construct the support set and train set for the observations with high missing rate in loop data in London
: processing the speed data, which will be used for extracting speed-based features.
- Number of nodes involved in the supersegment (SG)
- Length of SG
- Number of oneway edges in the SG
- Statistics of the speed limits of edges in the SG: mean, std, 25, 50, 75 percentiles, min, max
- Haversine distance between SG OD
- For SG
$i$ : Shortest/design travel time =$\sum_{j \in SG_i} \frac{\text{length}_j}{\text{MaxSpeed}_j}$ - Statistics of the
$y$ values of all samples under consideration (all nn) - Percentage of
$(- \infty, 1800]$ ,$(1800, 2400]$ and$(2400, \infty)$ in the y query set for each SG
- Sum, mean, std of loop counts (at nodes) within SG
- Number of loops with values (at each interval)
. Free flow speed and median speed of a SG is defined as the mean free flow speed and mean median speed of the edges involved.
- Mean, std of the free flow speed, median speed of
$k$ nearest neighbors - Mean, std of the edge volume classes percentage/distribution of
$k$ nearest neighbors
- Statistics of
$y$ values of the$k$ nearest neighbors: mean, std, 25, 50, 75 percentiles, min, max
We also combine (difference, addition, quotient) multiple aforementioned features together to construct more powerful features. This step is carried on in the model training script.
The accompanying technique report can be found in Traffic4cast_2022_TSE.pdf.
title = {Similarity-based Feature Extraction for Large-scale Sparse Traffic Forecasting},
author = {Wu, Xinhua and Lyu, Cheng and Lu, Qing-Long and Mahajan Vishal},
year = 2022,
month = {Oct},
url = {},
language = {en}
This repository is based on the official repository of the competition NeurIPS2022-traffic4cast.