Use the sparse GPS data collected from NFTA buses to impute the traffic condition for the time points when no GPS data is available.
- Python 3.x
- PyTorch
- Data process
- PeMS, used for validation; refer to
References 2
(which uses the same data as DCRNN) for more details- Pick up the stations and get their info
- We can use the selected sensors from DCRNN directly.
- Download data
- Transfer data to required format
- Pick up the stations and get their info
- NFTA (Refer to this repository for details)
- Pick up the road segments and find a way to save them
- Map raw GPS data to corresponding road segments and get the traffic condition from GPS
- Transfer data to required format
- PeMS, used for validation; refer to
- Implement existing missing data imputation methods
- Tensor decomposition
- Spatial-temporal based methods
- etc.
- Develop new deep learning based model
- Integration with Nittec
- Clarify the accepted data format
- Data uploading process
- Data updating frequency
Some raw PEMS data can be found here. Download them, unzip, and put under folder data_raw/d[xx]/
, where xx
is the district ID in two digits.
Steps to download more raw data and sensor metadata from official website
- Register if not yet (it might take some time for the new account to be approved) and sign in
- Download by following these steps
- Click on Data Clearinghouse at the bottom left of the homepage
- To download data,
- on the top of the page, in the dropdown list of
Type
: select Station 5-MinuteDistrict
: select target district, e.g., District 7
- Click Submit button
- In the table below the Submit button, click on the cell in the year and Month table
- Download data from the Available Files table
- on the top of the page, in the dropdown list of
- To downlaod metadata file, choose Station Metadata in the
Type
dropdown list, and then select the desiredDistrict
.
- Put data and meta file under folder
data_raw/d[xx]/
, wherexx
is the district ID in two digits.
Run the following commands under the root directory of this repository.
-
Select sensors/stations based on some rules, and calculate the distance between each pair of sensors
$ python -m scripts.select_sensors
-
Generate the distance matrix among selected sensors, where elements smaller than threshold are set to 0. Currently, randomly select 200 sensors.
$ python -m scripts.generate_adj_matrix
-
(Only run where needed) Generate graphs for different time intervals
$ python -m scripts.generate_more_graphs
Two methods to generate samples:
Method 1 (Tested and Recommended):
- Generate samples from raw data without preprocessing
$ python -m scripts.generate_data_samples_from_raw
Method 2:
-
For each district, select data based on selected sensors and merge them together
$ python -m scripts.process_pems
-
Generate samples
$ python -m scripts.generate_data_samples --source_data_filename=data_raw/d07/data.npz --output_dir=data/d07
The train, val, and test files will have the following format
x: (number of samples, input length, number of nodes, number of traffic measurements)
y: (number of samples, prediction length, number of nodes, number of traffic measurements)
mask_x: has the same shape with x; mask_x[idx] == 0 means data missing at certain point during data collection, whereas mask_x[idx] == 2 means manully added missingness.
mask_y: has the same shape with y.
The names of train
, val
, and test
data files are in format
{mode}_{input length}_{predict length}_{missing rate}.npz
, where
mode
is train, val or test, input length
is the length of input sequence in terms of time interval,
predict length
is the length of prediction sequence, and missing rate
is the missing rate in data samples.
- KDD'14 Travel Time Estimation of a Path using Sparse Trajectories
- AAAI'20 GMAN a graph multi-attention network for traffic prediction (GitHub)