Merge pull request #63 from Urban-Analytics-Technology-Platform/62-do…

…cumentation-for-repo initial documentation
Urban-Analytics-Technology-Platform · Nov 12, 2024 · 4b4f553 · 4b4f553
2 parents ca181c5 + d270a22
commit 4b4f553
Show file tree

Hide file tree

Showing 4 changed files with 338 additions and 41 deletions.
diff --git a/README.md b/README.md
@@ -6,7 +6,39 @@
 
 A package to create activity-based models (for transport demand modelling)
 
-## Installation
+- [acbm](#acbm)
+- [Motivation and Contribution](#motivation-and-contribution)
+- [Installation](#installation)
+- [How to Run the Pipeline](#how-to-run-the-pipeline)
+  - [Step 1: Prepare Data Inputs](#step-1-prepare-data-inputs)
+  - [Step 2: Setup your config.toml file](#step-2-setup-your-configtoml-file)
+  - [Step 3: Run the pipeline](#step-3-run-the-pipeline)
+  - [Future Work](#future-work)
+    - [Generative Aproaches to activity scheduling](#generative-aproaches-to-activity-scheduling)
+    - [Location Choice](#location-choice)
+  - [Related Work](#related-work)
+    - [Synthetic Population Generation](#synthetic-population-generation)
+    - [Activity Generation](#activity-generation)
+      - [Deep Learning](#deep-learning)
+    - [Location Choice](#location-choice-1)
+      - [Primary Locations](#primary-locations)
+      - [Secondary Locations](#secondary-locations)
+    - [Entire Pipeline](#entire-pipeline)
+- [Contributing](#contributing)
+- [License](#license)
+
+
+# Motivation and Contribution
+
+Activity-based models have emerged as an alternative to traditional 4-step transport demand models. They provide a more detailed framework by modeling travel as a sequence of activities, accounting for when, how, and with whom individuals participate. They can integrate household interactions, spatial-temporal constraints, are well suited to model on demand transport services (which are becoming increasingly common), and look at the equity implications across transport scenarios.
+
+Despite being increasingly popular in research, adoption in industry has been slow. A couple of factors have influenced this. The first is inertia and well-established guidelines on 4-step models. However, this is changing; in 2024, the UK Department for Transport released its first Transport Analysis Guidance on activity and agent-based models (See [TAG unit M5-4 agent-based methods and activity-based demand modelling](<ins>https://www.gov.uk/government/publications/tag-unit-m5-4-agent-based-methods-and-activity-based-demand-modelling</ins>)). Other initiatives, such as the [European Association of Activity-Based Modeling](https://eaabm.org/) are also being established to try and increase adoption of activity-based modelling and push research into practice.
+
+Another factor is tool availability. Activity-based modelling involves many steps, including synthetic population generation, activity sequence generation, and (primary and secondary) location assignment. Many tools exist for serving different steps, but only a couple of tools exist to run an entire, configurable pipeline, and they tend to be suited to the data of specific countries (see [Related Work](<ins>#related-work</ins>) for a list of different open-source tools).
+
+To our knowledge, no open-source activity-based modelling pipeline exists for the UK. This repository allows researchers to run the entire pipeline for any region in the UK, with the output being a synthetic population with daily activity diaries and locations for each person. The pipeline is meant to be extendible, and we aim to plug in different approaches developed by others in the future
+
+# Installation
 
 ```bash
 python -m pip install acbm
@@ -19,15 +51,140 @@ cd acbm
 poetry install
 ```
 
+# How to Run the Pipeline
+
+The pipeline is a series of scripts that are run in sequence to generate the activity-based model. There are a few external datasets that are required. The data and config directories are structured as follows:
+
+```md
+├── config
+│   ├── <your_config_1>.toml
+│   ├── <your_config_2>.toml
+├── data
+│   ├── external
+│   │   ├── boundaries
+│   │   │   ├── MSOA_DEC_2021_EW_NC_v3.geojson
+│   │   │   ├── oa_england.geojson
+│   │   │   ├── study_area_zones.geojson
+│   │   ├── census_2011_rural_urban.csv
+│   │   ├── centroids
+│   │   │   ├── LSOA_Dec_2011_PWC_in_England_and_Wales_2022.csv
+│   │   │   └── Output_Areas_Dec_2011_PWC_2022.csv
+│   │   ├── MSOA_2011_MSOA_2021_Lookup_for_England_and_Wales.csv
+│   │   ├── nts
+│   │   │   ├── filtered
+│   │   │   │   ├── nts_households.parquet
+│   │   │   │   ├── nts_individuals.parquet
+│   │   │   │   └── nts_trips.parquet
+│   │   │   └── UKDA-5340-tab
+│   │   │       ├── 5340_file_information.rtf
+│   │   │       ├── mrdoc
+│   │   │       │   ├── excel
+│   │   │       │   ├── pdf
+│   │   │       │   ├── UKDA
+│   │   │       │   └── ukda_data_dictionaries.zip
+│   │   │       └── tab
+│   │   │           ├── household_eul_2002-2022.tab
+│   │   │           ├── individual_eul_2002-2022.tab
+│   │   │           ├── psu_eul_2002-2022.tab
+│   │   │           ├── trip_eul_2002-2022.tab
+│   │   │           └── <other_nts_tables>.tab
+│   │   ├── travel_times
+│   │   |    ├── oa
+│   │   |    |   ├── travel_time_matrix.parquet
+|   |   |    └── msoa
+│   │   |        └── travel_time_matrix.parquet
+│   │   ├── ODWP01EW_OA.zip
+│   │   ├── ODWP15EW_MSOA_v1.zip
+│   │   ├── spc_output
+│   │   │   ├── <region>>_people_hh.parquet (Generated in Script 1)
+│   │   │   ├── <region>>_people_tu.parquet (Generated in Script 1)
+│   │   │   ├── raw
+│   │   │   │   ├── <region>_households.parquet
+│   │   │   │   ├── <region>_info_per_msoa.json
+│   │   │   │   ├── <region>.pb
+│   │   │   │   ├── <region>_people.parquet
+│   │   │   │   ├── <region>_time_use_diaries.parquet
+│   │   │   │   ├── <region>_venues.parquet
+│   │   │   │   ├── README.md
+│   ├── interim
+│   │   ├── assigning (Generated in Script 3)
+│   │   └── matching (Generated in Script 2)
+│   └── processed
+│       ├── acbm_<config_name>_<date>
+│       │   ├── activities.csv
+│       │   ├── households.csv
+│       │   ├── legs.csv
+│       │   ├── legs_with_locations.parquet
+│       │   ├── people.csv
+│       │   └── plans.xml
+│       ├── plots
+│       │   ├── assigning
+│       │   └── validation
+```
+
+## Step 1: Prepare Data Inputs
+
+You need to populate the data/external diectory with the required datasets. A guide on where to find / generate each dataset can be found in the [data/external/README.md]
+
+## Step 2: Setup your config.toml file
+
+You need to create a config file in the config directory. The config file is a toml file that contains the parameters for the pipeline. A guide on how to set up the config file can be found in the [config/README.md]
+
+## Step 3: Run the pipeline
+
+The scripts are listed in order of execution in the [scripts/run_pipeline.sh](https://github.com/Urban-Analytics-Technology-Platform/acbm/blob/main/scripts/run_pipeline.sh) bash file
+
+You can run the pipeline by executing the following command in the terminal from the base directory:
+
+```bash
+bash ./scripts/run_pipeline.sh config/<your_config_file>.toml
+```
+
+where your config file is the file you created in Step 2.
+
+## Future Work
+
+We aim to include different options for each step of the pipeline. Some hopes for the future include:
+
+### Generative Aproaches to activity scheduling
+- [ ] Bayesian Network approach to generate activities
+- [ ] Implement a Deep Learning approach to generate activities (see package below)
+
+### Location Choice
+- [ ] Workzone assignment: Plug in Neural Spatial Interaction Approach
+
+## Related Work
+
+There are a number of open-source tools for different parts of the activity-based modelling pipeline. Some of these include:
+
+### Synthetic Population Generation
+
+### Activity Generation
+
+#### Deep Learning
+- [caveat](https://github.com/fredshone/caveat)
+
+### Location Choice
+
+#### Primary Locations
+
+- [GeNSIT](https://github.com/YannisZa/GeNSIT)
+
+#### Secondary Locations
+- [PAM](https://github.com/arup-group/pam/blob/main/examples/17_advanced_discretionary_locations.ipynb): PAM c
+
 
-## Usage
+### Entire Pipeline
+- [Eqasim](https://github.com/eqasim-org/eqasim-java)
+- [ActivitySim](https://activitysim.github.io/activitysim/v1.3.1/index.html)
+- [PAM](https://github.com/arup-group/pam): PAM has functionality for different parts of the pipeline, but itis not clear how to use it to create an activity-based model for an entire population. Specifically, it does not yet have functionality for activity generation (e.g. statistical matching or generative approaches), or constarined primary location assignment.
 
 
-## Contributing
+# Contributing
 
 See [CONTRIBUTING.md](CONTRIBUTING.md) for instructions on how to contribute.
 
-## License
+# License
 
 Distributed under the terms of the [Apache license](LICENSE).
 

diff --git a/config/README.md b/config/README.md
@@ -0,0 +1,33 @@
+The config.toml file has an explanation for each parameter. You can copy the toml file, give it a name that is relevant to your project, and modify the parameters as needed. An example toml file, with an explanation of parameters:
+
+``` toml
+[parameters]
+seed = 0
+region = "leeds"  # this is used to query poi data from osm and to load in SPC data
+number_of_households = 5000  # how many people from the SPC do we want to run the model for? Comment out if you want to run the analysis on the entire SPC populaiton
+zone_id = "OA21CD" # "OA21CD": OA level, "MSOA11CD": MSOA level
+travel_times = true  # Only set to true if you have travel time matrix at the level specified in boundary_geography
+boundary_geography = "OA"
+
+[matching]
+# for optional and required columns, see the [iterative_match_categorical](https://github.com/Urban-Analytics-Technology-Platform/acbm/blob/ca181c54d7484ebe44706ff4b43c26286b22aceb/src/acbm/matching.py#L110) function
+# Do not add any column not listed below. You can only move a column from optional to require (or vise versa)
+required_columns = ["number_adults", "number_children"]
+optional_columns = [
+    "number_cars",
+    "num_pension_age",
+    "rural_urban_2_categories",
+    "employment_status",
+    "tenure_status",
+]
+n_matches = 10 # What is the maximum number of NTS matches we want for each SPC household?
+
+[work_assignment]
+use_percentages = true  # if true, optimization problem will try to minimize percentage difference at OD level (not absolute numbers). Recommended to set it to true
+# weights to add for each objective in the optimization problem
+weight_max_dev = 0.2
+weight_total_dev = 0.8
+max_zones = 8   # maximum number of feasible zones to include in the optimization problem (less zones makes problem smaller - so faster, but at the cost of a better solution)
+
+
+```
diff --git a/data/external/README.md b/data/external/README.md
@@ -0,0 +1,122 @@
+The folder contains all external datasets necessary to run the pipeline. Some can be downloaded, while others need to be generated. The README.md file in this folder provides a guide on where to find / generate each dataset. In the future, we aim to:
+- host some of the datasets on the cloud, or download them directly where possible in the pipeline
+- add dataset paths in the config file
+
+## Folder Structure
+
+The structure of the folder is as follows:
+
+```md
+.
+├── data
+│   ├── external
+│   │   ├── boundaries
+│   │   │   ├── MSOA_DEC_2021_EW_NC_v3.geojson
+│   │   │   ├── oa_england.geojson
+│   │   │   ├── study_area_zones.geojson
+│   │   ├── census_2011_rural_urban.csv
+│   │   ├── centroids
+│   │   │   ├── LSOA_Dec_2011_PWC_in_England_and_Wales_2022.csv
+│   │   │   ├── Output_Areas_Dec_2011_PWC_2022.csv
+│   │   ├── MSOA_2011_MSOA_2021_Lookup_for_England_and_Wales.csv
+│   │   ├── nts
+│   │   │   ├── filtered
+│   │   │   │   ├── nts_households.parquet
+│   │   │   │   ├── nts_individuals.parquet
+│   │   │   │   └── nts_trips.parquet
+│   │   │   └── UKDA-5340-tab
+│   │   │       ├── 5340_file_information.rtf
+│   │   │       ├── mrdoc
+│   │   │       │   ├── excel
+│   │   │       │   ├── pdf
+│   │   │       │   ├── UKDA
+│   │   │       │   └── ukda_data_dictionaries.zip
+│   │   │       └── tab
+│   │   │           ├── household_eul_2002-2022.tab
+│   │   │           ├── individual_eul_2002-2022.tab
+│   │   │           ├── psu_eul_2002-2022.tab
+│   │   │           ├── trip_eul_2002-2022.tab
+│   │   │           └── <other_nts_tables>.tab
+|   |   └── travel_times
+|   |   │   │       ├── oa
+|   |   │   │       |   ├── travel_time_matrix.parquet
+|   |   |   |       └── msoa
+|   |   │   │           └── travel_time_matrix.parquet
+│   │   ├── ODWP01EW_OA.zip
+│   │   ├── ODWP15EW_MSOA_v1.zip
+│   │   ├── spc_output
+│   │   │   ├── <region>>_people_hh.parquet (Generated in Script 1)
+│   │   │   ├── <region>>_people_tu.parquet (Generated in Script 1)
+│   │   │   ├── raw
+│   │   │   │   ├── <region>_households.parquet
+│   │   │   │   ├── <region>_info_per_msoa.json
+│   │   │   │   ├── <region>.pb
+│   │   │   │   ├── <region>_people.parquet
+│   │   │   │   ├── <region>_time_use_diaries.parquet
+│   │   │   │   ├── <region>_venues.parquet
+│   │   │   │   ├── README.md
+
+```
+
+## Data Sources
+
+
+`spc_output/`
+
+Use the code in the `Quickstart` [here](https://github.com/alan-turing-institute/uatk-spc/blob/55-output-formats-python/python/README.md)
+to get a parquet file and convert it to JSON.
+
+You have two options:
+1. Slow and memory-hungry: download the `.pb` file directly from [here](https://alan-turing-institute.github.io/uatk-spc/using_england_outputs.html)
+    and load in the pbf file with the python package
+2. Faster: Run SPC to generate parquet outputs, and then load using the SPC toolkit python package. To generate parquet, you need to:
+    1. Clone [uatk-spc](https://github.com/alan-turing-institute/uatk-spc/tree/main/docs)
+    2. Run:
+        ```shell
+        cargo run --release -- \
+            --rng-seed 0 \
+            --flat-output \
+            --year 2020 \
+            config/England/west-yorkshire.txt
+        ```
+        and replace `west-yorkshire` and `2020` with your preferred option.
+
+`boundaries/`
+
+- `MSOA_DEC_2021_EW_NC_v3.geojson`: This is the MSOA boundaries for England and Wales. It can be downloaded from [Data-Gov-UK](https://www.data.gov.uk/dataset/9dffb396-2934-43fb-9777-1aef704138ac/middle-layer-super-output-areas-december-2021-names-and-codes-in-ew-v3). If this link is no longer valid, the layer is also available from other sources
+- `oa_england.geojson`: This is the OA boundaries for England (2021). The user can download it from [Data-Gov-UK](https://www.data.gov.uk/dataset/4d4e021d-fe98-4a0e-88e2-3ead84538537/output-areas-december-2021-boundaries-ew-bgc-v2)
+- `study_area_zones.geojson`: This layer is the MSOAs / OAs in the study area. It is created in the pipeline (in [0_preprocess_inputs.py](https://github.com/Urban-Analytics-Technology-Platform/acbm/blob/main/scripts/0_preprocess_inputs.py)). The user does not have to worry about this file.
+
+`centroids/`
+
+- LSOA_Dec_2011_PWC_in_England_and_Wales_2022.csv
+- `Output_Areas_Dec_2011_PWC_2022.csv`: An OA 2021 centroid layer. It can be downloaded from [Data-Gov-UK](https://www.data.gov.uk/dataset/ba661484-ceff-4a1c-91d8-3c57d0f0a933/output-areas-december-2011-ew-population-weighted-centroids_)
+
+`nts/`
+
+`UKDA-5340-tab`:
+- Download the UKDA-5340-tab from the UK Data Service [here](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=5340)
+- Step 1: Create an account
+- Step 2: Create a project and request access to the data
+    - We use the `National Travel Survey, 2002-2023` dataset (SN: 5340)
+- Step 3: Download TAB file format
+
+`travel_times/`
+
+- OPTIONAL Dataset - If it does not exist, it will be generated in the pipeline. They are added under oa/ or msoa/ subdirectories (e.g. oa/`travel_time_matrix.parquet` or msoa/`travel_time_matrix.parquet`). Columns are:
+  - OA21CD_from / MSOA21CD_from: OA21CD code
+  - OA21CD_to / MSOA21CD_to: OA21CD code
+  - mode: ['pt', 'car', 'walk', 'cycle']
+  - weekday: 1, 0
+  - time_of_day: ['morning', 'afternoon', 'evening', 'night']
+  - time: time in minutes
+
+There is an [open issue](https://github.com/Urban-Analytics-Technology-Platform/acbm/issues/20#issuecomment-2317037441) on denerating travel times which the user can use as a starting point IF they wish to generate the travel time matrix. In the future, we aim to add a script to generate the travel time matrix.
+
+Other datasets (to be places in the root of the `external` folder):
+
+- `ODWP01EW_OA.zip` & `ODWP15EW_MSOA_v1.zip`: These are commuting matrices from the census. They can be found in WICID data service. Go to [this link](https://wicid.ukdataservice.ac.uk/flowdata/cider/wicid/downloads.php), and search for `ODWP01EW_OA` and `ODWP15EW_MSOA` under 2021 Census
+- `MSOA_2011_MSOA_2021_Lookup_for_England_and_Wales.csv`: This is the lookup table between MSOA 2011 and MSOA 2021. It can be downloaded from [Data-Gov-UK](https://www.data.gov.uk/dataset/da36cac8-51c4-4d68-a4a9-37ac47d2a4ba/msoa-2011-to-msoa-2021-to-local-authority-district-2022-exact-fit-lookup-for-ew-v2)
+- `census_2011_rural_urban.csv`: OA level rural-urban classification. It can be downloaded from [ONS](https://geoportal.statistics.gov.uk/datasets/53360acabd1e4567bc4b8d35081b36ff/about). The classification is based on the 2011 Census, and the categories are: 'Urban major conurbation', 'Urban minor conurbation', 'Urban city and town', 'Urban city and town in a sparse setting': 'Rural town and fringe', 'Rural town and fringe in a sparse setting', 'Rural village', 'Rural village in a sparse setting', 'Rural hamlets and isolated dwellings', 'Rural hamlets and isolated dwellings in a sparse setting'
+
+```