CHEM150 Atmospheric Chemistry ML Modelling

Author: Thomas Cintra

This is my lab github repo :)

Outline

Below we enumerate what is currently in this repository.

Defines a Python API to fetch AQS data (```datafetcher.py```).
Provides a sample notebook to determine the best site to sample from in Los Angeles County (```lab_notebook.ipynb```).
Provides a sample notebook to fetch relevant AQS data (```lab_notebook.ipynb```).
Provides a sample notebook to explore relevant CEDS emissions data (```hemco_data_exploration.ipynb```).
Provides a python script to generate a 2018 dataset for LA North Main St (```generate.py```), including AQS and CEDS data.
Provides a sample notebook to run a Random Forest model on the above dataset (```lab_notebook_2.ipynb```).

Setup

Please note that I use a virtual environment to manage any modules used for the models and data processing. To begin using this repo use the following commands on the root of this directory:

$ python3 -m venv venv # Creates a virtual environment
$ source venv/bin/activate # Activates virtual environment
$ python3 -m pip install --upgrade pip
$ python3 -m pip install -r requirements.txt # Download requirements

You must also create a .env file and populate it with your email and password to the AQS api. Your .env file should look like:

EMAIL="[email protected]"
KEY="example"

Using DataFetcher

Please refer to lab_notebook.ipynb for examples on how to use the DataFetcher class. It has 3 primary purposes: (1) Finding the site with the most data in a particular county/state; (2) fetching AQS data; and (3) fetching CEDS data. Important functions have detailed docstrings.

Finding the best site:

We can't realistically get data for an entire year, for every site, for every code, for multiple years. If we want to have an idea of what critical data was available over the span of decades, we need to do some sampling. As a first pass we sample 1 random day every 5 years starting in 2000. Then we get the top 5 sites with the most critical data for a given 5 year period, which happens to be Los Angeles-North Main Street no matter what data range we pick.

From the top 5 we then rank them based on the availability of PAMS_VOC data. We only need to check the date that resulted in the most available data from the previous queury.

However, we want to eventually find more relevant sites to train on so we could replicate the above logic to find the best sites accross multiple states, etc. Even better, we could find a clever way to aggregate data accross neighbouring sites.

Building a dataset

Running the generate.py script will create 3 datasets in the data/clean/Los_Angeles-North_Main_Street/2018 directory (the structure is data/clean/\<site\>/\<year\>). The core dataset contains CRITERIA and MET data, the vocs dataset contains VOCs data, and the emissions dataset contains CEDS data. Use the following command:

$ python3 generate.py

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
data.json		data.json
data_fetcher.py		data_fetcher.py
generate.py		generate.py
hemco_data_exploration.ipynb		hemco_data_exploration.ipynb
lab_notebook.ipynb		lab_notebook.ipynb
lab_notebook_2.ipynb		lab_notebook_2.ipynb
output.png		output.png
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt
voc_data.json		voc_data.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CHEM150 Atmospheric Chemistry ML Modelling

Author: Thomas Cintra

Outline

Setup

Using DataFetcher

Finding the best site:

Building a dataset

About

Releases

Packages

Languages

Tcintra/chem150

Folders and files

Latest commit

History

Repository files navigation

CHEM150 Atmospheric Chemistry ML Modelling

Author: Thomas Cintra

Outline

Setup

Using DataFetcher

Finding the best site:

Building a dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages