This is my lab github repo :)
Below we enumerate what is currently in this repository.
- Defines a Python API to fetch AQS data (```datafetcher.py```).
- Provides a sample notebook to determine the best site to sample from in Los Angeles County (```lab_notebook.ipynb```).
- Provides a sample notebook to fetch relevant AQS data (```lab_notebook.ipynb```).
- Provides a sample notebook to explore relevant CEDS emissions data (```hemco_data_exploration.ipynb```).
- Provides a python script to generate a 2018 dataset for LA North Main St (```generate.py```), including AQS and CEDS data.
- Provides a sample notebook to run a Random Forest model on the above dataset (```lab_notebook_2.ipynb```).
Please note that I use a virtual environment to manage any modules used for the models and data processing. To begin using this repo use the following commands on the root of this directory:
$ python3 -m venv venv # Creates a virtual environment
$ source venv/bin/activate # Activates virtual environment
$ python3 -m pip install --upgrade pip
$ python3 -m pip install -r requirements.txt # Download requirements
You must also create a .env
file and populate it with your email and password to the AQS api. Your .env file should look like:
EMAIL="[email protected]"
KEY="example"
Please refer to lab_notebook.ipynb
for examples on how to use the DataFetcher class. It has 3 primary purposes: (1) Finding the site with the most data in a particular county/state; (2) fetching AQS data; and (3) fetching CEDS data. Important functions have detailed docstrings.
We can't realistically get data for an entire year, for every site, for every code, for multiple years. If we want to have an idea of what critical data was available over the span of decades, we need to do some sampling. As a first pass we sample 1 random day every 5 years starting in 2000. Then we get the top 5 sites with the most critical data for a given 5 year period, which happens to be Los Angeles-North Main Street no matter what data range we pick.
From the top 5 we then rank them based on the availability of PAMS_VOC data. We only need to check the date that resulted in the most available data from the previous queury.
However, we want to eventually find more relevant sites to train on so we could replicate the above logic to find the best sites accross multiple states, etc. Even better, we could find a clever way to aggregate data accross neighbouring sites.
Running the generate.py
script will create 3 datasets in the data/clean/Los_Angeles-North_Main_Street/2018
directory (the structure is data/clean/\<site\>/\<year\>
). The core dataset contains CRITERIA and MET data, the vocs dataset contains VOCs data, and the emissions dataset contains CEDS data. Use the following command:
$ python3 generate.py