The Division of Drinking Water (DDW) at the California State Water Resources Control Board regulates 2866 Community Water Systems (CWS) throughout the state. Some of these CWS risk running out of water during the dry summer season. To address this problem, the Data Science Accelerator at the Office of Data and Innovation collaborated with the Division of Drinking Water to create a machine learning model that forecasts which CWS face the highest risk of running out of water. The model is intended to run in production on a monthly basis, producing forecasts for at-risk CWS within the subsequent ninety days.
Release v0.1.0 contains the results of the collaboration between DDW and ODI. For more up-to-date development on this project, see the Division of Drinking Water Github repository.
This repository includes data, code, environment files, and a file containing all the forecasts.
- Data - The repository /data includes all the data used for this model.
- Code - The Jupyter notebook titled
water.ipynb
includes all code to generate the machine learning model. - Environment Files - The environment files include all the Python packages necessary to run the code. To reproduce the original environment exactly, use the OSX-specific environment file called
environment.yml
. To use a platform-agnostic environment file, useenvironment_cross_compatible.yml
. - Forecast - The file
Forecast_2023.csv
includes all the CWS forecasted to experience some form of drought impact. - Docs - This folder contains two documents. One, called
Data_Quality.md
, lists all the candidate data sources for the machine learning model, describes any data quality and integrity issues with each source, and provides solutions to address these issues. The second document, calledFuture_Ideas.md
, lists research ideas, as well algorithms and tools, that might be useful to explore in the future. These documents may be useful for future roadmaps.
If you use any of this code or results, please consider citing our paper (TBD):
@article{Placeholder text}
The following steps describe exactly how to use the data, code, and envrionment file together to build and run the model using Jupyter Lab. This will generate a .csv file of drought-impacted CWSs ranked by probability (see the column Expected
).
The first two steps only need to be executed once.
- First, install conda. Conda is a tool that creates environments, or a collection of version-specific and OS-specific Python packages. This allows enables reproducibility. Using a different verson of a Python package than the original developer might produce differing results. It is best practice to use the same environment as the developer. Follow these installation instructions to install conda. Installing miniconda is sufficient for this purpose.
- Clone this Github repository using the command
git clone https://github.com/cagov/aae-dsa-water.git
. - For macOS, create the environment using the command
conda env create -f environment.yml
. This environment file will install OSX-specific packages, and will reproduce the original environment exactly. To create an enviroment file on a Linux or Windows machine, use the file calledenvironment_cross_compatible.yml
.
- Make sure your cloned version of the Github repository is up-to-date. Run the command
git pull origin main
. Note: If you are developing, make sure to commit your changes before pulling. - Activate the conda environment by using the command
conda activate water
. - Open Jupyter Lab by typing
jupyter-lab
in the terminal. - Navigate to the list of files and open
water.ipynb
. - Under the tab Run, select 'Run all cells'. This will produce a .csv file called
Forecast_2023.csv
.