Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MVP of dbt-based data publishing framework #1505

Merged
merged 35 commits into from
Jun 3, 2022
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
38c9c4b
upgrade to python 3.9 for Literal types
atvaccaro May 12, 2022
ca58fce
create initial dbt pydantic models and test a couple exposures
atvaccaro May 12, 2022
30e0932
add some useful libs
atvaccaro May 16, 2022
2e2ee0e
make ckan destinations uploadable and start on CLI app
atvaccaro May 16, 2022
cb5e1e4
couple more deps
atvaccaro May 17, 2022
534b695
update black
atvaccaro May 17, 2022
bd4db34
start saving to gcs when we publish something
atvaccaro May 17, 2022
b669289
use google default auth
atvaccaro May 17, 2022
3c11a6d
couple more deps
atvaccaro May 18, 2022
218c6c2
tweak ckan a bit, start on map tiles
atvaccaro May 18, 2022
0d04d62
log actual table queried
atvaccaro May 18, 2022
5eaccf2
add bq geo support plus swifter
atvaccaro May 18, 2022
aff621a
actually generate maptiles
atvaccaro May 18, 2022
57ce7ff
use latest-only shapes geo, and fix a bug
atvaccaro May 18, 2022
3383727
log tippecanoe args
atvaccaro May 18, 2022
fe600f3
wip on outputting metadata/dictionary from dbt docs
atvaccaro May 20, 2022
96d6d84
get these working! also fmt
atvaccaro May 23, 2022
76b4d13
use new metadata file format
atvaccaro May 23, 2022
d103102
get lists serializing properly
atvaccaro May 24, 2022
b7d3634
add gtfs schedule docs etc.
atvaccaro May 24, 2022
37aeb10
single ticks
atvaccaro May 24, 2022
b2d06e5
add og docs source and replace newlines in non-markdown descriptions
atvaccaro May 24, 2022
46060c2
make lint happy
atvaccaro May 24, 2022
6454865
start adding additional publishing docs
atvaccaro May 25, 2022
4744d17
make jb happy
atvaccaro May 25, 2022
bd4de91
more docs
atvaccaro May 25, 2022
32ca8a3
add ckan/geoportal note
atvaccaro May 25, 2022
9dc2a09
fix ref syntax
atvaccaro May 31, 2022
e53d092
fix this syntax too!
atvaccaro May 31, 2022
43766fb
adjust description to fit within 1024 bq limit
atvaccaro May 31, 2022
da741eb
add relationships test from agencyhub chat
atvaccaro Jun 1, 2022
b9de657
start addressing PR comments
atvaccaro Jun 2, 2022
556214d
address more comments
atvaccaro Jun 2, 2022
38fc2d4
address deploy comment
atvaccaro Jun 2, 2022
a9d8dd8
one more example
atvaccaro Jun 2, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions docs/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,10 +51,9 @@ parts:
- file: analytics_examples/warehouse_tutorial
- file: analytics_examples/new_tutorial
- file: analytics_examples/sample-catalog-viewer
- file: analytics_publishing/overview
- file: publishing/overview
sections:
- file: analytics_publishing/how_to_publish
- file: analytics_publishing/where_reports_live
- glob: publishing/sections/*
- caption: Developers
chapters:
- file: architecture/architecture_overview
Expand Down
150 changes: 0 additions & 150 deletions docs/analytics_publishing/how_to_publish.md

This file was deleted.

5 changes: 0 additions & 5 deletions docs/analytics_publishing/overview.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/analytics_publishing/where_reports_live.md

This file was deleted.

24 changes: 24 additions & 0 deletions docs/publishing/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
(publish-analyses)=
# Where can I publish data?

Analysts have a variety of tools available to publish their final
deliverables. With iterative work, analysts can implement certain best
practices within these bounds to do as much as is programmatically practical.
The workflow will look different depending on these factors:

* Are visualizations static or interactive?
* Does the deliverable need to be updated on a specified frequency or a one-off analysis?
* Is the deliverable format PDF, HTML, interactive dashboard, or a slide deck?

Analysts can string together a combination of these solutions. These options are
listed in increasing order of complexity and therefore capability.
* [Static visualizations](publishing-static-files) can be inserted directly
into slide decks (e.g. PNG) or emailed to stakeholders (e.g. HTML or PDF)
* HTML visualizations can be rendered in [GitHub Pages](publishing-github-pages)
and embedded as a URL into slide deck
* More advanced HTML-based reports can be hosted in the [analytics portfolio](publishing-analytics-portfolio-site)
which supports interactivity and notebook paramterization.
* Interactive dashboards should be hosted in [Metabase](publishing-metabase) to
share with external stakeholders.
* Structured data may be published to [CKAN](publishing-ckan) to facilitate
usage by analysts, researchers, or other stakeholders.
20 changes: 20 additions & 0 deletions docs/publishing/sections/1_publishing_principles.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
(publishing-principles=)
# Data Publishing Principles

## Follow prior art
The [California Open Data Publisher's Handbook](https://docs.data.ca.gov/california-open-data-publishers-handbook/)
is the inspiration for much of this process. Its sections include a
[pre-publishing checklist (including descriptions of ownership roles)](https://docs.data.ca.gov/california-open-data-publishers-handbook/1.-review-the-pre-publishing-checklist)
and [best practices for creating metadata](https://docs.data.ca.gov/california-open-data-publishers-handbook/3.-create-metadata-and-data-dictionary).

## Assume the data must stand on its own
Once out in the wild, we don't really have much control over how data will
be used or who may rely on it. The documentation should reflect this; we
should include as much information as possible while maintaining
backreferences to the data's source.

## Publish the right amount of data
Pick an appropriate subset of the data to publish, based on volume, expected
usage, and refresh/update frequency. For example, GTFS Schedule is fairly low
volume and slow to change, so updating weekly or monthly is more than
sufficient.
74 changes: 74 additions & 0 deletions docs/publishing/sections/2_static_files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
(publishing-static-files)=
# Static Visualizations

Static visualizations should be created in a Jupyter Notebook, saved locally
(in JupyterHub), and checked into GitHub. For example, you can save charts
as image files (such as PNGs) and commit them to the repository.

```python

# matplotlib or seaborn
import matplotlib.pyplot as plot
import seaborn
plt.savefig("../my-visualization.png")

# altair
import altair as alt
import altair_saver
chart.save("../my-visualization.png")

# plotnine
from plotnine import *
chart.save(filename = '../my-visualization.png')
```

## Publishing Reports
Reports can be shared as HTML webpages or PDFs. Standalone HTML pages tend
to be self-contained and can be sent via email or similar.

A Jupyter Notebook can be converted to HTML with:

```python
import papermill as pm
import subprocess

OUTPUT_FILENAME = "sample-report"

pm.execute_notebook(
# notebook to execute
'../my-notebook.ipynb',
# if needed, rename the notebook as something different
# this will be the filename that is used when converting to HTML or PDF
f'../{OUTPUT_FILENAME}.ipynb',
)

# shell out, run NB Convert
OUTPUT_FORMAT = 'html'
subprocess.run([
"jupyter",
"nbconvert",
"--to",
OUTPUT_FORMAT,
"--no-input",
"--no-prompt",
f"../{OUTPUT_FILENAME}.ipynb",
])
```

A Jupyter Notebook can be converted to PDF for email distribution with:

```python
# Similar as converting to HTML, but change the output_format
# shell out, run NB Convert
OUTPUT_FORMAT = 'PDFviaHTML'
subprocess.run([
"jupyter",
"nbconvert",
"--to",
OUTPUT_FORMAT,
"--no-input",
"--no-prompt",
f"../{OUTPUT_FILENAME}.ipynb",
])

```
46 changes: 46 additions & 0 deletions docs/publishing/sections/3_github_pages.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
(publishing-github-pages)=
# HTML Visualizations

Visualizations that benefit from limited interactivity, such as displaying tooltips on hover or zooming in / out and scrolling can be rendered within GitHub pages.

A `folium` map can be saved as a local HTML file and checked into GitHub (`ipyleaflet` must be rendered directly in the notebook). Many chart packages, including `altair`, `matplotlib`, and `plotly` allow you to export as HTML.

```python
# altair
import altair as alt
chart.save("../my-visualization.html")

# matplotlib (by encoding it)
import matplotlib.pyplot as plt
import base64
from io import BytesIO

fig = plt.figure()

tmpfile = BytesIO()
fig.savefig(tmpfile, format='png')
encoded = base64.b64encode(tmpfile.getvalue()).decode('utf-8')

html = 'Some html head' + '<img src=\'data:image/png;base64,{}\'>'.format(encoded) + 'Some more html'

with open('test.html','w') as f:
f.write(html)

# plotly
import plotly.express as px
fig.write_html("../my-visualization.html")

# folium
import folium
fig.save("../my-visualization.html")
```

## Use GitHub pages to display these HTML pages.
1. Go to the repo's [settings](https://github.com/cal-itp/data-analyses/settings)
1. Navigate to `Pages` on the left
1. Change the branch GH pages is sourcing from: `main` to `my-current-branch`
1. Embed the URL into the slides. Example URL: https://docs.calitp.org/data-analyses/PROJECT-FOLDER/MY-VISUALIZATION.html
1. Once a PR is ready and merged, the GH pages can be changed back to source from `main`. The URL is preserved within the slide deck.
1. Note: If analysts working on different branches want to display GH pages, one of them needs to merge in `main`, the other needs to do a `git rebase`, and then can choose `my-other-branch` as the GH pages source.

Ex: [Service Density Map](https://docs.calitp.org/data-analyses/bus_service_increase/img/arrivals_pc_high.html)
11 changes: 11 additions & 0 deletions docs/publishing/sections/4_analytics_portfolio_site.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
(publishing-analytics-portfolio-site)=
# The Cal-ITP Analytics Portfolio

Depending on the complexity of your visualizations, you may want to produce
a full website composed of multiple notebooks and/or the same notebook run
across different sets of data (for example, one report per Caltrans district).
For these situations, the [Jupyter Book-based](https://jupyterbook.org/en/stable/intro.html)
[publishing framework](https://github.com/cal-itp/data-analyses/tree/main/portfolio)
present in the data-analyses repo is your friend.

You can find the Cal-ITP Analytics Portfolio at [analysis.calitp.org](https://analysis.calitp.org).
10 changes: 10 additions & 0 deletions docs/publishing/sections/5_metabase.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
(publishing-metabase)=
# Metabase

Interactive charts should be displayed in Metabase. Using Voila on Jupyter Notebooks works locally, but doesn't allow for sharing with external stakeholders. The data cleaning and processing should still be done within Python scripts or Jupyter notebooks. The processed dataset backing the dashboard should be exported to a Google Cloud Storage bucket.

An [Airflow DAG](https://github.com/cal-itp/data-infra/tree/main/airflow/dags) needs to be set up to copy the processed dataset into the data warehouse. Metabase can only source data from the data warehouse. The dashboard visualizations can be set up in Metabase, remain interactive, and easily shared to external stakeholders.

Any tweaks to the data processing steps are easily done in scripts and notebooks, and it ensures that the visualizations in the dashboard remain updated with little friction.

Ex: [Payments Dashboard](https://dashboards.calitp.org/dashboard/3-payments-performance-dashboard?transit_provider=mst)
8 changes: 8 additions & 0 deletions docs/publishing/sections/6_gcs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
(publishing-gcs)=
# GCS

NOTE: If you are planning on publishing to [CKAN](publishing-ckan) and you are
using the dbt exposure publishing framework, your data will already be saved in
GCS as part of the upload process.

TBD.
Loading