Open Source Data Science Modules
Install using the following command:
pip install git+https://github.com/bluelabsio/open-dslib
The package picks up database credentials from the environment. A reference to the database name - eg. DATABASE - is required, and all environment variables must be named accordingly:
- {DATABASE}_DB - refers to the name of the database to be queried
- {DATABASE}_HOST - refers to the database host
- {DATABASE}_PORT - refers to the database port
- {DATABASE}_USER - refers to the database user making the connection
- {DATABASE}_PW - refers to the password for the database user who is making the connection
- {DATABASE}_TYPE - refers to the database dialect
Crosstabs can be created in two ways:
- Passing all the crosstab parameters inline:
import dslib.assessment.crosstabs as xt
xtabs = xt.CrossTabs(
database_name = '{DATABASE}',
universe="(SELECT * FROM schema.table_name LIMIT 100000)",
metrics=[
"count(*)",
"COUNT(*)::FLOAT/NULLIF(SUM(COUNT(*)) OVER(), 0) prop",
],
splits=[
"gender",
"race",
["gender", "race"] #(This can be used to cross two variables together)
]
)
xtabs.df
- Passing crosstab parameters through a yaml config:
import dslib.assessment.crosstabs as xt
xtabs = xt.CrossTabs(
database_name = '{DATABASE}',
yaml_file='example_config.yaml'
)
xtabs.df
The config should be formatted like:
universes:
'table1': (SELECT * FROM schema.table_name_1)
'table2': (SELECT * FROM schema.table_name_2)
metrics:
- count(*)
- COUNT(*)::FLOAT/NULLIF(SUM(COUNT(*)) OVER(), 0) prop
splits:
- gender
- race
- - gender
- race #(This can be used to cross two variables together)
When providing multiple universes, provide a name for each universe that it can be referenced by. Ensure that you do not use any reserved words that might have some meaning in the SQL dialect you are working with.
Examples of reserved words:
ALL
FULL
COLUMN
GROUP
You can find a list of reserved words for Amazon Redshift here.
Please do not provide any aliases for your universe in the sql that you enter, this will be handled by the tool itself.