This is a sample project for Databricks, generated via cookiecutter.
While using this project, you need Python 3.X and pip
or conda
for package management.
- Instantiate a local Python environment via a tool of your choice. This example is based on
conda
, but you can use any environment management tool:
conda create -n persuasion4good python=3.9
conda activate persuasion4good
- If you don't have JDK installed on your local machine, install it (in this example we use
conda
-based installation):
conda install -c conda-forge openjdk=11.0.15
- Install project locally (this will also install dev requirements):
pip install -e ".[local,test]"
For unit testing, please use pytest
:
pytest tests/unit --cov
Please check the directory tests/unit
for more details on how to use unit tests.
In the tests/unit/conftest.py
you'll also find useful testing primitives, such as local Spark instance with Delta support, local MLflow and DBUtils fixture.
There are two options for running integration tests:
- On an interactive cluster via
dbx execute
- On a job cluster via
dbx launch
For quicker startup of the job clusters we recommend using instance pools (AWS, Azure, GCP).
For an integration test on interactive cluster, use the following command:
dbx execute <workflow-name> --cluster-name=<name of interactive cluster>
To execute a task inside multitask job, use the following command:
dbx execute <workflow-name> \
--cluster-name=<name of interactive cluster> \
--job=<name of the job to test> \
--task=<task-key-from-job-definition>
For a test on a job cluster, deploy the job assets and then launch a run from them:
dbx deploy <workflow-name> --assets-only
dbx launch <workflow-name> --from-assets --trace
dbx
expects that cluster for interactive execution supports%pip
and%conda
magic commands.- Please configure your workflow (and tasks inside it) in
conf/deployment.yml
file. - To execute the code interactively, provide either
--cluster-id
or--cluster-name
.
dbx execute <workflow-name> \
--cluster-name="<some-cluster-name>"
Multiple users also can use the same cluster for development. Libraries will be isolated per each user execution context.
To start working with your notebooks from a Repos, do the following steps:
- Add your git provider token to your user settings in Databricks
- Add your repository to Repos. This could be done via UI, or via CLI command below:
databricks repos create --url <your repo URL> --provider <your-provider>
This command will create your personal repository under /Repos/<username>/persuasion4good
.
3. Use git_source
in your job definition as described here
Please set the following secrets or environment variables for your CI provider:
DATABRICKS_HOST
DATABRICKS_TOKEN
- To trigger the CI pipeline, simply push your code to the repository. If CI provider is correctly set, it shall trigger the general testing pipeline
- To trigger the release pipeline, get the current version from the
persuasion4good/__init__.py
file and tag the current code version:
git tag -a v<your-project-version> -m "Release tag for version <your-project-version>"
git push origin --tags