Skip to content

Commit

Permalink
Merge pull request #161 from fivetran/feature/unstructured-data
Browse files Browse the repository at this point in the history
Feature/unstructured data
  • Loading branch information
fivetran-catfritz authored Sep 3, 2024
2 parents d9e5b19 + 1b68e18 commit 03bf529
Show file tree
Hide file tree
Showing 15 changed files with 259 additions and 44 deletions.
2 changes: 1 addition & 1 deletion .buildkite/scripts/run_models.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ dbt deps
dbt seed --target "$db" --full-refresh
dbt run --target "$db" --full-refresh
dbt test --target "$db"
dbt run --vars '{using_schedules: false, using_domain_names: false, using_user_tags: false, using_ticket_form_history: false, using_organization_tags: false}' --target "$db" --full-refresh
dbt run --vars '{zendesk__unstructured_enabled: true, using_schedules: false, using_domain_names: false, using_user_tags: false, using_ticket_form_history: false, using_organization_tags: false}' --target "$db" --full-refresh
dbt test --target "$db"

dbt run-operation fivetran_utils.drop_schemas_automation --target "$db"
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,27 @@
# dbt_zendesk v0.17.0

## New model ([#161](https://github.com/fivetran/dbt_zendesk/pull/161))
- Addition of the `zendesk__document` model, designed to structure Zendesk textual data for vectorization and integration into NLP workflows. The model outputs a table with:
- `document_id`: Corresponding to the `ticket_id`
- `chunk_index`: For text segmentation
- `chunk`: The text chunk itself
- `chunk_tokens_approximate`: Approximate token count for each segment
- This model is currently disabled by default. You may enable it by setting the `zendesk__unstructured_enabled` variable as `true` in your `dbt_project.yml`.
- This model was developed to limit the chunk sizes to approximately 5000 tokens for use with OpenAI, however you can change this limit by setting the variable `zendesk_max_tokens` in your `dbt_project.yml`.
- See the README section [Enabling the unstructured document model for NLP](https://github.com/fivetran/dbt_zendesk/blob/main/README.md#enabling-the-unstructured-document-model-for-nlp) for more information.

## Breaking Changes (Full refresh required after upgrading)
- Incremental models running on BigQuery have had the `partition_by` logic adjusted to include a granularity of a month. This change only impacts BigQuery warehouses and was applied to avoid the common `too many partitions` error some users have experienced when partitioning by day. Therefore, adjusting the partition to a month granularity will decrease the number of partitions created and allow for more performant querying and incremental loads. This change was applied to the following models ([#165](https://github.com/fivetran/dbt_zendesk/pull/165)):
- `int_zendesk__field_calendar_spine`
- `int_zendesk__field_history_pivot`
- `zendesk__ticket_field_history`

- In the [dbt_zendesk_source v0.12.0 release](https://github.com/fivetran/dbt_zendesk_source/releases/tag/v0.12.0), the field `_fivetran_deleted` was added to the following models for use in `zendesk__document` model ([#161](https://github.com/fivetran/dbt_zendesk/pull/161)):
- `stg_zendesk__ticket`
- `stg_zendesk__ticket_comment`
- `stg_zendesk__user`
- If you have already added `_fivetran_deleted` as a passthrough column via the `zendesk__ticket_passthrough_columns` or `zendesk__user_passthrough_columns` variable, you will need to remove or alias this field from the variable to avoid duplicate column errors.

## Bug Fixes
- Fixed an issue in the `zendesk__sla_policies` model where tickets that were opened and solved outside of scheduled hours were not being reported, specifically for the metrics `requester_wait_time` and `agent_work_time`.
- Resolved by adjusting the join logic in models `int_zendesk__agent_work_time_business_hours` and `int_zendesk__requester_wait_time_business_hours`. ([#164](https://github.com/fivetran/dbt_zendesk/pull/164), [#156](https://github.com/fivetran/dbt_zendesk/pull/156))
Expand Down
26 changes: 23 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,8 @@ The following table provides a detailed list of final tables materialized within
| [zendesk__ticket_summary](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_summary) | A single record table containing Zendesk Support ticket and user summary metrics. |
| [zendesk__ticket_backlog](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_backlog) | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable for all backlog tickets. Backlog tickets being defined as any ticket not in a 'closed', 'deleted', or 'solved' status. |
| [zendesk__ticket_field_history](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_field_history) | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable and the corresponding updater fields defined in the `ticket_field_history_updater_columns` variable. |
| [zendesk__sla_policies](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__sla_policies) | Each record represents an SLA policy event and additional sla breach and achievement metrics. Calendar and business hour SLA breaches are supported.
| [zendesk__sla_policies](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__sla_policies) | Each record represents an SLA policy event and additional sla breach and achievement metrics. Calendar and business hour SLA breaches are supported.
| [zendesk__document](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__document) | Each record represents a chunk of text from ticket data, prepared for vectorization. It includes fields for use in NLP workflows. Disabled by default. |

Many of the above reports are now configurable for [visualization via Streamlit](https://github.com/fivetran/streamlit_zendesk). Check out some [sample reports here](https://fivetran-zendesk.streamlit.app/).

Expand Down Expand Up @@ -91,7 +92,26 @@ vars:

### (Optional) Step 5: Additional configurations

#### Add passthrough columns
#### Enabling the unstructured document model for NLP
This package includes the `zendesk__document` model, which processes and segments Zendesk text data for vectorization, making it suitable for NLP workflows. The model outputs structured chunks of text with associated document IDs, segment indices, and token counts. For definitions and more information, refer to [zendesk__document](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__document) in our dbt docs.

By default, this model is disabled. To enable it, update the `zendesk__unstructured_enabled` variable to true in your dbt_project.yml:

```yml
vars:
zendesk__unstructured_enabled: true # false by default.
```

##### Customizing Chunk Size for Vectorization

The `zendesk__document` model was developed to limit approximate chunk sizes to 7,500 tokens, optimized for OpenAI models. However, you can adjust this limit by setting the `max_tokens` variable in your `dbt_project.yml`:

```yml
vars:
zendesk_max_tokens: 5000 # Default value
```

### Add passthrough columns
This package includes all source columns defined in the macros folder. You can add more columns from the `TICKET`, `USER`, and `ORGANIZATION` tables using our pass-through column variables.

These variables allow for the pass-through fields to be aliased (`alias`) and casted (`transform_sql`) if desired, but not required. Datatype casting is configured via a sql snippet within the `transform_sql` key. You may add the desired sql while omitting the `as field_name` at the end and your custom pass-through fields will be casted accordingly. Use the below format for declaring the respective pass-through variables:
Expand Down Expand Up @@ -211,7 +231,7 @@ This dbt package is dependent on the following dbt packages. These dependencies
```yml
packages:
- package: fivetran/zendesk_source
version: [">=0.11.0", "<0.12.0"]
version: [">=0.12.0", "<0.13.0"]
- package: fivetran/fivetran_utils
version: [">=0.4.0", "<0.5.0"]
Expand Down
3 changes: 3 additions & 0 deletions dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ models:
ticket_history:
+schema: zendesk_intermediate
+materialized: ephemeral
unstructured:
+schema: zendesk_unstructured
+materialized: table
utils:
+materialized: ephemeral
vars:
Expand Down
Loading

0 comments on commit 03bf529

Please sign in to comment.