-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/unstructured data #161
Conversation
{%- endmacro %} | ||
|
||
{% macro default__count_tokens(column_name) %} | ||
{{ dbt.length(column_name) }} / 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this doc and internal discussion, approximate count of tokens is appropriate.
# - package: fivetran/zendesk_source | ||
# version: [">=0.12.0", "<0.13.0"] | ||
- git: https://github.com/fivetran/dbt_zendesk_source.git | ||
revision: feature/unstructured-data | ||
warn-unpinned: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# - package: fivetran/zendesk_source | |
# version: [">=0.12.0", "<0.13.0"] | |
- git: https://github.com/fivetran/dbt_zendesk_source.git | |
revision: feature/unstructured-data | |
warn-unpinned: false | |
- package: fivetran/zendesk_source | |
version: [">=0.12.0", "<0.13.0"] |
ticket_comment_id, | ||
ticket_id, | ||
comment_time, | ||
case when comment_tokens > {{ var('max_tokens', 7500) }} then left(comment_markdown, {{ var('max_tokens', 7500) }} * 4) -- approximate 4 characters per token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using left()
instead of substring()
since it's easier to deal with across warehouses and does the same thing since we're just truncating.
unstructured: | ||
+schema: zendesk_unstructured | ||
+materialized: table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's sync on this default schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our discussion, leaving this as-is so it is separated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fivetran-joemarkiewicz Thank you for reviewing! I also updated the changelog with the changes from the source.
unstructured: | ||
+schema: zendesk_unstructured | ||
+materialized: table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our discussion, leaving this as-is so it is separated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fivetran-catfritz great work on this PR. Just a few final comments and suggestions before approval. Let me know if you have any questions!
README.md
Outdated
@@ -37,6 +37,7 @@ The following table provides a detailed list of final models materialized within | |||
| [zendesk__ticket_backlog](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_backlog) | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable for all backlog tickets. Backlog tickets being defined as any ticket not in a 'closed', 'deleted', or 'solved' status. | | |||
| [zendesk__ticket_field_history](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_field_history) | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable and the corresponding updater fields defined in the `ticket_field_history_updater_columns` variable. | | |||
| [zendesk__sla_policies](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__sla_policies) | Each record represents an SLA policy event and additional sla breach and achievement metrics. Calendar and business hour SLA breaches are supported. | |||
| zendesk__document | Each record represents a chunk of text from ticket data, prepared for vectorization. It includes fields for use in NLP workflows. Disabled by default. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elanfivetran this won't be available yet in Quickstart, yet it will be displayed in the UI via this table. Is that okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fivetran-catfritz would you be able to edit this to be the hyperlink to the package docs so users can see the table structure and documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fivetran-joemarkiewicz this is related to my question below about having this in the manifest.
README.md
Outdated
@@ -91,6 +92,23 @@ vars: | |||
|
|||
## (Optional) Step 5: Additional configurations | |||
|
|||
### Enabling the unstructured document model for NLP | |||
This package includes the `zendesk__document` model, which processes and segments Zendesk text data for vectorization, making it suitable for NLP workflows. The model outputs structured chunks of text with associated document IDs, segment indices, and token counts. By default, this model is disabled. To enable it, update the `zendesk__unstructured_enabled` variable to true in your dbt_project.yml: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar note from above, can we include the dbt docs link to the table zendesk__document here so curious users can go and inspect the structure and documentation of the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also related to my manifest question.
@@ -24,6 +24,9 @@ models: | |||
ticket_history: | |||
+schema: zendesk_intermediate | |||
+materialized: ephemeral | |||
unstructured: | |||
+schema: zendesk_unstructured | |||
+materialized: table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need all of these models to be materialized as tables? I definitely see the zendesk__document
needing to be, but do all the intermediate models as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really not sure, so I went with tables to be safe. I wasn't sure how demanding/large all that text data could get for a user, so open to suggestions on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the nature of the end result, I think it makes sense to keep these as tables to help with the query load as best as we can. We can always adjust this default materialization based off feedback.
integration_tests/dbt_project.yml
Outdated
@@ -1,7 +1,7 @@ | |||
config-version: 2 | |||
|
|||
name: 'zendesk_integration_tests' | |||
version: '0.16.0' | |||
version: '0.17.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the variable configuration zendesk__unstructured_enabled: true
in the vars config here so it can be enabled when generating the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fivetran-joemarkiewicz At the time I did this I thought we didn't want this added to the manifest, and therefore not the docs. Is this not the case?
Co-authored-by: Joe Markiewicz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fivetran-joemarkiewicz. I applied your suggestions but also had a couple more questions!
integration_tests/dbt_project.yml
Outdated
@@ -1,7 +1,7 @@ | |||
config-version: 2 | |||
|
|||
name: 'zendesk_integration_tests' | |||
version: '0.16.0' | |||
version: '0.17.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fivetran-joemarkiewicz At the time I did this I thought we didn't want this added to the manifest, and therefore not the docs. Is this not the case?
README.md
Outdated
@@ -37,6 +37,7 @@ The following table provides a detailed list of final models materialized within | |||
| [zendesk__ticket_backlog](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_backlog) | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable for all backlog tickets. Backlog tickets being defined as any ticket not in a 'closed', 'deleted', or 'solved' status. | | |||
| [zendesk__ticket_field_history](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_field_history) | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable and the corresponding updater fields defined in the `ticket_field_history_updater_columns` variable. | | |||
| [zendesk__sla_policies](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__sla_policies) | Each record represents an SLA policy event and additional sla breach and achievement metrics. Calendar and business hour SLA breaches are supported. | |||
| zendesk__document | Each record represents a chunk of text from ticket data, prepared for vectorization. It includes fields for use in NLP workflows. Disabled by default. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fivetran-joemarkiewicz this is related to my question below about having this in the manifest.
README.md
Outdated
@@ -91,6 +92,23 @@ vars: | |||
|
|||
## (Optional) Step 5: Additional configurations | |||
|
|||
### Enabling the unstructured document model for NLP | |||
This package includes the `zendesk__document` model, which processes and segments Zendesk text data for vectorization, making it suitable for NLP workflows. The model outputs structured chunks of text with associated document IDs, segment indices, and token counts. By default, this model is disabled. To enable it, update the `zendesk__unstructured_enabled` variable to true in your dbt_project.yml: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also related to my manifest question.
@@ -24,6 +24,9 @@ models: | |||
ticket_history: | |||
+schema: zendesk_intermediate | |||
+materialized: ephemeral | |||
unstructured: | |||
+schema: zendesk_unstructured | |||
+materialized: table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really not sure, so I went with tables to be safe. I wasn't sure how demanding/large all that text data could get for a user, so open to suggestions on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fivetran-joemarkiewicz I updated the readme to anticipate zendesk__document
being included in the docs and add in the links and will merge into the release branch if it looks good! I will regen the docs in the release branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
PR Overview
This PR will address the following Issue/Feature:
This PR will result in the following new package version:
Please provide the finalized CHANGELOG entry which details the relevant changes included in this PR:
PR Checklist
Basic Validation
Please acknowledge that you have successfully performed the following commands locally:
&& dbt testdbt run (if incremental models are present) && dbt testBefore marking this PR as "ready for review" the following have been applied:
Detailed Validation
Please share any and all of your validation steps:
If you had to summarize this PR in an emoji, which would it be?
💃