Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/unstructured data #161

Merged
merged 18 commits into from
Sep 3, 2024

Conversation

fivetran-catfritz
Copy link
Contributor

@fivetran-catfritz fivetran-catfritz commented Aug 13, 2024

PR Overview

This PR will address the following Issue/Feature:

  • internal ticket

This PR will result in the following new package version:

  • 0.17.0 since we're adding a new model

Please provide the finalized CHANGELOG entry which details the relevant changes included in this PR:

New model!

  • Addition of the zendesk__document model, designed to structure Zendesk textual data for vectorization and integration into NLP workflows. The model outputs a table with:
    • document_id: Corresponding to the ticket_id
    • chunk_index: For text segmentation
    • chunk: The text chunk itself
    • chunk_tokens_approximate: Approximate token count for each segment
  • This model is currently disabled by default. You may enable it by setting the zendesk__unstructured_enabled variable as true in your dbt_project.yml.

PR Checklist

Basic Validation

Please acknowledge that you have successfully performed the following commands locally:

  • dbt run –full-refresh && dbt test
  • dbt run (if incremental models are present) && dbt test

Before marking this PR as "ready for review" the following have been applied:

  • The appropriate issue has been linked, tagged, and properly assigned
  • All necessary documentation and version upgrades have been applied
  • docs were regenerated (unless this PR does not include any code or yml updates)
  • BuildKite integration tests are passing
  • Detailed validation steps have been provided below

Detailed Validation

Please share any and all of your validation steps:

  • see internal ticket

If you had to summarize this PR in an emoji, which would it be?

💃

@fivetran-catfritz fivetran-catfritz self-assigned this Aug 13, 2024
{%- endmacro %}

{% macro default__count_tokens(column_name) %}
{{ dbt.length(column_name) }} / 4
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this doc and internal discussion, approximate count of tokens is appropriate.

Comment on lines +2 to +6
# - package: fivetran/zendesk_source
# version: [">=0.12.0", "<0.13.0"]
- git: https://github.com/fivetran/dbt_zendesk_source.git
revision: feature/unstructured-data
warn-unpinned: false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# - package: fivetran/zendesk_source
# version: [">=0.12.0", "<0.13.0"]
- git: https://github.com/fivetran/dbt_zendesk_source.git
revision: feature/unstructured-data
warn-unpinned: false
- package: fivetran/zendesk_source
version: [">=0.12.0", "<0.13.0"]

ticket_comment_id,
ticket_id,
comment_time,
case when comment_tokens > {{ var('max_tokens', 7500) }} then left(comment_markdown, {{ var('max_tokens', 7500) }} * 4) -- approximate 4 characters per token
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using left() instead of substring() since it's easier to deal with across warehouses and does the same thing since we're just truncating.

Comment on lines +27 to +29
unstructured:
+schema: zendesk_unstructured
+materialized: table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's sync on this default schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our discussion, leaving this as-is so it is separated.

Copy link
Contributor Author

@fivetran-catfritz fivetran-catfritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-joemarkiewicz Thank you for reviewing! I also updated the changelog with the changes from the source.

Comment on lines +27 to +29
unstructured:
+schema: zendesk_unstructured
+materialized: table
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our discussion, leaving this as-is so it is separated.

@fivetran-catfritz fivetran-catfritz changed the base branch from main to release/v0.17.0 August 29, 2024 23:02
Copy link
Contributor

@fivetran-joemarkiewicz fivetran-joemarkiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-catfritz great work on this PR. Just a few final comments and suggestions before approval. Let me know if you have any questions!

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
README.md Outdated
@@ -37,6 +37,7 @@ The following table provides a detailed list of final models materialized within
| [zendesk__ticket_backlog](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_backlog) | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable for all backlog tickets. Backlog tickets being defined as any ticket not in a 'closed', 'deleted', or 'solved' status. |
| [zendesk__ticket_field_history](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_field_history) | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable and the corresponding updater fields defined in the `ticket_field_history_updater_columns` variable. |
| [zendesk__sla_policies](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__sla_policies) | Each record represents an SLA policy event and additional sla breach and achievement metrics. Calendar and business hour SLA breaches are supported.
| zendesk__document | Each record represents a chunk of text from ticket data, prepared for vectorization. It includes fields for use in NLP workflows. Disabled by default. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elanfivetran this won't be available yet in Quickstart, yet it will be displayed in the UI via this table. Is that okay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-catfritz would you be able to edit this to be the hyperlink to the package docs so users can see the table structure and documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-joemarkiewicz this is related to my question below about having this in the manifest.

README.md Outdated
@@ -91,6 +92,23 @@ vars:

## (Optional) Step 5: Additional configurations

### Enabling the unstructured document model for NLP
This package includes the `zendesk__document` model, which processes and segments Zendesk text data for vectorization, making it suitable for NLP workflows. The model outputs structured chunks of text with associated document IDs, segment indices, and token counts. By default, this model is disabled. To enable it, update the `zendesk__unstructured_enabled` variable to true in your dbt_project.yml:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar note from above, can we include the dbt docs link to the table zendesk__document here so curious users can go and inspect the structure and documentation of the table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also related to my manifest question.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@@ -24,6 +24,9 @@ models:
ticket_history:
+schema: zendesk_intermediate
+materialized: ephemeral
unstructured:
+schema: zendesk_unstructured
+materialized: table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all of these models to be materialized as tables? I definitely see the zendesk__document needing to be, but do all the intermediate models as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not sure, so I went with tables to be safe. I wasn't sure how demanding/large all that text data could get for a user, so open to suggestions on this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the nature of the end result, I think it makes sense to keep these as tables to help with the query load as best as we can. We can always adjust this default materialization based off feedback.

@@ -1,7 +1,7 @@
config-version: 2

name: 'zendesk_integration_tests'
version: '0.16.0'
version: '0.17.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the variable configuration zendesk__unstructured_enabled: true in the vars config here so it can be enabled when generating the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-joemarkiewicz At the time I did this I thought we didn't want this added to the manifest, and therefore not the docs. Is this not the case?

Copy link
Contributor Author

@fivetran-catfritz fivetran-catfritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fivetran-joemarkiewicz. I applied your suggestions but also had a couple more questions!

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@@ -1,7 +1,7 @@
config-version: 2

name: 'zendesk_integration_tests'
version: '0.16.0'
version: '0.17.0'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-joemarkiewicz At the time I did this I thought we didn't want this added to the manifest, and therefore not the docs. Is this not the case?

README.md Outdated
@@ -37,6 +37,7 @@ The following table provides a detailed list of final models materialized within
| [zendesk__ticket_backlog](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_backlog) | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable for all backlog tickets. Backlog tickets being defined as any ticket not in a 'closed', 'deleted', or 'solved' status. |
| [zendesk__ticket_field_history](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_field_history) | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable and the corresponding updater fields defined in the `ticket_field_history_updater_columns` variable. |
| [zendesk__sla_policies](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__sla_policies) | Each record represents an SLA policy event and additional sla breach and achievement metrics. Calendar and business hour SLA breaches are supported.
| zendesk__document | Each record represents a chunk of text from ticket data, prepared for vectorization. It includes fields for use in NLP workflows. Disabled by default. |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-joemarkiewicz this is related to my question below about having this in the manifest.

README.md Outdated
@@ -91,6 +92,23 @@ vars:

## (Optional) Step 5: Additional configurations

### Enabling the unstructured document model for NLP
This package includes the `zendesk__document` model, which processes and segments Zendesk text data for vectorization, making it suitable for NLP workflows. The model outputs structured chunks of text with associated document IDs, segment indices, and token counts. By default, this model is disabled. To enable it, update the `zendesk__unstructured_enabled` variable to true in your dbt_project.yml:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also related to my manifest question.

@@ -24,6 +24,9 @@ models:
ticket_history:
+schema: zendesk_intermediate
+materialized: ephemeral
unstructured:
+schema: zendesk_unstructured
+materialized: table
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not sure, so I went with tables to be safe. I wasn't sure how demanding/large all that text data could get for a user, so open to suggestions on this one.

Copy link
Contributor Author

@fivetran-catfritz fivetran-catfritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-joemarkiewicz I updated the readme to anticipate zendesk__document being included in the docs and add in the links and will merge into the release branch if it looks good! I will regen the docs in the release branch.

Copy link
Contributor

@fivetran-joemarkiewicz fivetran-joemarkiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@fivetran-catfritz fivetran-catfritz merged commit 03bf529 into release/v0.17.0 Sep 3, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants