Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#126: Add documentation for token for private models in prediction UDFs #127

Merged
merged 5 commits into from
Sep 8, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions doc/changes/changes_0.5.0.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
# Transformers Extension 0.5.0, released YYYY-MM-DD
# Transformers Extension 0.5.0, released 2023-09-08

Code name: TBD
Code name: Support for transformer 4.31


## Summary

TBD
This release makes the extension compatible with Huggingface transformers v.4.31.0 and their new model cache format.
Furthermore, it makes the deployment scripts compatible with Exasol v8 by enabling encryption
and allows the user to configure the TLS verification.

### Features

Expand Down Expand Up @@ -33,6 +35,7 @@ TBD
### Documentation

- #93: Added the Developer Guide
- #126: Add documentation for token for private models in prediction UDFs



Expand Down
108 changes: 74 additions & 34 deletions doc/user_guide/user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ The extension provides two types of UDFs:
```buildoutcfg
http[s]://<BUCKETFS_HOST>:<BUCKETFS_PORT>/<BUCKET_NAME>/<PATH_IN_BUCKET>;<BUCKETFS_NAME>
```
- A valid token is required to download private models from the Huggingface hub.
- A valid token is required to download private models from the Huggingface hub and run prediction on them.
To avoid exposing such sensitive information, you can use Exasol Connection
objects. As seen in the example below, a token can be specified in the
password part of the Exasol connection object:
Expand Down Expand Up @@ -128,7 +128,7 @@ language container file released in GitHub Releases section.
(see [the latest release](https://github.com/exasol/transformers-extension/releases/latest)).
- Before installing the language container, these parts must be combined using the following command::
```shell
cat language_container_part_* > language_container.tar.gz
ls -v language_container_part_* | xargs cat > language_container.tar.gz
```

##### Install Language Container
Expand Down Expand Up @@ -265,6 +265,7 @@ classes of the specified model. An example usage is given below:
SELECT TE_SEQUENCE_CLASSIFICATION_SINGLE_TEXT_UDF(
device_id,
bucketfs_conn,
token_conn,
sub_dir,
model_name,
text_data
Expand All @@ -273,7 +274,11 @@ SELECT TE_SEQUENCE_CLASSIFICATION_SINGLE_TEXT_UDF(
- Parameters:
- ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise,
you can provide NULL for this parameter.
- ```bucketfs_conn```: The BucketFS connection name
- ```bucketfs_conn```: The BucketFS connection name
- ```token_conn```: The connection name containing the token required for
private models. You can use NULL for public models. For details
on how to create a connection object with token information, please check
[here](#getting-started).
- ```sub_dir```: The directory where the model is stored in the BucketFS.
- ```model_name```: The name of the model to use for prediction. You can find the
details of the models in [huggingface models page](https://huggingface.co/models).
Expand All @@ -285,10 +290,10 @@ this UDF. In case of any error during model loading or prediction, these new
columns are set to `null` and column _ERROR_MESSAGE_ is set
to the stacktrace of the error. For example:

| BUCKETFS_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA | LABEL | SCORE | ERROR_MESSAGE |
| ------------- | ------- | ---------- | --------- |---------| ----- |----------------|
| conn_name | dir/ | model_name | text | label_1 | 0.75 | None |
| ... | ... | ... | ... | ... | ... | ... |
| BUCKETFS_CONN | TOKEN_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA | LABEL | SCORE | ERROR_MESSAGE |
| ------------- |-----------------|---------|------------| --------- |---------| ----- |----------------|
| conn_name | token_conn_name | dir/ | model_name | text | label_1 | 0.75 | None |
| ... | ... | ... | ... | ... | ... | ... | ... |


### Sequence Classification for Text Pair UDF
Expand All @@ -298,6 +303,7 @@ determine if two sequences are paraphrases of each other. An example usage is gi
SELECT TE_SEQUENCE_CLASSIFICATION_TEXT_PAIR_UDF(
device_id,
bucketfs_conn,
token_conn,
sub_dir,
model_name,
first_text,
Expand All @@ -307,7 +313,11 @@ SELECT TE_SEQUENCE_CLASSIFICATION_TEXT_PAIR_UDF(
- Parameters:
- ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise,
you can provide NULL for this parameter.
- ```bucketfs_conn```: The BucketFS connection name
- ```bucketfs_conn```: The BucketFS connection name
- ```token_conn```: The connection name containing the token required for
private models. You can use NULL for public models. For details
on how to create a connection object with token information, please check
[here](#getting-started).
- ```sub_dir```: The directory where the model is stored in the BucketFS.
- ```model_name```: The name of the model to use for prediction. You can find the
details of the models in [huggingface models page](https://huggingface.co/models).
Expand All @@ -329,6 +339,7 @@ An example usage is given below:
SELECT TE_QUESTION_ANSWERING_UDF(
device_id,
bucketfs_conn,
token_conn,
sub_dir,
model_name,
question,
Expand All @@ -340,6 +351,10 @@ SELECT TE_QUESTION_ANSWERING_UDF(
- ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise,
you can provide NULL for this parameter.
- ```bucketfs_conn```: The BucketFS connection name
- ```token_conn```: The connection name containing the token required for
private models. You can use NULL for public models. For details
on how to create a connection object with token information, please check
[here](#getting-started).
- ```sub_dir```: The directory where the model is stored in the BucketFS.
- ```model_name```: The name of the model to use for prediction. You can find the
details of the models in [huggingface models page](https://huggingface.co/models).
Expand All @@ -354,11 +369,11 @@ If `top_k` > 1, each input row is repeated for each answer. In case of any error
during model loading or prediction, these new columns are set to `null` and column _ERROR_MESSAGE_ is set
to the stacktrace of the error. For example:

| BUCKETFS_CONN | SUB_DIR | MODEL_NAME | QUESTION | CONTEXT | TOP_K | ANSWER | SCORE | RANK | ERROR_MESSAGE |
| ------------- | ------- | ---------- |------------|-----------| ----- |----------| ----- |------| ------------- |
| conn_name | dir/ | model_name | question_1 | context_1 | 2 | answer_1 | 0.75 | 1 | None |
| conn_name | dir/ | model_name | question_2 | context_1 | 2 | answer_2 | 0.70 | 2 | None |
| ... | ... | ... | ... | ... | ... | ... | ... | .. | ... |
| BUCKETFS_CONN | TOKEN_CONN | SUB_DIR | MODEL_NAME | QUESTION | CONTEXT | TOP_K | ANSWER | SCORE | RANK | ERROR_MESSAGE |
| ------------- |-----------------|---------|------------|------------|-----------| ----- |----------| ----- |------| ------------- |
| conn_name | token_conn_name | dir/ | model_name | question_1 | context_1 | 2 | answer_1 | 0.75 | 1 | None |
| conn_name | token_conn_name | dir/ | model_name | question_2 | context_1 | 2 | answer_2 | 0.70 | 2 | None |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | .. | ... |


### Masked Language Modelling UDF
Expand All @@ -369,6 +384,7 @@ this UDF is ```<mask>```. An example usage is given below:
SELECT TE_FILLING_MASK_UDF(
device_id,
bucketfs_conn,
token_conn,
sub_dir,
model_name,
text_data,
Expand All @@ -379,7 +395,11 @@ SELECT TE_FILLING_MASK_UDF(
- Parameters:
- ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise,
you can provide NULL for this parameter.
- ```bucketfs_conn```: The BucketFS connection name
- ```bucketfs_conn```: The BucketFS connection name
- ```token_conn```: The connection name containing the token required for
private models. You can use NULL for public models. For details
on how to create a connection object with token information, please check
[here](#getting-started).
- ```sub_dir```: The directory where the model is stored in the BucketFS.
- ```model_name```: The name of the model to use for prediction. You can find the
details of the models in [huggingface models page](https://huggingface.co/models).
Expand All @@ -393,11 +413,11 @@ If `top_k` > 1, each input row is repeated for each prediction. In case of any
error during model loading or prediction, these new columns are set to `null`
and column _ERROR_MESSAGE_ is set to the stacktrace of the error. For example:

| BUCKETFS_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA | TOP_K | FILLED_TEXT | SCORE | RANK | ERROR_MESSAGE |
| ------------- | ------- | ---------- |---------------| ----- |---------------| ----- |------|---------------|
| conn_name | dir/ | model_name | text `<mask>` | 2 | text filled_1 | 0.75 | 1 | None |
| conn_name | dir/ | model_name | text `<mask>` | 2 | text filled_2 | 0.70 | 2 | None |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| BUCKETFS_CONN | TOKEN_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA | TOP_K | FILLED_TEXT | SCORE | RANK | ERROR_MESSAGE |
| ------------- |-----------------|---------|------------|---------------| ----- |---------------| ----- |------|---------------|
| conn_name | token_conn_name | dir/ | model_name | text `<mask>` | 2 | text filled_1 | 0.75 | 1 | None |
| conn_name | token_conn_name | dir/ | model_name | text `<mask>` | 2 | text filled_2 | 0.70 | 2 | None |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |


### Text Generation UDF
Expand All @@ -409,6 +429,7 @@ An example usage is given below:
SELECT TE_TEXT_GENERATION_UDF(
device_id,
bucketfs_conn,
token_conn,
sub_dir,
model_name,
text_data,
Expand All @@ -419,7 +440,11 @@ SELECT TE_TEXT_GENERATION_UDF(
- Parameters:
- ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise,
you can provide NULL for this parameter.
- ```bucketfs_conn```: The BucketFS connection name.
- ```bucketfs_conn```: The BucketFS connection name.
- ```token_conn```: The connection name containing the token required for
private models. You can use NULL for public models. For details
on how to create a connection object with token information, please check
[here](#getting-started).
- ```sub_dir```: The directory where the model is stored in the BucketFS.
- ```model_name```: The name of the model to use for prediction. You can find the
details of the models in [huggingface models page](https://huggingface.co/models).
Expand All @@ -443,6 +468,7 @@ There are two popular subtasks of token classification:
SELECT TE_TOKEN_CLASSIFICATION_UDF(
device_id,
bucketfs_conn,
token_conn,
sub_dir,
model_name,
text_data,
Expand All @@ -453,6 +479,10 @@ SELECT TE_TOKEN_CLASSIFICATION_UDF(
- ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise,
you can provide NULL for this parameter.
- ```bucketfs_conn```: The BucketFS connection name.
- ```token_conn```: The connection name containing the token required for
private models. You can use NULL for public models. For details
on how to create a connection object with token information, please check
[here](#getting-started).
- ```sub_dir```: The directory where the model is stored in the BucketFS.
- ```model_name```: The name of the model to use for prediction. You can find the
details of the models in [huggingface models page](https://huggingface.co/models).
Expand All @@ -469,10 +499,10 @@ In case of any error during model loading or prediction, these new
columns are set to `null`, and column _ERROR_MESSAGE_ is set
to the stacktrace of the error. For example:

| BUCKETFS_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA | AGGREGATION_STRATEGY | START_POS | END_POS | WORD | ENTITY | SCORE | ERROR_MESSAGE |
| ------------- | ------- | ---------- |-----------|----------------------|-----------|---------|------|--------|-------| ------------- |
| conn_name | dir/ | model_name | text | simple | 0 | 4 | text | noun | 0.75 | None |
| ... | ... | ... | ... | ... | ... | ... | ... | .. | ... | ... |
| BUCKETFS_CONN | TOKEN_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA | AGGREGATION_STRATEGY | START_POS | END_POS | WORD | ENTITY | SCORE | ERROR_MESSAGE |
| ------------- |-----------------|---------|------------|-----------|----------------------|-----------|---------|------|--------|-------| ------------- |
| conn_name | token_conn_name | dir/ | model_name | text | simple | 0 | 4 | text | noun | 0.75 | None |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | .. | ... | ... |



Expand All @@ -483,6 +513,7 @@ This UDF translates a given text from one language to another.
SELECT TE_TRANSLATION_UDF(
device_id,
bucketfs_conn,
token_conn,
sub_dir,
model_name,
text_data,
Expand All @@ -495,7 +526,11 @@ SELECT TE_TRANSLATION_UDF(
- Parameters:
- ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise,
you can provide NULL for this parameter.
- ```bucketfs_conn```: The BucketFS connection name.
- ```bucketfs_conn```: The BucketFS connection name.
- ```token_conn```: The connection name containing the token required for
private models. You can use NULL for public models. For details
on how to create a connection object with token information, please check
[here](#getting-started).
- ```sub_dir```: The directory where the model is stored in the BucketFS.
- ```model_name```: The name of the model to use for prediction. You can find the
details of the models in [huggingface models page](https://huggingface.co/models).
Expand All @@ -511,10 +546,10 @@ combined with the inputs used when calling this UDF. In case of any error during
model loading or prediction, these new columns are set to `null`, and
column _ERROR_MESSAGE_ is set to the stacktrace of the error. For example:

| BUCKETFS_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA | SOURCE_LANGUAGE | TARGET_LANGUAGE | MAX_LENGTH | TRANSLATION_TEXT | ERROR_MESSAGE |
| ------------- | ------- | ---------- |-----------|-----------------|-----------------|------------| ---------------- |---------------|
| conn_name | dir/ | model_name | context | English | German | 100 | kontext | None |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| BUCKETFS_CONN | TOKEN_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA | SOURCE_LANGUAGE | TARGET_LANGUAGE | MAX_LENGTH | TRANSLATION_TEXT | ERROR_MESSAGE |
| ------------- |-----------------|---------|------------|-----------|-----------------|-----------------|------------| ---------------- |---------------|
| conn_name | token_conn_name | dir/ | model_name | context | English | German | 100 | kontext | None |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |


### Zero-Shot Text Classification UDF
Expand All @@ -526,6 +561,7 @@ string, and generate probability scores prediction for each label.
SELECT TE_ZERO_SHOT_TEXT_CLASSIFICATION_UDF(
device_id,
bucketfs_conn,
token_conn,
sub_dir,
model_name,
text_data,
Expand All @@ -537,6 +573,10 @@ SELECT TE_ZERO_SHOT_TEXT_CLASSIFICATION_UDF(
- ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise,
you can provide NULL for this parameter.
- ```bucketfs_conn```: The BucketFS connection name.
- ```token_conn```: The connection name containing the token required for
private models. You can use NULL for public models. For details
on how to create a connection object with token information, please check
[here](#getting-started).
- ```sub_dir```: The directory where the model is stored in the BucketFS.
- ```model_name```: The name of the model to use for prediction. You can find the
details of the models in [huggingface models page](https://huggingface.co/models).
Expand All @@ -549,8 +589,8 @@ columns, combined with the inputs used when calling this UDF. In case of any
error during model loading or prediction, these new columns are set to `null`,
and column _ERROR_MESSAGE_ is set to the stacktrace of the error. For example:

| BUCKETFS_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA | CANDIDATE LABELS | LABEL | SCORE | RANK | ERROR_MESSAGE |
| ------------- | ------- | ---------- |-----------|------------------|--------|-------|------|---------------|
| conn_name | dir/ | model_name | text | label1,label2.. | label1 | 0.75 | 1 | None |
| conn_name | dir/ | model_name | text | label1,label2.. | label2 | 0.70 | 2 | None |
| ... | ... | ... | ... | ... | ... | ... | .. | ... |
| BUCKETFS_CONN | TOKEN_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA | CANDIDATE LABELS | LABEL | SCORE | RANK | ERROR_MESSAGE |
| ------------- |-----------------|---------|------------|-----------|------------------|--------|-------|------|---------------|
| conn_name | token_conn_name | dir/ | model_name | text | label1,label2.. | label1 | 0.75 | 1 | None |
| conn_name | token_conn_name | dir/ | model_name | text | label1,label2.. | label2 | 0.70 | 2 | None |
| ... | ... | ... | ... | ... | ... | ... | ... | .. | ... |
Loading