#126: Add documentation for token for private models in prediction UDFs

exasol · Sep 8, 2023 · f4c40bb · f4c40bb
1 parent 8b57c76
commit f4c40bb
Showing 1 changed file with 74 additions and 34 deletions.
diff --git a/doc/user_guide/user_guide.md b/doc/user_guide/user_guide.md
@@ -52,7 +52,7 @@ The extension provides two types of UDFs:
   ```buildoutcfg
     http[s]://<BUCKETFS_HOST>:<BUCKETFS_PORT>/<BUCKET_NAME>/<PATH_IN_BUCKET>;<BUCKETFS_NAME>
   ```
-  - A valid token is required to download private models from the Huggingface hub. 
+  - A valid token is required to download private models from the Huggingface hub and run prediction on them. 
   To avoid exposing such sensitive information, you can use Exasol Connection 
   objects. As seen in the example below, a token can be specified in the 
   password part of the Exasol connection object:
@@ -128,7 +128,7 @@ language container file released in GitHub Releases section.
 (see [the latest release](https://github.com/exasol/transformers-extension/releases/latest)).
 - Before installing the language container, these parts must be combined using the following command::
 ```shell
-cat language_container_part_* > language_container.tar.gz
+ls -v language_container_part_* | xargs cat > language_container.tar.gz
 ```
 
 ##### Install Language Container
@@ -265,6 +265,7 @@ classes of the specified  model. An example usage is given below:
 SELECT TE_SEQUENCE_CLASSIFICATION_SINGLE_TEXT_UDF(
     device_id,
     bucketfs_conn,
+    token_conn,
     sub_dir,
     model_name,
     text_data
@@ -273,7 +274,11 @@ SELECT TE_SEQUENCE_CLASSIFICATION_SINGLE_TEXT_UDF(
 - Parameters:
   - ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise, 
   you can provide NULL for this parameter.
-  - ```bucketfs_conn```: The BucketFS connection name 
+  - ```bucketfs_conn```: The BucketFS connection name
+  - ```token_conn```: The connection name containing the token required for 
+  private models. You can use NULL for public models. For details 
+  on how to create a connection object with token information, please check 
+  [here](#getting-started).
   - ```sub_dir```: The directory where the model is stored in the BucketFS.
   - ```model_name```: The name of the model to use for prediction. You can find the 
   details of the models in [huggingface models page](https://huggingface.co/models).
@@ -285,10 +290,10 @@ this UDF. In case of any error during model loading or prediction, these new
 columns are set to `null` and column _ERROR_MESSAGE_ is set 
 to the stacktrace of the error. For example:
 
-| BUCKETFS_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA | LABEL   | SCORE | ERROR_MESSAGE  |
-| ------------- | ------- | ---------- | --------- |---------| ----- |----------------|
-| conn_name     | dir/    | model_name | text      | label_1 | 0.75  | None           |          
-| ...           | ...     | ...        | ...       | ...     | ...   | ...            |
+| BUCKETFS_CONN | TOKEN_CONN      | SUB_DIR | MODEL_NAME | TEXT_DATA | LABEL   | SCORE | ERROR_MESSAGE  |
+| ------------- |-----------------|---------|------------| --------- |---------| ----- |----------------|
+| conn_name     | token_conn_name | dir/    | model_name | text      | label_1 | 0.75  | None           |          
+| ...           | ...             | ...     | ...        | ...       | ...     | ...   | ...            |
 
 
 ### Sequence Classification for Text Pair UDF
@@ -298,6 +303,7 @@ determine if two sequences are paraphrases of each other. An example usage is gi
 SELECT TE_SEQUENCE_CLASSIFICATION_TEXT_PAIR_UDF(
     device_id,
     bucketfs_conn,
+    token_conn,
     sub_dir,
     model_name,
     first_text,
@@ -307,7 +313,11 @@ SELECT TE_SEQUENCE_CLASSIFICATION_TEXT_PAIR_UDF(
 - Parameters:
   - ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise, 
   you can provide NULL for this parameter.
-  - ```bucketfs_conn```: The BucketFS connection name 
+  - ```bucketfs_conn```: The BucketFS connection name
+  - ```token_conn```: The connection name containing the token required for 
+  private models. You can use NULL for public models. For details 
+  on how to create a connection object with token information, please check 
+  [here](#getting-started).
   - ```sub_dir```: The directory where the model is stored in the BucketFS.
   - ```model_name```: The name of the model to use for prediction. You can find the 
   details of the models in [huggingface models page](https://huggingface.co/models).
@@ -329,6 +339,7 @@ An example usage is given below:
 SELECT TE_QUESTION_ANSWERING_UDF(
     device_id,
     bucketfs_conn,
+    token_conn,
     sub_dir,
     model_name,
     question,
@@ -340,6 +351,10 @@ SELECT TE_QUESTION_ANSWERING_UDF(
   - ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise, 
   you can provide NULL for this parameter.
   - ```bucketfs_conn```: The BucketFS connection name 
+  - ```token_conn```: The connection name containing the token required for 
+  private models. You can use NULL for public models. For details 
+  on how to create a connection object with token information, please check 
+  [here](#getting-started).
   - ```sub_dir```: The directory where the model is stored in the BucketFS.
   - ```model_name```: The name of the model to use for prediction. You can find the 
   details of the models in [huggingface models page](https://huggingface.co/models).
@@ -354,11 +369,11 @@ If `top_k` > 1, each input row is repeated for each answer. In case of any error
 during model loading or prediction, these new columns are set to `null` and column _ERROR_MESSAGE_ is set 
 to the stacktrace of the error. For example:
 
-| BUCKETFS_CONN | SUB_DIR | MODEL_NAME | QUESTION   | CONTEXT   | TOP_K | ANSWER   | SCORE | RANK | ERROR_MESSAGE |
-| ------------- | ------- | ---------- |------------|-----------| ----- |----------| ----- |------| ------------- |
-| conn_name     | dir/    | model_name | question_1 | context_1 | 2     | answer_1 | 0.75  | 1    | None          |
-| conn_name     | dir/    | model_name | question_2 | context_1 | 2     | answer_2 | 0.70  | 2    | None          |
-| ...           | ...     | ...        | ...        | ...       | ...   | ...      | ...   | ..   | ...           |
+| BUCKETFS_CONN | TOKEN_CONN      | SUB_DIR | MODEL_NAME | QUESTION   | CONTEXT   | TOP_K | ANSWER   | SCORE | RANK | ERROR_MESSAGE |
+| ------------- |-----------------|---------|------------|------------|-----------| ----- |----------| ----- |------| ------------- |
+| conn_name     | token_conn_name | dir/    | model_name | question_1 | context_1 | 2     | answer_1 | 0.75  | 1    | None          |
+| conn_name     | token_conn_name | dir/    | model_name | question_2 | context_1 | 2     | answer_2 | 0.70  | 2    | None          |
+| ...           | ...             | ...     | ...        | ...        | ...       | ...   | ...      | ...   | ..   | ...           |
 
 
 ### Masked Language Modelling UDF
@@ -369,6 +384,7 @@ this UDF is ```<mask>```. An example usage is given below:
 SELECT TE_FILLING_MASK_UDF(
     device_id,
     bucketfs_conn,
+    token_conn,
     sub_dir,
     model_name,
     text_data,
@@ -379,7 +395,11 @@ SELECT TE_FILLING_MASK_UDF(
 - Parameters:
   - ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise, 
   you can provide NULL for this parameter.
-  - ```bucketfs_conn```: The BucketFS connection name 
+  - ```bucketfs_conn```: The BucketFS connection name
+  - ```token_conn```: The connection name containing the token required for 
+  private models. You can use NULL for public models. For details 
+  on how to create a connection object with token information, please check 
+  [here](#getting-started).
   - ```sub_dir```: The directory where the model is stored in the BucketFS.
   - ```model_name```: The name of the model to use for prediction. You can find the 
   details of the models in [huggingface models page](https://huggingface.co/models).
@@ -393,11 +413,11 @@ If `top_k` > 1, each input row is repeated for each prediction. In case of any
 error during model loading or prediction, these new columns are set to `null` 
 and column _ERROR_MESSAGE_ is set to the stacktrace of the error. For example:
 
-| BUCKETFS_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA     | TOP_K | FILLED_TEXT   | SCORE | RANK | ERROR_MESSAGE |
-| ------------- | ------- | ---------- |---------------| ----- |---------------| ----- |------|---------------|
-| conn_name     | dir/    | model_name | text `<mask>` | 2     | text filled_1 | 0.75  |   1  | None          |
-| conn_name     | dir/    | model_name | text `<mask>` | 2     | text filled_2 | 0.70  |   2  | None          |
-| ...           | ...     | ...        | ...           | ...   | ...           | ...   |  ... | ...           |
+| BUCKETFS_CONN | TOKEN_CONN      | SUB_DIR | MODEL_NAME | TEXT_DATA     | TOP_K | FILLED_TEXT   | SCORE | RANK | ERROR_MESSAGE |
+| ------------- |-----------------|---------|------------|---------------| ----- |---------------| ----- |------|---------------|
+| conn_name     | token_conn_name | dir/    | model_name | text `<mask>` | 2     | text filled_1 | 0.75  |   1  | None          |
+| conn_name     | token_conn_name | dir/    | model_name | text `<mask>` | 2     | text filled_2 | 0.70  |   2  | None          |
+| ...           | ...             | ...     | ...        | ...           | ...   | ...           | ...   |  ... | ...           |
 
 
 ### Text Generation UDF
@@ -409,6 +429,7 @@ An example usage is given below:
 SELECT TE_TEXT_GENERATION_UDF(
     device_id,
     bucketfs_conn,
+    token_conn,
     sub_dir,
     model_name,
     text_data,
@@ -419,7 +440,11 @@ SELECT TE_TEXT_GENERATION_UDF(
 - Parameters:
   - ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise, 
   you can provide NULL for this parameter.
-  - ```bucketfs_conn```: The BucketFS connection name. 
+  - ```bucketfs_conn```: The BucketFS connection name.
+  - ```token_conn```: The connection name containing the token required for 
+  private models. You can use NULL for public models. For details 
+  on how to create a connection object with token information, please check 
+  [here](#getting-started).
   - ```sub_dir```: The directory where the model is stored in the BucketFS.
   - ```model_name```: The name of the model to use for prediction. You can find the 
   details of the models in [huggingface models page](https://huggingface.co/models).
@@ -443,6 +468,7 @@ There are two popular subtasks of token classification:
 SELECT TE_TOKEN_CLASSIFICATION_UDF(
     device_id,
     bucketfs_conn,
+    token_conn,
     sub_dir,
     model_name,
     text_data,
@@ -453,6 +479,10 @@ SELECT TE_TOKEN_CLASSIFICATION_UDF(
   - ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise, 
   you can provide NULL for this parameter.
   - ```bucketfs_conn```: The BucketFS connection name. 
+  - ```token_conn```: The connection name containing the token required for 
+  private models. You can use NULL for public models. For details 
+  on how to create a connection object with token information, please check 
+  [here](#getting-started).
   - ```sub_dir```: The directory where the model is stored in the BucketFS.
   - ```model_name```: The name of the model to use for prediction. You can find the 
   details of the models in [huggingface models page](https://huggingface.co/models).
@@ -469,10 +499,10 @@ In case of any error during model loading or prediction, these new
 columns are set to `null`, and column _ERROR_MESSAGE_ is set 
 to the stacktrace of the error. For example:
 
-| BUCKETFS_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA | AGGREGATION_STRATEGY | START_POS | END_POS | WORD | ENTITY | SCORE | ERROR_MESSAGE |
-| ------------- | ------- | ---------- |-----------|----------------------|-----------|---------|------|--------|-------| ------------- |
-| conn_name     | dir/    | model_name | text      | simple               | 0         | 4       | text | noun   | 0.75  | None          |
-| ...           | ...     | ...        | ...       | ...                  | ...       | ...     | ...  | ..     | ...   | ...           |
+| BUCKETFS_CONN | TOKEN_CONN      | SUB_DIR | MODEL_NAME | TEXT_DATA | AGGREGATION_STRATEGY | START_POS | END_POS | WORD | ENTITY | SCORE | ERROR_MESSAGE |
+| ------------- |-----------------|---------|------------|-----------|----------------------|-----------|---------|------|--------|-------| ------------- |
+| conn_name     | token_conn_name | dir/    | model_name | text      | simple               | 0         | 4       | text | noun   | 0.75  | None          |
+| ...           | ...             | ...     | ...        | ...       | ...                  | ...       | ...     | ...  | ..     | ...   | ...           |
 
 
 
@@ -483,6 +513,7 @@ This UDF translates a given text from one language to another.
 SELECT TE_TRANSLATION_UDF(
     device_id,
     bucketfs_conn,
+    token_conn,
     sub_dir,
     model_name,
     text_data,
@@ -495,7 +526,11 @@ SELECT TE_TRANSLATION_UDF(
 - Parameters:
   - ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise, 
   you can provide NULL for this parameter.
-  - ```bucketfs_conn```: The BucketFS connection name. 
+  - ```bucketfs_conn```: The BucketFS connection name.
+  - ```token_conn```: The connection name containing the token required for 
+  private models. You can use NULL for public models. For details 
+  on how to create a connection object with token information, please check 
+  [here](#getting-started).
   - ```sub_dir```: The directory where the model is stored in the BucketFS.
   - ```model_name```: The name of the model to use for prediction. You can find the 
   details of the models in [huggingface models page](https://huggingface.co/models).
@@ -511,10 +546,10 @@ combined with the inputs used when calling this UDF. In case of any error during
 model loading or prediction, these new columns are set to `null`, and 
 column _ERROR_MESSAGE_ is set to the stacktrace of the error. For example:
 
-| BUCKETFS_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA | SOURCE_LANGUAGE | TARGET_LANGUAGE | MAX_LENGTH | TRANSLATION_TEXT | ERROR_MESSAGE |
-| ------------- | ------- | ---------- |-----------|-----------------|-----------------|------------| ---------------- |---------------|
-| conn_name     | dir/    | model_name | context   | English         | German          | 100        | kontext          | None          |
-| ...           | ...     | ...        | ...       | ...             | ...             | ...        | ...              | ...           |
+| BUCKETFS_CONN | TOKEN_CONN      | SUB_DIR | MODEL_NAME | TEXT_DATA | SOURCE_LANGUAGE | TARGET_LANGUAGE | MAX_LENGTH | TRANSLATION_TEXT | ERROR_MESSAGE |
+| ------------- |-----------------|---------|------------|-----------|-----------------|-----------------|------------| ---------------- |---------------|
+| conn_name     | token_conn_name | dir/    | model_name | context   | English         | German          | 100        | kontext          | None          |
+| ...           | ...             | ...     | ...        | ...       | ...             | ...             | ...        | ...              | ...           |
 
 
 ### Zero-Shot Text Classification UDF
@@ -526,6 +561,7 @@ string, and generate probability scores prediction for each label.
 SELECT TE_ZERO_SHOT_TEXT_CLASSIFICATION_UDF(
     device_id,
     bucketfs_conn,
+    token_conn,
     sub_dir,
     model_name,
     text_data,
@@ -537,6 +573,10 @@ SELECT TE_ZERO_SHOT_TEXT_CLASSIFICATION_UDF(
   - ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise, 
   you can provide NULL for this parameter.
   - ```bucketfs_conn```: The BucketFS connection name. 
+  - ```token_conn```: The connection name containing the token required for 
+  private models. You can use NULL for public models. For details 
+  on how to create a connection object with token information, please check 
+  [here](#getting-started).
   - ```sub_dir```: The directory where the model is stored in the BucketFS.
   - ```model_name```: The name of the model to use for prediction. You can find the 
   details of the models in [huggingface models page](https://huggingface.co/models).
@@ -549,8 +589,8 @@ columns, combined with the inputs used when calling this UDF. In case of any
 error during model loading or prediction, these new  columns are set to `null`, 
 and column _ERROR_MESSAGE_ is set to the stacktrace of the error. For example:
 
-| BUCKETFS_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA | CANDIDATE LABELS | LABEL  | SCORE | RANK | ERROR_MESSAGE |
-| ------------- | ------- | ---------- |-----------|------------------|--------|-------|------|---------------|
-| conn_name     | dir/    | model_name | text      | label1,label2..  | label1 | 0.75  | 1    | None          |
-| conn_name     | dir/    | model_name | text      | label1,label2..  | label2 | 0.70  | 2    | None          |
-| ...           | ...     | ...        | ...       | ...              | ...    | ...   | ..   | ...           |  
+| BUCKETFS_CONN | TOKEN_CONN      | SUB_DIR | MODEL_NAME | TEXT_DATA | CANDIDATE LABELS | LABEL  | SCORE | RANK | ERROR_MESSAGE |
+| ------------- |-----------------|---------|------------|-----------|------------------|--------|-------|------|---------------|
+| conn_name     | token_conn_name | dir/    | model_name | text      | label1,label2..  | label1 | 0.75  | 1    | None          |
+| conn_name     | token_conn_name | dir/    | model_name | text      | label1,label2..  | label2 | 0.70  | 2    | None          |
+| ...           | ...             | ...     | ...        | ...       | ...              | ...    | ...   | ..   | ...           |