exasol · MarleneKress79789 · Apr 2, 2024 · Feb 22, 2024 · Feb 22, 2024 · Feb 22, 2024
diff --git a/README.md b/README.md
@@ -3,6 +3,9 @@
 An Exasol extension to use state-of-the-art pretrained machine learning models 
 via the [transformers api](https://github.com/huggingface/transformers).
 
+This Extension is build and tested for Linux OS, and does not have Windows or MacOS support. 
+It might work regardless but proceed at your own risk.
+
 
 ## Table of Contents
 

diff --git a/buildspec.yml b/buildspec.yml
@@ -22,6 +22,5 @@ phases:
       - echo "$DOCKER_PASSWORD" | docker login --username "$DOCKER_USER" --password-stdin
   build:
     commands:
-      - poetry run nox -s unit_tests
       - poetry run nox -s start_database
       - poetry run nox -s integration_tests
diff --git a/doc/changes/changes_1.0.0.md b/doc/changes/changes_1.0.0.md
@@ -11,6 +11,10 @@ T.B.D
 
 - #146: Integrated new download and load functions using save_pretrained
 
+### Documentation
+
+- #133: Improved user and developer documentation with additional information
+
 ### Refactorings
 
 

diff --git a/doc/developer_guide/developer_guide.md b/doc/developer_guide/developer_guide.md
@@ -27,6 +27,12 @@ and install it as follows:
 pip install <path/wheel-filename.whl> --extra-index-url https://download.pytorch.org/whl/cpu
 ```
 
+### Check wheel installation
+
+The wheel should be installed in `transformers-extension/dist`. After updating and building a new release 
+there may be multiple wheels installed here. This leads to problems, so check and delete the old wheels if necessary.
+You may also need to check 
+`transformers-extension/language_container/exasol_transformers_extension_container/flavor_base/release/dist` for the same reason.
 
 ### Run Tests
 All unit and integration tests can be run within the Poetry environment created 
@@ -42,6 +48,8 @@ Start a test database and run integration tests:
       poetry run nox -s integration_tests
 ```
 
+You can find more information regarding the tests in the [Tests](#tests) section below
+
 ## Add Transformer Tasks
 In the transformers-extension library, the 8 most popular NLP tasks provided by 
 [Transformers API](https://huggingface.co/docs/transformers/index) have already 
@@ -126,3 +134,21 @@ Currently, the CodeBuild project is managed manually and is triggered with a web
 For this our aws-ci user is added to this Repository. The webhook can be configured in the AWS CodeBuild 
 project directly.
 The CodeBuild project also uses our DockerHub user for the build. For this it has access to the AWS SecretsManager.
+
+
+#### 3. Release download test
+
+After you do a Release on the project, you may want to trigger the [SLC Download Test](https://github.com/exasol/transformers-extension/blob/8f57d1f0ca3f95a2d3edc9b84e8dd779aa6093d8/tests/integration_tests/with_db/deployment/test_language_container_deployer_cli.py#L117)
+to make sure the new SLC is uploaded and correctly named. 
+**This is especially important if the naming convention of the SLC was changed!**
+* testfile: [tests/integration_tests/with_db/deployment/test_language_container_deployer_cli.py](../../tests/integration_tests/with_db/deployment/test_language_container_deployer_cli.py)
+* test_name: test test_language_container_deployer_cli_by_downloading_container
+
+Also, during a release, the version object should be updated from time to time,
+so we actually use the correct SLC for the test. Do this 
+[here](https://github.com/exasol/transformers-extension/blob/8f57d1f0ca3f95a2d3edc9b84e8dd779aa6093d8/tests/integration_tests/with_db/deployment/test_language_container_deployer_cli.py#L128)
+
+## Good to know
+
+* Hugging Face models consist of 2 parts, the model and the Tokenizer. 
+Most of our functions deal with both parts
diff --git a/doc/user_guide/user_guide.md b/doc/user_guide/user_guide.md
@@ -20,6 +20,7 @@ model and perform prediction. These are the supported tasks:
 
 ## Table of Contents
 
+- [Introduction](#introduction)
 - [Getting Started](#getting-started)
 - [Setup](#setup)
 - [Model Downloader UDF](#model-downloader-udf)
@@ -33,7 +34,21 @@ model and perform prediction. These are the supported tasks:
   7. [Text Translation UDF](#text-translation-udf)
   8. [Zero-Shot Text Classification](#zero-shot-text-classification-udf)
 
+## Introduction
 
+This Exasol Extension provides UDFs for interacting with Hugging Face's Transformers API in order to use 
+pre-trained models on an Exasol Cluster.
+
+User Defined Function, UDFs for short, are scripts in various programming languages that can be 
+executed in the Exasol Database. They can be used by the user for more flexibility in data processing. 
+In this Extension we provide multiple UDFs for you to use on your Exasol Database.
+You can find a more detailed documentation on UDFs
+[here](https://docs.exasol.com/db/latest/database_concepts/udf_scripts.htm).
+
+UDFs and the necessary [Script language Container](https://docs.exasol.com/db/latest/database_concepts/udf_scripts/adding_new_packages_script_languages.htm) 
+are stored in Exasol's file system BucketFS, and we also use this to store the Hugging Face
+models on the Exasol Cluster. More information on The BucketFS can be found 
+[here](https://docs.exasol.com/db/latest/database_concepts/bucketfs/bucketfs.htm).
 
 ## Getting Started
 - Exasol DB
@@ -47,9 +62,12 @@ model and perform prediction. These are the supported tasks:
   CREATE OR REPLACE CONNECTION <BUCKETFS_CONNECTION_NAME>
       TO '<BUCKETFS_ADDRESS>'
       USER '<BUCKETFS_USER>'
-      IDENTIFIED BY '<BUCKETFS_PASS>'
+      IDENTIFIED BY '<BUCKETFS_PASSWORD>'
   ```
+
   - The `BUCKETFS_ADDRESS` looks like the following:
+
+    **Note:** The `<PATH_IN_BUCKET>` can not be empty.
   ```buildoutcfg
     http[s]://<BUCKETFS_HOST>:<BUCKETFS_PORT>/<BUCKET_NAME>/<PATH_IN_BUCKET>;<BUCKETFS_NAME>
   ```
@@ -65,31 +83,58 @@ model and perform prediction. These are the supported tasks:
   - For more information please check the [Create Connection in Exasol](https://docs.exasol.com/sql/create_connection.htm?Highlight=connection) document.
 
 ## Setup
-### The Python Package
-#### Download The Python Wheel Package
-- The latest version of the python package of this extension can be 
-downloaded from the [GitHUb Release](https://github.com/exasol/transformers-extension/releases/latest).
+### Install the Python Package
+
+There are multiple ways to install the Python Package. You can use Pip install, 
+Download the Wheel from GitHub or build the project yourself.
+Additionally, you will need a Script Language Container. Find the how-to below.
+
+#### Pip
+
+The Transformers Extension is published on [Pypi](https://pypi.org/project/exasol-transformers-extension/). 
+
+You can install it with:
+
+```shell
+pip install exasol-transformers-extension
+```
+
+
+#### Download and Install the Python Wheel Package
+
+You can also get the wheel from a Github release.
+- The latest version of the Python package of this extension can be 
+downloaded from the [GitHub Release](https://github.com/exasol/transformers-extension/releases/latest).
 Please download the following built archive:
 ```buildoutcfg 
 exasol_transformers_extension-<version-number>-py3-none-any.whl
 ```
 If you need to use a version < 0.5.0, the build archive is called `transformers_extension.whl`.
 
-
-#### Install The Python Wheel Package
-Install the packaged transformers-extension project as follows:
+Then install the packaged transformers-extension project as follows:
 ```shell
-pip install <path/wheel-filename.whl> --extra-index-url https://download.pytorch.org/whl/cpu
+pip install <path/wheel-filename.whl>
+```
+
+#### Build the project yourself
+
+In order to build Transformers Extension yourself, you need to have the [Poetry](https://python-poetry.org/)
+(>= 1.1.11) package manager installed. Clone the Github Repository, and install and build 
+the `transformers-extension` as follows:
+```bash
+poetry install
+poetry build
 ```
 
 ### The Pre-built Language Container
 
-This extension requires the installation of the language container for this 
-extension to run. It can be installed in two ways: Quick and Customized 
-installations
+This extension requires the installation of a Language Container in the Exasol Database for this 
+extension to run. The Script Language Container is a way to install the required programming language and 
+necessary dependencies in the Exasol Database so the UDF scripts can be executed.
+It can be installed in two ways: Quick and Customized installations:
 
 #### Quick Installation
-The language container is downloaded and installed by executing the 
+The Language Container is downloaded and installed by executing the 
 deployment script below with the desired version. Make sure the version matches with your installed version of the 
 Transformers Extension Package. See [the latest release](https://github.com/exasol/transformers-extension/releases) on Github.
 
@@ -111,10 +156,17 @@ Transformers Extension Package. See [the latest release](https://github.com/exas
       --ssl-cert-path <ssl-cert-path> \
       --use-ssl-cert-validation
   ```
+
+**Note:** The `PATH_IN_BUCKET` can not be empty.
+
 The `--ssl-cert-path` is optional if your certificate is not in the OS truststore. 
+This certificate is basically a list of trusted CA. It is needed for the server's certificate 
+validation by the client.
 The option `--use-ssl-cert-validation`is the default, you can disable it with `--no-use-ssl-cert-validation`.
 Use caution if you want to turn certificate validation off as it potentially lowers the security of your 
 Database connection.
+This is not to be confused with the client's own certificate. It may or may not include the private key. 
+In the latter case the key may be provided as a separate file.
 
 By default, the above command will upload and activate the language container at the System level.
 The latter requires you to have the System Privileges, as it will attempt to change DB system settings.
@@ -173,25 +225,32 @@ There are two ways to install the language container: (1) using a python script
           --language-alias <LANGUAGE_ALIAS> \ 
           --container-file <path/to/language_container_name.tar.gz>       
       ```
-     Please note, that all considerations described in the Quick Installation 
+     Please note:  all considerations described in the Quick Installation 
      section are still applicable.
 
+     **Note:** The  `--path-in-bucket` can not be empty.
+
 
   2. *Manual Installation*
 
-     In the manual installation, the pre-built container should be firstly 
-     uploaded into BucketFS. In order to do that, you can use 
+     In the manual installation, the pre-built container should be
+     uploaded into BucketFS first. In order to do that, you can use 
      either a [http(s) client](https://docs.exasol.com/database_concepts/bucketfs/file_access.htm) 
      or the [bucketfs-client](https://github.com/exasol/bucketfs-client). 
-     The following command uploads a given container into BucketFS through curl 
-     command, an http(s) client: 
+     The following command uploads a given container into BucketFS through 
+     the http(s) client curl: 
+
       ```shell
       curl -vX PUT -T \ 
           "<CONTAINER_FILE>" 
           "http://w:<BUCKETFS_WRITE_PASSWORD>@<BUCKETFS_HOST>:<BUCKETFS_PORT>/<BUCKETFS_NAME>/<PATH_IN_BUCKET><CONTAINER_FILE>"
       ```
 
-      Please note that specifying the password on command line will make your shell record the password in the history. To avoid leaking your password please consider to set an environment variable. The following examples sets environment variable `BUCKETFS_WRITE_PASSWORD`:
+      Please note that specifying the password on command line will make your shell 
+      record the password in the history. To avoid leaking your password please 
+      consider to set an environment variable. The following examples sets
+      environment variable `BUCKETFS_WRITE_PASSWORD`:
+
       ```shell 
         read -sp "password: " BUCKETFS_WRITE_PASSWORD
       ```
@@ -203,36 +262,53 @@ There are two ways to install the language container: (1) using a python script
 
       ```sql
       ALTER SESSION SET SCRIPT_LANGUAGES=\
-      <ALIAS>=localzmq+protobuf:///<BUCKETFS_NAME>/<BUCKET_NAME>/<PATH_IN_BUCKET><CONTAINER_NAME>/?\
-              lang=<LANGUAGE>#buckets/<BUCKETFS_NAME>/<BUCKET_NAME>/<PATH_IN_BUCKET><CONTAINER_NAME>/\
+      PYTHON3_TE=localzmq+protobuf:///<BUCKETFS_NAME>/<BUCKET_NAME>/<PATH_IN_BUCKET><CONTAINER_NAME>/?\
+              lang=_python_#buckets/<BUCKETFS_NAME>/<BUCKET_NAME>/<PATH_IN_BUCKET><CONTAINER_NAME>/\
               exaudf/exaudfclient_py3
       ```
-
-      In project transformer-extensions replace `<ALIAS>` by `_PYTHON3_TE_` and `<LANGUAGE>` by `_python_`.
       For more details please check [Adding New Packages to Existing Script Languages](https://docs.exasol.com/database_concepts/udf_scripts/adding_new_packages_script_languages.htm).
 
 
 ### Deployment
-- Deploy all necessary scripts installed in the previous step to the specified 
-`SCHEMA` in Exasol DB with the same `LANGUAGE_ALIAS`  using the following python cli command:
+
+Next you need to deploy all necessary scripts installed in the previous step to the specified 
+`SCHEMA` in your Exasol DB with the same `LANGUAGE_ALIAS`  using the following Python CLI command:
 ```buildoutcfg
 python -m exasol_transformers_extension.deploy scripts
     --dsn <DB_HOST:DB_PORT> \
     --db-user <DB_USER> \
     --db-pass <DB_PASSWORD> \
     --schema <SCHEMA> \
-    --language-alias <LANGUAGE_ALIAS>
+    --language-alias PYTHON3_TE
 ```
 
 ## Store Models in BucketFS
 Before you can use pre-trained models, the models must be stored in the 
 BucketFS. We provide two different ways to load transformers models 
-into BucketFS:
+into the BucketFS. You may either use the Model Downloader UDF to download a Hugging Face 
+transformers model directly from the Exasol Database, or you can download the model to your local 
+file system and upload it to the Database using the Model Uploader Script.
+The Model Downloader UDF is the simpler option, but if you do not want to connect your Exasol Database
+directly to the internet, the Model Uploader Script is an option for you.
 
+Note that the extension currently only supports the `PyTorch` framework. 
+Please make sure that the selected models are in the `Pytorch` model library section.
 
 ### 1. Model Downloader UDF
 Using the `TE_MODEL_DOWNLOADER_UDF` below, you can download the desired model 
 from the huggingface hub and upload it to BucketFS.
+This requires the Exasol Database to have internet access, since the UDF will 
+download the model from Hugging Face to the Database without saving it somewhere else intermittently.
+If you are using the Exasol DockerDB or an Exasol version 8 setup via 
+[c4](https://docs.exasol.com/db/latest/administration/on-premise/admin_interface/c4.htm), 
+this is not the case by default, and you need to specify a name server.
+For example setting it to 'nameserver = 8.8.8.8' will set it to use Google DNS.
+You will need to use [ConfD](https://docs.exasol.com/db/latest/confd/confd.htm) to do this, 
+you can use the [general_settings](https://docs.exasol.com/db/latest/confd/jobs/general_settings.htm) command.
+If you are using the [Integration Test Docker Environment](https://github.com/exasol/integration-test-docker-environment), 
+you can just set the nameserver parameter like this: `--nameserver 8.8.8.8`
+
+Once you have internet access, invoke the UDF like this:
 
 ```sql
 SELECT TE_MODEL_DOWNLOADER_UDF(
@@ -244,18 +320,15 @@ SELECT TE_MODEL_DOWNLOADER_UDF(
 ```
 - Parameters:
   - ```model_name```: The name of the model to use for prediction. You can find the 
-  details of the models in [huggingface models page](https://huggingface.co/models).
+  details of the models on the [huggingface models page](https://huggingface.co/models).
   - ```sub_dir```: The directory where the model is stored in the BucketFS.
   - ```bucketfs_conn```: The BucketFS connection name.
   - ```token_conn```: The connection name containing the token required for 
   private models. You can use empty string ('') for public models. For details 
   on how to create a connection object with token information, please check 
   [here](#getting-started).
-
-Note that the extension currently only supports the `PyTorch` framework. 
-Please make sure that the selected models are in the `Pytorch` model library section.
-
-
+
+
 ### 2. Model Uploader Script
 You can invoke the python script as below which allows to load the transformer 
 models from the local filesystem into BucketFS:
@@ -274,24 +347,30 @@ models from the local filesystem into BucketFS:
       --local-model-path <MODEL_PATH>     
   ```
 
-*Note*: The options --local-model-path needs to point to a path which contains the model and its tokenizer. 
+**Note:**  The `--path-in-bucket` can not be empty.
+**Note**: The options --local-model-path needs to point to a path which contains the model and its tokenizer. 
 These should have been saved using transformers [save_pretrained](https://huggingface.co/docs/transformers/v4.32.1/en/installation#fetch-models-and-tokenizers-to-use-offline) 
 function to ensure proper loading by the Transformers Extension UDFs.
-You can download the model using python lke this:
+You can download the model using python like this:
 
 ```python
     for model_factory in [transformers.AutoModel, transformers.AutoTokenizer]:
-        # download the model an tokenizer from huggingface
-        model = model_factory.from_pretrained(model_name, cache_dir=<your cache path> / <huggingface model name>)
-        # save the downloaded model using the save_pretrained fuction
-        model.save_pretrained(<save_path> / "pretrained" / <model_name>)
+        # download the model and tokenizer from Hugging Face
+        model = model_factory.from_pretrained(model_name)
+        # save the downloaded model using the save_pretrained function
+        model_save_path = <save_path> / "pretrained" / <model_name>
+        model.save_pretrained(model_save_path)
 ```
-And then upload it using exasol_transformers_extension.upload_model script where ```local-model-path = <save_path> / "pretrained" / <model_name>```
+***Note:*** Hugging Face models consist of two parts, the model and the tokenizer. 
+Make sure to download and save both into the same save directory so the upload model script uploads them together.
+And then upload it using exasol_transformers_extension.upload_model script where ```--local-model-path = <save_path> / "pretrained" / <model_name>```
 
 
-## Prediction UDFs
-We provided 7 prediction UDFs, each performing an NLP task through the [transformers API](https://huggingface.co/docs/transformers/task_summary). 
-These tasks cache the model downloaded to BucketFS and make an inference using the cached models with user-supplied inputs.
+## Using Prediction UDFs
+We provide 7 prediction UDFs in this Transformers Extension, each performing an NLP 
+task through the [transformers API](https://huggingface.co/docs/transformers/task_summary). 
+These tasks use the model downloaded to BucketFS and run inference using 
+the user-supplied inputs.
 
 ### Sequence Classification for Single Text UDF
 This UDF classifies the given single text  according to a given number of 

diff --git a/pyproject.toml b/pyproject.toml
@@ -1,7 +1,7 @@
 [tool.poetry]
 name = "exasol-transformers-extension"
 version = "1.0.0"
-description = "An Exasol extension to use state-of-the-art pretrained machine learning models via the transformers api."
+description = "An Exasol extension for using state-of-the-art pretrained machine learning models via the Hugging Face Transformers API."
 
 authors = [
     	"Umit Buyuksahin <[email protected]>",

diff --git a/tests/integration_tests/with_db/deployment/test_language_container_deployer_cli.py b/tests/integration_tests/with_db/deployment/test_language_container_deployer_cli.py
@@ -125,7 +125,7 @@ def test_language_container_deployer_cli_by_downloading_container(
     schema = test_name
     language_alias = f"PYTHON3_TE_{test_name.upper()}"
     container_path = None
-    version = "0.9.0"
+    version = "0.10.0"
     create_schema(pyexasol_connection, schema)
     dsn = f"{exasol_config.host}:{exasol_config.port}"
     with revert_language_settings(pyexasol_connection):