Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation improvements #199

Merged
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@
An Exasol extension to use state-of-the-art pretrained machine learning models
via the [transformers api](https://github.com/huggingface/transformers).

This Extension is build and tested for Linux OS, and does not have Windows or MacOS support.
It might work regardless but proceed at your own risk.


## Table of Contents

Expand Down
1 change: 0 additions & 1 deletion buildspec.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,5 @@ phases:
- echo "$DOCKER_PASSWORD" | docker login --username "$DOCKER_USER" --password-stdin
build:
commands:
- poetry run nox -s unit_tests
tkilias marked this conversation as resolved.
Show resolved Hide resolved
- poetry run nox -s start_database
- poetry run nox -s integration_tests
4 changes: 4 additions & 0 deletions doc/changes/changes_1.0.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,10 @@ T.B.D

- #146: Integrated new download and load functions using save_pretrained

### Documentation

- #133: Improved user and developer documentation with additional information

### Refactorings


Expand Down
26 changes: 26 additions & 0 deletions doc/developer_guide/developer_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,12 @@ and install it as follows:
pip install <path/wheel-filename.whl> --extra-index-url https://download.pytorch.org/whl/cpu
```

### Check wheel installation

The wheel should be installed in `transformers-extension/dist`. After updating and building a new release
there may be multiple wheels installed here. This leads to problems, so check and delete the old wheels if necessary.
You may also need to check
`transformers-extension/language_container/exasol_transformers_extension_container/flavor_base/release/dist` for the same reason.

### Run Tests
All unit and integration tests can be run within the Poetry environment created
Expand All @@ -42,6 +48,8 @@ Start a test database and run integration tests:
poetry run nox -s integration_tests
```

You can find more information regarding the tests in the [Tests](#tests) section below

## Add Transformer Tasks
In the transformers-extension library, the 8 most popular NLP tasks provided by
[Transformers API](https://huggingface.co/docs/transformers/index) have already
Expand Down Expand Up @@ -126,3 +134,21 @@ Currently, the CodeBuild project is managed manually and is triggered with a web
For this our aws-ci user is added to this Repository. The webhook can be configured in the AWS CodeBuild
project directly.
The CodeBuild project also uses our DockerHub user for the build. For this it has access to the AWS SecretsManager.


#### 3. Release download test

After you do a Release on the project, you may want to trigger the [SLC Download Test](https://github.com/exasol/transformers-extension/blob/8f57d1f0ca3f95a2d3edc9b84e8dd779aa6093d8/tests/integration_tests/with_db/deployment/test_language_container_deployer_cli.py#L117)
to make sure the new SLC is uploaded and correctly named.
**This is especially important if the naming convention of the SLC was changed!**
* testfile: [tests/integration_tests/with_db/deployment/test_language_container_deployer_cli.py](../../tests/integration_tests/with_db/deployment/test_language_container_deployer_cli.py)
* test_name: test test_language_container_deployer_cli_by_downloading_container

Also, during a release, the version object should be updated from time to time,
so we actually use the correct SLC for the test. Do this
[here](https://github.com/exasol/transformers-extension/blob/8f57d1f0ca3f95a2d3edc9b84e8dd779aa6093d8/tests/integration_tests/with_db/deployment/test_language_container_deployer_cli.py#L128)

## Good to know

* Hugging Face models consist of 2 parts, the model and the Tokenizer.
Most of our functions deal with both parts
165 changes: 122 additions & 43 deletions doc/user_guide/user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ model and perform prediction. These are the supported tasks:

## Table of Contents

- [Introduction](#introduction)
- [Getting Started](#getting-started)
- [Setup](#setup)
- [Model Downloader UDF](#model-downloader-udf)
Expand All @@ -33,7 +34,21 @@ model and perform prediction. These are the supported tasks:
7. [Text Translation UDF](#text-translation-udf)
8. [Zero-Shot Text Classification](#zero-shot-text-classification-udf)

## Introduction

This Exasol Extension provides UDFs for interacting with Hugging Face's Transformers API in order to use
pre-trained models on an Exasol Cluster.

User Defined Function, UDFs for short, are scripts in various programming languages that can be
executed in the Exasol Database. They can be used by the user for more flexibility in data processing.
In this Extension we provide multiple UDFs for you to use on your Exasol Database.
You can find a more detailed documentation on UDFs
[here](https://docs.exasol.com/db/latest/database_concepts/udf_scripts.htm).

UDFs and the necessary [Script language Container](https://docs.exasol.com/db/latest/database_concepts/udf_scripts/adding_new_packages_script_languages.htm)
are stored in Exasol's file system BucketFS, and we also use this to store the Hugging Face
models on the Exasol Cluster. More information on The BucketFS can be found
[here](https://docs.exasol.com/db/latest/database_concepts/bucketfs/bucketfs.htm).

## Getting Started
- Exasol DB
Expand All @@ -47,9 +62,12 @@ model and perform prediction. These are the supported tasks:
CREATE OR REPLACE CONNECTION <BUCKETFS_CONNECTION_NAME>
TO '<BUCKETFS_ADDRESS>'
USER '<BUCKETFS_USER>'
IDENTIFIED BY '<BUCKETFS_PASS>'
IDENTIFIED BY '<BUCKETFS_PASSWORD>'
```

- The `BUCKETFS_ADDRESS` looks like the following:

**Note:** The `<PATH_IN_BUCKET>` can not be empty.
```buildoutcfg
http[s]://<BUCKETFS_HOST>:<BUCKETFS_PORT>/<BUCKET_NAME>/<PATH_IN_BUCKET>;<BUCKETFS_NAME>
```
Expand All @@ -65,31 +83,58 @@ model and perform prediction. These are the supported tasks:
- For more information please check the [Create Connection in Exasol](https://docs.exasol.com/sql/create_connection.htm?Highlight=connection) document.

## Setup
### The Python Package
#### Download The Python Wheel Package
- The latest version of the python package of this extension can be
downloaded from the [GitHUb Release](https://github.com/exasol/transformers-extension/releases/latest).
### Install the Python Package

There are multiple ways to install the Python Package. You can use Pip install,
Download the Wheel from GitHub or build the project yourself.
Additionally, you will need a Script Language Container. Find the how-to below.

#### Pip

The Transformers Extension is published on [Pypi](https://pypi.org/project/exasol-transformers-extension/).

You can install it with:

```shell
pip install exasol-transformers-extension
```


#### Download and Install the Python Wheel Package

You can also get the wheel from a Github release.
- The latest version of the Python package of this extension can be
downloaded from the [GitHub Release](https://github.com/exasol/transformers-extension/releases/latest).
Please download the following built archive:
```buildoutcfg
exasol_transformers_extension-<version-number>-py3-none-any.whl
```
If you need to use a version < 0.5.0, the build archive is called `transformers_extension.whl`.


#### Install The Python Wheel Package
Install the packaged transformers-extension project as follows:
Then install the packaged transformers-extension project as follows:
```shell
pip install <path/wheel-filename.whl> --extra-index-url https://download.pytorch.org/whl/cpu
pip install <path/wheel-filename.whl>
```

#### Build the project yourself

In order to build Transformers Extension yourself, you need to have the [Poetry](https://python-poetry.org/)
(>= 1.1.11) package manager installed. Clone the Github Repository, and install and build
the `transformers-extension` as follows:
```bash
poetry install
poetry build
```

### The Pre-built Language Container

This extension requires the installation of the language container for this
extension to run. It can be installed in two ways: Quick and Customized
installations
This extension requires the installation of a Language Container in the Exasol Database for this
extension to run. The Script Language Container is a way to install the required programming language and
necessary dependencies in the Exasol Database so the UDF scripts can be executed.
It can be installed in two ways: Quick and Customized installations:

#### Quick Installation
The language container is downloaded and installed by executing the
The Language Container is downloaded and installed by executing the
deployment script below with the desired version. Make sure the version matches with your installed version of the
Transformers Extension Package. See [the latest release](https://github.com/exasol/transformers-extension/releases) on Github.

Expand All @@ -111,10 +156,17 @@ Transformers Extension Package. See [the latest release](https://github.com/exas
--ssl-cert-path <ssl-cert-path> \
--use-ssl-cert-validation
```

**Note:** The `PATH_IN_BUCKET` can not be empty.

The `--ssl-cert-path` is optional if your certificate is not in the OS truststore.
This certificate is basically a list of trusted CA. It is needed for the server's certificate
validation by the client.
The option `--use-ssl-cert-validation`is the default, you can disable it with `--no-use-ssl-cert-validation`.
Use caution if you want to turn certificate validation off as it potentially lowers the security of your
Database connection.
This is not to be confused with the client's own certificate. It may or may not include the private key.
MarleneKress79789 marked this conversation as resolved.
Show resolved Hide resolved
In the latter case the key may be provided as a separate file.

By default, the above command will upload and activate the language container at the System level.
The latter requires you to have the System Privileges, as it will attempt to change DB system settings.
Expand Down Expand Up @@ -173,25 +225,32 @@ There are two ways to install the language container: (1) using a python script
--language-alias <LANGUAGE_ALIAS> \
--container-file <path/to/language_container_name.tar.gz>
```
Please note, that all considerations described in the Quick Installation
Please note: all considerations described in the Quick Installation
section are still applicable.

**Note:** The `--path-in-bucket` can not be empty.


2. *Manual Installation*

In the manual installation, the pre-built container should be firstly
uploaded into BucketFS. In order to do that, you can use
In the manual installation, the pre-built container should be
uploaded into BucketFS first. In order to do that, you can use
either a [http(s) client](https://docs.exasol.com/database_concepts/bucketfs/file_access.htm)
or the [bucketfs-client](https://github.com/exasol/bucketfs-client).
The following command uploads a given container into BucketFS through curl
command, an http(s) client:
The following command uploads a given container into BucketFS through
the http(s) client curl:

```shell
curl -vX PUT -T \
"<CONTAINER_FILE>"
"http://w:<BUCKETFS_WRITE_PASSWORD>@<BUCKETFS_HOST>:<BUCKETFS_PORT>/<BUCKETFS_NAME>/<PATH_IN_BUCKET><CONTAINER_FILE>"
```

Please note that specifying the password on command line will make your shell record the password in the history. To avoid leaking your password please consider to set an environment variable. The following examples sets environment variable `BUCKETFS_WRITE_PASSWORD`:
Please note that specifying the password on command line will make your shell
record the password in the history. To avoid leaking your password please
consider to set an environment variable. The following examples sets
environment variable `BUCKETFS_WRITE_PASSWORD`:

```shell
read -sp "password: " BUCKETFS_WRITE_PASSWORD
```
Expand All @@ -203,36 +262,53 @@ There are two ways to install the language container: (1) using a python script

```sql
ALTER SESSION SET SCRIPT_LANGUAGES=\
<ALIAS>=localzmq+protobuf:///<BUCKETFS_NAME>/<BUCKET_NAME>/<PATH_IN_BUCKET><CONTAINER_NAME>/?\
lang=<LANGUAGE>#buckets/<BUCKETFS_NAME>/<BUCKET_NAME>/<PATH_IN_BUCKET><CONTAINER_NAME>/\
PYTHON3_TE=localzmq+protobuf:///<BUCKETFS_NAME>/<BUCKET_NAME>/<PATH_IN_BUCKET><CONTAINER_NAME>/?\
lang=_python_#buckets/<BUCKETFS_NAME>/<BUCKET_NAME>/<PATH_IN_BUCKET><CONTAINER_NAME>/\
exaudf/exaudfclient_py3
```

In project transformer-extensions replace `<ALIAS>` by `_PYTHON3_TE_` and `<LANGUAGE>` by `_python_`.
For more details please check [Adding New Packages to Existing Script Languages](https://docs.exasol.com/database_concepts/udf_scripts/adding_new_packages_script_languages.htm).


### Deployment
- Deploy all necessary scripts installed in the previous step to the specified
`SCHEMA` in Exasol DB with the same `LANGUAGE_ALIAS` using the following python cli command:

Next you need to deploy all necessary scripts installed in the previous step to the specified
`SCHEMA` in your Exasol DB with the same `LANGUAGE_ALIAS` using the following Python CLI command:
```buildoutcfg
python -m exasol_transformers_extension.deploy scripts
--dsn <DB_HOST:DB_PORT> \
--db-user <DB_USER> \
--db-pass <DB_PASSWORD> \
--schema <SCHEMA> \
--language-alias <LANGUAGE_ALIAS>
--language-alias PYTHON3_TE
```

## Store Models in BucketFS
Before you can use pre-trained models, the models must be stored in the
BucketFS. We provide two different ways to load transformers models
into BucketFS:
into the BucketFS. You may either use the Model Downloader UDF to download a Hugging Face
transformers model directly from the Exasol Database, or you can download the model to your local
file system and upload it to the Database using the Model Uploader Script.
The Model Downloader UDF is the simpler option, but if you do not want to connect your Exasol Database
directly to the internet, the Model Uploader Script is an option for you.

Note that the extension currently only supports the `PyTorch` framework.
Please make sure that the selected models are in the `Pytorch` model library section.

### 1. Model Downloader UDF
Using the `TE_MODEL_DOWNLOADER_UDF` below, you can download the desired model
from the huggingface hub and upload it to BucketFS.
This requires the Exasol Database to have internet access, since the UDF will
download the model from Hugging Face to the Database without saving it somewhere else intermittently.
If you are using the Exasol DockerDB or an Exasol version 8 setup via
[c4](https://docs.exasol.com/db/latest/administration/on-premise/admin_interface/c4.htm),
this is not the case by default, and you need to specify a name server.
For example setting it to 'nameserver = 8.8.8.8' will set it to use Google DNS.
You will need to use [ConfD](https://docs.exasol.com/db/latest/confd/confd.htm) to do this,
you can use the [general_settings](https://docs.exasol.com/db/latest/confd/jobs/general_settings.htm) command.
If you are using the [Integration Test Docker Environment](https://github.com/exasol/integration-test-docker-environment),
you can just set the nameserver parameter like this: `--nameserver 8.8.8.8`

Once you have internet access, invoke the UDF like this:

```sql
SELECT TE_MODEL_DOWNLOADER_UDF(
Expand All @@ -244,18 +320,15 @@ SELECT TE_MODEL_DOWNLOADER_UDF(
```
- Parameters:
- ```model_name```: The name of the model to use for prediction. You can find the
details of the models in [huggingface models page](https://huggingface.co/models).
details of the models on the [huggingface models page](https://huggingface.co/models).
- ```sub_dir```: The directory where the model is stored in the BucketFS.
- ```bucketfs_conn```: The BucketFS connection name.
- ```token_conn```: The connection name containing the token required for
private models. You can use empty string ('') for public models. For details
on how to create a connection object with token information, please check
[here](#getting-started).

Note that the extension currently only supports the `PyTorch` framework.
Please make sure that the selected models are in the `Pytorch` model library section.




### 2. Model Uploader Script
You can invoke the python script as below which allows to load the transformer
models from the local filesystem into BucketFS:
Expand All @@ -274,24 +347,30 @@ models from the local filesystem into BucketFS:
--local-model-path <MODEL_PATH>
```

*Note*: The options --local-model-path needs to point to a path which contains the model and its tokenizer.
**Note:** The `--path-in-bucket` can not be empty.
**Note**: The options --local-model-path needs to point to a path which contains the model and its tokenizer.
These should have been saved using transformers [save_pretrained](https://huggingface.co/docs/transformers/v4.32.1/en/installation#fetch-models-and-tokenizers-to-use-offline)
function to ensure proper loading by the Transformers Extension UDFs.
You can download the model using python lke this:
You can download the model using python like this:

```python
for model_factory in [transformers.AutoModel, transformers.AutoTokenizer]:
# download the model an tokenizer from huggingface
model = model_factory.from_pretrained(model_name, cache_dir=<your cache path> / <huggingface model name>)
# save the downloaded model using the save_pretrained fuction
model.save_pretrained(<save_path> / "pretrained" / <model_name>)
# download the model and tokenizer from Hugging Face
model = model_factory.from_pretrained(model_name)
# save the downloaded model using the save_pretrained function
model_save_path = <save_path> / "pretrained" / <model_name>
model.save_pretrained(model_save_path)
```
And then upload it using exasol_transformers_extension.upload_model script where ```local-model-path = <save_path> / "pretrained" / <model_name>```
***Note:*** Hugging Face models consist of two parts, the model and the tokenizer.
Make sure to download and save both into the same save directory so the upload model script uploads them together.
And then upload it using exasol_transformers_extension.upload_model script where ```--local-model-path = <save_path> / "pretrained" / <model_name>```


## Prediction UDFs
We provided 7 prediction UDFs, each performing an NLP task through the [transformers API](https://huggingface.co/docs/transformers/task_summary).
These tasks cache the model downloaded to BucketFS and make an inference using the cached models with user-supplied inputs.
## Using Prediction UDFs
We provide 7 prediction UDFs in this Transformers Extension, each performing an NLP
task through the [transformers API](https://huggingface.co/docs/transformers/task_summary).
These tasks use the model downloaded to BucketFS and run inference using
the user-supplied inputs.

### Sequence Classification for Single Text UDF
This UDF classifies the given single text according to a given number of
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[tool.poetry]
name = "exasol-transformers-extension"
version = "1.0.0"
description = "An Exasol extension to use state-of-the-art pretrained machine learning models via the transformers api."
description = "An Exasol extension for using state-of-the-art pretrained machine learning models via the Hugging Face Transformers API."

authors = [
"Umit Buyuksahin <[email protected]>",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ def test_language_container_deployer_cli_by_downloading_container(
schema = test_name
language_alias = f"PYTHON3_TE_{test_name.upper()}"
container_path = None
version = "0.9.0"
version = "0.10.0"
create_schema(pyexasol_connection, schema)
dsn = f"{exasol_config.host}:{exasol_config.port}"
with revert_language_settings(pyexasol_connection):
Expand Down
Loading