Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring/#146 use new model save functionality #186

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
07e6f38
switched use to huggingface transfer save pretrained version
MarleneKress79789 Jan 31, 2024
a6d7709
switched use to huggingface transfer save pretrained version
MarleneKress79789 Jan 31, 2024
a6690a7
chnanged to load local model
MarleneKress79789 Feb 5, 2024
1a8d4a5
Merge remote-tracking branch 'origin/refactoring/#146_use_new_model_s…
MarleneKress79789 Feb 5, 2024
3bf4121
[CodeBuild] change bucketfs upload
MarleneKress79789 Feb 6, 2024
3451f58
[CodeBuild] change local bucketfs upload
MarleneKress79789 Feb 8, 2024
5a52093
fix local bucketfs model upload and some cleanup
MarleneKress79789 Feb 9, 2024
93efc77
[CodeBuild] removed download sample model fixture because of duplication
MarleneKress79789 Feb 9, 2024
fb944bc
documentation
MarleneKress79789 Feb 9, 2024
2281e9a
removed todo
MarleneKress79789 Feb 9, 2024
9b36191
Apply suggestions from code review
MarleneKress79789 Feb 20, 2024
1b420a8
review findings
MarleneKress79789 Feb 20, 2024
d4cd004
Update doc/user_guide/user_guide.md
MarleneKress79789 Feb 20, 2024
2b1abe9
simplify upload_model_to_local_bucketfs in model_fixture.py [CodeBuild]
tkilias Feb 22, 2024
a9d9ad4
simplify implementation and improve naming in model_fixture.py [CodeB…
tkilias Feb 22, 2024
04ac3cb
Fix recursion issue in prepare_model_in_local_bucketfs [CodeBuild]
tkilias Mar 6, 2024
14a1939
Remove .replace("-", "_") from model names [CodeBuild]
tkilias Mar 6, 2024
9fb310e
[CodeBuild] Fix change files
MarleneKress79789 Mar 6, 2024
a0a0271
Merge branch 'dev_storage_format_change' into refactoring/#146_use_ne…
MarleneKress79789 Mar 6, 2024
c882cc2
[CodeBuild]
MarleneKress79789 Mar 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/changes/changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Changelog

* [1.0.0](changes_1.0.0.md)
* [0.10.0](changes_0.10.0.md)
* [0.9.2](changes_0.9.2.md)
* [0.9.1](changes_0.9.1.md)
Expand Down
1 change: 0 additions & 1 deletion doc/changes/changes_0.10.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,3 @@ Deploying SLC under Windows, releasing to PyPi.

### Security


17 changes: 17 additions & 0 deletions doc/changes/changes_1.0.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Transformers Extension 1.0.0, T.B.D

Code name: T.B.D


## Summary
T.B.D


### Features

- #146: Integrated new download and load functions using save_pretrained

### Refactorings


### Security
16 changes: 15 additions & 1 deletion doc/user_guide/user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,9 +229,10 @@ Before you can use pre-trained models, the models must be stored in the
BucketFS. We provide two different ways to load transformers models
into BucketFS:


### 1. Model Downloader UDF
Using the `TE_MODEL_DOWNLOADER_UDF` below, you can download the desired model
from the huggingface hub and upload it to bucketfs.
from the huggingface hub and upload it to BucketFS.

```sql
SELECT TE_MODEL_DOWNLOADER_UDF(
Expand Down Expand Up @@ -274,6 +275,19 @@ models from the local filesystem into BucketFS:
```

*Note*: The options --local-model-path needs to point to a path which contains the model and its tokenizer.
These should have been saved using transformers [save_pretrained](https://huggingface.co/docs/transformers/v4.32.1/en/installation#fetch-models-and-tokenizers-to-use-offline)
function to ensure proper loading by the Transformers Extension UDFs.
tkilias marked this conversation as resolved.
Show resolved Hide resolved
You can download the model using python lke this:

```python
for model_factory in [transformers.AutoModel, transformers.AutoTokenizer]:
# download the model an tokenizer from huggingface
model = model_factory.from_pretrained(model_name, cache_dir=<your cache path> / <huggingface model name>)
# save the downloaded model using the save_pretrained fuction
model.save_pretrained(<save_path> / "pretrained" / <model_name>)
```
And then upload it using exasol_transformers_extension.upload_model script where ```local-model-path = <save_path> / "pretrained" / <model_name>```


## Prediction UDFs
We provided 7 prediction UDFs, each performing an NLP task through the [transformers API](https://huggingface.co/docs/transformers/task_summary).
Expand Down
54 changes: 29 additions & 25 deletions exasol_transformers_extension/udfs/models/base_model_udf.py
Original file line number Diff line number Diff line change
@@ -1,33 +1,44 @@
import os
from abc import abstractmethod, ABC
from typing import Iterator, List, Any
import torch
import traceback
import pandas as pd
import numpy as np
import transformers

from exasol_transformers_extension.deployment import constants
from exasol_transformers_extension.utils import device_management, \
bucketfs_operations, dataframe_operations
from exasol_transformers_extension.utils.load_model import LoadModel
from exasol_transformers_extension.utils.load_local_model import LoadLocalModel
from exasol_transformers_extension.utils.model_factory_protocol import ModelFactoryProtocol


class BaseModelUDF(ABC):
"""
This base class should be extended by each UDF class containing model logic.
This class contains common operations for all prediction UDFs. The following
methods should be implemented specifically for each UDF class:
This class contains common operations for all prediction UDFs:
- accesses data part-by-part based on predefined batch size
- manages the script cache
- manages the model cache
- reads the corresponding model from BucketFS into cache
- creates model pipeline through transformer api
- manages the creation of predictions and the preparation of results.

Additionally, the following
methods should be implemented specifically for each UDF class:
- create_dataframes_from_predictions
- extract_unique_param_based_dataframes
- execute_prediction
- append_predictions_to_input_dataframe

"""
def __init__(self,
exa,
batch_size,
pipeline,
base_model,
tokenizer,
task_name):
batch_size: int,
pipeline: transformers.Pipeline,
base_model: ModelFactoryProtocol,
tokenizer: ModelFactoryProtocol,
task_name: str):
self.exa = exa
self.batch_size = batch_size
self.pipeline = pipeline
Expand Down Expand Up @@ -59,11 +70,11 @@ def create_model_loader(self):
"""
Creates the model_loader.
"""
self.model_loader = LoadModel(self.pipeline,
self.base_model,
self.tokenizer,
self.task_name,
self.device)
self.model_loader = LoadLocalModel(pipeline_factory=self.pipeline,
base_model_factory=self.base_model,
tokenizer_factory=self.tokenizer,
task_name=self.task_name,
device=self.device)

def get_predictions_from_batch(self, batch_df: pd.DataFrame) -> pd.DataFrame:
"""
Expand Down Expand Up @@ -180,17 +191,11 @@ def check_cache(self, model_df: pd.DataFrame) -> None:
token_conn = model_df["token_conn"].iloc[0]

current_model_key = (bucketfs_conn, sub_dir, model_name, token_conn)
if self.model_loader.last_loaded_model_key != current_model_key:
if self.model_loader.loaded_model_key != current_model_key:
self.set_cache_dir(model_name, bucketfs_conn, sub_dir)
self.model_loader.clear_device_memory()
if token_conn:
token_conn_obj = self.exa.get_connection(token_conn)
else:
token_conn_obj = None
self.last_created_pipeline = self.model_loader.load_models(model_name,
current_model_key,
self.cache_dir,
token_conn_obj)
self.last_created_pipeline = self.model_loader.load_models(self.cache_dir,
current_model_key)

def set_cache_dir(
self, model_name: str, bucketfs_conn_name: str,
Expand All @@ -206,11 +211,10 @@ def set_cache_dir(
bucketfs_operations.create_bucketfs_location_from_conn_object(
self.exa.get_connection(bucketfs_conn_name))

model_path = bucketfs_operations.get_model_path(sub_dir, model_name)
model_path = bucketfs_operations.get_model_path_with_pretrained(sub_dir, model_name)
self.cache_dir = bucketfs_operations.get_local_bucketfs_path(
bucketfs_location=bucketfs_location, model_path=str(model_path))


def get_prediction(self, model_df: pd.DataFrame) -> pd.DataFrame:
"""
Perform prediction of the given model and preparation of the prediction
Expand Down
29 changes: 21 additions & 8 deletions exasol_transformers_extension/udfs/models/model_downloader_udf.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,29 @@
from exasol_bucketfs_utils_python.bucketfs_factory import BucketFSFactory

from exasol_transformers_extension.utils import bucketfs_operations
from exasol_transformers_extension.utils.huggingface_hub_bucketfs_model_transfer import ModelFactoryProtocol, \
HuggingFaceHubBucketFSModelTransferFactory
from exasol_transformers_extension.utils.model_factory_protocol import ModelFactoryProtocol
from exasol_transformers_extension.utils.huggingface_hub_bucketfs_model_transfer_sp import \
HuggingFaceHubBucketFSModelTransferSPFactory


class ModelDownloaderUDF:
"""
UDF which downloads a pretrained model from Huggingface using Huggingface's transformers API,
and uploads it to the BucketFS, from where it can then be loaded without accessing Huggingface again.
Must be called with the following Input Parameter:

model_name | sub_dir | bfs_conn | token_conn
---------------------------------------------------------------------------------------------------
name of Huggingface model | directory to save model | BucketFS connection | name of token connection

returns <sub_dir/model_name> , <path of model BucketFS>
"""
def __init__(self,
exa,
base_model_factory: ModelFactoryProtocol = transformers.AutoModel,
tokenizer_factory: ModelFactoryProtocol = transformers.AutoTokenizer,
huggingface_hub_bucketfs_model_transfer: HuggingFaceHubBucketFSModelTransferFactory =
HuggingFaceHubBucketFSModelTransferFactory(),
huggingface_hub_bucketfs_model_transfer: HuggingFaceHubBucketFSModelTransferSPFactory =
HuggingFaceHubBucketFSModelTransferSPFactory(),
bucketfs_factory: BucketFSFactory = BucketFSFactory()):
self._exa = exa
self._base_model_factory = base_model_factory
Expand All @@ -31,10 +43,10 @@ def run(self, ctx) -> None:

def _download_model(self, ctx) -> Tuple[str, str]:
# parameters
model_name = ctx.model_name
sub_dir = ctx.sub_dir
bfs_conn = ctx.bfs_conn
token_conn = ctx.token_conn
model_name = ctx.model_name # name of Huggingface model
sub_dir = ctx.sub_dir # directory to save model
bfs_conn = ctx.bfs_conn # BucketFS connection
token_conn = ctx.token_conn # name of token connection

# extract token from the connection if token connection name is given.
# note that, token is required for private models. It doesn't matter
Expand Down Expand Up @@ -64,6 +76,7 @@ def _download_model(self, ctx) -> Tuple[str, str]:
) as downloader:
for model in [self._base_model_factory, self._tokenizer_factory]:
downloader.download_from_huggingface_hub(model)
# upload model files to BucketFS
model_tar_file_path = downloader.upload_to_bucketfs()

return str(model_path), str(model_tar_file_path)
7 changes: 6 additions & 1 deletion exasol_transformers_extension/upload_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,18 @@ def main(
model_name: str,
sub_dir: str,
local_model_path: str):
"""
Script for uploading locally saved model files to BucketFS. Files should have been saved locally
using Transformers save_pretrained function. This ensures proper loading from the BucketFS later
"""
# create bucketfs location
bucketfs_location = bucketfs_operations.create_bucketfs_location(
bucketfs_name, bucketfs_host, bucketfs_port, bucketfs_use_https,
bucketfs_user, bucketfs_password, bucket, path_in_bucket)

# upload the downloaded model files into bucketfs
upload_path = bucketfs_operations.get_model_path(sub_dir, model_name)
upload_path = bucketfs_operations.get_model_path_with_pretrained(sub_dir, model_name)

bucketfs_operations.upload_model_files_to_bucketfs(
local_model_path, upload_path, bucketfs_location)

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from pathlib import Path

from exasol_bucketfs_utils_python.abstract_bucketfs_location import AbstractBucketFSLocation
from exasol_bucketfs_utils_python.bucketfs_location import BucketFSLocation

from exasol_transformers_extension.utils import bucketfs_operations
Expand Down
13 changes: 10 additions & 3 deletions exasol_transformers_extension/utils/bucketfs_operations.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,13 @@ def create_bucketfs_location(
def upload_model_files_to_bucketfs(
tmpdir_name: str, model_path: Path,
bucketfs_location: AbstractBucketFSLocation) -> Path:
"""
uploads model in tmpdir_name to model_path in bucketfs_location
"""
with tempfile.TemporaryFile() as fileobj:
create_tar_of_directory(Path(tmpdir_name), fileobj)
model_tar_file = model_path.with_suffix(".tar.gz")
return upload_file_to_bucketfs_with_retry(bucketfs_location, fileobj, model_tar_file)
model_upload_tar_file_path = model_path.with_suffix(".tar.gz")
return upload_file_to_bucketfs_with_retry(bucketfs_location, fileobj, model_upload_tar_file_path)


@retry(wait=wait_fixed(2), stop=stop_after_attempt(10))
Expand All @@ -69,4 +72,8 @@ def get_local_bucketfs_path(


def get_model_path(sub_dir: str, model_name: str) -> Path:
return Path(sub_dir, model_name.replace('-', '_'))
return Path(sub_dir, model_name)


def get_model_path_with_pretrained(sub_dir: str, model_name: str) -> Path:
return Path(sub_dir, model_name, "pretrained" , model_name)

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,24 +1,22 @@
from pathlib import Path

from exasol_bucketfs_utils_python.abstract_bucketfs_location import AbstractBucketFSLocation
from exasol_bucketfs_utils_python.bucketfs_location import BucketFSLocation

from exasol_transformers_extension.utils.model_factory_protocol import ModelFactoryProtocol
from exasol_transformers_extension.utils.bucketfs_model_uploader import BucketFSModelUploaderFactory
from exasol_transformers_extension.utils.temporary_directory_factory import TemporaryDirectoryFactory





class HuggingFaceHubBucketFSModelTransferSP:
"""
Class for downloading a model using the Huggingface Transformers API, and loading it into the BucketFS
using save_pretrained.
Class for downloading a model using the Huggingface Transformers API, saving it locally using
transformers save_pretrained, and loading the saved model files into the BucketFS.

:bucketfs_location: BucketFSLocation the model should be loaded to
:model_name: Name of the model to be downloaded using Huggingface Transformers API
:model_path: Path the model will be loaded into the BucketFS at
:token: Huggingface token, only needed for private models
:bucketfs_location: BucketFSLocation the model should be loaded to
:model_name: Name of the model to be downloaded using Huggingface Transformers API
:model_path: Path the model will be loaded into the BucketFS at
:token: Huggingface token, only needed for private models
:temporary_directory_factory: Optional. Default is TemporaryDirectoryFactory. Mainly change for testing.
:bucketfs_model_uploader_factory: Optional. Default is BucketFSModelUploaderFactory. Mainly change for testing.
"""
Expand Down Expand Up @@ -50,9 +48,10 @@ def __exit__(self, exc_type, exc_val, exc_tb):
def download_from_huggingface_hub(self, model_factory: ModelFactoryProtocol):
"""
Download a model from HuggingFace Hub into a temporary directory and save it with save_pretrained
in temporary directory / pretrained .
in temporary directory / pretrained / model_name.
"""
model = model_factory.from_pretrained(self._model_name, cache_dir=self._tmpdir_name / "cache", use_auth_token=self._token)
model = model_factory.from_pretrained(self._model_name, cache_dir=self._tmpdir_name / "cache",
use_auth_token=self._token)
model.save_pretrained(self._tmpdir_name / "pretrained" / self._model_name)

def upload_to_bucketfs(self) -> Path:
Expand All @@ -61,7 +60,7 @@ def upload_to_bucketfs(self) -> Path:

returns: Path of the uploaded model in the BucketFS
"""
return self._bucketfs_model_uploader.upload_directory(self._tmpdir_name / "pretrained" / self._model_name)
return self._bucketfs_model_uploader.upload_directory(self._tmpdir_name / "pretrained" / self._model_name) #todo should we do replace(-,_) here to?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did we replace it befoe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here:

def get_model_path(sub_dir: str, model_name: str) -> Path:

and then consequently here:

model_params.base_model, tmpdir / model_params.sub_dir / model_params.base_model.replace("-", "_")):

looks like there was concern about the path in bucketfs containing "-". but i dont know if that is still valid.



class HuggingFaceHubBucketFSModelTransferSPFactory:
Expand Down
Loading
Loading