Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable embedding finetuning #639

Merged
merged 8 commits into from
Sep 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 45 additions & 1 deletion comps/finetuning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,8 @@ For reranking and embedding models finetuning, the training file [toy_finetune_d

## 3.2 Create fine-tuning job

### 3.2.1 Instruction Tuning

After a training file like `alpaca_data.json` is uploaded, use the following command to launch a finetuning job using `meta-llama/Llama-2-7b-chat-hf` as base model:

```bash
Expand All @@ -112,6 +114,8 @@ curl http://${your_ip}:8015/v1/fine_tuning/jobs \
}'
```

### 3.2.2 Reranking Model Training

Use the following command to launch a finetuning job for reranking model finetuning, such as `BAAI/bge-reranker-large`:

```bash
Expand All @@ -129,6 +133,46 @@ curl http://${your_ip}:8015/v1/fine_tuning/jobs \
}'
```

### 3.2.3 Embedding Model Training

Use the following command to launch a finetuning job for embedding model finetuning, such as `BAAI/bge-base-en-v1.5`:

```bash
# create a finetuning job
curl http://${your_ip}:8015/v1/fine_tuning/jobs \
-X POST \
-H "Content-Type: application/json" \
-d '{
"training_file": "toy_finetune_data.jsonl",
"model": "BAAI/bge-base-en-v1.5",
"General":{
"task":"embedding",
"lora_config":null
}
}'


# If training on Gaudi2, we need to set --padding "max_length" and the value of --query_max_len is same with --passage_max_len for static shape during training. For example:
curl http://${your_ip}:8015/v1/fine_tuning/jobs \
-X POST \
-H "Content-Type: application/json" \
-d '{
"training_file": "toy_finetune_data.jsonl",
"model": "BAAI/bge-base-en-v1.5",
"General":{
"task":"embedding",
"lora_config":null
},
"Dataset":{
"query_max_len":128,
"passage_max_len":128,
"padding":"max_length"
}
}'


```

## 3.3 Manage fine-tuning job

Below commands show how to list finetuning jobs, retrieve a finetuning job, cancel a finetuning job and list checkpoints of a finetuning job.
Expand All @@ -149,4 +193,4 @@ curl http://${your_ip}:8015/v1/finetune/list_checkpoints -X POST -H "Content-Typ

## 🚀4. Descriptions for Finetuning parameters

We utilize [OpenAI finetuning parameters](https://platform.openai.com/docs/api-reference/fine-tuning) and extend it with more customizable parameters.
We utilize [OpenAI finetuning parameters](https://platform.openai.com/docs/api-reference/fine-tuning) and extend it with more customizable parameters, see the definitions at [finetune_config](https://github.com/opea-project/GenAIComps/blob/main/comps/finetuning/finetune_config.py).
29 changes: 27 additions & 2 deletions comps/finetuning/finetune_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

from typing import List, Optional, Union

from pydantic import BaseModel, validator
from pydantic import BaseModel, Field, validator

from comps.cores.proto.api_protocol import FineTuningJobsRequest

Expand Down Expand Up @@ -74,13 +74,29 @@ class DatasetConfig(BaseModel):
truncation_side: str = "right"
max_seq_length: int = 512
truncation: bool = True
padding: bool = True
padding: Union[bool, str] = True
mask_input: bool = True
mask_response: bool = True
data_preprocess_type: str = "neural_chat"
max_train_samples: int = 0
max_eval_samples: int = 0
train_group_size: int = 8
query_max_len: int = Field(
default=128,
description=(
"The maximum total input sequence length after tokenization for passage. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
),
)
passage_max_len: int = Field(
default=128,
description=(
"The maximum total input sequence length after tokenization for passage. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
),
)
query_instruction_for_retrieval: Optional[str] = Field(default=None, description="instruction for query")
passage_instruction_for_retrieval: Optional[str] = Field(default=None, description="instruction for passage")


class RayResourceConfig(BaseModel):
Expand All @@ -89,6 +105,14 @@ class RayResourceConfig(BaseModel):
HPU: int = 0


class EmbeddingTrainingConfig(BaseModel):
negatives_cross_device: bool = Field(default=False, description="share negatives across devices")
temperature: Optional[float] = Field(default=0.02)
sentence_pooling_method: str = Field(default="cls", description="the pooling method, should be cls or mean")
normalized: bool = Field(default=True)
use_inbatch_neg: bool = Field(default=True, description="use passages in the same batch as negatives")


class TrainingConfig(BaseModel):
optimizer: str = "adamw_torch"
batch_size: int = 2
Expand All @@ -106,6 +130,7 @@ class TrainingConfig(BaseModel):
gradient_accumulation_steps: int = 1
logging_steps: int = 10
deepspeed_config_file: str = ""
embedding_training_config: Optional[EmbeddingTrainingConfig] = EmbeddingTrainingConfig()

@validator("device")
def check_device(cls, v: str):
Expand Down
71 changes: 71 additions & 0 deletions comps/finetuning/llm_on_ray/finetune/data_process.py
Original file line number Diff line number Diff line change
Expand Up @@ -246,3 +246,74 @@ def __call__(self, features) -> Tuple[Dict[str, torch.Tensor], Dict[str, torch.T
if isinstance(features[0], list):
features = sum(features, [])
return super().__call__(features)


class TrainDatasetForEmbedding(Dataset):
def __init__(self, dataset, args, tokenizer):
self.dataset = dataset
self.tokenizer = tokenizer
self.args = args
self.total_len = len(self.dataset)

def __len__(self):
return self.total_len

def __getitem__(self, item) -> Tuple[str, List[str]]:
query = self.dataset[item]["query"]
if self.args["query_instruction_for_retrieval"] is not None:
query = self.args["query_instruction_for_retrieval"] + query

passages = []

assert isinstance(self.dataset[item]["pos"], list)
pos = random.choice(self.dataset[item]["pos"])
passages.append(pos)

train_group_size = self.args.get("train_group_size", 8)
if len(self.dataset[item]["neg"]) < train_group_size - 1:
num = math.ceil((train_group_size - 1) / len(self.dataset[item]["neg"]))
negs = random.sample(self.dataset[item]["neg"] * num, train_group_size - 1)
else:
negs = random.sample(self.dataset[item]["neg"], train_group_size - 1)
passages.extend(negs)

if self.args["passage_instruction_for_retrieval"] is not None:
passages = [self.args["passage_instruction_for_retrieval"] + p for p in passages]
return query, passages


@dataclass
class EmbedCollator(DataCollatorWithPadding):
"""Wrapper that does conversion from List[Tuple[encode_qry, encode_psg]] to List[qry], List[psg]
and pass batch separately to the actual collator.

Abstract out data detail for the model.
"""

query_max_len: int = 32
passage_max_len: int = 128

def __call__(self, features):
query = [f[0] for f in features]
passage = [f[1] for f in features]

if isinstance(query[0], list):
query = sum(query, [])
if isinstance(passage[0], list):
passage = sum(passage, [])

q_collated = self.tokenizer(
query,
padding=self.padding,
truncation=True,
max_length=self.query_max_len,
return_tensors="pt",
)
d_collated = self.tokenizer(
passage,
padding=self.padding,
truncation=True,
max_length=self.passage_max_len,
return_tensors="pt",
)
return {"query": q_collated, "passage": d_collated}
42 changes: 33 additions & 9 deletions comps/finetuning/llm_on_ray/finetune/finetune.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,14 @@
from comps import CustomLogger
from comps.finetuning.finetune_config import FinetuneConfig
from comps.finetuning.llm_on_ray import common
from comps.finetuning.llm_on_ray.finetune.data_process import DataProcessor, GroupCollator, TrainDatasetForCE
from comps.finetuning.llm_on_ray.finetune.modeling import CrossEncoder
from comps.finetuning.llm_on_ray.finetune.data_process import (
DataProcessor,
EmbedCollator,
GroupCollator,
TrainDatasetForCE,
TrainDatasetForEmbedding,
)
from comps.finetuning.llm_on_ray.finetune.modeling import BiEncoderModel, CrossEncoder

logger = CustomLogger("llm_on_ray/finetune")

Expand Down Expand Up @@ -244,7 +250,8 @@ def group_texts(examples):
dataset["train"] = TrainDatasetForCE(dataset["train"], config["Dataset"], tokenizer)
return dataset
elif task == "embedding":
pass
dataset["train"] = TrainDatasetForEmbedding(dataset["train"], config["Dataset"], tokenizer)
return dataset
else:
raise NotImplementedError(f"Unsupported task {task}, only support instruction_tuning, rerank, embedding now.")

Expand All @@ -258,7 +265,12 @@ def prepare_data_collator(config: Dict, tokenizer):
elif task == "rerank":
return GroupCollator(tokenizer)
elif task == "embedding":
pass
return EmbedCollator(
tokenizer=tokenizer,
padding=config["Dataset"]["padding"],
query_max_len=config["Dataset"]["query_max_len"],
passage_max_len=config["Dataset"]["passage_max_len"],
)
else:
raise NotImplementedError(f"Unsupported task {task}, only support instruction_tuning, rerank, embedding now.")

Expand All @@ -268,24 +280,36 @@ def load_model(config: Dict):
model_dtype = convert_dtype(config["Training"].get("mixed_precision", "no"))
model_config = config["General"].get("config", {})
task = config["General"].get("task", "instruction_tuning")
training_args = convert_to_training_args(TrainingArguments, config)
if task == "instruction_tuning":
model = transformers.AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=model_dtype, **model_config)

lora_config = config["General"].get("lora_config", None)
if lora_config:
peft_config = LoraConfig(**lora_config)
model = get_peft_model(model, peft_config)
elif task == "rerank":
model = CrossEncoder.from_pretrained(
config["Dataset"],
training_args,
config["Dataset"].get("train_group_size", 8),
config["Training"]["batch_size"],
model_name,
from_tf=bool(".ckpt" in model_name),
config=model_config,
)
elif task == "embedding":
pass
should_concat = False
if (
config["Dataset"]["query_max_len"] == config["Dataset"]["passage_max_len"]
and config["Dataset"]["padding"] == "max_length"
):
should_concat = True
if config["Training"]["device"] == "hpu" and not should_concat:
raise ValueError("please set query_max_len==passage_max_len and padding='max_length' for hpu.")

if config["Training"].get("embedding_training_config", None) is not None:
model = BiEncoderModel(
model_name=model_name, should_concat=should_concat, **config["Training"]["embedding_training_config"]
)
else:
model = BiEncoderModel(model_name=model_name, should_concat=should_concat)
else:
raise NotImplementedError(f"Unsupported task {task}, only support instruction_tuning, rerank, embedding now.")

Expand Down
Loading
Loading