Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mthreads] deepspeed llama2 #354

Merged
merged 11 commits into from
Dec 21, 2023
21 changes: 21 additions & 0 deletions inference/benchmarks/bertLarge/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,25 @@ bert_reference_results_text_md5.txt

- XTCL 2.1

#### 2.3 天数智芯 MR-100

- ##### 硬件环境
- 机器、加速卡型号: MR-100

- ##### 软件环境
- OS版本:Ubuntu 20.04
- OS kernel版本: 5.15.0-89-generic
- 加速卡驱动版本:3.2.0
- Docker 版本:24.0.4
- 依赖软件版本:
- torch-1.13.1+corex.3.2.1
- onnxsim

- 推理工具包

- IXRT: ixrt-0.8.0+corex.3.2.1


### 4. 运行情况(BERT-Large)

* 指标列表
Expand All @@ -83,3 +102,5 @@ bert_reference_results_text_md5.txt
| tensorrt | fp16 | 32 | 1283.9 | 257.3 | 260.4 | 408.3 | 418.1 | 45.3% | 0.600/0.638 | 17.4/40.0 |
| tensorrt | fp32 | 32 | 1868.8 | 150.4 | 152.2 | 190.4 | 194.1 | 42.0% | 0.638/0.638 | 16.9/40.0 |
| kunlunxin_xtcl| W32A16 | 32 |/ | / | / | / | / | / | 0.638/0.638| /|
| iluvatar_ixrt| fp16 | 32 |/ | / | / | / | / | / | 0.599/0.638| /|

Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
transformers
shh2000 marked this conversation as resolved.
Show resolved Hide resolved
onnxsim
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ixrt_tmp_path: iluvatar_tmp/bertLarge.trt
compiler: ixrt
# no_validation: true
has_dynamic_axis: false
torchtrt_full_compile: true
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

>联系邮箱: [email protected]

ixrt-0.7.0+corex.latest.version-cp310-cp310-linux_x86_64.whl
ixrt-0.8.0+corex.latest.version-cp310-cp310-linux_x86_64.whl

torchvision-0.14.1+corex.3.2.1.20231006.892-cp310-cp310-linux_x86_64.whl

Expand Down
26 changes: 15 additions & 11 deletions inference/inference_engine/iluvatar/ixrt.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
import time
import subprocess


class InferModel:

class HostDeviceMem(object):
Expand Down Expand Up @@ -66,27 +65,32 @@ def __init__(self, config, onnx_path, model):

def build_engine(self, config, onnx_path):
if config.exist_compiler_path is None:
trt_path = config.log_dir + "/" + config.ixrt_tmp_path
ixrt_path = config.log_dir + "/" + config.ixrt_tmp_path

dir_trt_path = os.path.dirname(trt_path)
dir_trt_path = os.path.dirname(ixrt_path)
os.makedirs(dir_trt_path, exist_ok=True)

time.sleep(10)

trtexec_cmd = "ixrtexec --onnx=" + onnx_path + " --save_engine=" + trt_path
onnxsim_cmd = f"onnxsim {onnx_path} {onnx_path}"

onnxsim_cmd = subprocess.Popen(onnxsim_cmd, shell=True)
onnxsim_cmd.wait()

ixrtexec_cmd = "ixrtexec --onnx=" + onnx_path + " --save_engine=" + ixrt_path
if config.fp16:
trtexec_cmd += " --precision fp16"
ixrtexec_cmd += " --precision fp16"
if config.has_dynamic_axis:
trtexec_cmd += " --minShapes=" + config.minShapes
trtexec_cmd += " --optShapes=" + config.optShapes
trtexec_cmd += " --maxShapes=" + config.maxShapes
ixrtexec_cmd += " --minShapes=" + config.minShapes
ixrtexec_cmd += " --optShapes=" + config.optShapes
ixrtexec_cmd += " --maxShapes=" + config.maxShapes

p = subprocess.Popen(trtexec_cmd, shell=True)
p = subprocess.Popen(ixrtexec_cmd, shell=True)
p.wait()
else:
trt_path = config.exist_compiler_path
ixrt_path = config.exist_compiler_path

with open(trt_path, "rb") as f:
with open(ixrt_path, "rb") as f:
return self.runtime.deserialize_cuda_engine(f.read())

def allocate_buffers(self, engine):
Expand Down
6 changes: 4 additions & 2 deletions training/benchmarks/aquila2_7b/flagscale/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@ aquila2是北京人工智能研究院开源的语言模型,包含基础语言

## 模型配置及tokenizer准备

本测试样例为预训练case,需要下载tokenizer,下载链接为https://github.com/FlagOpen/FlagScale/tree/main/examples/aquila/tokenizer。需要在data_dir下创建tokenizer目录,将上述链接中的三个文件下载到此目录中
本测试样例为预训练case,需要下载tokenizer,下载链接为https://github.com/FlagOpen/FlagScale/tree/main/examples/aquila/tokenizer

此tokenizer需要下载FlagScale仓库ed55532这一commit版本,需要在data_dir下创建tokenizer目录,将上述链接中的三个文件下载到此目录中

## 数据准备

Expand All @@ -14,4 +16,4 @@ https://model.ks3-cn-beijing.ksyuncs.com/nlpdata/pile_wikipedia_demo.bin

https://model.ks3-cn-beijing.ksyuncs.com/nlpdata/pile_wikipedia_demo.idx

将上述两个文件放置于data_dir下。
将上述两个文件放置于data_dir下。
16 changes: 1 addition & 15 deletions training/benchmarks/bert_hf/pytorch/train/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,21 +82,7 @@ def train_one_epoch(self, train_dataloader, eval_dataloader):
dist_pytorch.barrier(self.config.vendor)
pure_start_time = time.time()

if scaler is not None:
with torch.cuda.amp.autocast(enabled=True):
output = model(input_ids=input_ids, labels=labels)
loss = output.loss

scaler.scale(loss).backward()
if step % self.config.gradient_accumulation_steps == 0:
scaler.step(optimizer)
scaler.update()
else:
output = model(input_ids=input_ids, labels=labels)
loss = output.loss
loss.backward()
if step % self.config.gradient_accumulation_steps == 0:
optimizer.step()
loss = self.adapter.train_one_step(model, (input_ids, labels), optimizer, step, scaler)

if step % self.config.log_freq == 0:
print("Train Step " + str(step) + "/" + str(len(data_loader)) +
Expand Down
21 changes: 21 additions & 0 deletions training/benchmarks/bert_hf/pytorch/train/trainer_adapter.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,24 @@ def create_grad_scaler():
"""create_grad_scaler for mixed precision training"""
scaler = torch.cuda.amp.GradScaler() if config.amp else None
return scaler


def train_one_step(model, batch_data, optimizer, cur_step, scaler=None):
input_ids, labels = batch_data
if scaler:
with torch.cuda.amp.autocast(enabled=True):
output = model(input_ids=input_ids, labels=labels)
loss = output.loss

scaler.scale(loss).backward()
if cur_step % config.gradient_accumulation_steps == 0:
scaler.step(optimizer)
scaler.update()
else:
output = model(input_ids=input_ids, labels=labels)
loss = output.loss
loss.backward()
if cur_step % config.gradient_accumulation_steps == 0:
optimizer.step()

return loss
19 changes: 19 additions & 0 deletions training/benchmarks/driver/dist_pytorch.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,8 @@ def barrier(vendor="nvidia"):
if torch.distributed.is_available() and torch.distributed.is_initialized():
if vendor == "kunlunxin":
torch.distributed.barrier()
elif vendor == "mthreads":
torch.distributed.barrier()
else:
torch.distributed.all_reduce(torch.cuda.FloatTensor(1))
torch.cuda.synchronize()
Expand All @@ -172,6 +174,23 @@ def init_dist_training_env(config):
rank=rank,
world_size=world_size)
config.n_device = torch.distributed.get_world_size()
elif config.vendor == "mthreads":
import torch_musa
if int(os.environ.get("WORLD_SIZE", 1)) <= 1:
config.device = torch.device("musa")
config.n_device = 1
else:
torch.musa.set_device(config.local_rank)
host_addr_full = 'tcp://' + os.environ[
"MASTER_ADDR"] + ':' + os.environ["MASTER_PORT"]
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
torch.distributed.init_process_group(backend=config.dist_backend,
init_method=host_addr_full,
rank=rank,
world_size=world_size)
config.device = torch.device("musa", config.local_rank)
config.n_device = torch.distributed.get_world_size()
else: # nvidia
if int(os.environ.get("WORLD_SIZE", 1)) <= 1:
config.device = torch.device("cuda")
Expand Down
6 changes: 6 additions & 0 deletions training/benchmarks/driver/helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,12 @@ def set_seed(self, seed: int, vendor: str = None):
elif lower_vendor == "ascend":
import mindspore
mindspore.set_seed(seed)
elif lower_vendor == "mthreads":
import torch
import torch_musa
torch.manual_seed(seed)
torch.musa.manual_seed(seed)
torch.musa.manual_seed_all(seed)
else:
# TODO 其他厂商设置seed,在此扩展
pass
28 changes: 21 additions & 7 deletions training/benchmarks/llama2_7b/deepspeed/run_pretraining.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@
from importlib import import_module

import torch
try:
import torch_musa
DEVICE = 'musa'
except:
DEVICE = 'cuda'
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler

Expand Down Expand Up @@ -54,29 +59,32 @@ def get_argument_parser():

def train(model_engine, dataloader):
model_engine.train()
device = torch.device(f"{DEVICE}:{args.local_rank}")
ave_loss = 0.0
for step, data in enumerate(dataloader):

fake_data = torch.tensor(data).long()
input_ids = fake_data.to(args.local_rank)
labels = fake_data.to(args.local_rank)
input_ids = fake_data.to(device)
labels = fake_data.to(device)
loss = model_engine(input_ids=input_ids, labels=labels).loss
model_engine.backward(loss)
model_engine.step()

ave_loss += loss
if step % 10 == 0 and args.local_rank == 0:
if step > 0 and step % 10 == 0 and args.local_rank == 0:
print('Step {}/{}, Loss: {}'.format(step, len(dataloader),
ave_loss / 10))
ave_loss = 0.0


def get_deepspeed_engine(args, model_config_dir, flashattn):
def get_deepspeed_engine(args, model_config_dir):
with deepspeed.zero.Init(config_dict_or_path=args.deepspeed_config,
enabled=True,
mem_efficient_linear=False,
mpu=None):
model = get_llama_model(model_config_dir, flashattn)
model = get_llama_model(model_config_dir, args.flashattn)
if args.gradient_checkpointing_enable:
model.gradient_checkpointing_enable()

model_engine, _, _, _ = deepspeed.initialize(
args=args, model=model, model_parameters=model.parameters())
Expand Down Expand Up @@ -107,10 +115,12 @@ def get_metric(texts):
theoryflops = getattr(module, 'theoryflops')
epochs = getattr(module, 'epochs')
flashattn = getattr(module, 'flashattn')
gradient_checkpointing_enable = getattr(module, 'gradient_checkpointing_enable', False)
args.flashattn = flashattn
args.gradient_checkpointing_enable = gradient_checkpointing_enable

deepspeed.init_distributed()
model_engine = get_deepspeed_engine(args, os.path.join("llama2_7b_hf"),
flashattn)
model_engine = get_deepspeed_engine(args, os.path.join("llama2_7b_hf"))
dataset = get_llama_dataset(args, seqlength, datafilename)

logger = logging.getLogger("DeepSpeed")
Expand Down Expand Up @@ -138,4 +148,8 @@ def get_metric(texts):
chip_tps = whole_tps / args.nproc * args.nnodes
print("System tokens per second: ", whole_tps)
print("Tokens/p/s: ", chip_tps)

TFLOPS = int(theoryflops/1000000000000)
print("Theory TFLOPS: ", TFLOPS)
print("Tokens/TFLOPS: ", chip_tps / TFLOPS)
print("MFU: ", chip_tps * 7000000000.0 * 6 / theoryflops)
17 changes: 1 addition & 16 deletions training/benchmarks/resnet50/pytorch/train/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,22 +82,7 @@ def train_one_epoch(self, train_dataloader, eval_dataloader):
pure_start_time = time.time()
optimizer.zero_grad()

images, target = batch
if scaler is not None:
with torch.cuda.amp.autocast(enabled=True):
output = model(images)
loss = criterion(output, target)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
else:
output = model(images)

criterion = torch.nn.CrossEntropyLoss()
loss = criterion(output, target)
loss.backward()
optimizer.step()
loss = self.adapter.train_step(model, batch, optimizer, scaler)

if step % self.config.log_freq == 0:
print("Train Step " + str(step) + "/" + str(len(data_loader)) +
Expand Down
20 changes: 20 additions & 0 deletions training/benchmarks/resnet50/pytorch/train/trainer_adapter.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,23 @@ def create_grad_scaler():
"""create_grad_scaler for mixed precision training"""
scaler = torch.cuda.amp.GradScaler() if config.amp else None
return scaler


def train_step(model, batch, optimizer, scaler=None):
"""train one step"""
images, target = batch
criterion = torch.nn.CrossEntropyLoss()
if scaler:
with torch.cuda.amp.autocast(enabled=True):
output = model(images)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
else:
output = model(images)
loss = criterion(output, target)
loss.backward()
optimizer.step()

return loss
2 changes: 1 addition & 1 deletion training/iluvatar/iluvatar_monitor.py
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,7 @@ def get_system_info():
cmd = cmd + r"echo ;"

cmd = cmd + r"echo Accelerator Model:;"
cmd = cmd + r"ixsmi -L;"
cmd = cmd + r"export PATH=/usr/local/corex/bin:$PATH; export LD_LIBRARY_PATH=/usr/local/corex/lib; ixsmi -L;"
cmd = cmd + r"echo ;"

cmd = cmd + r"echo Accelerator Driver version:;"
Expand Down
3 changes: 2 additions & 1 deletion training/iluvatar/mobilenetv2-pytorch/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@

| 配置 | precision | fix_hp | e2e_time | p_whole | p_train | p_core | acc | mem |
| --------------------- | --------- | -------------- | -------- | ------- | ------- | ------ | ------ | ----------- |
| BI-V100单机8卡(1x8) | fp32 | bs=256,lr=0.72 | 103759 | 3520 | 3604 | 3651 | 68.61% | 21.6 / 32.0 |
| BI-V100单机8卡(1x8) | fp32 | / | 174534 | 1857 | 1876 | 1885 | 68.52% | 3.6/32.0 |
| BI-V100单机8卡(1x8) | fp32 | bs=256,lr=0.72 | 87559 | 4390 | 4543 | 4625 | 61.92% | 21.6 / 32.0 |
| BI-V100单机8卡(1x1) | fp32 | bs=256,lr=0.72 | / | 624 | 632 | 633 | / | 21.4 / 32.0 |
| BI-V100单机8卡(2x8) | fp32 | bs=256,lr=0.72 | / | 6835 | 7058 | 7219 | / | 22.2 / 32.0 |

Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from config_common import *

train_batch_size = 256
eval_batch_size = 256
train_batch_size = 32
eval_batch_size = 32

Loading