Distillation is one of popular approaches of network compression, which transfers knowledge from a large model to a smaller one without loss of validity. As smaller models are less expensive to evaluate, they can be deployed on less powerful hardware (such as a mobile device). Graph shown below is the workflow of the distillation, the teacher model will take the same input that feed into the student model to produce the output that contains knowledge of the teacher model to instruct the student model.
Intel® Neural Compressor supports Knowledge Distillation and Intermediate Layer Knowledge Distillation algorithms.
Knowledge distillation is proposed in Distilling the Knowledge in a Neural Network. It leverages the logits (the input of softmax in the classification tasks) of teacher and student model to minimize the the difference between their predicted class distributions, this can be done by minimizing the below loss function.
Where
There are more information contained in the teacher model beside its logits, for example, the output features of the teacher model's intermediate layers often been used to guide the student model, as in Patient Knowledge Distillation for BERT Model Compression and MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices. The general loss function for this approach can be summarized as follow.
Where
Self-distillation ia a one-stage training method where the teacher model and student models can be trained together. It attaches several attention modules and shallow classifiers at different depths of neural networks and distills knowledge from the deepest classifier to the shallower classifiers. Different from the conventional knowledge distillation methods where the knowledge of the teacher model is transferred to another student model, self-distillation can be considered as knowledge transfer in the same model, from the deeper layers to the shallower layers.
The additional classifiers in self-distillation allow the neural network to work in a dynamic manner, which leads to a much higher acceleration.
Architecture from paper Self-Distillation: Towards Efficient and Compact Neural Networks
Distillation Algorithm | PyTorch | TensorFlow |
---|---|---|
Knowledge Distillation | ✔ | ✔ |
Intermediate Layer Knowledge Distillation | ✔ | Will be supported |
Self Distillation | ✔ | ✖ |
Simplest launcher code if training behavior is defined in user-defined yaml.
from neural_compressor.experimental import Distillation, common
distiller = Distillation('/path/to/user/yaml')
distiller.student_model = student_model
distiller.teacher_model = teacher_model
model = distiller.fit()
Distillation class also support DistillationConf class as it's argument.
from neural_compressor.experimental import Distillation, common
from neural_compressor.conf.config import DistillationConf
conf = DistillationConf('/path/to/user/yaml')
distiller = Distillation(conf)
distiller.student_model = student_model
distiller.teacher_model = teacher_model
model = distiller.fit()
User can pass the customized training/evaluation functions to Distillation
for flexible scenarios. In this case, distillation process can be done by pre-defined hooks in Neural Compressor. User needs to put those hooks inside the training function.
Neural Compressor defines several hooks for user pass
on_train_begin() : Hook executed before training begins
on_after_compute_loss(input, student_output, student_loss) : Hook executed after each batch inference of student model
on_epoch_end() : Hook executed at each epoch end
Following section shows how to use hooks in user pass-in training function which is part of example from BlendCNN distillation:
def train_func(model):
distiller.on_train_begin()
for nepoch in range(epochs):
model.train()
cnt = 0
loss_sum = 0.
iter_bar = tqdm(train_dataloader, desc='Iter (loss=X.XXX)')
for batch in iter_bar:
teacher_logits, input_ids, segment_ids, input_mask, target = batch
cnt += 1
output = model(input_ids, segment_ids, input_mask)
loss = criterion(output, target)
loss = distiller.on_after_compute_loss(
{'input_ids':input_ids, 'segment_ids':segment_ids, 'input_mask':input_mask},
output,
loss,
teacher_logits)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if cnt >= iters:
break
print('Average Loss: {}'.format(loss_sum / cnt))
distiller.on_epoch_end()
...
In this case, the launcher code is like the following:
from neural_compressor.experimental import Distillation, common
from neural_compressor.experimental.common.criterion import PyTorchKnowledgeDistillationLoss
distiller = Distillation(args.config)
distiller.student_model = model
distiller.teacher_model = teacher
distiller.criterion = PyTorchKnowledgeDistillationLoss()
distiller.train_func = train_func
model = distiller.fit()