Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable serialize prepacked weights into data file #22256

Merged
merged 45 commits into from
Oct 25, 2024
Merged

Conversation

frank-dong-ms
Copy link
Contributor

Description

part of #21448
This change is intend to save CPU memory during model load for inference.
Added session option save_prepacked_constant_initializers, with save_prepacked_constant_initializers turn on:

  1. optimize model with inference session, prepacked external initializer will be saved into data file.
  2. load optimized model and external data file with prepacked initializer, no prepack is needed
  3. run inference with optimized model and data file

Tested with model Phi-3-mini-instruct-onnx,
with ORT 1.12.0:
image

with this change:
image

Peak memory usage dropped from 5.438 GB to 2.726GB.
This change takes advantage of ORT loads external initializer with mmap on CPU. Prepack will use extra memory on heap, omit prepack process can save this part of memory (roughly same size as external initializers).

next step:
Change all the kernels on CPU with PrePack method implemented and test properly. Will do in next PR.

Motivation and Context

import os

import numpy as np
import onnx

Check notice

Code scanning / CodeQL

Module is imported with 'import' and 'import from' Note test

Module 'onnx' is imported with both 'import' and 'import from'.
Module 'onnxruntime.test.onnx' is imported with both 'import' and 'import from'.
@yuslepukhin yuslepukhin requested a review from tianleiwu October 23, 2024 16:59
yuslepukhin
yuslepukhin previously approved these changes Oct 23, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

onnxruntime/core/framework/session_state.cc Outdated Show resolved Hide resolved
onnxruntime/core/framework/tensorprotoutils.cc Outdated Show resolved Hide resolved
onnxruntime/core/framework/tensorprotoutils.h Outdated Show resolved Hide resolved
onnxruntime/core/framework/tensorprotoutils.h Outdated Show resolved Hide resolved
@frank-dong-ms frank-dong-ms dismissed github-actions[bot]’s stale review October 24, 2024 05:04

can't re-request review so dismiss to unblock

Copy link
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@frank-dong-ms frank-dong-ms merged commit c5b6be0 into main Oct 25, 2024
90 checks passed
@frank-dong-ms frank-dong-ms deleted the frdong/prepack_1 branch October 25, 2024 05:24
yuslepukhin added a commit that referenced this pull request Nov 9, 2024
yuslepukhin added a commit that referenced this pull request Nov 11, 2024
…22788)

This reverts commit c5b6be0.

### Description
Revert

### Motivation and Context
This needs simpler and more robust approach
ishwar-raut1 pushed a commit to ishwar-raut1/onnxruntime that referenced this pull request Nov 19, 2024
### Description
part of microsoft#21448
This change is intend to save CPU memory during model load for
inference.
Added session option save_prepacked_constant_initializers, with
save_prepacked_constant_initializers turn on:
1. optimize model with inference session, prepacked external initializer
will be saved into data file.
2. load optimized model and external data file with prepacked
initializer, no prepack is needed
3. run inference with optimized model and data file

Tested with model Phi-3-mini-instruct-onnx,
with ORT 1.12.0:

![image](https://github.com/user-attachments/assets/3c0337be-f340-4bb7-8f9f-30f3552072ef)

with this change:

![image](https://github.com/user-attachments/assets/23282990-2e1e-4a1f-92de-afa8ed7e6a43)

Peak memory usage dropped from **5.438 GB to 2.726GB**.
This change takes advantage of ORT loads external initializer with mmap
on CPU. Prepack will use extra memory on heap, omit prepack process can
save this part of memory (roughly same size as external initializers).

next step:
Change all the kernels on CPU with PrePack method implemented and test
properly. Will do in next PR.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ishwar-raut1 pushed a commit to ishwar-raut1/onnxruntime that referenced this pull request Nov 19, 2024
…22256)" (microsoft#22788)

This reverts commit c5b6be0.

### Description
Revert

### Motivation and Context
This needs simpler and more robust approach
guschmue pushed a commit that referenced this pull request Dec 2, 2024
…22788)

This reverts commit c5b6be0.

### Description
Revert

### Motivation and Context
This needs simpler and more robust approach
ankitm3k pushed a commit to intel/onnxruntime that referenced this pull request Dec 11, 2024
### Description
part of microsoft#21448
This change is intend to save CPU memory during model load for
inference.
Added session option save_prepacked_constant_initializers, with
save_prepacked_constant_initializers turn on:
1. optimize model with inference session, prepacked external initializer
will be saved into data file.
2. load optimized model and external data file with prepacked
initializer, no prepack is needed
3. run inference with optimized model and data file

Tested with model Phi-3-mini-instruct-onnx,
with ORT 1.12.0:

![image](https://github.com/user-attachments/assets/3c0337be-f340-4bb7-8f9f-30f3552072ef)

with this change:

![image](https://github.com/user-attachments/assets/23282990-2e1e-4a1f-92de-afa8ed7e6a43)

Peak memory usage dropped from **5.438 GB to 2.726GB**.
This change takes advantage of ORT loads external initializer with mmap
on CPU. Prepack will use extra memory on heap, omit prepack process can
save this part of memory (roughly same size as external initializers).

next step:
Change all the kernels on CPU with PrePack method implemented and test
properly. Will do in next PR.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ankitm3k pushed a commit to intel/onnxruntime that referenced this pull request Dec 11, 2024
…22256)" (microsoft#22788)

This reverts commit c5b6be0.

### Description
Revert

### Motivation and Context
This needs simpler and more robust approach
ankitm3k pushed a commit to intel/onnxruntime that referenced this pull request Dec 11, 2024
### Description
part of microsoft#21448
This change is intend to save CPU memory during model load for
inference.
Added session option save_prepacked_constant_initializers, with
save_prepacked_constant_initializers turn on:
1. optimize model with inference session, prepacked external initializer
will be saved into data file.
2. load optimized model and external data file with prepacked
initializer, no prepack is needed
3. run inference with optimized model and data file

Tested with model Phi-3-mini-instruct-onnx,
with ORT 1.12.0:

![image](https://github.com/user-attachments/assets/3c0337be-f340-4bb7-8f9f-30f3552072ef)

with this change:

![image](https://github.com/user-attachments/assets/23282990-2e1e-4a1f-92de-afa8ed7e6a43)

Peak memory usage dropped from **5.438 GB to 2.726GB**.
This change takes advantage of ORT loads external initializer with mmap
on CPU. Prepack will use extra memory on heap, omit prepack process can
save this part of memory (roughly same size as external initializers).

next step:
Change all the kernels on CPU with PrePack method implemented and test
properly. Will do in next PR.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ankitm3k pushed a commit to intel/onnxruntime that referenced this pull request Dec 11, 2024
…22256)" (microsoft#22788)

This reverts commit c5b6be0.

### Description
Revert

### Motivation and Context
This needs simpler and more robust approach
ankitm3k pushed a commit to intel/onnxruntime that referenced this pull request Dec 11, 2024
### Description
part of microsoft#21448
This change is intend to save CPU memory during model load for
inference.
Added session option save_prepacked_constant_initializers, with
save_prepacked_constant_initializers turn on:
1. optimize model with inference session, prepacked external initializer
will be saved into data file.
2. load optimized model and external data file with prepacked
initializer, no prepack is needed
3. run inference with optimized model and data file

Tested with model Phi-3-mini-instruct-onnx,
with ORT 1.12.0:

![image](https://github.com/user-attachments/assets/3c0337be-f340-4bb7-8f9f-30f3552072ef)

with this change:

![image](https://github.com/user-attachments/assets/23282990-2e1e-4a1f-92de-afa8ed7e6a43)

Peak memory usage dropped from **5.438 GB to 2.726GB**.
This change takes advantage of ORT loads external initializer with mmap
on CPU. Prepack will use extra memory on heap, omit prepack process can
save this part of memory (roughly same size as external initializers).

next step:
Change all the kernels on CPU with PrePack method implemented and test
properly. Will do in next PR.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ankitm3k pushed a commit to intel/onnxruntime that referenced this pull request Dec 11, 2024
…22256)" (microsoft#22788)

This reverts commit c5b6be0.

### Description
Revert

### Motivation and Context
This needs simpler and more robust approach
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants