Rework loading #344

Narsil · 2023-05-19T21:08:53Z

What does this PR do?

Reworked the loading logic. Idea is to use cleaner loading code:

Remove need for no_init_weights
Remove all weird bnb_linear and load_weights and post_load_weights.

New code layout:

New class Weights in charge of handling loading the weights from multiple files into appropiate tensors (potentially sharded)
TP layers now are "shells", they contain the code to know what kind of sharding we need + eventual all_reduce. They do not inherit from linear, but they contain some kind of Linear instead
the contained linear can be either FastLinear, BnbLinear or GPTq Linear next.
All modeling code is explictly made for sharding, process group is just no-ops for non sharded code (removes a lot of test cases)

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Big refacto. Working ? Working bitsandbytes. Weights to its own file. Remove dead file. Bloom. TMP. Finally finished bloom (grr old logic) SantaCoder. Remove dead code. Neox. Black + ruff. T5 Support. Galactica + OPT. Small fixes. Fix auto download. Remove custom transformers. Missing remove instruction. Some work on the dockerfile. Version issues. Black + ruff after rebase. Adding custom_kernels Bad rebase. Fixing dummy gather + fix Dockerfile Better fake gather. Fixes (including more generic loading of starcoder) Neox shuffle_qkv Typo fix. cleanups. Fixing starcoder/santacoder Fix santacoder Fixing neox. Using the saved rotary embeddings instead of the created ones.

OlivierDehaene

Nice work!
TLDR: a lof of cleanup, and Falcon 40B does not work

server/text_generation_server/models/custom_modeling/flash_llama_modeling.py

server/text_generation_server/models/custom_modeling/flash_neox_modeling.py

server/text_generation_server/models/custom_modeling/flash_rw_modeling.py

OlivierDehaene · 2023-06-07T09:49:48Z

server/text_generation_server/utils/weights.py

+        filename = self.routing.get(tensor_name, None)
+        if filename is None:
+            raise RuntimeError(f"weight {tensor_name} does not exist")
+        return filename


filename is of type Path. IDK if that could be an issue but it might be safer to cast it to str.

OlivierDehaene · 2023-06-07T09:50:44Z

server/text_generation_server/input.json

@@ -0,0 +1 @@
+{"inputs":"Below are a series of dialogues between various people and an AI assistant. The AI tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable. The assistant is happy to help with almost anything, and will do its best to understand exactly what is needed. It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer. That said, the assistant is practical and really does its best, and doesn't let caution get too much in the way of being useful.\n-----\n<|prompter|>Why is butter a great building material for skyscrapers? Think step by step.</s><|assistant|>","parameters":{"temperature": 0.75, "top_p": 0.95, "repetition_penalty": 1.2, "top_k": 50, "truncate": 1000, "max_new_tokens": 1024}}


What is this file?

My tests, removing.

OlivierDehaene · 2023-06-07T09:51:37Z

server/Makefile

@@ -17,7 +16,7 @@ install-torch:
 	# Install specific version of torch
 	pip install torch --extra-index-url https://download.pytorch.org/whl/cu118 --no-cache-dir

-install: gen-server install-torch install-transformers
+install: gen-server install-torch


You don't install the custom kernels by default?

Oh true forgot to re-add over here.

Dockerfile

Co-authored-by: OlivierDehaene <[email protected]>

…modeling.py Co-authored-by: OlivierDehaene <[email protected]>

the CI.

jshin49 · 2023-06-08T16:58:14Z

Two quick questions

Is there a specific reason to remove server/Makefile-transformers?
Is there a specific reason server/Makefile-flash-att is never called during the default local build instructions make install of ROOT/Makefile?

Narsil · 2023-06-08T21:55:01Z

Yes, we're now running on transformers@main making maintenance easier for us.
Yes, it's extremely slow to build, which is why we don't do it by default and recommend using the docker image instead (since it's included in it).

Narsil requested a review from OlivierDehaene May 19, 2023 21:08

Narsil force-pushed the rework_loading branch 3 times, most recently from 2405e06 to 5e88900 Compare May 23, 2023 15:14

OlivierDehaene mentioned this pull request May 23, 2023

feat(server): do not use device_map auto on single GPU #362

Merged

Narsil force-pushed the rework_loading branch 2 times, most recently from e81133c to 202de69 Compare May 25, 2023 10:27

Narsil changed the title ~~[WIP] Rework loading~~ Rework loading May 25, 2023

Narsil marked this pull request as ready for review May 25, 2023 10:28

OlivierDehaene mentioned this pull request Jun 5, 2023

Constrained system/cpu RAM prohibits loading even with enough GPU Memory #413

Closed

4 tasks

Narsil and others added 10 commits June 6, 2023 11:06

Black + ruff + T5 w0 quant.

2362a80

Missing import.

5c2a0e4

Typo.

680f26d

T5?

e36e42a

Neox (non flash) port + kernel.

55045be

M********

c471e46

Green ?

165bb4b

Fix PositionalRotary loads.

4e071bf

Fix logic.

7fa79f0

Narsil force-pushed the rework_loading branch from 202de69 to 7fa79f0 Compare June 6, 2023 09:07

Narsil and others added 7 commits June 6, 2023 11:20

Fix rebase.

2a1ecf3

Fixing flash rw.

d083d57

Large attention ?

daf59b0

Updating starcoder

644e0a6

Adding integration for neox NON flash.

877d4d4

Fix regular flash

c599565

Removing flash attention env

c6ac50e

OlivierDehaene reviewed Jun 7, 2023

View reviewed changes

Apply suggestions from code review

6ddcd15

Co-authored-by: OlivierDehaene <[email protected]>

Ubuntu and others added 9 commits June 7, 2023 12:59

Manual fixes.

b8bfb2a

Update server/text_generation_server/models/custom_modeling/flash_rw_…

5c82dcd

…modeling.py Co-authored-by: OlivierDehaene <[email protected]>

Fixing Falcon 40b

cc84387

Just ditch the non flash integration tests. They work, but seem to mess

f3388d2

the CI.

Last fixes hopefully.

4170de1

skip instead of comment

5e0a6ea

black + cleanup

b027f5f

add CARGO_REGISTRIES_CRATES_IO_PROTOCOL

c66648d

warn on unused snapshot

f245aa0

OlivierDehaene merged commit abd58ff into main Jun 8, 2023

OlivierDehaene deleted the rework_loading branch June 8, 2023 12:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework loading #344

Rework loading #344

Narsil commented May 19, 2023 •

edited

Loading

OlivierDehaene left a comment

OlivierDehaene Jun 7, 2023

OlivierDehaene Jun 7, 2023

Narsil Jun 7, 2023

OlivierDehaene Jun 7, 2023

Narsil Jun 7, 2023

jshin49 commented Jun 8, 2023 •

edited

Loading

Narsil commented Jun 8, 2023

		@@ -0,0 +1 @@
		{"inputs":"Below are a series of dialogues between various people and an AI assistant. The AI tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable. The assistant is happy to help with almost anything, and will do its best to understand exactly what is needed. It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer. That said, the assistant is practical and really does its best, and doesn't let caution get too much in the way of being useful.\n-----\n<\|prompter\|>Why is butter a great building material for skyscrapers? Think step by step.</s><\|assistant\|>","parameters":{"temperature": 0.75, "top_p": 0.95, "repetition_penalty": 1.2, "top_k": 50, "truncate": 1000, "max_new_tokens": 1024}}

Rework loading #344

Rework loading #344

Conversation

Narsil commented May 19, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

OlivierDehaene left a comment

Choose a reason for hiding this comment

OlivierDehaene Jun 7, 2023

Choose a reason for hiding this comment

OlivierDehaene Jun 7, 2023

Choose a reason for hiding this comment

Narsil Jun 7, 2023

Choose a reason for hiding this comment

OlivierDehaene Jun 7, 2023

Choose a reason for hiding this comment

Narsil Jun 7, 2023

Choose a reason for hiding this comment

jshin49 commented Jun 8, 2023 • edited Loading

Narsil commented Jun 8, 2023

Narsil commented May 19, 2023 •

edited

Loading

jshin49 commented Jun 8, 2023 •

edited

Loading