[Feat] return hidden states #3364

Jackmin801 · 2025-02-07T05:38:18Z

Motivation

This PR intends to add the return_hidden_states argument to ServerArgs which makes the results contain the last layer hidden states in output["meta_info"]["hidden_states"].
These hidden states are useful for example for verifying computations. (e.g. https://arxiv.org/abs/2501.16007)

Modifications

Add return_hidden_states to ServerArgs
Changed the logic to determine capture_hidden_mode to accomodate return_hidden_states
Modify scheduler process_batch_results to save the hidden state to the Req
Add return_hidden_states and hidden_states to necessary dataclasses

Script used to test changes

# launch the offline engine
import asyncio
from transformers import AutoTokenizer
import sglang as sgl

def main():
    MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
    llm = sgl.Engine(
        model_path=MODEL_NAME,
        skip_tokenizer_init=True,
        disable_cuda_graph=False,
        return_hidden_states=False,
    )
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]

    sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

    input_ids = tokenizer(prompts).input_ids
    #outputs = llm.generate(input_ids=input_ids, sampling_params=sampling_params)
    outputs = llm.generate(prompts, sampling_params=sampling_params)
    for input_id, output in zip(input_ids, outputs):
        print("===============================")
        print(input_id)
        print(output)
        print()
        if "token_ids" in output:
            print(input_id, output["token_ids"], len(input_id), len(output["token_ids"]))
        else:
            print(output['text'])
        if "hidden_states" in output["meta_info"]:
            print(
                [i.shape for i in output["meta_info"]["hidden_states"]],
                len(output["meta_info"]["hidden_states"]),
            )

if __name__ == "__main__":
    main()

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.

zhaochenyang20 · 2025-02-07T06:35:04Z

This is good to see. But could change our documents to demonstrate the usage and add unit tests to your feature?

docs/backend/server_arguments.md

test/srt/test_srt_engine.py

This reverts commit fc64fdc. Revert "add docs in server args" This reverts commit ef315c2.

zhaochenyang20 · 2025-02-07T18:11:16Z

Thanks. I will try to get some one familiar with hidden state to help.

zhaochenyang20 · 2025-02-07T18:12:45Z

https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py

You can check this, may gonna help.

zhaochenyang20

Great! Also, could you add the API also to the server, not only the engine. Like how we do for update_weights_from_dist. You can use Engine API and Server / HTTPS API.

zhaochenyang20 · 2025-02-08T18:33:31Z

docs/backend/offline_engine_api.ipynb

+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "llm.shutdown()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},


Thanks. I think we should not change the docs of offline API. Instead, we should change this.

https://docs.sglang.ai/backend/native_api.html

Also, I think that the best to do this is not add an serving arguement, but rather make a new native API instead. Just like:

sglang/python/sglang/srt/entrypoints/http_server.py

Line 291 in f90db8b

@app.post("/update_weights_from_distributed")

And this:

sglang/python/sglang/srt/entrypoints/engine.py

Line 249 in f90db8b

def update_weights_from_distributed(self, name: str, dtype, shape):

This could be much easier to use and do not need to launch a specific engine, which cost a lot of time in the docs CI.

test/srt/test_srt_engine.py

zhaochenyang20 · 2025-02-08T18:52:25Z

Also, update the beginning of the native API docs.

https://docs.sglang.ai/backend/native_api.html

Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce these following APIs:

/generate (text generation model)

/get_model_info

/get_server_info

/health

/health_generate

/flush_cache

/update_weights

/encode(embedding model)

/classify(reward model)

We mainly use requests to test these APIs in the following examples. You can also use curl.

zhaochenyang20 · 2025-02-08T18:54:06Z

You can add one seperate test file as test_hidden_state.py, but add it in https://github.com/sgl-project/sglang/blob/main/test/srt/run_suite.py

Just like test_update_weights_from_disk in the run_suite.

This reverts commit 0288b8f.

This reverts commit a061616.

This reverts commit 496b572.

This reverts commit 5ff6edc.

zhaochenyang20

The test looks good.

Jackmin801 · 2025-02-09T07:36:46Z

python/sglang/srt/model_executor/cuda_graph_runner.py

+                CaptureHiddenMode.FULL
+                if self.model_runner.server_args.return_hidden_states
+                else (
+                    spec_info.capture_hidden_mode
+                    if spec_info
+                    else CaptureHiddenMode.NULL
+                )


@zhaochenyang20 What I meant is here. It seems like it is necessary for the capture_hidden_mode to be known at engine init time. Otherwise, the decode cuda graph will not contain the return hidden state logic and this cant be changed by sampling args.

zhaochenyang20 · 2025-02-10T07:31:46Z

https://github.com/sgl-project/sglang/actions/runs/13235087533/job/36938449704?pr=3364

This needs to update the time out of the CI @Jackmin801

Jackmin801 added 8 commits February 6, 2025 08:14

extract hidden states

54b524d

include meow.py

f5a8a4d

allow cuda graph runner

e6414cc

add return hidden states as engine arg

73e5305

change meow script

eb4f93a

lint

5e9ce35

add cli arg

52dc2cb

forward from detokenizer

371fe0e

Jackmin801 requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners February 7, 2025 05:38

Jackmin801 added 3 commits February 7, 2025 05:41

fix: dont error on embedding model

9cb1111

remove testing script

64232cb

style

7c73a30

Jackmin801 force-pushed the feat-hidden_states branch from fdbd188 to 7c73a30 Compare February 7, 2025 05:41

Merge branch 'main' into feat-hidden_states

b0e4765

Jackmin801 added 3 commits February 7, 2025 08:26

add docs in server args

ef315c2

add example

fc64fdc

test: add test

5ff6edc

zhaochenyang20 requested changes Feb 7, 2025

View reviewed changes

docs/backend/server_arguments.md Outdated Show resolved Hide resolved

test/srt/test_srt_engine.py Outdated Show resolved Hide resolved

Jackmin801 added 4 commits February 7, 2025 10:02

add example to offline engine api

9dfdbff

Revert "add example"

09be3af

This reverts commit fc64fdc. Revert "add docs in server args" This reverts commit ef315c2.

add comparison to hf [skip ci]

496b572

add 1 decode to test [skip ci]

a061616

change to meta llama 3.1 8b I

0288b8f

zhaochenyang20 reviewed Feb 8, 2025

View reviewed changes

test/srt/test_srt_engine.py Outdated Show resolved Hide resolved

Jackmin801 added 5 commits February 9, 2025 05:38

add test_hidden_states

3e49643

Revert "change to meta llama 3.1 8b I"

c296ed9

This reverts commit 0288b8f.

Revert "add 1 decode to test [skip ci]"

01afdf8

This reverts commit a061616.

Revert "add comparison to hf [skip ci]"

45d64fe

This reverts commit 496b572.

Revert "test: add test"

bab6d20

This reverts commit 5ff6edc.

zhaochenyang20 reviewed Feb 9, 2025

View reviewed changes

Jackmin801 added 2 commits February 9, 2025 06:03

lint

329e3d0

add to test suite

6efd6eb

Jackmin801 commented Feb 9, 2025

View reviewed changes

zhaochenyang20 mentioned this pull request Feb 10, 2025

[Feature] Add return hidden state in the native API #3461

Open

2 tasks

zhaochenyang20 approved these changes Feb 10, 2025

View reviewed changes

Merge branch 'main' into feat-hidden_states

929cdb9

Jackmin801 added 2 commits February 10, 2025 08:06

fix: only output when return hidden states in server args

be03127

increase ci timeout

0afff9a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] return hidden states #3364

[Feat] return hidden states #3364

Jackmin801 commented Feb 7, 2025 •

edited

Loading

zhaochenyang20 commented Feb 7, 2025

zhaochenyang20 commented Feb 7, 2025

zhaochenyang20 commented Feb 7, 2025

zhaochenyang20 left a comment

zhaochenyang20 Feb 8, 2025 •

edited

Loading

zhaochenyang20 commented Feb 8, 2025

zhaochenyang20 commented Feb 8, 2025

zhaochenyang20 left a comment

Jackmin801 Feb 9, 2025

zhaochenyang20 commented Feb 10, 2025

[Feat] return hidden states #3364

Are you sure you want to change the base?

[Feat] return hidden states #3364

Conversation

Jackmin801 commented Feb 7, 2025 • edited Loading

Motivation

Modifications

Checklist

zhaochenyang20 commented Feb 7, 2025

zhaochenyang20 commented Feb 7, 2025

zhaochenyang20 commented Feb 7, 2025

zhaochenyang20 left a comment

Choose a reason for hiding this comment

zhaochenyang20 Feb 8, 2025 • edited Loading

Choose a reason for hiding this comment

zhaochenyang20 commented Feb 8, 2025

zhaochenyang20 commented Feb 8, 2025

zhaochenyang20 left a comment

Choose a reason for hiding this comment

Jackmin801 Feb 9, 2025

Choose a reason for hiding this comment

zhaochenyang20 commented Feb 10, 2025

Jackmin801 commented Feb 7, 2025 •

edited

Loading

zhaochenyang20 Feb 8, 2025 •

edited

Loading