fix!: update eval-tasks -> benchmarks #1032

yanxi0830 · 2025-02-10T17:38:27Z

What does this PR do?

Update /eval-tasks to /benchmarks
⚠️ Remove differentiation between app v.s. benchmark eval task config. Now we only have BenchmarkConfig. The overloaded benchmark is confusing and do not add any value. Backward compatibility is being kept as the "type" is not being used anywhere.

Test Plan

This change is backward compatible
Run notebook test with

pytest -v -s --nbval-lax ./docs/getting_started.ipynb
pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb

raghotham · 2025-02-10T17:42:38Z

Why not just move the code over to /eval? Also, isnt this a breaking change? we should probably deprecate the eval-task methods first?

yanxi0830 · 2025-02-10T18:14:15Z

Why not just move the code over to /eval? Also, isnt this a breaking change? we should probably deprecate the eval-task methods first?

The /eval-tasks endpoint is more like /models (which is for CRUD on resources), so keeping it in a separate file as rest of the resources apis.

llama_stack/distribution/routers/routing_tables.py

leseb

Please update CHANGELOG.md to reflect on the breaking change, thanks!

yanxi0830 · 2025-02-11T17:51:49Z

Please update CHANGELOG.md to reflect on the breaking change, thanks!

@leseb We removed the CHANGELOG.md in favour of releases. Wondering if you have a suggested template on where to put these breaking changes?

hardikjshah · 2025-02-12T01:56:30Z

Lets rename tasks to benchmarks as discussed offline.

terrytangyuan · 2025-02-12T02:10:54Z

We can update the PR title with an "fix!: xxx" to indicate breaking changes

yanxi0830 · 2025-02-12T19:17:47Z

Plan for deprecation:

0.1.4 --> keep backward compat (/eval-tasks + /benchmarks works, /eval-tasks mark as deprecated)
0.1.5 --> fully deprecation (/eval-tasks no longer works)

cc @terrytangyuan @ashwinb @hardikjshah @raghotham

terrytangyuan · 2025-02-12T19:20:37Z

Kinda related but not 100% sure if considered breaking: #1023. It might be a good idea to change /vector-dbs to something else as well to be consistent.

ashwinb · 2025-02-12T19:22:50Z

I actually want to combine /vector-dbs and /vector-io into one API (same thing about shields and safety)… We had chatted about this unification a while ago @raghotham.

terrytangyuan · 2025-02-13T15:13:50Z

I actually want to combine /vector-dbs and /vector-io into one API (same thing about shields and safety)… We had chatted about this unification a while ago @raghotham.

Sure. My PR #1023 only modifies the Python classes. We can probably work on merging those two APIs separately

raghotham · 2025-02-13T15:26:28Z

Let us please come up with a plan for all APIs (will inference and models be merged as well?). Also, it will be good to consider that how we would handle things when we eventually add a notion of namespacing for resources (like projects) to help with access control etc.

hardikjshah

Looks good, when do you want to update the eval notebook to use the new APIs ?

hardikjshah · 2025-02-13T16:15:41Z

llama_stack/apis/datatypes.py

@@ -28,7 +28,7 @@ class Api(Enum):
    vector_dbs = "vector_dbs"
    datasets = "datasets"
    scoring_functions = "scoring_functions"
-    eval_tasks = "eval_tasks"


should we not leave this as when we are is in the deprecation time ?

hardikjshah · 2025-02-13T16:19:30Z

llama_stack/distribution/ui/page/distribution/eval_tasks.py

@@ -8,12 +8,12 @@
 from modules.api import llama_stack_api


-def eval_tasks():


nit: rename this file to benchmarks.py

hardikjshah · 2025-02-13T16:21:31Z

llama_stack/providers/tests/eval/test_eval.py

            eval_stack[Api.eval],
-            eval_stack[Api.eval_tasks],
+            eval_stack[Api.benchmarks],


Can we keep one test with the old eval_task APIs , so that we know that it still works ?

The Api enum is used for server implementation only. I prefer that (1) we keep the server side internal implementation naming consistent with moving to all "benchmarks". (2) Keep backward compatibility with still supporting the /eval-tasks endpoint but mark it deprecated, this is make it in a sense that client SDK still works without any updates (3) It is confusing and error prone to have these 2 conventions living in the codebase for long, so I think we should completely deprecate as soon as possible for next release so that people writing new eval providers do not get confused and frustrated with the future refactors.

yanxi0830 · 2025-02-13T16:56:04Z

Looks good, when do you want to update the eval notebook to use the new APIs ?

The eval notebook with old API works with no client update. I think should update notebook to use new APIs after client update & package release.

raghotham · 2025-02-13T17:09:24Z

llama_stack/apis/eval/eval.py

    type: Literal["benchmark"] = "benchmark"
    eval_candidate: EvalCandidate
-    num_examples: Optional[int] = Field(


num_examples no longer needed?

No, this comes from the "Remove differentiation between app v.s. benchmark eval task config. Now we only have BenchmarkConfig". num_examples is still being kept in BenchmarkConfig.

raghotham · 2025-02-13T17:10:09Z

llama_stack/apis/eval/eval.py

        self,
        task_id: str,
        input_rows: List[Dict[str, Any]],
        scoring_functions: List[str],
-        task_config: EvalTaskConfig,
+        task_config: BenchmarkConfig,


Suggested change

task_config: BenchmarkConfig,

benchmark_config: BenchmarkConfig,

this is the deprecated function which its param name must be kept the same, otherwise we would not keep backward compat

# What does this PR do? Support listing all for `llama stack list-providers`. For ease of reading, sort the output rows by type. Before the change. ```  llama stack list-providers usage: llama stack list-providers [-h] {inference,safety,agents,vector_io,datasetio,scoring,eval,post_training,tool_runtime,telemetry} llama stack list-providers: error: the following arguments are required: api ``` After the change. ``` +---------------+----------------------------------+----------------------------------------------------------------------------------+ | API Type | Provider Type | PIP Package Dependencies | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | agents | inline::meta-reference | matplotlib,pillow,pandas,scikit-learn,aiosqlite,psycopg2-binary,redis | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | datasetio | inline::localfs | pandas | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | datasetio | remote::huggingface | datasets | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | eval | inline::meta-reference | | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | inference | inline::meta-reference | accelerate,blobfile,fairscale,torch,torchvision,transformers,zmq,lm-format- | | | | enforcer,sentence-transformers | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | inference | inline::meta-reference-quantized | accelerate,blobfile,fairscale,torch,torchvision,transformers,zmq,lm-format- | | | | enforcer,sentence-transformers,fbgemm-gpu,torchao==0.5.0 | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | inference | inline::sentence-transformers | sentence-transformers | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | inference | inline::vllm | vllm | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | inference | remote::bedrock | boto3 | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | inference | remote::cerebras | cerebras_cloud_sdk | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | inference | remote::databricks | openai | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | inference | remote::fireworks | fireworks-ai | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | inference | remote::groq | groq | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | inference | remote::hf::endpoint | huggingface_hub,aiohttp | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | inference | remote::hf::serverless | huggingface_hub,aiohttp | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | inference | remote::nvidia | openai | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | inference | remote::ollama | ollama,aiohttp | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | inference | remote::runpod | openai | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | inference | remote::sambanova | openai | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | inference | remote::tgi | huggingface_hub,aiohttp | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | inference | remote::together | together | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | inference | remote::vllm | openai | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | post_training | inline::torchtune | torch,torchtune==0.5.0,torchao==0.8.0,numpy | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | safety | inline::code-scanner | codeshield | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | safety | inline::llama-guard | | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | safety | inline::meta-reference | transformers,torch --index-url https://download.pytorch.org/whl/cpu | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | safety | inline::prompt-guard | transformers,torch --index-url https://download.pytorch.org/whl/cpu | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | safety | remote::bedrock | boto3 | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | scoring | inline::basic | | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | scoring | inline::braintrust | autoevals,openai | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | scoring | inline::llm-as-judge | | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | telemetry | inline::meta-reference | opentelemetry-sdk,opentelemetry-exporter-otlp-proto-http | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | tool_runtime | inline::code-interpreter | | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | tool_runtime | inline::rag-runtime | | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | tool_runtime | remote::bing-search | requests | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | tool_runtime | remote::brave-search | requests | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | tool_runtime | remote::model-context-protocol | mcp | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | tool_runtime | remote::tavily-search | requests | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | tool_runtime | remote::wolfram-alpha | requests | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | vector_io | inline::chromadb | blobfile,chardet,pypdf,tqdm,numpy,scikit- | | | | learn,scipy,nltk,sentencepiece,transformers,torch torchvision --index-url | | | | https://download.pytorch.org/whl/cpu,sentence-transformers --no-deps,chromadb | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | vector_io | inline::faiss | blobfile,chardet,pypdf,tqdm,numpy,scikit- | | | | learn,scipy,nltk,sentencepiece,transformers,torch torchvision --index-url | | | | https://download.pytorch.org/whl/cpu,sentence-transformers --no-deps,faiss-cpu | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | vector_io | inline::meta-reference | blobfile,chardet,pypdf,tqdm,numpy,scikit- | | | | learn,scipy,nltk,sentencepiece,transformers,torch torchvision --index-url | | | | https://download.pytorch.org/whl/cpu,sentence-transformers --no-deps,faiss-cpu | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | vector_io | remote::chromadb | blobfile,chardet,pypdf,tqdm,numpy,scikit- | | | | learn,scipy,nltk,sentencepiece,transformers,torch torchvision --index-url | | | | https://download.pytorch.org/whl/cpu,sentence-transformers --no-deps,chromadb- | | | | client | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | vector_io | remote::pgvector | blobfile,chardet,pypdf,tqdm,numpy,scikit- | | | | learn,scipy,nltk,sentencepiece,transformers,torch torchvision --index-url | | | | https://download.pytorch.org/whl/cpu,sentence-transformers --no- | | | | deps,psycopg2-binary | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | vector_io | remote::qdrant | blobfile,chardet,pypdf,tqdm,numpy,scikit- | | | | learn,scipy,nltk,sentencepiece,transformers,torch torchvision --index-url | | | | https://download.pytorch.org/whl/cpu,sentence-transformers --no-deps,qdrant- | | | | client | +---------------+----------------------------------+----------------------------------------------------------------------------------+ | vector_io | remote::weaviate | blobfile,chardet,pypdf,tqdm,numpy,scikit- | | | | learn,scipy,nltk,sentencepiece,transformers,torch torchvision --index-url | | | | https://download.pytorch.org/whl/cpu,sentence-transformers --no-deps,weaviate- | | | | client | +---------------+----------------------------------+----------------------------------------------------------------------------------+ ``` [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan Manually. [//]: # (## Documentation) Signed-off-by: Ihar Hrachyshka <[email protected]>

# What does this PR do? This adds a note to ensure pull requests follow the conventional commits format, along with a link to that format, in CONTRIBUTING.md. One of the pull-request checks enforces PR titles that match this format, so it's good to be upfront about this expectation before a new developer opens a PR. Signed-off-by: Ben Browning <[email protected]>

# What does this PR do? The remote-vllm provider was not passing logprobs options from CompletionRequest or ChatCompletionRequests through to the OpenAI client parameters. I manually verified this, as well as observed this provider failing `TestInference::test_completion_logprobs`. This was filed as issue #1073. This fixes that by passing the `logprobs.top_k` value through to the parameters we pass into the OpenAI client. Additionally, this fixes a bug in `test_text_inference.py` where it mistakenly assumed chunk.delta were of type `ContentDelta` for completion requests. The deltas are of type `ContentDelta` for chat completion requests, but for basic completion requests the deltas are of type string. This test was likely failing for other providers that did properly support logprobs because of this latter issue in the test, which was hit while fixing the above issue with the remote-vllm provider. (Closes #1073) ## Test Plan First, you need a vllm running. I ran one locally like this: ``` vllm serve meta-llama/Llama-3.2-3B-Instruct --port 8001 --enable-auto-tool-choice --tool-call-parser llama3_json ``` Next, run test_text_inference.py against this vllm using the remote vllm provider like this: ``` VLLM_URL="http://localhost:8001/v1" python -m pytest -s -v llama_stack/providers/tests/inference/test_text_inference.py --providers "inference=vllm_remote" ``` Before my change, the test failed with this error: ``` llama_stack/providers/tests/inference/test_text_inference.py:155: in test_completion_logprobs assert 1 <= len(response.logprobs) <= 5 E TypeError: object of type 'NoneType' has no len() ``` After my change, the test passes. [//]: # (## Documentation) Signed-off-by: Ben Browning <[email protected]>

# What does this PR do? This commit enhances the signal handling mechanism in the server by improving the `handle_signal` (previously handle_sigint) function. It now properly retrieves the signal name, ensuring clearer logging when a termination signal is received. Additionally, it cancels all running tasks and waits for their completion before stopping the event loop, allowing for a more graceful shutdown. Support for handling SIGTERM has also been added alongside SIGINT. Before the changes, handle_sigint used asyncio.run(run_shutdown()). However, asyncio.run() is meant to start a new event loop, and calling it inside an existing one (like when running Uvicorn) raises an error. The fix replaces asyncio.run(run_shutdown()) with an async function scheduled on the existing loop using loop.create_task(shutdown()). This ensures that the shutdown coroutine runs within the current event loop instead of trying to create a new one. Furthermore, this commit updates the project dependencies. `fastapi` and `uvicorn` have been added to the development dependencies in `pyproject.toml` and `uv.lock`, ensuring that the necessary packages are available for development and execution. Closes: #1043 Signed-off-by: Sébastien Han <[email protected]> [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan Run a server and send SIGINT: ``` INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m llama_stack.distribution.server.server --yaml-config ./llama_stack/templates/ollama/run.yaml Using config file: llama_stack/templates/ollama/run.yaml Run configuration: apis: - agents - datasetio - eval - inference - safety - scoring - telemetry - tool_runtime - vector_io container_image: null datasets: [] eval_tasks: [] image_name: ollama metadata_store: db_path: /Users/leseb/.llama/distributions/ollama/registry.db namespace: null type: sqlite models: - metadata: {} model_id: meta-llama/Llama-3.2-3B-Instruct model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType - llm provider_id: ollama provider_model_id: null - metadata: embedding_dimension: 384 model_id: all-MiniLM-L6-v2 model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType - embedding provider_id: sentence-transformers provider_model_id: null providers: agents: - config: persistence_store: db_path: /Users/leseb/.llama/distributions/ollama/agents_store.db namespace: null type: sqlite provider_id: meta-reference provider_type: inline::meta-reference datasetio: - config: {} provider_id: huggingface provider_type: remote::huggingface - config: {} provider_id: localfs provider_type: inline::localfs eval: - config: {} provider_id: meta-reference provider_type: inline::meta-reference inference: - config: url: http://localhost:11434 provider_id: ollama provider_type: remote::ollama - config: {} provider_id: sentence-transformers provider_type: inline::sentence-transformers safety: - config: {} provider_id: llama-guard provider_type: inline::llama-guard scoring: - config: {} provider_id: basic provider_type: inline::basic - config: {} provider_id: llm-as-judge provider_type: inline::llm-as-judge - config: openai_api_key: '********' provider_id: braintrust provider_type: inline::braintrust telemetry: - config: service_name: llama-stack sinks: console,sqlite sqlite_db_path: /Users/leseb/.llama/distributions/ollama/trace_store.db provider_id: meta-reference provider_type: inline::meta-reference tool_runtime: - config: api_key: '********' max_results: 3 provider_id: brave-search provider_type: remote::brave-search - config: api_key: '********' max_results: 3 provider_id: tavily-search provider_type: remote::tavily-search - config: {} provider_id: code-interpreter provider_type: inline::code-interpreter - config: {} provider_id: rag-runtime provider_type: inline::rag-runtime vector_io: - config: kvstore: db_path: /Users/leseb/.llama/distributions/ollama/faiss_store.db namespace: null type: sqlite provider_id: faiss provider_type: inline::faiss scoring_fns: [] server: port: 8321 tls_certfile: null tls_keyfile: null shields: [] tool_groups: - args: null mcp_endpoint: null provider_id: tavily-search toolgroup_id: builtin::websearch - args: null mcp_endpoint: null provider_id: rag-runtime toolgroup_id: builtin::rag - args: null mcp_endpoint: null provider_id: code-interpreter toolgroup_id: builtin::code_interpreter vector_dbs: [] version: '2' INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:213: Resolved 31 providers INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-inference => ollama INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-inference => sentence-transformers INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: models => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inference => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-vector_io => faiss INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-safety => llama-guard INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: shields => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: safety => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: vector_dbs => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: vector_io => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-tool_runtime => brave-search INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-tool_runtime => tavily-search INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-tool_runtime => code-interpreter INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-tool_runtime => rag-runtime INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: tool_groups => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: tool_runtime => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: agents => meta-reference INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-datasetio => huggingface INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-datasetio => localfs INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: datasets => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: datasetio => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: telemetry => meta-reference INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-scoring => basic INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-scoring => llm-as-judge INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-scoring => braintrust INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: scoring_functions => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: scoring => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-eval => meta-reference INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: eval_tasks => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: eval => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inspect => __builtin__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:216: INFO 2025-02-12 10:21:03,723 llama_stack.providers.remote.inference.ollama.ollama:148: checking connectivity to Ollama at `http://localhost:11434`... INFO 2025-02-12 10:21:03,734 httpx:1740: HTTP Request: GET http://localhost:11434/api/ps "HTTP/1.1 200 OK" INFO 2025-02-12 10:21:03,843 faiss.loader:148: Loading faiss. INFO 2025-02-12 10:21:03,865 faiss.loader:150: Successfully loaded faiss. INFO 2025-02-12 10:21:03,868 faiss:173: Failed to load GPU Faiss: name 'GpuIndexIVFFlat' is not defined. Will not load constructor refs for GPU indexes. Warning: `bwrap` is not available. Code interpreter tool will not work correctly. INFO 2025-02-12 10:21:04,315 datasets:54: PyTorch version 2.6.0 available. INFO 2025-02-12 10:21:04,556 httpx:1740: HTTP Request: GET http://localhost:11434/api/ps "HTTP/1.1 200 OK" INFO 2025-02-12 10:21:04,557 llama_stack.providers.utils.inference.embedding_mixin:42: Loading sentence transformer for all-MiniLM-L6-v2... INFO 2025-02-12 10:21:07,202 sentence_transformers.SentenceTransformer:210: Use pytorch device_name: mps INFO 2025-02-12 10:21:07,202 sentence_transformers.SentenceTransformer:218: Load pretrained SentenceTransformer: all-MiniLM-L6-v2 INFO 2025-02-12 10:21:09,500 llama_stack.distribution.stack:102: Models: all-MiniLM-L6-v2 served by sentence-transformers INFO 2025-02-12 10:21:09,500 llama_stack.distribution.stack:102: Models: meta-llama/Llama-3.2-3B-Instruct served by ollama INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: basic::equality served by basic INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: basic::regex_parser_multiple_choice_answer served by basic INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: basic::subset_of served by basic INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::answer-correctness served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::answer-relevancy served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::answer-similarity served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::context-entity-recall served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::context-precision served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::context-recall served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::context-relevancy served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::factuality served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::faithfulness served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: llm-as-judge::405b-simpleqa served by llm-as-judge INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: llm-as-judge::base served by llm-as-judge INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Tool_groups: builtin::code_interpreter served by code-interpreter INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Tool_groups: builtin::rag served by rag-runtime INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Tool_groups: builtin::websearch served by tavily-search INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:106: Serving API eval POST /v1/eval/tasks/{task_id}/evaluations DELETE /v1/eval/tasks/{task_id}/jobs/{job_id} GET /v1/eval/tasks/{task_id}/jobs/{job_id}/result GET /v1/eval/tasks/{task_id}/jobs/{job_id} POST /v1/eval/tasks/{task_id}/jobs Serving API agents POST /v1/agents POST /v1/agents/{agent_id}/session POST /v1/agents/{agent_id}/session/{session_id}/turn DELETE /v1/agents/{agent_id} DELETE /v1/agents/{agent_id}/session/{session_id} GET /v1/agents/{agent_id}/session/{session_id} GET /v1/agents/{agent_id}/session/{session_id}/turn/{turn_id}/step/{step_id} GET /v1/agents/{agent_id}/session/{session_id}/turn/{turn_id} Serving API scoring_functions GET /v1/scoring-functions/{scoring_fn_id} GET /v1/scoring-functions POST /v1/scoring-functions Serving API safety POST /v1/safety/run-shield Serving API inspect GET /v1/health GET /v1/inspect/providers GET /v1/inspect/routes GET /v1/version Serving API tool_runtime POST /v1/tool-runtime/invoke GET /v1/tool-runtime/list-tools POST /v1/tool-runtime/rag-tool/insert POST /v1/tool-runtime/rag-tool/query Serving API datasetio POST /v1/datasetio/rows GET /v1/datasetio/rows Serving API shields GET /v1/shields/{identifier} GET /v1/shields POST /v1/shields Serving API eval_tasks GET /v1/eval-tasks/{eval_task_id} GET /v1/eval-tasks POST /v1/eval-tasks Serving API models GET /v1/models/{model_id} GET /v1/models POST /v1/models DELETE /v1/models/{model_id} Serving API datasets GET /v1/datasets/{dataset_id} GET /v1/datasets POST /v1/datasets DELETE /v1/datasets/{dataset_id} Serving API vector_io POST /v1/vector-io/insert POST /v1/vector-io/query Serving API inference POST /v1/inference/chat-completion POST /v1/inference/completion POST /v1/inference/embeddings Serving API tool_groups GET /v1/tools/{tool_name} GET /v1/toolgroups/{toolgroup_id} GET /v1/toolgroups GET /v1/tools POST /v1/toolgroups DELETE /v1/toolgroups/{toolgroup_id} Serving API vector_dbs GET /v1/vector-dbs/{vector_db_id} GET /v1/vector-dbs POST /v1/vector-dbs DELETE /v1/vector-dbs/{vector_db_id} Serving API scoring POST /v1/scoring/score POST /v1/scoring/score-batch Serving API telemetry GET /v1/telemetry/traces/{trace_id}/spans/{span_id} GET /v1/telemetry/spans/{span_id}/tree GET /v1/telemetry/traces/{trace_id} POST /v1/telemetry/events GET /v1/telemetry/spans GET /v1/telemetry/traces POST /v1/telemetry/spans/export Listening on ['::', '0.0.0.0']:5001 INFO: Started server process [65372] INFO: Waiting for application startup. INFO: ASGI 'lifespan' protocol appears unsupported. INFO: Application startup complete. INFO: Uvicorn running on http://['::', '0.0.0.0']:5001 (Press CTRL+C to quit) ^CINFO: Shutting down INFO: Finished server process [65372] Received signal SIGINT (2). Exiting gracefully... INFO 2025-02-12 10:21:11,215 __main__:151: Shutting down ModelsRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down InferenceRouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down ShieldsRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down SafetyRouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down VectorDBsRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down VectorIORouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down ToolGroupsRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down ToolRuntimeRouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down MetaReferenceAgentsImpl INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down DatasetsRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down DatasetIORouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down TelemetryAdapter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down ScoringFunctionsRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down ScoringRouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down EvalTasksRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down EvalRouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down DistributionInspectImpl ``` [//]: # (## Documentation) [//]: # (- [ ] Added a Changelog entry if the change is significant) Signed-off-by: Sébastien Han <[email protected]>

# What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] Since the subcommands used `MODEL_ID`, it would be better to use it in `model list` and make it easy to find it. ``` $ llama model verify-download --help usage: llama model verify-download [-h] --model-id MODEL_ID << $ llama model describe --help usage: llama model describe [-h] -m MODEL_ID << $ llama download --help --model-id MODEL_ID See `llama model list` or `llama model list --show-all` for the list of available models before: $ llama model list +-----------------------------------------+-----------------------------------------------------+----------------+ | Model Descriptor | Hugging Face Repo | Context Length | +-----------------------------------------+-----------------------------------------------------+----------------+ after: $ llama model list +-----------------------------------------+-----------------------------------------------------+----------------+ | Model Descriptor | Model ID | Context Length | +-----------------------------------------+-----------------------------------------------------+----------------+ | Llama3.1-8B | meta-llama/Llama-3.1-8B | 128K | +-----------------------------------------+-----------------------------------------------------+----------------+ ``` [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.*] [//]: # (## Documentation) Signed-off-by: reidliu <[email protected]> Co-authored-by: reidliu <[email protected]>

) This should be `llama-3.2-3b` instead of `llama-3.2-3b-instruct`.

hardikjshah · 2025-02-13T23:51:30Z

Can you just make sure both notebooks work e2e before merging, just to double confirm since a bunch of changes have come in after the initial test.

# What does this PR do? - Update `/eval-tasks` to `/benchmarks` - ⚠️ Remove differentiation between `app` v.s. `benchmark` eval task config. Now we only have `BenchmarkConfig`. The overloaded `benchmark` is confusing and do not add any value. Backward compatibility is being kept as the "type" is not being used anywhere. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan - This change is backward compatible - Run notebook test with ``` pytest -v -s --nbval-lax ./docs/getting_started.ipynb pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb ``` <img width="846" alt="image" src="https://github.com/user-attachments/assets/d2fc06a7-593a-444f-bc1f-10ab9b0c843d" /> [//]: # (## Documentation) [//]: # (- [ ] Added a Changelog entry if the change is significant) --------- Signed-off-by: Ihar Hrachyshka <[email protected]> Signed-off-by: Ben Browning <[email protected]> Signed-off-by: Sébastien Han <[email protected]> Signed-off-by: reidliu <[email protected]> Co-authored-by: Ihar Hrachyshka <[email protected]> Co-authored-by: Ben Browning <[email protected]> Co-authored-by: Sébastien Han <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu <[email protected]> Co-authored-by: Yuan Tang <[email protected]>

update eval-tasks -> eval/task

f1844a8

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 10, 2025

yanxi0830 marked this pull request as ready for review February 10, 2025 17:39

yanxi0830 requested review from ashwinb, hardikjshah, dltn, raghotham, dineshyv, vladimirivic, sixianyi0721, ehhuang and terrytangyuan as code owners February 10, 2025 17:39

yanxi0830 added 2 commits February 10, 2025 09:41

update eval_task_id -> task_id

5fe3ddb

openapi

b11c38e

yanxi0830 added 3 commits February 10, 2025 10:47

fix path

e013b90

deprecation in OpenAPI spec

79e7253

deprecation

65ffcdd

terrytangyuan reviewed Feb 11, 2025

View reviewed changes

llama_stack/distribution/routers/routing_tables.py Outdated Show resolved Hide resolved

leseb reviewed Feb 11, 2025

View reviewed changes

yanxi0830 changed the title ~~fix: update eval-tasks -> eval/task~~ fix!: update eval-tasks -> benchmark Feb 12, 2025

yanxi0830 added 2 commits February 12, 2025 20:29

naming update

9a8f402

replace

b20742f

yanxi0830 changed the title ~~chore: update eval-tasks -> benchmarks~~ fix!: update eval-tasks -> benchmarks Feb 13, 2025

hardikjshah reviewed Feb 13, 2025

View reviewed changes

raghotham reviewed Feb 13, 2025

View reviewed changes

booxter and others added 17 commits February 13, 2025 09:50

update

b8a612e

chore: Link to Groq docs in the warning message for preview model (#1060

0e426d3

) This should be `llama-3.2-3b` instead of `llama-3.2-3b-instruct`.

deprecation in OpenAPI spec

ceff631

update

9ce00ed

openapi

39980dc

update

139d5bd

Merge branch 'main' into eval_task_api_update

327be2f

update

e183ec9

Merge branch 'main' into eval_task_api_update

8ae5970

Merge branch 'main' into eval_task_api_update

cda598d

compeltely remove eval_task

c56db9e

precommit

b0ad0c1

hardikjshah approved these changes Feb 13, 2025

View reviewed changes

yanxi0830 merged commit 8b655e3 into main Feb 14, 2025
3 checks passed

yanxi0830 deleted the eval_task_api_update branch February 14, 2025 00:41

yanxi0830 mentioned this pull request Feb 14, 2025

Fully deprecate /eval-tasks endpoint #1088

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix!: update eval-tasks -> benchmarks #1032

fix!: update eval-tasks -> benchmarks #1032

yanxi0830 commented Feb 10, 2025 •

edited

Loading

raghotham commented Feb 10, 2025

yanxi0830 commented Feb 10, 2025 •

edited

Loading

leseb left a comment

yanxi0830 commented Feb 11, 2025

hardikjshah commented Feb 12, 2025

terrytangyuan commented Feb 12, 2025 •

edited

Loading

yanxi0830 commented Feb 12, 2025

terrytangyuan commented Feb 12, 2025

ashwinb commented Feb 12, 2025 via email •

edited by terrytangyuan

Loading

terrytangyuan commented Feb 13, 2025

raghotham commented Feb 13, 2025

hardikjshah left a comment

hardikjshah Feb 13, 2025

hardikjshah Feb 13, 2025

hardikjshah Feb 13, 2025

yanxi0830 Feb 13, 2025 •

edited

Loading

yanxi0830 commented Feb 13, 2025 •

edited

Loading

raghotham Feb 13, 2025

yanxi0830 Feb 13, 2025 •

edited

Loading

raghotham Feb 13, 2025

yanxi0830 Feb 13, 2025

hardikjshah commented Feb 13, 2025

		@@ -8,12 +8,12 @@
		from modules.api import llama_stack_api


		def eval_tasks():

	task_config: BenchmarkConfig,
	benchmark_config: BenchmarkConfig,

fix!: update eval-tasks -> benchmarks #1032

fix!: update eval-tasks -> benchmarks #1032

Conversation

yanxi0830 commented Feb 10, 2025 • edited Loading

What does this PR do?

Test Plan

raghotham commented Feb 10, 2025

yanxi0830 commented Feb 10, 2025 • edited Loading

leseb left a comment

Choose a reason for hiding this comment

yanxi0830 commented Feb 11, 2025

hardikjshah commented Feb 12, 2025

terrytangyuan commented Feb 12, 2025 • edited Loading

yanxi0830 commented Feb 12, 2025

terrytangyuan commented Feb 12, 2025

ashwinb commented Feb 12, 2025 via email • edited by terrytangyuan Loading

terrytangyuan commented Feb 13, 2025

raghotham commented Feb 13, 2025

hardikjshah left a comment

Choose a reason for hiding this comment

hardikjshah Feb 13, 2025

Choose a reason for hiding this comment

hardikjshah Feb 13, 2025

Choose a reason for hiding this comment

hardikjshah Feb 13, 2025

Choose a reason for hiding this comment

yanxi0830 Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

yanxi0830 commented Feb 13, 2025 • edited Loading

raghotham Feb 13, 2025

Choose a reason for hiding this comment

yanxi0830 Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

raghotham Feb 13, 2025

Choose a reason for hiding this comment

yanxi0830 Feb 13, 2025

Choose a reason for hiding this comment

hardikjshah commented Feb 13, 2025

yanxi0830 commented Feb 10, 2025 •

edited

Loading

yanxi0830 commented Feb 10, 2025 •

edited

Loading

terrytangyuan commented Feb 12, 2025 •

edited

Loading

ashwinb commented Feb 12, 2025 via email •

edited by terrytangyuan

Loading

yanxi0830 Feb 13, 2025 •

edited

Loading

yanxi0830 commented Feb 13, 2025 •

edited

Loading

yanxi0830 Feb 13, 2025 •

edited

Loading