planning: `/chat/completions` Documentation and Roadmap #1582

dan-menlo · 2024-10-30T07:50:45Z

Goal

Our /chat/completions should have parameters similar to OpenAI
- Check if our implemetation is aligned
- Create planning: roadmap issues if it not supported yet
- Goal: communicate that we are committed to OpenAI compatibility

Tasklist

Itemize Roadmap issues (once Dan approves, we will create Roadmap issues)
Update Swaggerfile

The text was updated successfully, but these errors were encountered:

dan-menlo · 2024-10-30T09:27:01Z

@nguyenhoangthuan99 - Please transfer https://github.com/janhq/internal/issues/160 to this issue (can be public)

nguyenhoangthuan99 · 2024-10-30T10:14:04Z

API reference: https://platform.openai.com/docs/api-reference/chat/create

Missing supported fields from /v1/chat/completions API:

store boolean or null Optional Defaults to false Whether or not to store the output of this chat completion request for use in our model distillation or evals products. To support this, we should come up with an architecture to save and store output of chat completion requests of user. (e.g. MinIO for storage and postgres for DB).
metadata object or null Optional Developer-defined tags and values used for filtering completions in the dashboard.

This also require some logics to save result to DB then user can query later.
logit_bias map Optional Defaults to null. Modify the likelihood of specified tokens appearing in the completion. Accepts a JSON object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token. -> need to confirm llamacpp support this or not, but this might be nice to have feature.
Issue: feat: [support logit_bias for OpenAI API compatible] cortex.llamacpp#263
logprobs boolean or null Optional Defaults to false. Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message. This feature is partially supported, need up update cortex.llamacpp to return logprob when use stream/ non stream mode.
Issue: feat: [support log prob like OpenAI API] cortex.llamacpp#262

The result should look like this:

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1702685778,
  "model": "gpt-4o-mini",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I assist you today?"
      },
      "logprobs": {
        "content": [
          {
            "token": "Hello",
            "logprob": -0.31725305,
            "bytes": [72, 101, 108, 108, 111],
            "top_logprobs": [
              {
                "token": "Hello",
                "logprob": -0.31725305,
                "bytes": [72, 101, 108, 108, 111]
              },
              {
                "token": "Hi",
                "logprob": -1.3190403,
                "bytes": [72, 105]
              }
            ]
          },
          {
            "token": "!",
            "logprob": -0.02380986,
            "bytes": [
              33
            ],
            "top_logprobs": [
              {
                "token": "!",
                "logprob": -0.02380986,
                "bytes": [33]
              },
              {
                "token": " there",
                "logprob": -3.787621,
                "bytes": [32, 116, 104, 101, 114, 101]
              }
            ]
          },
          {
            "token": " How",
            "logprob": -0.000054669687,
            "bytes": [32, 72, 111, 119],
            "top_logprobs": [
              {
                "token": " How",
                "logprob": -0.000054669687,
                "bytes": [32, 72, 111, 119]
              },
              {
                "token": "<|end|>",
                "logprob": -10.953937,
                "bytes": null
              }
            ]
          },
          {
            "token": " can",
            "logprob": -0.015801601,
            "bytes": [32, 99, 97, 110],
            "top_logprobs": [
              {
                "token": " can",
                "logprob": -0.015801601,
                "bytes": [32, 99, 97, 110]
              },
              {
                "token": " may",
                "logprob": -4.161023,
                "bytes": [32, 109, 97, 121]
              }
            ]
          },
          {
            "token": " I",
            "logprob": -3.7697225e-6,
            "bytes": [
              32,
              73
            ],
            "top_logprobs": [
              {
                "token": " I",
                "logprob": -3.7697225e-6,
                "bytes": [32, 73]
              },
              {
                "token": " assist",
                "logprob": -13.596657,
                "bytes": [32, 97, 115, 115, 105, 115, 116]
              }
            ]
          },
          {
            "token": " assist",
            "logprob": -0.04571125,
            "bytes": [32, 97, 115, 115, 105, 115, 116],
            "top_logprobs": [
              {
                "token": " assist",
                "logprob": -0.04571125,
                "bytes": [32, 97, 115, 115, 105, 115, 116]
              },
              {
                "token": " help",
                "logprob": -3.1089056,
                "bytes": [32, 104, 101, 108, 112]
              }
            ]
          },
          {
            "token": " you",
            "logprob": -5.4385737e-6,
            "bytes": [32, 121, 111, 117],
            "top_logprobs": [
              {
                "token": " you",
                "logprob": -5.4385737e-6,
                "bytes": [32, 121, 111, 117]
              },
              {
                "token": " today",
                "logprob": -12.807695,
                "bytes": [32, 116, 111, 100, 97, 121]
              }
            ]
          },
          {
            "token": " today",
            "logprob": -0.0040071653,
            "bytes": [32, 116, 111, 100, 97, 121],
            "top_logprobs": [
              {
                "token": " today",
                "logprob": -0.0040071653,
                "bytes": [32, 116, 111, 100, 97, 121]
              },
              {
                "token": "?",
                "logprob": -5.5247097,
                "bytes": [63]
              }
            ]
          },
          {
            "token": "?",
            "logprob": -0.0008108172,
            "bytes": [63],
            "top_logprobs": [
              {
                "token": "?",
                "logprob": -0.0008108172,
                "bytes": [63]
              },
              {
                "token": "?\n",
                "logprob": -7.184561,
                "bytes": [63, 10]
              }
            ]
          }
        ]
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 9,
    "total_tokens": 18,
    "completion_tokens_details": {
      "reasoning_tokens": 0
    }
  },
  "system_fingerprint": null
}

n integer or null Optional Defaults to 1. How many chat completion choices to generate for each input message. Note that you will be charged based on the number of generated tokens across all of the choices. Keep n as 1 to minimize costs. -> need to check if llama.cpp support this option.
Issue: feat: [support return multiple choices] cortex.llamacpp#264
service_tier string or null Optional Defaults to auto.
Specifies the latency tier to use for processing the request. This parameter is relevant for customers subscribed to the scale tier service:
If set to 'auto', and the Project is Scale tier enabled, the system will utilize scale tier credits until they are exhausted.
If set to 'auto', and the Project is not Scale tier enabled, the request will be processed using the default service tier with a lower uptime SLA and no latency guarentee.
If set to 'default', the request will be processed using the default service tier with a lower uptime SLA and no latency guarentee.
When not set, the default behavior is 'auto'.
When this parameter is set, the response body will include the service_tier utilized.
stream_options object or null Optional Defaults to null Options for streaming response. Only set this when you set stream: true. -> need to update cortex.llamacpp to support this.
Issue: feat: [support stream_option for OpenAI API compatible] cortex.llamacpp#265
modalities and audio: reference: https://platform.openai.com/docs/api-reference/chat/create#chat-create-modalities. We need a roadmap to support multimodalities for audio.
user : reference: https://platform.openai.com/docs/api-reference/chat/create#chat-create-user.

nguyenhoangthuan99 · 2024-10-31T08:49:59Z

The following fields cannot be supported directly with cortex.cpp and need road map for it in the enterprise version:

store boolean or null Optional Defaults to false Whether or not to store the output of this chat completion request for use in our model distillation or evals products. To support this, we should come up with an architecture to save and store output of chat completion requests of user. (e.g. MinIO for storage and postgres for DB).
metadata object or null Optional Developer-defined tags and values used for filtering completions in the dashboard.

This also require some logics to save result to DB then user can query later.
service_tier string or null Optional Defaults to auto.
Specifies the latency tier to use for processing the request. This parameter is relevant for customers subscribed to the scale tier service:
If set to 'auto', and the Project is Scale tier enabled, the system will utilize scale tier credits until they are exhausted.
If set to 'auto', and the Project is not Scale tier enabled, the request will be processed using the default service tier with a lower uptime SLA and no latency guarentee.
If set to 'default', the request will be processed using the default service tier with a lower uptime SLA and no latency guarentee.
When not set, the default behavior is 'auto'.
When this parameter is set, the response body will include the service_tier utilized.
modalities and audio: reference: https://platform.openai.com/docs/api-reference/chat/create#chat-create-modalities. We need a roadmap to support multimodalities for audio.
user : reference: https://platform.openai.com/docs/api-reference/chat/create#chat-create-user.
Linked documentation to this issue in this PR Feat/api docs #1589

dan-menlo added this to Menlo Oct 30, 2024

dan-menlo assigned nguyenhoangthuan99 Oct 30, 2024

dan-menlo converted this from a draft issue Oct 30, 2024

nguyenhoangthuan99 moved this from Investigating to In Progress in Menlo Oct 31, 2024

nguyenhoangthuan99 moved this from In Progress to In Review in Menlo Oct 31, 2024

gabrielle-ong added this to the v1.0.2 milestone Nov 6, 2024

gabrielle-ong mentioned this issue Nov 8, 2024

planning: support more fields on /chat/completions (for OpenAI Compatiblity) #1661

Open

gabrielle-ong moved this from In Review to Completed in Menlo Nov 12, 2024

gabrielle-ong closed this as completed Nov 12, 2024

github-project-automation bot moved this from Completed to Review + QA in Menlo Nov 12, 2024

gabrielle-ong moved this from Review + QA to Completed in Menlo Nov 12, 2024

gabrielle-ong modified the milestones: v1.0.2, v1.0.3 Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

planning: `/chat/completions` Documentation and Roadmap #1582

planning: `/chat/completions` Documentation and Roadmap #1582

dan-menlo commented Oct 30, 2024 •

edited by gabrielle-ong

Loading

dan-menlo commented Oct 30, 2024

nguyenhoangthuan99 commented Oct 30, 2024 •

edited by gabrielle-ong

Loading

nguyenhoangthuan99 commented Oct 31, 2024

planning: /chat/completions Documentation and Roadmap #1582

planning: /chat/completions Documentation and Roadmap #1582

Comments

dan-menlo commented Oct 30, 2024 • edited by gabrielle-ong Loading

Goal

Tasklist

dan-menlo commented Oct 30, 2024

nguyenhoangthuan99 commented Oct 30, 2024 • edited by gabrielle-ong Loading

nguyenhoangthuan99 commented Oct 31, 2024

planning: `/chat/completions` Documentation and Roadmap #1582

planning: `/chat/completions` Documentation and Roadmap #1582

dan-menlo commented Oct 30, 2024 •

edited by gabrielle-ong

Loading

nguyenhoangthuan99 commented Oct 30, 2024 •

edited by gabrielle-ong

Loading