[OpenAI] Add Whisper (Azure#27109)

### Packages impacted by this PR @azure/openai ### Issues associated with this PR None for whisper but has a rudimentary fix for Azure#26953 ### Describe the problem that is addressed by this PR Adds support for speech to text capabilities. See the changelog entry and the samples for more details about the addition. Few notes: - Bring Your Own Data tests are skipped because the new version deployment doesn't support it yet, hopefully the support should be there soon - @azure/core-rest-pipeline's `formDataPolicy` doesn't support file uploads. I added a custom version of the policy in openai that supports file uploads and uses an actively maintained 3rd party library. - adds a fix for Azure#26953 that doesn't rely on core changes (see the changes in `src/api/getSSE.ts` and `src/api/getSSE.browser.ts` files. A better fix is in Azure#27000 but that is still being reviewed. ### What are the possible designs available to address the problem? If there are more than one possible design, why was the one in this PR chosen? N/A ### Are there test cases added in this PR? _(If not, why?)_ Yes ### Provide a list of related PRs _(if any)_ N/A ### Command used to generate this PR:**_(Applicable only to SDK release request PRs)_ ### Checklists - [x] Added impacted package name to the issue description - [ ] Does this PR needs any fixes in the SDK Generator?** _(If so, create an Issue in the [Autorest/typescript](https://github.com/Azure/autorest.typescript) repository and link it here)_ - [x] Added a changelog (if necessary) --------- Co-authored-by: Minh-Anh Phan <[email protected]>
azure-sdk · Sep 19, 2023 · 4e1b3ec · 4e1b3ec
1 parent 245548f
commit 4e1b3ec
Show file tree

Hide file tree

Showing 47 changed files with 1,783 additions and 197 deletions.
diff --git a/common/config/rush/pnpm-lock.yaml b/common/config/rush/pnpm-lock.yaml
diff --git a/sdk/openai/openai/CHANGELOG.md b/sdk/openai/openai/CHANGELOG.md
@@ -1,17 +1,18 @@
 # Release History
 
-## 1.0.0-beta.6 (Unreleased)
+## 1.0.0-beta.6 (2023-09-21)
 
 ### Features Added
 
-### Breaking Changes
+- Introduces speech to text and translation capabilities for a wide variety of audio file formats.
+  - Adds `getAudioTranscription` and `getAudioTranslation` methods for transcribing and translating audio files. The result can be either a simple JSON structure with just a `text` field or a more detailed JSON structure containing the text alongside additional information. In addition, VTT (Web Video Text Tracks), SRT (SubRip Text), and plain text formats are also supported. The type of the result depends on the `format` parameter if specified, otherwise, a simple JSON output is assumed. The methods could take as input an optional text prompt to guide the model's style or continue a previous audio segment. The language of the prompt should match that of the audio file.
+  - The available model at the time of this release supports the following list of audio file formats: m4a, mp3, wav, ogg, flac, webm, mp4, mpga, mpeg, and oga.
 
 ### Bugs Fixed
 
-- Return `usage` information when available.
-- Return `error` information in `ContentFilterResults` when available.
-
-### Other Changes
+- Returns `usage` information when available.
+- Fixes a bug where errors weren't properly being thrown from the streaming methods.
+- Returns `error` information in `ContentFilterResults` when available.
 
 ## 1.0.0-beta.5 (2023-08-25)
 

diff --git a/sdk/openai/openai/README.md b/sdk/openai/openai/README.md
@@ -6,10 +6,12 @@ non-Azure OpenAI inference endpoint, making it a great choice for even non-Azure
 
 Use the client library for Azure OpenAI to:
 
-* [Create a completion for text][msdocs_openai_completion]
-* [Create a chat completion with ChatGPT][msdocs_openai_chat_completion]
+* [Create a completion for text][get_completions_sample]
+* [Create a chat completion with ChatGPT][list_chat_completion_sample]
 * [Create a text embedding for comparisons][msdocs_openai_embedding]
-* [Use your own data with Azure OpenAI][msdocs_openai_custom_data]
+* [Use your own data with Azure OpenAI][byod_sample]
+* [Generate images][get_images_sample]
+* [Transcribe and Translate audio files][transcribe_audio_sample]
 
 Azure OpenAI is a managed service that allows developers to deploy, tune, and generate content from OpenAI models on Azure resources.
 
@@ -20,6 +22,7 @@ Checkout the following examples:
 - [Summarize Text](#summarize-text-with-completion)
 - [Generate Images](#generate-images-with-dall-e-image-generation-models)
 - [Analyze Business Data](#analyze-business-data)
+- [Transcribe and Translate audio files](#transcribe-and-translate-audio-files)
 
 Key links:
 
@@ -140,6 +143,10 @@ async function main(){
     console.log(choice.text);
   }
 }
+
+main().catch((err) => {
+  console.error("The sample encountered an error:", err);
+});
 ```
 
 ## Examples
@@ -179,6 +186,10 @@ async function main(){
     }
   }
 }
+
+main().catch((err) => {
+  console.error("The sample encountered an error:", err);
+});
 ```
 
 ### Generate Multiple Completions With Subscription Key
@@ -212,6 +223,10 @@ async function main(){
     console.log(`Chatbot: ${completion}`);
   }
 }
+
+main().catch((err) => {
+  console.error("The sample encountered an error:", err);
+});
 ```
 
 ### Summarize Text with Completion
@@ -254,6 +269,9 @@ async function main(){
   console.log(`Summarization: ${completion}`);
 }
 
+main().catch((err) => {
+  console.error("The sample encountered an error:", err);
+});
 ```
 ### Generate images with DALL-E image generation models
 
@@ -276,6 +294,10 @@ async function main() {
     console.log(`Image generation result URL: ${image.url}`);
   }
 }
+
+main().catch((err) => {
+  console.error("The sample encountered an error:", err);
+});
 ```
 
 ### Analyze Business Data
@@ -285,7 +307,7 @@ This example generates chat responses to input chat questions about your busines
 
 ```javascript
 const { OpenAIClient } = require("@azure/openai");
-const { DefaultAzureCredential } = require("@azure/identity")
+const { DefaultAzureCredential } = require("@azure/identity");
 
 async function main(){
   const endpoint = "https://myaccount.openai.azure.com/";
@@ -323,6 +345,36 @@ async function main(){
     }
   }
 }
+
+main().catch((err) => {
+  console.error("The sample encountered an error:", err);
+});
+```
+
+### Transcribe and translate audio files
+
+The speech to text and translation capabilities of Azure OpenAI can be used to transcribe and translate a wide variety of audio file formats. The following example shows how to use the `getAudioTranscription` method to transcribe audio into the language the audio is in. You can also translate and transcribe the audio into English using the `getAudioTranslation` method.
+
+The audio file can be loaded into memory using the NodeJS file system APIs. In the browser, the file can be loaded using the `FileReader` API and the output of `arrayBuffer` instance method can be passed to the `getAudioTranscription` method.
+
+```js
+const { OpenAIClient, AzureKeyCredential } = require("@azure/openai");
+const fs = require("fs/promises");
+
+async function main() {
+  console.log("== Transcribe Audio Sample ==");
+
+  const client = new OpenAIClient(endpoint, new AzureKeyCredential(azureApiKey));
+  const deploymentName = "whisper-deployment";
+  const audio = await fs.readFile("< path to an audio file >");
+  const result = await client.getAudioTranscription(deploymentName, audio);
+
+  console.log(`Transcription: ${result.text}`);
+}
+
+main().catch((err) => {
+  console.error("The sample encountered an error:", err);
+});
 ```
 
 ## Troubleshooting
@@ -340,9 +392,11 @@ setLogLevel("info");
 For more detailed instructions on how to enable logs, you can look at the [@azure/logger package docs](https://github.com/Azure/azure-sdk-for-js/tree/main/sdk/core/logger).
 
 <!-- LINKS -->
-[msdocs_openai_completion]: https://github.com/Azure/azure-sdk-for-js/blob/main/sdk/openai/openai/samples/v1-beta/javascript/completions.js
-[msdocs_openai_chat_completion]: https://github.com/Azure/azure-sdk-for-js/blob/main/sdk/openai/openai/samples/v1-beta/javascript/listChatCompletions.js
-[msdocs_openai_custom_data]: https://github.com/Azure/azure-sdk-for-js/blob/main/sdk/openai/openai/samples-dev/bringYourOwnData.ts
+[get_completions_sample]: https://github.com/Azure/azure-sdk-for-js/blob/main/sdk/openai/openai/samples/v1-beta/javascript/completions.js
+[list_chat_completion_sample]: https://github.com/Azure/azure-sdk-for-js/blob/main/sdk/openai/openai/samples/v1-beta/javascript/listChatCompletions.js
+[byod_sample]: https://github.com/Azure/azure-sdk-for-js/blob/main/sdk/openai/openai/samples/v1-beta/javascript/bringYourOwnData.js
+[get_images_sample]: https://github.com/Azure/azure-sdk-for-js/blob/main/sdk/openai/openai/samples/v1-beta/javascript/getImages.js
+[transcribe_audio_sample]: https://github.com/Azure/azure-sdk-for-js/tree/openai/add-whisper/sdk/openai/openai/samples-dev/audioTranscription.ts
 [msdocs_openai_embedding]: https://learn.microsoft.com/azure/cognitive-services/openai/concepts/understand-embeddings
 [azure_openai_completions_docs]: https://learn.microsoft.com/azure/cognitive-services/openai/how-to/completions
 [defaultazurecredential]: https://github.com/Azure/azure-sdk-for-js/tree/main/sdk/identity/identity#defaultazurecredential

diff --git a/sdk/openai/openai/assets.json b/sdk/openai/openai/assets.json
@@ -2,5 +2,5 @@
   "AssetsRepo": "Azure/azure-sdk-assets",
   "AssetsRepoPrefixPath": "js",
   "TagPrefix": "js/openai/openai",
-  "Tag": "js/openai/openai_353545d522"
+  "Tag": "js/openai/openai_85d9317957"
 }
diff --git a/sdk/openai/openai/assets/audio/countdown.flac b/sdk/openai/openai/assets/audio/countdown.flac
diff --git a/sdk/openai/openai/assets/audio/countdown.m4a b/sdk/openai/openai/assets/audio/countdown.m4a
diff --git a/sdk/openai/openai/assets/audio/countdown.mp3 b/sdk/openai/openai/assets/audio/countdown.mp3
diff --git a/sdk/openai/openai/assets/audio/countdown.mp4 b/sdk/openai/openai/assets/audio/countdown.mp4
diff --git a/sdk/openai/openai/assets/audio/countdown.mpeg b/sdk/openai/openai/assets/audio/countdown.mpeg
diff --git a/sdk/openai/openai/assets/audio/countdown.mpga b/sdk/openai/openai/assets/audio/countdown.mpga
diff --git a/sdk/openai/openai/assets/audio/countdown.oga b/sdk/openai/openai/assets/audio/countdown.oga
diff --git a/sdk/openai/openai/assets/audio/countdown.ogg b/sdk/openai/openai/assets/audio/countdown.ogg
diff --git a/sdk/openai/openai/assets/audio/countdown.wav b/sdk/openai/openai/assets/audio/countdown.wav
diff --git a/sdk/openai/openai/assets/audio/countdown.webm b/sdk/openai/openai/assets/audio/countdown.webm
diff --git a/sdk/openai/openai/package.json b/sdk/openai/openai/package.json
@@ -7,6 +7,7 @@
   "module": "dist-esm/src/index.js",
   "browser": {
     "./dist-esm/src/api/getSSEs.js": "./dist-esm/src/api/getSSEs.browser.js",
+    "./dist-esm/src/api/policies/formDataPolicy.js": "./dist-esm/src/api/policies/formDataPolicy.browser.js",
     "./dist-esm/test/public/utils/getImageDimensions.js": "./dist-esm/test/public/utils/getImageDimensions.browser.js"
   },
   "type": "module",
@@ -136,6 +137,8 @@
     "@azure/core-lro": "^2.5.3",
     "@azure/core-rest-pipeline": "^1.10.2",
     "@azure/logger": "^1.0.3",
+    "formdata-node": "^4.0.0",
+    "form-data-encoder": "1.7.2",
     "tslib": "^2.4.0"
   },
   "//sampleConfiguration": {

diff --git a/sdk/openai/openai/review/openai.api.md b/sdk/openai/openai/review/openai.api.md
@@ -11,6 +11,58 @@ import { KeyCredential } from '@azure/core-auth';
 import { OperationOptions } from '@azure-rest/core-client';
 import { TokenCredential } from '@azure/core-auth';
 
+// @public
+export type AudioResult<ResponseFormat extends AudioResultFormat> = {
+    json: AudioResultSimpleJson;
+    verbose_json: AudioResultVerboseJson;
+    vtt: string;
+    srt: string;
+    text: string;
+}[ResponseFormat];
+
+// @public
+export type AudioResultFormat =
+/** This format will return an JSON structure containing a single \"text\" with the transcription. */
+"json"
+/** This format will return an JSON structure containing an enriched structure with the transcription. */
+| "verbose_json"
+/** This will make the response return the transcription as plain/text. */
+| "text"
+/** The transcription will be provided in SRT format (SubRip Text) in the form of plain/text. */
+| "srt"
+/** The transcription will be provided in VTT format (Web Video Text Tracks) in the form of plain/text. */
+| "vtt";
+
+// @public
+export interface AudioResultSimpleJson {
+    text: string;
+}
+
+// @public
+export interface AudioResultVerboseJson extends AudioResultSimpleJson {
+    duration: number;
+    language: string;
+    segments: AudioSegment[];
+    task: AudioTranscriptionTask;
+}
+
+// @public
+export interface AudioSegment {
+    avgLogprob: number;
+    compressionRatio: number;
+    end: number;
+    id: number;
+    noSpeechProb: number;
+    seek: number;
+    start: number;
+    temperature: number;
+    text: string;
+    tokens: number[];
+}
+
+// @public
+export type AudioTranscriptionTask = string;
+
 // @public
 export interface AzureChatExtensionConfiguration {
     parameters: Record<string, any>;
@@ -184,6 +236,21 @@ export interface FunctionName {
     name: string;
 }
 
+// @public
+export interface GetAudioTranscriptionOptions extends OperationOptions {
+    language?: string;
+    model?: string;
+    prompt?: string;
+    temperature?: number;
+}
+
+// @public
+export interface GetAudioTranslationOptions extends OperationOptions {
+    model?: string;
+    prompt?: string;
+    temperature?: number;
+}
+
 // @public
 export interface GetChatCompletionsOptions extends OperationOptions {
     azureExtensionOptions?: AzureExtensionsOptions;
@@ -261,6 +328,10 @@ export class OpenAIClient {
     constructor(endpoint: string, credential: KeyCredential, options?: OpenAIClientOptions);
     constructor(endpoint: string, credential: TokenCredential, options?: OpenAIClientOptions);
     constructor(openAiApiKey: KeyCredential, options?: OpenAIClientOptions);
+    getAudioTranscription(deploymentName: string, fileContent: Uint8Array, options?: GetAudioTranscriptionOptions): Promise<AudioResultSimpleJson>;
+    getAudioTranscription<Format extends AudioResultFormat>(deploymentName: string, fileContent: Uint8Array, format: Format, options?: GetAudioTranscriptionOptions): Promise<AudioResult<Format>>;
+    getAudioTranslation(deploymentName: string, fileContent: Uint8Array, options?: GetAudioTranslationOptions): Promise<AudioResultSimpleJson>;
+    getAudioTranslation<Format extends AudioResultFormat>(deploymentName: string, fileContent: Uint8Array, format: Format, options?: GetAudioTranslationOptions): Promise<AudioResult<Format>>;
     getChatCompletions(deploymentName: string, messages: ChatMessage[], options?: GetChatCompletionsOptions): Promise<ChatCompletions>;
     getCompletions(deploymentName: string, prompt: string[], options?: GetCompletionsOptions): Promise<Completions>;
     getEmbeddings(deploymentName: string, input: string[], options?: GetEmbeddingsOptions): Promise<Embeddings>;