Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roadmap: Jan supports Local Voice Mode w/ Ichijo #3488

Open
6 tasks
imtuyethan opened this issue Aug 28, 2024 · 2 comments
Open
6 tasks

roadmap: Jan supports Local Voice Mode w/ Ichijo #3488

imtuyethan opened this issue Aug 28, 2024 · 2 comments
Assignees
Labels
category: multimodal Vision, audio, video, etc P2: nice to have Nice to have feature type: epic A major feature or initiative
Milestone

Comments

@imtuyethan
Copy link
Contributor

imtuyethan commented Aug 28, 2024

Goal

Blocked on Python: https://github.com/janhq/jan-internal/issues/8

Overview

  • Jan has been training Ichigo, which augments llama3.1 with native voice understanding
  • This can be seen as an open source Siri (or the ChatGPT Advanced Voice Mode)

Goals

  • Jan should have a Voice Mode built in that user can talk to in continuous conversation
  • Voice Mode should utilize Jan's existing Threads, Messages architecture
  • cortex.cpp should have add APIs that can elegantly handle Audio input
  • [TBD] Jan can TTS output from model (we may KIV this, in favor of s2s focus)

Tasklist

  • Finalize architecture to support Audio Encodings
  • Finalize API methods to add to cortex.cpp for Audio input
  • Cortex supports Realtime API
  • Cortex supports Python

Lifecycle

Step 1: Jan captures input

  • UI: Jan has a "Voice Mode", similar to ChatGPT Voice Mode
  • Jan's Voice Mode should utilize existing Threads, Messages architecture
  • Audio input should be captured as Message attachment (OpenAI spec)
  • Jan should call cortex.cpp methods to encode audio inputs
  • Jan should call cortex.cpp equivalent to /chat/completions with encoded audio input
  • [TBD] Jan should call cortex.cpp to transcribe the audio input (for Thread view)

Questions:

  • Should we transcribe audio input, and display it in Thread mode?
  • Should we use Message attachments? (or what alternative is there?)

Step 2: cortex.cpp

We are likely to have to add API methods to OpenAI API standard:

  • Currently, /audio API does not support completions
  • However, OpenAI already has voice mode and it's likely a matter of time before they standardize the API
  • Our design philosophy should be minimalist, and then adopt their standard when they release

From my POV, this is how cortex.cpp should handle it:

Option 1: /audio/completions

  • We create /audio/completions, an alternative to /chat/completions
  • This takes care of the full process of encoding audio inputs, inference request, and potentially transcription
  • I would prefer this would be purely in C++, and no back-and-forth between JS and C++

Option 2: /audio/encodings -> /chat/completions

  • We break it up into two calls, and re-use the /chat/completions endpoint as audio input is just tokens
  • I am personally not sure whether "encodings" is the right term - or should we just re-use "embeddings" API?

Either way, cortex.cpp should eventually expose an /audio/encodings endpoint:

  • Allows us to generate audio encodings
  • Allows developer to define encoder (e.g. WhisperVQ, SNAC, etc)

Step 3: Jan generates TTS (TBD)

  • cortex.cpp should generate the inference output
  • We should allow the user to have a full voice experience (i.e. s2s)

Option 1: TTS

  • I prefer this option as user can pick their preferred voice, and it's fairly clean
  • However: would this mean that we need to unload Jade for Whisper, and then re-load?
  • This could be solved by Jade being able to do TTS natively

Option 2: s2s

  • Personally, I think this is still in research phase and we should KIV this for future release

Reference

Old Specs: https://www.notion.so/jan-ai/Jan-supports-Llama-3s-646f8c359a6c4c77a40ec822509b09e5?pvs=4
@imtuyethan imtuyethan added this to Menlo Aug 28, 2024
@imtuyethan imtuyethan converted this from a draft issue Aug 28, 2024
@dan-menlo dan-menlo changed the title Jan supports Llama-3s epic: Jan supports Llama-3s Aug 28, 2024
@imtuyethan imtuyethan moved this from Scheduled to Planning in Menlo Aug 30, 2024
@dan-menlo dan-menlo changed the title epic: Jan supports Llama-3s epic: Jan and Cortex support Llama-3s Aug 30, 2024
@imtuyethan imtuyethan added type: feature request A new feature type: epic A major feature or initiative and removed type: feature request A new feature labels Aug 30, 2024
@freelerobot freelerobot added the P2: nice to have Nice to have feature label Sep 5, 2024
@dan-menlo dan-menlo changed the title epic: Jan and Cortex support Llama-3s epic: Jan has Voice support Sep 11, 2024
@freelerobot freelerobot changed the title epic: Jan has Voice support epic: Jan supports voice models Sep 18, 2024
@freelerobot freelerobot changed the title epic: Jan supports voice models epic: Jan supports Jan voice models Sep 18, 2024
@dan-menlo dan-menlo changed the title epic: Jan supports Jan voice models epic: Jan supports running Jan's Voice models locally Sep 19, 2024
@dan-menlo dan-menlo changed the title epic: Jan supports running Jan's Voice models locally epic: Jan supports local Voice Mode using Jade models Sep 19, 2024
@dan-menlo dan-menlo changed the title epic: Jan supports local Voice Mode using Jade models epic: Jan supports Local Voice Mode using Jade models Sep 19, 2024
@freelerobot
Copy link
Contributor

Question @dan-homebrew : Is Voice Mode continuous input? i.e.

  1. Users turn on listen mode, and the model listens in an ongoing way, detecting pauses in speech
  2. Users press record and stop recording, telegram voice message style

(1) will likely not work right now

@dan-menlo
Copy link
Contributor

Question @dan-homebrew : Is Voice Mode continuous input? i.e.

  1. Users turn on listen mode, and the model listens in an ongoing way, detecting pauses in speech
  2. Users press record and stop recording, telegram voice message style

(1) will likely not work right now

I think (2) is the first step, and we work incrementally towards (1).

There might be simple workarounds to get to (1), at the full-stack engineering level (vs. model level)

@dan-menlo dan-menlo changed the title epic: Jan supports Local Voice Mode using Jade models epic: Jan supports Local Voice Mode w/ Ichijo Sep 23, 2024
@dan-menlo dan-menlo moved this from Planning to Scheduled in Menlo Sep 26, 2024
@freelerobot freelerobot added the category: multimodal Vision, audio, video, etc label Oct 14, 2024
@dan-menlo dan-menlo changed the title epic: Jan supports Local Voice Mode w/ Ichijo roadmap: Jan supports Local Voice Mode w/ Ichijo Nov 29, 2024
@dan-menlo dan-menlo moved this from Scheduled to In Progress in Menlo Jan 2, 2025
@imtuyethan imtuyethan modified the milestone: v0.5.16 Jan 2, 2025
@imtuyethan imtuyethan modified the milestones: v0.5.15, v0.5.16 Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: multimodal Vision, audio, video, etc P2: nice to have Nice to have feature type: epic A major feature or initiative
Projects
Status: In Progress
Development

No branches or pull requests

5 participants