roadmap: Jan supports Local Voice Mode w/ Ichijo #3488

imtuyethan · 2024-08-28T09:46:26Z

Goal

Blocked on Python: https://github.com/janhq/jan-internal/issues/8

Overview

Jan has been training Ichigo, which augments llama3.1 with native voice understanding
This can be seen as an open source Siri (or the ChatGPT Advanced Voice Mode)

Goals

Jan should have a Voice Mode built in that user can talk to in continuous conversation
Voice Mode should utilize Jan's existing Threads, Messages architecture
cortex.cpp should have add APIs that can elegantly handle Audio input
[TBD] Jan can TTS output from model (we may KIV this, in favor of s2s focus)

Tasklist

Finalize architecture to support Audio Encodings
Finalize API methods to add to cortex.cpp for Audio input
Cortex supports Realtime API
Cortex supports Python

Lifecycle

Step 1: Jan captures input

UI: Jan has a "Voice Mode", similar to ChatGPT Voice Mode
Jan's Voice Mode should utilize existing Threads, Messages architecture
Audio input should be captured as Message attachment (OpenAI spec)
Jan should call cortex.cpp methods to encode audio inputs
Jan should call cortex.cpp equivalent to /chat/completions with encoded audio input
[TBD] Jan should call cortex.cpp to transcribe the audio input (for Thread view)

Questions:

Should we transcribe audio input, and display it in Thread mode?
Should we use Message attachments? (or what alternative is there?)

Step 2: cortex.cpp

We are likely to have to add API methods to OpenAI API standard:

Currently, /audio API does not support completions
However, OpenAI already has voice mode and it's likely a matter of time before they standardize the API
Our design philosophy should be minimalist, and then adopt their standard when they release

From my POV, this is how cortex.cpp should handle it:

Option 1: /audio/completions

We create /audio/completions, an alternative to /chat/completions
This takes care of the full process of encoding audio inputs, inference request, and potentially transcription
I would prefer this would be purely in C++, and no back-and-forth between JS and C++

Option 2: /audio/encodings -> /chat/completions

We break it up into two calls, and re-use the /chat/completions endpoint as audio input is just tokens
I am personally not sure whether "encodings" is the right term - or should we just re-use "embeddings" API?

Either way, cortex.cpp should eventually expose an /audio/encodings endpoint:

Allows us to generate audio encodings
Allows developer to define encoder (e.g. WhisperVQ, SNAC, etc)

Step 3: Jan generates TTS (TBD)

cortex.cpp should generate the inference output
We should allow the user to have a full voice experience (i.e. s2s)

Option 1: TTS

I prefer this option as user can pick their preferred voice, and it's fairly clean
However: would this mean that we need to unload Jade for Whisper, and then re-load?
This could be solved by Jade being able to do TTS natively

Option 2: s2s

Personally, I think this is still in research phase and we should KIV this for future release

Reference

Old Specs: https://www.notion.so/jan-ai/Jan-supports-Llama-3s-646f8c359a6c4c77a40ec822509b09e5?pvs=4

The text was updated successfully, but these errors were encountered:

freelerobot · 2024-09-19T03:29:26Z

Question @dan-homebrew : Is Voice Mode continuous input? i.e.

Users turn on listen mode, and the model listens in an ongoing way, detecting pauses in speech
Users press record and stop recording, telegram voice message style

(1) will likely not work right now

dan-menlo · 2024-09-19T06:09:13Z

Question @dan-homebrew : Is Voice Mode continuous input? i.e.

Users turn on listen mode, and the model listens in an ongoing way, detecting pauses in speech

Users press record and stop recording, telegram voice message style

(1) will likely not work right now

I think (2) is the first step, and we work incrementally towards (1).

There might be simple workarounds to get to (1), at the full-stack engineering level (vs. model level)

imtuyethan added this to Menlo Aug 28, 2024

imtuyethan converted this from a draft issue Aug 28, 2024

dan-menlo changed the title ~~Jan supports Llama-3s~~ epic: Jan supports Llama-3s Aug 28, 2024

imtuyethan moved this from Scheduled to Planning in Menlo Aug 30, 2024

dan-menlo changed the title ~~epic: Jan supports Llama-3s~~ epic: Jan and Cortex support Llama-3s Aug 30, 2024

dan-menlo assigned nguyenhoangthuan99 Aug 30, 2024

imtuyethan added type: feature request A new feature type: epic A major feature or initiative and removed type: feature request A new feature labels Aug 30, 2024

freelerobot added the P2: nice to have Nice to have feature label Sep 5, 2024

dan-menlo changed the title ~~epic: Jan and Cortex support Llama-3s~~ epic: Jan has Voice support Sep 11, 2024

dan-menlo unassigned nguyenhoangthuan99 Sep 11, 2024

freelerobot changed the title ~~epic: Jan has Voice support~~ epic: Jan supports voice models Sep 18, 2024

freelerobot changed the title ~~epic: Jan supports voice models~~ epic: Jan supports Jan voice models Sep 18, 2024

freelerobot assigned freelerobot and unassigned freelerobot Sep 18, 2024

freelerobot mentioned this issue Sep 18, 2024

epic: cortex.cpp support Ichigo & TTS janhq/cortex.cpp#1247

Closed

dan-menlo changed the title ~~epic: Jan supports Jan voice models~~ epic: Jan supports running Jan's Voice models locally Sep 19, 2024

dan-menlo changed the title ~~epic: Jan supports running Jan's Voice models locally~~ epic: Jan supports local Voice Mode using Jade models Sep 19, 2024

dan-menlo changed the title ~~epic: Jan supports local Voice Mode using Jade models~~ epic: Jan supports Local Voice Mode using Jade models Sep 19, 2024

dan-menlo changed the title ~~epic: Jan supports Local Voice Mode using Jade models~~ epic: Jan supports Local Voice Mode w/ Ichijo Sep 23, 2024

dan-menlo mentioned this issue Sep 23, 2024

eng: 8th Oct ichigo.homebrew.ltd Demo janhq/ichigo#68

Closed

8 tasks

dan-menlo moved this from Planning to Scheduled in Menlo Sep 26, 2024

freelerobot added the category: multimodal Vision, audio, video, etc label Oct 14, 2024

freelerobot mentioned this issue Oct 14, 2024

epic: Jan & Cortex have Voice (TTS & STT) #3531

Closed

imtuyethan mentioned this issue Oct 18, 2024

chore: Structure Icebox in Github Projects #3840

Closed

This was referenced Nov 28, 2024

epic: Cortex supports Audio (STT) janhq/cortex.cpp#743

Closed

Sprint 26 Planning janhq/cortex.cpp#1735

Closed

dan-menlo changed the title ~~epic: Jan supports Local Voice Mode w/ Ichijo~~ roadmap: Jan supports Local Voice Mode w/ Ichijo Nov 29, 2024

dan-menlo assigned nguyenhoangthuan99 Dec 15, 2024

dan-menlo moved this from Scheduled to In Progress in Menlo Jan 2, 2025

imtuyethan modified the milestone: v0.5.16 Jan 2, 2025

nguyenhoangthuan99 mentioned this issue Jan 6, 2025

[Docs] Add documentation for python engine janhq/cortex.cpp#1830

Open

dan-menlo assigned vansangpfiev Jan 13, 2025

This was referenced Jan 20, 2025

Openai advanced voice mode #4309

Closed

idea: Add Live Mode Feature for Jan #4492

Closed

imtuyethan modified the milestones: v0.5.15, v0.5.16 Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roadmap: Jan supports Local Voice Mode w/ Ichijo #3488

roadmap: Jan supports Local Voice Mode w/ Ichijo #3488

imtuyethan commented Aug 28, 2024 •

edited

Loading

freelerobot commented Sep 19, 2024

dan-menlo commented Sep 19, 2024

roadmap: Jan supports Local Voice Mode w/ Ichijo #3488

roadmap: Jan supports Local Voice Mode w/ Ichijo #3488

Comments

imtuyethan commented Aug 28, 2024 • edited Loading

Goal

Overview

Goals

Tasklist

Lifecycle

Step 1: Jan captures input

Step 2: cortex.cpp

Step 3: Jan generates TTS (TBD)

Reference

freelerobot commented Sep 19, 2024

dan-menlo commented Sep 19, 2024

imtuyethan commented Aug 28, 2024 •

edited

Loading