[Feature Request]: Voice Input and Audio Response #1877

paresh2806 · 2024-08-08T13:09:32Z

Is there an existing issue for the same feature request?

I have checked the existing issues.

Is your feature request related to a problem?

No response

Describe the feature you'd like

Summary:
Add voice input and audio response capabilities to the chat application.

Description:

Voice Input:
- Allow users to speak instead of typing their messages.
- Implement a speech-to-text functionality that transcribes the spoken words.
- The transcribed text will be passed to the chat completion API call, similar to the current text input method.
Audio Response:
- Provide users with the option to listen to the responses.
- Implement a text-to-speech functionality that converts the text responses into audio.
- Users can choose to play the audio response instead of reading the text.

Benefits:

Improves accessibility and convenience.
Enhances user experience with multiple interaction methods.

Implementation:

Integrate speech-to-text and text-to-speech APIs.
Add UI elements for recording and playing audio.

Describe implementation you've considered

No response

Documentation, adoption, use case

No response

Additional information

No response

### What problem does this PR solve? feat: Supports text output and sound output #1877 ### Type of change - [x] New Feature (non-breaking change which adds functionality)

…to the tab of the conversation infiniflow#1877

…to the tab of the conversation #1877 (#2440) ### What problem does this PR solve? feat: After the voice in the new conversation window is played, jump to the tab of the conversation #1877 ### Type of change - [x] New Feature (non-breaking change which adds functionality)

…llowed to be turned on infiniflow#1877

…llowed to be turned on #1877 (#2446) ### What problem does this PR solve? feat: If the tts model is not set, the Text to Speech switch is not allowed to be turned on #1877 ### Type of change - [x] New Feature (non-breaking change which adds functionality)

…ly message when the answer is empty infiniflow#1877

…ly message when the answer is empty #1877 (#2447) ### What problem does this PR solve? feat: When voice is turned on, the page will not display an empty reply message when the answer is empty #1877 ### Type of change - [x] New Feature (non-breaking change which adds functionality)

peizimo · 2024-10-22T09:26:09Z

Right now we can't implement voice input, so if I want to implement this function, how should I do it, thank you

cike8899 · 2024-10-23T05:40:22Z

You can implement the voice input function by modifying the message-input file, which will use the Web Audio API. Of course, you can directly use third-party libraries, such as react-mic.

peizimo · 2024-10-23T06:13:29Z

Thank you for your reply. We have a development plan. When will the audio input function be implemented. In the meantime, can you tell me the message-input file path. Good luck to you！

peizimo · 2024-10-24T01:43:09Z

You can implement the voice input function by modifying the message-input file, which will use the Web Audio API. Of course, you can directly use third-party libraries, such as react-mic.

just I need add files related to web audio API or react - mic in the ragflow/ web/SRC/component/message - input path? or other operations ? could you send me a files or link, look forward to your reply!

paresh2806 · 2024-10-30T06:54:22Z

Follow-Up on voice input

Hi @peizimo and @cike8899,

I’ve been following the discussions regarding the implementation of voice input and audio responses in the chat application. Given the recent updates and the plan to integrate voice input, I’d like to share a potential approach for the audio transcription aspect.

I have implemented a similar feature in my project, where I receive an audio file (in WAV format) from the front-end using "@types/react-mic": "^12.4.6" and duplicated a base /completion API endpoint with /completion_with_transcribe which is capable of taking input of audio formats and directly process upon releasing the mic button from client side . Here’s a code snippet illustrating my modifications:

@manager.route('/completion_with_transcribe', methods=['POST'])
@login_required
def completion_with_transcribe():
    audio_file = request.files.get('file')
    transcription_text = None

    if audio_file:
        print(f"Received file: {audio_file.filename}")

        # Secure the filename and save it temporarily
        filename = secure_filename(audio_file.filename)
        temp_path = os.path.join('/tmp', filename)
        audio_file.save(temp_path)

        # Convert the audio file to MP3 format
        try:
            audio = AudioSegment.from_file(temp_path)
            mp3_path = os.path.join('/tmp', f"{os.path.splitext(filename)[0]}.mp3")
            audio.export(mp3_path, format="mp3")
            print(f"File converted to MP3: {mp3_path}")

        except Exception as e:
            print(f"Error converting file to MP3: {e}")
            return jsonify({"error": "Failed to convert audio to MP3"}), 500

        # Transcription logic using Groq client
        try:
            client = Groq(api_key="YOUR_API_KEY_HERE")

            with open(mp3_path, "rb") as file:
                transcription = client.audio.transcriptions.create(
                    file=(mp3_path, file.read()),
                    model="whisper-large-v3",
                    response_format="verbose_json",
                )
                transcription_text = transcription.text
                print(f"Transcription: {transcription_text}")

        except Exception as e:
            print(f"Error during transcription: {e}")
            return jsonify({"error": "Failed to transcribe audio"}), 500

        # Clean up temporary files
        os.remove(temp_path)
        os.remove(mp3_path)

    if not transcription_text:
        return jsonify({"error": "Failed to transcribe audio"}), 500

    # Access JSON data from the form
    req_data = request.form.get('data')
    if not req_data:
        return jsonify({"error": "No JSON data provided"}), 400

    req = json.loads(req_data)
    print('completion------->', req)

    msg = []
    for m in req["messages"]:
        if m["role"] == "system":
            continue
        if m["role"] == "assistant" and not msg:
            continue
        msg.append({"role": m["role"], "content": m["content"]})

    # Add the transcription as a new user message if available
    if transcription_text:
        msg.append({"role": "user", "content": transcription_text})

    try:
        e, conv = ConversationService.get_by_id(req["conversation_id"])
        if not e:
            return get_data_error_result(retmsg="Conversation not found!")
        conv.message.append(deepcopy(msg[-1]))
        e, dia = DialogService.get_by_id(conv.dialog_id)
        if not e:
            return get_data_error_result(retmsg="Dialog not found!")

        del req["conversation_id"]
        del req["messages"]

        if not conv.reference:
            conv.reference = []
        conv.message.append({"role": "assistant", "content": ""})
        conv.reference.append({"chunks": [], "doc_aggs": []})

        def fillin_conv(ans):
            nonlocal conv
            if not conv.reference:
                conv.reference.append(ans["reference"])
            else:
                conv.reference[-1] = ans["reference"]
            conv.message[-1] = {"role": "assistant", "content": ans["answer"]}

        def stream():
            nonlocal dia, msg, req, conv
            try:
                for ans in chat(dia, msg, True, **req):
                    fillin_conv(ans)
                    yield "data:" + json.dumps({"retcode": 0, "retmsg": "", "data": ans}, ensure_ascii=False) + "\n\n"
                ConversationService.update_by_id(conv.id, conv.to_dict())
            except Exception as e:
                yield "data:" + json.dumps({"retcode": 500, "retmsg": str(e), "data": {"answer": "**ERROR**: " + str(e), "reference": []}}, ensure_ascii=False) + "\n\n"
            yield "data:" + json.dumps({"retcode": 0, "retmsg": "", "data": True}, ensure_ascii=False) + "\n\n"

        if req.get("stream", True):
            resp = Response(stream(), mimetype="text/event-stream")
            resp.headers.add_header("Cache-control", "no-cache")
            resp.headers.add_header("Connection", "keep-alive")
            resp.headers.add_header("X-Accel-Buffering", "no")
            resp.headers.add_header("Content-Type", "text/event-stream; charset=utf-8")
            return resp
        else:
            answer = None
            for ans in chat(dia, msg, **req):
                answer = ans
                fillin_conv(ans)
                ConversationService.update_by_id(conv.id, conv.to_dict())
                break
            return get_json_result(data=answer)
    except Exception as e:
        return server_error_response(e)

Hope this helps
Let me know if i can help !

peizimo · 2024-10-30T08:01:12Z

@manager.route('/completion_with_transcribe', methods=['POST'])

Thank you for your reply. Thank you .I will test this code as soon as possible and give you feedback. Best wishes !

peizimo · 2024-11-04T02:39:46Z

Follow-Up on voice input

Hi @peizimo and @cike8899,

I’ve been following the discussions regarding the implementation of voice input and audio responses in the chat application. Given the recent updates and the plan to integrate voice input, I’d like to share a potential approach for the audio transcription aspect.

I have implemented a similar feature in my project, where I receive an audio file (in WAV format) from the front-end using "@types/react-mic": "^12.4.6" and duplicated a base /completion API endpoint with /completion_with_transcribe which is capable of taking input of audio formats and directly process upon releasing the mic button from client side . Here’s a code snippet illustrating my modifications:

@manager.route('/completion_with_transcribe', methods=['POST'])
@login_required
def completion_with_transcribe():
    audio_file = request.files.get('file')
    transcription_text = None

    if audio_file:
        print(f"Received file: {audio_file.filename}")

        # Secure the filename and save it temporarily
        filename = secure_filename(audio_file.filename)
        temp_path = os.path.join('/tmp', filename)
        audio_file.save(temp_path)

        # Convert the audio file to MP3 format
        try:
            audio = AudioSegment.from_file(temp_path)
            mp3_path = os.path.join('/tmp', f"{os.path.splitext(filename)[0]}.mp3")
            audio.export(mp3_path, format="mp3")
            print(f"File converted to MP3: {mp3_path}")

        except Exception as e:
            print(f"Error converting file to MP3: {e}")
            return jsonify({"error": "Failed to convert audio to MP3"}), 500

        # Transcription logic using Groq client
        try:
            client = Groq(api_key="YOUR_API_KEY_HERE")

            with open(mp3_path, "rb") as file:
                transcription = client.audio.transcriptions.create(
                    file=(mp3_path, file.read()),
                    model="whisper-large-v3",
                    response_format="verbose_json",
                )
                transcription_text = transcription.text
                print(f"Transcription: {transcription_text}")

        except Exception as e:
            print(f"Error during transcription: {e}")
            return jsonify({"error": "Failed to transcribe audio"}), 500

        # Clean up temporary files
        os.remove(temp_path)
        os.remove(mp3_path)

    if not transcription_text:
        return jsonify({"error": "Failed to transcribe audio"}), 500

    # Access JSON data from the form
    req_data = request.form.get('data')
    if not req_data:
        return jsonify({"error": "No JSON data provided"}), 400

    req = json.loads(req_data)
    print('completion------->', req)

    msg = []
    for m in req["messages"]:
        if m["role"] == "system":
            continue
        if m["role"] == "assistant" and not msg:
            continue
        msg.append({"role": m["role"], "content": m["content"]})

    # Add the transcription as a new user message if available
    if transcription_text:
        msg.append({"role": "user", "content": transcription_text})

    try:
        e, conv = ConversationService.get_by_id(req["conversation_id"])
        if not e:
            return get_data_error_result(retmsg="Conversation not found!")
        conv.message.append(deepcopy(msg[-1]))
        e, dia = DialogService.get_by_id(conv.dialog_id)
        if not e:
            return get_data_error_result(retmsg="Dialog not found!")

        del req["conversation_id"]
        del req["messages"]

        if not conv.reference:
            conv.reference = []
        conv.message.append({"role": "assistant", "content": ""})
        conv.reference.append({"chunks": [], "doc_aggs": []})

        def fillin_conv(ans):
            nonlocal conv
            if not conv.reference:
                conv.reference.append(ans["reference"])
            else:
                conv.reference[-1] = ans["reference"]
            conv.message[-1] = {"role": "assistant", "content": ans["answer"]}

        def stream():
            nonlocal dia, msg, req, conv
            try:
                for ans in chat(dia, msg, True, **req):
                    fillin_conv(ans)
                    yield "data:" + json.dumps({"retcode": 0, "retmsg": "", "data": ans}, ensure_ascii=False) + "\n\n"
                ConversationService.update_by_id(conv.id, conv.to_dict())
            except Exception as e:
                yield "data:" + json.dumps({"retcode": 500, "retmsg": str(e), "data": {"answer": "**ERROR**: " + str(e), "reference": []}}, ensure_ascii=False) + "\n\n"
            yield "data:" + json.dumps({"retcode": 0, "retmsg": "", "data": True}, ensure_ascii=False) + "\n\n"

        if req.get("stream", True):
            resp = Response(stream(), mimetype="text/event-stream")
            resp.headers.add_header("Cache-control", "no-cache")
            resp.headers.add_header("Connection", "keep-alive")
            resp.headers.add_header("X-Accel-Buffering", "no")
            resp.headers.add_header("Content-Type", "text/event-stream; charset=utf-8")
            return resp
        else:
            answer = None
            for ans in chat(dia, msg, **req):
                answer = ans
                fillin_conv(ans)
                ConversationService.update_by_id(conv.id, conv.to_dict())
                break
            return get_json_result(data=answer)
    except Exception as e:
        return server_error_response(e)

Hope this helps Let me know if i can help !

Does this code need to be alone on a web/SRC/component/message - input folder ？Whether to add a recording component on the application interface？thanks，looking forward to your reply ！

…ow#2436) ### What problem does this PR solve? feat: Supports text output and sound output infiniflow#1877 ### Type of change - [x] New Feature (non-breaking change which adds functionality)

…to the tab of the conversation infiniflow#1877 (infiniflow#2440) ### What problem does this PR solve? feat: After the voice in the new conversation window is played, jump to the tab of the conversation infiniflow#1877 ### Type of change - [x] New Feature (non-breaking change which adds functionality)

…llowed to be turned on infiniflow#1877 (infiniflow#2446) ### What problem does this PR solve? feat: If the tts model is not set, the Text to Speech switch is not allowed to be turned on infiniflow#1877 ### Type of change - [x] New Feature (non-breaking change which adds functionality)

…ly message when the answer is empty infiniflow#1877 (infiniflow#2447) ### What problem does this PR solve? feat: When voice is turned on, the page will not display an empty reply message when the answer is empty infiniflow#1877 ### Type of change - [x] New Feature (non-breaking change which adds functionality)

KevinHuSh added the Feature label Aug 9, 2024

KevinHuSh mentioned this issue Sep 12, 2024

ROADMAP 2024 #1821

Open

59 tasks

cike8899 added a commit to cike8899/ragflow that referenced this issue Sep 14, 2024

feat: Supports text output and sound output infiniflow#1877

e6bc86c

cike8899 mentioned this issue Sep 14, 2024

feat: Supports text output and sound output #1877 #2436

Merged

6 tasks

cike8899 added a commit to cike8899/ragflow that referenced this issue Sep 14, 2024

feat: After the voice in the new conversation window is played, jump …

160e87d

…to the tab of the conversation infiniflow#1877

cike8899 mentioned this issue Sep 14, 2024

feat: After the voice in the new conversation window is played, jump to the tab of the conversation #1877 #2440

Merged

6 tasks

cike8899 added a commit to cike8899/ragflow that referenced this issue Sep 14, 2024

feat: If the tts model is not set, the Text to Speech switch is not a…

f34a1cc

…llowed to be turned on infiniflow#1877

cike8899 mentioned this issue Sep 14, 2024

feat: If the tts model is not set, the Text to Speech switch is not allowed to be turned on #1877 #2446

Merged

6 tasks

cike8899 added a commit to cike8899/ragflow that referenced this issue Sep 14, 2024

feat: When voice is turned on, the page will not display an empty rep…

5deec03

…ly message when the answer is empty infiniflow#1877

cike8899 mentioned this issue Sep 14, 2024

feat: When voice is turned on, the page will not display an empty reply message when the answer is empty #1877 #2447

Merged

6 tasks

KevinHuSh closed this as completed Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Voice Input and Audio Response #1877

[Feature Request]: Voice Input and Audio Response #1877

paresh2806 commented Aug 8, 2024

peizimo commented Oct 22, 2024

cike8899 commented Oct 23, 2024

peizimo commented Oct 23, 2024

peizimo commented Oct 24, 2024

paresh2806 commented Oct 30, 2024 •

edited

Loading

peizimo commented Oct 30, 2024

peizimo commented Nov 4, 2024 •

edited

Loading

Follow-Up on voice input

[Feature Request]: Voice Input and Audio Response #1877

[Feature Request]: Voice Input and Audio Response #1877

Comments

paresh2806 commented Aug 8, 2024

Is there an existing issue for the same feature request?

Is your feature request related to a problem?

Describe the feature you'd like

Describe implementation you've considered

Documentation, adoption, use case

Additional information

peizimo commented Oct 22, 2024

cike8899 commented Oct 23, 2024

peizimo commented Oct 23, 2024

peizimo commented Oct 24, 2024

paresh2806 commented Oct 30, 2024 • edited Loading

Follow-Up on voice input

peizimo commented Oct 30, 2024

peizimo commented Nov 4, 2024 • edited Loading

Follow-Up on voice input

paresh2806 commented Oct 30, 2024 •

edited

Loading

peizimo commented Nov 4, 2024 •

edited

Loading