Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Voice Input and Audio Response #1877

Closed
1 task done
paresh2806 opened this issue Aug 8, 2024 · 7 comments
Closed
1 task done

[Feature Request]: Voice Input and Audio Response #1877

paresh2806 opened this issue Aug 8, 2024 · 7 comments
Labels

Comments

@paresh2806
Copy link
Contributor

Is there an existing issue for the same feature request?

  • I have checked the existing issues.

Is your feature request related to a problem?

No response

Describe the feature you'd like

Summary:
Add voice input and audio response capabilities to the chat application.

Description:

  1. Voice Input:

    • Allow users to speak instead of typing their messages.
    • Implement a speech-to-text functionality that transcribes the spoken words.
    • The transcribed text will be passed to the chat completion API call, similar to the current text input method.

    Voice Input Example

  2. Audio Response:

    • Provide users with the option to listen to the responses.
    • Implement a text-to-speech functionality that converts the text responses into audio.
    • Users can choose to play the audio response instead of reading the text.

    Audio Response Example

Benefits:

  • Improves accessibility and convenience.
  • Enhances user experience with multiple interaction methods.

Implementation:

  • Integrate speech-to-text and text-to-speech APIs.
  • Add UI elements for recording and playing audio.

Describe implementation you've considered

No response

Documentation, adoption, use case

No response

Additional information

No response

@KevinHuSh KevinHuSh mentioned this issue Sep 12, 2024
59 tasks
cike8899 added a commit to cike8899/ragflow that referenced this issue Sep 14, 2024
KevinHuSh pushed a commit that referenced this issue Sep 14, 2024
### What problem does this PR solve?

feat: Supports text output and sound output #1877

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
cike8899 added a commit to cike8899/ragflow that referenced this issue Sep 14, 2024
KevinHuSh pushed a commit that referenced this issue Sep 14, 2024
…to the tab of the conversation #1877 (#2440)

### What problem does this PR solve?

feat: After the voice in the new conversation window is played, jump to
the tab of the conversation #1877

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
cike8899 added a commit to cike8899/ragflow that referenced this issue Sep 14, 2024
KevinHuSh pushed a commit that referenced this issue Sep 14, 2024
…llowed to be turned on #1877 (#2446)

### What problem does this PR solve?

feat: If the tts model is not set, the Text to Speech switch is not
allowed to be turned on #1877

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
cike8899 added a commit to cike8899/ragflow that referenced this issue Sep 14, 2024
KevinHuSh pushed a commit that referenced this issue Sep 14, 2024
…ly message when the answer is empty #1877 (#2447)

### What problem does this PR solve?

feat: When voice is turned on, the page will not display an empty reply
message when the answer is empty #1877

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
@peizimo
Copy link

peizimo commented Oct 22, 2024

Right now we can't implement voice input, so if I want to implement this function, how should I do it, thank you

@cike8899
Copy link
Contributor

You can implement the voice input function by modifying the message-input file, which will use the Web Audio API. Of course, you can directly use third-party libraries, such as react-mic.

@peizimo
Copy link

peizimo commented Oct 23, 2024

Thank you for your reply. We have a development plan. When will the audio input function be implemented. In the meantime, can you tell me the message-input file path. Good luck to you!

@peizimo
Copy link

peizimo commented Oct 24, 2024

You can implement the voice input function by modifying the message-input file, which will use the Web Audio API. Of course, you can directly use third-party libraries, such as react-mic.

image
just I need add files related to web audio API or react - mic in the ragflow/ web/SRC/component/message - input path? or other operations ? could you send me a files or link, look forward to your reply!

@paresh2806
Copy link
Contributor Author

paresh2806 commented Oct 30, 2024

Follow-Up on voice input

Hi @peizimo and @cike8899,

I’ve been following the discussions regarding the implementation of voice input and audio responses in the chat application. Given the recent updates and the plan to integrate voice input, I’d like to share a potential approach for the audio transcription aspect.

I have implemented a similar feature in my project, where I receive an audio file (in WAV format) from the front-end using "@types/react-mic": "^12.4.6" and duplicated a base /completion API endpoint with /completion_with_transcribe which is capable of taking input of audio formats and directly process upon releasing the mic button from client side . Here’s a code snippet illustrating my modifications:

@manager.route('/completion_with_transcribe', methods=['POST'])
@login_required
def completion_with_transcribe():
    audio_file = request.files.get('file')
    transcription_text = None

    if audio_file:
        print(f"Received file: {audio_file.filename}")

        # Secure the filename and save it temporarily
        filename = secure_filename(audio_file.filename)
        temp_path = os.path.join('/tmp', filename)
        audio_file.save(temp_path)

        # Convert the audio file to MP3 format
        try:
            audio = AudioSegment.from_file(temp_path)
            mp3_path = os.path.join('/tmp', f"{os.path.splitext(filename)[0]}.mp3")
            audio.export(mp3_path, format="mp3")
            print(f"File converted to MP3: {mp3_path}")

        except Exception as e:
            print(f"Error converting file to MP3: {e}")
            return jsonify({"error": "Failed to convert audio to MP3"}), 500

        # Transcription logic using Groq client
        try:
            client = Groq(api_key="YOUR_API_KEY_HERE")

            with open(mp3_path, "rb") as file:
                transcription = client.audio.transcriptions.create(
                    file=(mp3_path, file.read()),
                    model="whisper-large-v3",
                    response_format="verbose_json",
                )
                transcription_text = transcription.text
                print(f"Transcription: {transcription_text}")

        except Exception as e:
            print(f"Error during transcription: {e}")
            return jsonify({"error": "Failed to transcribe audio"}), 500

        # Clean up temporary files
        os.remove(temp_path)
        os.remove(mp3_path)

    if not transcription_text:
        return jsonify({"error": "Failed to transcribe audio"}), 500

    # Access JSON data from the form
    req_data = request.form.get('data')
    if not req_data:
        return jsonify({"error": "No JSON data provided"}), 400

    req = json.loads(req_data)
    print('completion------->', req)

    msg = []
    for m in req["messages"]:
        if m["role"] == "system":
            continue
        if m["role"] == "assistant" and not msg:
            continue
        msg.append({"role": m["role"], "content": m["content"]})

    # Add the transcription as a new user message if available
    if transcription_text:
        msg.append({"role": "user", "content": transcription_text})

    try:
        e, conv = ConversationService.get_by_id(req["conversation_id"])
        if not e:
            return get_data_error_result(retmsg="Conversation not found!")
        conv.message.append(deepcopy(msg[-1]))
        e, dia = DialogService.get_by_id(conv.dialog_id)
        if not e:
            return get_data_error_result(retmsg="Dialog not found!")

        del req["conversation_id"]
        del req["messages"]

        if not conv.reference:
            conv.reference = []
        conv.message.append({"role": "assistant", "content": ""})
        conv.reference.append({"chunks": [], "doc_aggs": []})

        def fillin_conv(ans):
            nonlocal conv
            if not conv.reference:
                conv.reference.append(ans["reference"])
            else:
                conv.reference[-1] = ans["reference"]
            conv.message[-1] = {"role": "assistant", "content": ans["answer"]}

        def stream():
            nonlocal dia, msg, req, conv
            try:
                for ans in chat(dia, msg, True, **req):
                    fillin_conv(ans)
                    yield "data:" + json.dumps({"retcode": 0, "retmsg": "", "data": ans}, ensure_ascii=False) + "\n\n"
                ConversationService.update_by_id(conv.id, conv.to_dict())
            except Exception as e:
                yield "data:" + json.dumps({"retcode": 500, "retmsg": str(e), "data": {"answer": "**ERROR**: " + str(e), "reference": []}}, ensure_ascii=False) + "\n\n"
            yield "data:" + json.dumps({"retcode": 0, "retmsg": "", "data": True}, ensure_ascii=False) + "\n\n"

        if req.get("stream", True):
            resp = Response(stream(), mimetype="text/event-stream")
            resp.headers.add_header("Cache-control", "no-cache")
            resp.headers.add_header("Connection", "keep-alive")
            resp.headers.add_header("X-Accel-Buffering", "no")
            resp.headers.add_header("Content-Type", "text/event-stream; charset=utf-8")
            return resp
        else:
            answer = None
            for ans in chat(dia, msg, **req):
                answer = ans
                fillin_conv(ans)
                ConversationService.update_by_id(conv.id, conv.to_dict())
                break
            return get_json_result(data=answer)
    except Exception as e:
        return server_error_response(e)

Hope this helps
Let me know if i can help !

@peizimo
Copy link

peizimo commented Oct 30, 2024

@manager.route('/completion_with_transcribe', methods=['POST'])

Thank you for your reply. Thank you .I will test this code as soon as possible and give you feedback. Best wishes !

@peizimo
Copy link

peizimo commented Nov 4, 2024

Follow-Up on voice input

Hi @peizimo and @cike8899,

I’ve been following the discussions regarding the implementation of voice input and audio responses in the chat application. Given the recent updates and the plan to integrate voice input, I’d like to share a potential approach for the audio transcription aspect.

I have implemented a similar feature in my project, where I receive an audio file (in WAV format) from the front-end using "@types/react-mic": "^12.4.6" and duplicated a base /completion API endpoint with /completion_with_transcribe which is capable of taking input of audio formats and directly process upon releasing the mic button from client side . Here’s a code snippet illustrating my modifications:

@manager.route('/completion_with_transcribe', methods=['POST'])
@login_required
def completion_with_transcribe():
    audio_file = request.files.get('file')
    transcription_text = None

    if audio_file:
        print(f"Received file: {audio_file.filename}")

        # Secure the filename and save it temporarily
        filename = secure_filename(audio_file.filename)
        temp_path = os.path.join('/tmp', filename)
        audio_file.save(temp_path)

        # Convert the audio file to MP3 format
        try:
            audio = AudioSegment.from_file(temp_path)
            mp3_path = os.path.join('/tmp', f"{os.path.splitext(filename)[0]}.mp3")
            audio.export(mp3_path, format="mp3")
            print(f"File converted to MP3: {mp3_path}")

        except Exception as e:
            print(f"Error converting file to MP3: {e}")
            return jsonify({"error": "Failed to convert audio to MP3"}), 500

        # Transcription logic using Groq client
        try:
            client = Groq(api_key="YOUR_API_KEY_HERE")

            with open(mp3_path, "rb") as file:
                transcription = client.audio.transcriptions.create(
                    file=(mp3_path, file.read()),
                    model="whisper-large-v3",
                    response_format="verbose_json",
                )
                transcription_text = transcription.text
                print(f"Transcription: {transcription_text}")

        except Exception as e:
            print(f"Error during transcription: {e}")
            return jsonify({"error": "Failed to transcribe audio"}), 500

        # Clean up temporary files
        os.remove(temp_path)
        os.remove(mp3_path)

    if not transcription_text:
        return jsonify({"error": "Failed to transcribe audio"}), 500

    # Access JSON data from the form
    req_data = request.form.get('data')
    if not req_data:
        return jsonify({"error": "No JSON data provided"}), 400

    req = json.loads(req_data)
    print('completion------->', req)

    msg = []
    for m in req["messages"]:
        if m["role"] == "system":
            continue
        if m["role"] == "assistant" and not msg:
            continue
        msg.append({"role": m["role"], "content": m["content"]})

    # Add the transcription as a new user message if available
    if transcription_text:
        msg.append({"role": "user", "content": transcription_text})

    try:
        e, conv = ConversationService.get_by_id(req["conversation_id"])
        if not e:
            return get_data_error_result(retmsg="Conversation not found!")
        conv.message.append(deepcopy(msg[-1]))
        e, dia = DialogService.get_by_id(conv.dialog_id)
        if not e:
            return get_data_error_result(retmsg="Dialog not found!")

        del req["conversation_id"]
        del req["messages"]

        if not conv.reference:
            conv.reference = []
        conv.message.append({"role": "assistant", "content": ""})
        conv.reference.append({"chunks": [], "doc_aggs": []})

        def fillin_conv(ans):
            nonlocal conv
            if not conv.reference:
                conv.reference.append(ans["reference"])
            else:
                conv.reference[-1] = ans["reference"]
            conv.message[-1] = {"role": "assistant", "content": ans["answer"]}

        def stream():
            nonlocal dia, msg, req, conv
            try:
                for ans in chat(dia, msg, True, **req):
                    fillin_conv(ans)
                    yield "data:" + json.dumps({"retcode": 0, "retmsg": "", "data": ans}, ensure_ascii=False) + "\n\n"
                ConversationService.update_by_id(conv.id, conv.to_dict())
            except Exception as e:
                yield "data:" + json.dumps({"retcode": 500, "retmsg": str(e), "data": {"answer": "**ERROR**: " + str(e), "reference": []}}, ensure_ascii=False) + "\n\n"
            yield "data:" + json.dumps({"retcode": 0, "retmsg": "", "data": True}, ensure_ascii=False) + "\n\n"

        if req.get("stream", True):
            resp = Response(stream(), mimetype="text/event-stream")
            resp.headers.add_header("Cache-control", "no-cache")
            resp.headers.add_header("Connection", "keep-alive")
            resp.headers.add_header("X-Accel-Buffering", "no")
            resp.headers.add_header("Content-Type", "text/event-stream; charset=utf-8")
            return resp
        else:
            answer = None
            for ans in chat(dia, msg, **req):
                answer = ans
                fillin_conv(ans)
                ConversationService.update_by_id(conv.id, conv.to_dict())
                break
            return get_json_result(data=answer)
    except Exception as e:
        return server_error_response(e)

Hope this helps Let me know if i can help !

Does this code need to be alone on a web/SRC/component/message - input folder ?Whether to add a recording component on the application interface?thanks,looking forward to your reply !

Halfknow pushed a commit to Halfknow/ragflow that referenced this issue Nov 11, 2024
…ow#2436)

### What problem does this PR solve?

feat: Supports text output and sound output infiniflow#1877

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
Halfknow pushed a commit to Halfknow/ragflow that referenced this issue Nov 11, 2024
…to the tab of the conversation infiniflow#1877 (infiniflow#2440)

### What problem does this PR solve?

feat: After the voice in the new conversation window is played, jump to
the tab of the conversation infiniflow#1877

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
Halfknow pushed a commit to Halfknow/ragflow that referenced this issue Nov 11, 2024
…llowed to be turned on infiniflow#1877 (infiniflow#2446)

### What problem does this PR solve?

feat: If the tts model is not set, the Text to Speech switch is not
allowed to be turned on infiniflow#1877

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
Halfknow pushed a commit to Halfknow/ragflow that referenced this issue Nov 11, 2024
…ly message when the answer is empty infiniflow#1877 (infiniflow#2447)

### What problem does this PR solve?

feat: When voice is turned on, the page will not display an empty reply
message when the answer is empty infiniflow#1877

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants