In this article, we will create a simple voice-to-voice translator application using Python. We will use Gradio for the interface, AssemblyAI for voice recognition, the translate library for text translation, and ElevenLabs for text-to-speech conversion.

Preparing the Environment

Before starting, make sure you have the required API keys and libraries. Create a .env file in the same folder as your project and save the API keys in this file. Do not forget to add .env to .gitignore if you plan to push your code to GitHub. To obtain the keys, first visit ElevenLabs and Assembly AI.

ELEVENLABS_API_KEY=your_elevenlabs_api_key
ASSEMBLYAI_API_KEY=your_assemblyai_api_key

Next, you also need to install the required libraries:

pip install gradio assemblyai translate elevenlabs python-dotenv

Importing Libraries and Loading Configuration

First, we import the necessary libraries and load the API keys from the .env file.

import uuid
import gradio as gr
import assemblyai as aai
from pathlib import Path
from dotenv import dotenv_values
from translate import Translator
from elevenlabs import VoiceSettings
from elevenlabs.client import ElevenLabs

We use gradio for the web interface, assemblyai for audio-to-text transcription, translate for text translation, and elevenlabs for text-to-speech conversion. We use uuid to generate unique file names, Path for file path handling, and dotenv_values to load the .env file.

After importing all the necessary libraries, we load the API keys from the .env file we created earlier:

config = dotenv_values(".env")

ELEVENLABS_API_KEY = config["ELEVENLABS_API_KEY"]
ASSEMBLYAI_API_KEY = config["ASSEMBLYAI_API_KEY"]

Audio Transcription

We use AssemblyAI to transcribe the audio file.

def audio_transcription(audio_file):
    aai.settings.api_key = ASSEMBLYAI_API_KEY
    transcriber = aai.Transcriber()
    transcription = transcriber.transcribe(audio_file)
    return transcription

First, we set the API key for AssemblyAI in aai.settings.api_key. Then, we initialize the transcriber aai.Transcriber and transcribe the provided audio file with transcribe.

Text Translation

In this function, we translate the transcribed text into Spanish, Arabic, and Japanese.

def text_translation(text):
    translator_es = Translator(from_lang="en", to_lang="es")
    es_text = translator_es.translate(text)
    translator_ar = Translator(from_lang="en", to_lang="ar")
    ar_text = translator_ar.translate(text)
    translator_ja = Translator(from_lang="en", to_lang="ja")
    ja_text = translator_ja.translate(text)
    return es_text, ar_text, ja_text

First, we initialize translators for each target language using Translator. Next, we use the translate method to translate the text from English to the target languages.

Text to Speech

Finally, we use ElevenLabs to convert the translated text into speech.

def text_to_speech(text: str) -> str:
    client = ElevenLabs(api_key=ELEVENLABS_API_KEY)
    response = client.text_to_speech.convert(
        voice_id="pNInz6obpgDQGcFmaJgB",
        optimize_streaming_latency="0",
        output_format="mp3_22050_32",
        text=text,
        model_id="eleven_multilingual_v2",
        voice_settings=VoiceSettings(
            stability=0.5,
            similarity_boost=0.8,
            style=0.5,
            use_speaker_boost=True,
        ),
    )

    save_file_path = f"{uuid.uuid4()}.mp3"
    with open(save_file_path, "wb") as f:
        for chunk in response:
            if chunk:
                f.write(chunk)
    print(f"{save_file_path}: A new audio file was saved successfully!")
    return save_file_path

First, we initialize the client variable by calling the ElevenLabs class. Next, we use the text_to_speech.convert method to convert the text to speech with the specified parameters.

The `voice_to_voice` Function

This function is the core of our project. It transcribes the input audio, translates it into several languages, and converts the translations back into audio.

def voice_to_voice(audio_file):
    transcription_response = audio_transcription(audio_file)

if transcription_response.status == aai.TranscriptStatus.error:
        raise gr.Error(transcription_response.error)
    else:
        text = transcription_response.text
    es_translation, ar_translation, ja_translation = text_translation(text)
    es_audio_path = text_to_speech(es_translation)
    ar_audio_path = text_to_speech(ar_translation)
    ja_audio_path = text_to_speech(ja_translation)
    es_path = Path(es_audio_path)
    ar_path = Path(ar_audio_path)
    ja_path = Path(ja_audio_path)
    return es_path, ar_path, ja_path

The voice_to_voice function takes an audio file as a parameter and handles the entire process of transcription, translation, and text-to-speech conversion. The transcription_response variable stores the response from the transcription function. The transcribed audio is then translated into text in Spanish, Arabic, and Japanese. Once the translated text is obtained, it is converted back into speech.

Creating the User Interface

For this project, we use Gradio to create a simple, user-friendly interface. Here, users can input audio through their microphone and get the translated audio output.

audio_input = gr.Audio(
    sources=["microphone"],
    type="filepath"
)

demo = gr.Interface(
    fn=voice_to_voice,
    inputs=audio_input,
    outputs=[gr.Audio(label="Spanish"), gr.Audio(label="Arabic"), gr.Audio(label="Japanese")],
    title="Voice-to-Voice Translator",
    description="Voice-to-Voice",
)

if __name__ == "__main__":
    demo.launch()

First, we define the audio input source with gr.Audio. Then, with gr.Interface, we create the Gradio interface with the specified input and output.

Full main.py file

import gradio as gr
import assemblyai as aai
from translate import Translator
from elevenlabs import VoiceSettings
from elevenlabs.client import ElevenLabs
import uuid
from pathlib import Path
from dotenv import dotenv_values

config = dotenv_values(".env")

ELEVENLABS_API_KEY = config["ELEVENLABS_API_KEY"]
ASSEMBLYAI_API_KEY = config["ASSEMBLYAI_API_KEY"]


def voice_to_voice(audio_file):
    transcription_response = audio_transcription(audio_file)

    if transcription_response.status == aai.TranscriptStatus.error:
        raise gr.Error(transcription_response.error)
    else:
        text =  transcription_response.text

    es_translation, ar_translation, ja_translation  = text_translation(text)

    es_audio_path = text_to_speech(es_translation)
    ar_audio_path = text_to_speech(ar_translation)
    ja_audio_path = text_to_speech(ja_translation)

    es_path = Path(es_audio_path)
    ar_path = Path(ar_audio_path)
    ja_path = Path(ja_audio_path)

    return es_path, ar_path, ja_path

def audio_transcription(audio_file):

    aai.settings.api_key = ASSEMBLYAI_API_KEY

    transcriber = aai.Transcriber()
    transcription = transcriber.transcribe(audio_file)

    return transcription

def text_translation(text):
    translator_es = Translator(from_lang="en", to_lang="es")
    es_text = translator_es.translate(text)

    translator_ar = Translator(from_lang="en", to_lang="ar")
    ar_text = translator_ar.translate(text)

    translator_ja = Translator(from_lang="en", to_lang="ja")
    ja_text = translator_ja.translate(text)

    return es_text, ar_text, ja_text

def text_to_speech(text: str) -> str:
    client = ElevenLabs(
    api_key=ELEVENLABS_API_KEY,
    )
    # Calling the text_to_speech conversion API with detailed parameters
    response = client.text_to_speech.convert(
        voice_id="pNInz6obpgDQGcFmaJgB", # Adam pre-made voice
        optimize_streaming_latency="0",
        output_format="mp3_22050_32",
        text=text,
        model_id="eleven_multilingual_v2", # use the turbo model for low latency
        voice_settings=VoiceSettings(
            stability=0.5,
            similarity_boost=0.8,
            style=0.5,
            use_speaker_boost=True,
        ),
    )

    # Generating a unique file name for the output MP3 file
    save_file_path = f"{uuid.uuid4()}.mp3"

    # Writing the audio to a file
    with open(save_file_path, "wb") as f:
        for chunk in response:
            if chunk:
                f.write(chunk)

    print(f"{save_file_path}: A new audio file was saved successfully!")

    # Return the path of the saved audio file
    return save_file_path

audio_input = gr.Audio(
    sources=["microphone"],
    type="filepath"
)

demo = gr.Interface(
    fn=voice_to_voice,
    inputs=audio_input,
    outputs=[gr.Audio(label="Spanish"), gr.Audio(label="Arabic"), gr.Audio(label="Japanese")],
    title="Voice-to-Voice Translator",
    description="Voice-to-Voice",
)

if __name__ == "__main__":
    demo.launch()

credit https://blog.gopenai.com/create-a-simple-voice-to-voice-translation-app-with-python-83310c633a20

Develop a Basic Voice-to-Voice Translation Application Using Python