Models like GPT-4o can handle audio modalities. With Sudo, you can call OpenAI-compatible audio models to produce audio output and optionally send audio input alongside text.
Only certain providers currently support audio input/output with Chat Completions in this format (notably OpenAI). Before using audio, confirm your chosen model provider’s API accepts modalities and input_audio as shown here, and adjust parameters accordingly.
Audio Output from Text
Generate spoken audio from a text prompt using an audio-capable model.
import { Sudo } from "sudo-ai";
const sudo = new Sudo({
serverURL: "https://sudoapp.dev/api",
apiKey: process.env.SUDO_API_KEY ?? "",
});
async function textToSpeech() {
const response = await sudo.router.create({
model: "gpt-4o-audio-preview",
modalities: ["text", "audio"],
audio: { voice: "alloy", format: "wav" },
messages: [
{
role: "user",
content: "Is a golden retriever a good family dog?",
},
],
});
// The response may include both text and audio. Inspect the structure as providers evolve.
console.log("Text:", response.choices[0].message.content);
// Optionally check for audio fields if present in your provider’s response shape
// console.log("Audio (base64):", response.choices[0].message.audio?.data);
}
textToSpeech();
Audio Input + Text
Send user audio together with a text prompt. Encode your audio file as base64 and include it using type: "input_audio".
import { Sudo } from "sudo-ai";
import * as fs from "fs";
import * as path from "path";
const sudo = new Sudo({
serverURL: "https://sudoapp.dev/api",
apiKey: process.env.SUDO_API_KEY ?? "",
});
function encodeAudioToBase64(audioPath: string): { data: string; format: string } {
const ext = path.extname(audioPath).toLowerCase();
const format = ext.replace(".", "") || "wav"; // e.g. .wav -> wav
const audioBuffer = fs.readFileSync(audioPath);
return { data: audioBuffer.toString("base64"), format };
}
async function analyzeAudioWithText() {
const { data, format } = encodeAudioToBase64("path/to/recording.wav");
const response = await sudo.router.create({
model: "gpt-4o-audio-preview",
modalities: ["text", "audio"],
audio: { voice: "alloy", format: "wav" },
messages: [
{
role: "user",
content: [
{ type: "text", text: "What is in this recording?" },
{
type: "input_audio",
inputAudio: {
data,
format,
},
},
],
},
],
});
console.log("Text:", response.choices[0].message.content);
}
analyzeAudioWithText();
Tips
- Prefer lossless or high-quality input formats for best transcription/understanding.
- Keep requests small; very long audio can increase latency and cost.
- Verify your chosen model supports audio input/output and the requested
format.