Skip to main content
Models like GPT-4o can handle audio modalities. With Sudo, you can call OpenAI-compatible audio models to produce audio output and optionally send audio input alongside text.
Only certain providers currently support audio input/output with Chat Completions in this format (notably OpenAI). Before using audio, confirm your chosen model provider’s API accepts modalities and input_audio as shown here, and adjust parameters accordingly.

Audio Output from Text

Generate spoken audio from a text prompt using an audio-capable model.
TypeScript
import { Sudo } from "sudo-ai";

const sudo = new Sudo({
  serverURL: "https://sudoapp.dev/api",
  apiKey: process.env.SUDO_API_KEY ?? "",
});

async function textToSpeech() {
  const response = await sudo.router.create({
    model: "gpt-4o-audio-preview",
    modalities: ["text", "audio"],
    audio: { voice: "alloy", format: "wav" },
    messages: [
      {
        role: "user",
        content: "Is a golden retriever a good family dog?",
      },
    ],
  });

  // The response may include both text and audio. Inspect the structure as providers evolve.
  console.log("Text:", response.choices[0].message.content);
  // Optionally check for audio fields if present in your provider’s response shape
  // console.log("Audio (base64):", response.choices[0].message.audio?.data);
}

textToSpeech();

Audio Input + Text

Send user audio together with a text prompt. Encode your audio file as base64 and include it using type: "input_audio".
TypeScript
import { Sudo } from "sudo-ai";
import * as fs from "fs";
import * as path from "path";

const sudo = new Sudo({
  serverURL: "https://sudoapp.dev/api",
  apiKey: process.env.SUDO_API_KEY ?? "",
});

function encodeAudioToBase64(audioPath: string): { data: string; format: string } {
  const ext = path.extname(audioPath).toLowerCase();
  const format = ext.replace(".", "") || "wav"; // e.g. .wav -> wav
  const audioBuffer = fs.readFileSync(audioPath);
  return { data: audioBuffer.toString("base64"), format };
}

async function analyzeAudioWithText() {
  const { data, format } = encodeAudioToBase64("path/to/recording.wav");

  const response = await sudo.router.create({
    model: "gpt-4o-audio-preview",
    modalities: ["text", "audio"],
    audio: { voice: "alloy", format: "wav" },
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "What is in this recording?" },
          {
            type: "input_audio",
            inputAudio: {
              data,
              format,
            },
          },
        ],
      },
    ],
  });

  console.log("Text:", response.choices[0].message.content);
}

analyzeAudioWithText();

Tips

  • Prefer lossless or high-quality input formats for best transcription/understanding.
  • Keep requests small; very long audio can increase latency and cost.
  • Verify your chosen model supports audio input/output and the requested format.