Push to Talk

Push to talk (PTT) gives you manual control over when the user’s audio is listened to, instead of relying on automatic turn-taking detection. This is useful for walkie-talkie-style interfaces, button-driven UIs, or any scenario where you want explicit control over turn-taking.

How it works

In the default mode, Phonic automatically detects when the user starts and stops speaking. With push to talk enabled, you control this by sending unmute and mute messages over the WebSocket:

  1. Unmute — signals that the user is about to speak. Interrupts the assistant and starts listening.
  2. Mute — signals that the user has finished speaking. Stops listening and triggers the assistant’s reply.

You must continue sending audio_chunk messages at all times, even while muted. Send either the real microphone audio or silent frames (zero-filled PCM). If you stop sending audio entirely, the connection may time out and the assistant will not respond.

Setup

1. Enable push to talk in your config message

Set push_to_talk to true in the initial config message:

1{
2 "type": "config",
3 "agent": "my-agent",
4 "push_to_talk": true
5}

2. Send mute and unmute messages during the conversation

When the user presses the talk button, send an unmute message:

1{
2 "type": "unmute"
3}

When the user releases the button, send a mute message:

1{
2 "type": "mute"
3}

3. Keep streaming audio while muted

Even when the user is muted, you must keep sending audio_chunk messages. You can send either:

  • Silent frames — zero-filled PCM data of the same format and chunk size you normally send
  • Real microphone audio — the audio will be ignored by the speech model while muted, so it doesn’t matter
1// Generate a silent audio chunk (example: ~100ms of 16-bit PCM at 44100 Hz)
2const SILENT_CHUNK_SAMPLES = 4410; // 44100 * 0.1
3const silentBuffer = Buffer.alloc(SILENT_CHUNK_SAMPLES * 2); // 2 bytes per 16-bit sample
4const silentBase64 = silentBuffer.toString("base64");
5
6// Send silent chunks on an interval while muted
7const silenceInterval = setInterval(() => {
8 phonicSocket.send(JSON.stringify({
9 type: "audio_chunk",
10 audio: silentBase64,
11 }));
12}, 100);

Example flow

Here is a typical sequence of messages for a push-to-talk conversation:

StepDirectionMessageDescription
1Client → ServerconfigConnect with push_to_talk: true
2Client → Serveraudio_chunkStream silent frames (user hasn’t spoken yet)
3Client → ServerunmuteUser presses talk button
4Client → Serveraudio_chunkStream real microphone audio
5Client → ServermuteUser releases talk button
6Server → Clientaudio_chunkAssistant responds with audio
7Client → Serveraudio_chunkContinue streaming silent frames while assistant speaks
8Client → ServerunmuteUser presses talk button again (interrupts assistant if still speaking)

See the WebSocket API reference for full details on the mute and unmute message schemas.