speech.started and speech.completed are optional to ingest, and not strictly bracketed around their utterance’s binary frames. See Event ordering for details.Overview
Bluejay supports real-time, bidirectional voice over WebSocket using CHIRP, a transport-only protocol for exchanging raw audio and optional control events between Bluejay and your server. CHIRP (Conversational Handoff for Inter‑agent Realtime Protocol) is a novel real-time A2A communication protocol, enabling conversational agents to interact over websockets over a standard medium of communication. The bare minimum integration is accepting a WebSocket connection, validating auth, and exchanging raw PCM binary frames. The three text event types (speech.started, speech.completed, session.error) are all optional from your side. Bluejay sends them automatically, but your server can safely ignore every text frame and still have a fully working integration.
When to use WebSocket
Use a WebSocket integration when:- Your agent does not have a phone number.
- You already have a real-time audio pipeline and want to connect it directly to Bluejay.
- You need bidirectional streaming with lower latency than telephony.
- Your system handles raw PCM audio natively.
Quick reference
- Format: 16 kHz mono
pcm_s16lein WebSocket binary frames. - Minimum server: accept WS, validate Basic auth, read binary, write binary.
- Text events are optional from your side. Bluejay emits
speech.started/speech.completedaround Digital Human utterances — not strictly bracketed, see Event ordering. - Barge-in: send
speech.startedwhile the Digital Human is speaking. - Hang up: close the WebSocket with code
1000.
Sample server implementations
Minimal servers that Bluejay can connect to. Each one accepts the WS upgrade, validates Basic auth, and echoes received audio back. Replace the echo with your actual audio source and sink.- Basic (audio only)
- With text events
The minimum integration. Handles binary audio frames and ignores everything else.
Message formats
CHIRP uses two WebSocket frame types: binary for audio (required) and text for control events (optional). A minimum integration only needs binary.- Binary (audio)
- speech.started (optional)
- speech.completed (optional)
- session.error (optional)
Each binary frame is raw audio samples. Both sides send and receive them.
| Property | Value | Notes |
|---|---|---|
| Encoding | pcm_s16le | Signed 16-bit little-endian PCM |
| Sample rate | 16 000 Hz | Industry standard for speech AI |
| Channels | 1 (mono) | Voice is single-channel |
| Frame size | Recommended 20 ms (640 bytes). Any even length OK. | Must be even (samples are 2 bytes) |
Error handling
| Scenario | Bluejay behavior |
|---|---|
| HTTP 401/403 on upgrade | test_result.status = REJECTED |
| Host unreachable / TCP refused / TLS failure | 3 retries with backoff, then INCOMPLETED |
| Upgrade accepted but closed immediately | INCOMPLETED, close code logged |
| Protocol violation (bad JSON, missing field, odd-length audio) | session.error sent, frame dropped, session continues |
| Your server closes gracefully | Session torn down; status set by call completion |
| Network drop or crash | INCOMPLETED, session ended |
session.error, Bluejay logs it and surfaces it in test_result.metadata. The connection stays open unless you also close it.
| Close code | Meaning |
|---|---|
1000 | Normal closure |
1008 | Policy violation (typically auth rejected) |
1011 | Internal error on sender’s side |
How it works
Connection lifecycle
Bluejay opens the WebSocket
Bluejay dials the URL you configured on your Agent (e.g.
wss://your-host/voice) and sends an HTTP upgrade with an Authorization: Basic <base64(user:pass)> header.Your server accepts or rejects
- Valid credentials: respond with
HTTP 101 Switching Protocolsand the WS is live. - Invalid credentials: respond with
HTTP 401. Bluejay marks the run asREJECTED.
Exchange audio
- When you want to send audio, send it as a WebSocket binary frame of raw
pcm_s16leat 16 kHz mono. - When you receive a binary frame from Bluejay, play it back as raw
pcm_s16leat 16 kHz mono. That is the Digital Human’s voice.
(Optional) React to text events
Bluejay emits
speech.started / speech.completed text frames around every utterance. You can ignore them or use them to drive UI. They are not strictly bracketed around their audio — see Event ordering.Event ordering
speech.started and speech.completed are derived from voice-activity detection, which runs on an independent clock from the audio reader. They are not a precise timestamp for when an utterance’s binary audio starts or ends on the wire.
In practice, you should expect this skew:
| Event | Can be offset by | From |
|---|---|---|
speech.started | up to ~200 ms late | The first binary frame of the turn |
speech.completed | up to ~200 ms early | The last binary frame of the turn |
- Do not assume binary frames arrive only between
speech.startedandspeech.completedfor a givenutterance_id. A few frames on either side are normal. - Do not use
speech.completedas a hard signal to stop playing audio. Continue playing binary frames as they arrive; end-of-utterance on the wire is when frames stop (or a newspeech.startedwith a differentutterance_idarrives). - Do use
utterance_idto group audio with its speech events for logging, UI, or analytics, but treat the association as best-effort around the boundaries. - Minimum-integration servers (binary only) are unaffected — they never see these events.
Sample session
Barge-in
If you run your own VAD or push-to-talk, sendspeech.started while the Digital Human is mid-utterance to interrupt it. There is no separate interrupt message; speech.started is the interrupt.
You send speech.started while… | Bluejay’s behavior |
|---|---|
| Digital Human is mid-utterance | Stops sending audio, interrupts the Digital Human, emits speech.completed for the canceled utterance, listens |
| Digital Human is idle | Informational. Bluejay picks up user audio from binary frames regardless. |