Skip to main content
speech.started and speech.completed are optional to ingest, and not strictly bracketed around their utterance’s binary frames. See Event ordering for details.

Overview

Bluejay supports real-time, bidirectional voice over WebSocket using CHIRP, a transport-only protocol for exchanging raw audio and optional control events between Bluejay and your server. CHIRP (Conversational Handoff for Inter‑agent Realtime Protocol) is a novel real-time A2A communication protocol, enabling conversational agents to interact over websockets over a standard medium of communication. The bare minimum integration is accepting a WebSocket connection, validating auth, and exchanging raw PCM binary frames. The three text event types (speech.started, speech.completed, session.error) are all optional from your side. Bluejay sends them automatically, but your server can safely ignore every text frame and still have a fully working integration.

When to use WebSocket

Use a WebSocket integration when:
  • Your agent does not have a phone number.
  • You already have a real-time audio pipeline and want to connect it directly to Bluejay.
  • You need bidirectional streaming with lower latency than telephony.
  • Your system handles raw PCM audio natively.
For other connection types, check out our other integrations, or connect via adding your agent’s phone number.

Quick reference

  • Format: 16 kHz mono pcm_s16le in WebSocket binary frames.
  • Minimum server: accept WS, validate Basic auth, read binary, write binary.
  • Text events are optional from your side. Bluejay emits speech.started / speech.completed around Digital Human utterances — not strictly bracketed, see Event ordering.
  • Barge-in: send speech.started while the Digital Human is speaking.
  • Hang up: close the WebSocket with code 1000.

Sample server implementations

Minimal servers that Bluejay can connect to. Each one accepts the WS upgrade, validates Basic auth, and echoes received audio back. Replace the echo with your actual audio source and sink.
The minimum integration. Handles binary audio frames and ignores everything else.
# pip install websockets
import asyncio, base64, os
from websockets.asyncio.server import serve

USER, PASS = os.environ["CHIRP_USER"], os.environ["CHIRP_PASS"]
EXPECTED = "Basic " + base64.b64encode(f"{USER}:{PASS}".encode()).decode()

async def handler(ws):
    if ws.request.headers.get("Authorization") != EXPECTED:
        await ws.close(1008, "unauthorized"); return

    async for msg in ws:
        if isinstance(msg, bytes):
            await ws.send(msg)

async def main():
    async with serve(handler, "0.0.0.0", 8080):
        await asyncio.Future()

asyncio.run(main())

Message formats

CHIRP uses two WebSocket frame types: binary for audio (required) and text for control events (optional). A minimum integration only needs binary.
Each binary frame is raw audio samples. Both sides send and receive them.
PropertyValueNotes
Encodingpcm_s16leSigned 16-bit little-endian PCM
Sample rate16 000 HzIndustry standard for speech AI
Channels1 (mono)Voice is single-channel
Frame sizeRecommended 20 ms (640 bytes). Any even length OK.Must be even (samples are 2 bytes)

Error handling

ScenarioBluejay behavior
HTTP 401/403 on upgradetest_result.status = REJECTED
Host unreachable / TCP refused / TLS failure3 retries with backoff, then INCOMPLETED
Upgrade accepted but closed immediatelyINCOMPLETED, close code logged
Protocol violation (bad JSON, missing field, odd-length audio)session.error sent, frame dropped, session continues
Your server closes gracefullySession torn down; status set by call completion
Network drop or crashINCOMPLETED, session ended
If you send a session.error, Bluejay logs it and surfaces it in test_result.metadata. The connection stays open unless you also close it.
Close codeMeaning
1000Normal closure
1008Policy violation (typically auth rejected)
1011Internal error on sender’s side

How it works

Connection lifecycle

1

Bluejay opens the WebSocket

Bluejay dials the URL you configured on your Agent (e.g. wss://your-host/voice) and sends an HTTP upgrade with an Authorization: Basic <base64(user:pass)> header.
2

Your server accepts or rejects

  • Valid credentials: respond with HTTP 101 Switching Protocols and the WS is live.
  • Invalid credentials: respond with HTTP 401. Bluejay marks the run as REJECTED.
3

Exchange audio

  • When you want to send audio, send it as a WebSocket binary frame of raw pcm_s16le at 16 kHz mono.
  • When you receive a binary frame from Bluejay, play it back as raw pcm_s16le at 16 kHz mono. That is the Digital Human’s voice.
4

(Optional) React to text events

Bluejay emits speech.started / speech.completed text frames around every utterance. You can ignore them or use them to drive UI. They are not strictly bracketed around their audio — see Event ordering.
5

Hang up

Close with code 1000 for a normal end. Bluejay will also close 1000 when the test run completes.

Event ordering

speech.started and speech.completed are derived from voice-activity detection, which runs on an independent clock from the audio reader. They are not a precise timestamp for when an utterance’s binary audio starts or ends on the wire. In practice, you should expect this skew:
EventCan be offset byFrom
speech.startedup to ~200 ms lateThe first binary frame of the turn
speech.completedup to ~200 ms earlyThe last binary frame of the turn
Why: VAD needs an analysis window before it can declare that speech has started, and it declares end-of-speech before the underlying audio stream buffer fully drains. What this means for your server:
  • Do not assume binary frames arrive only between speech.started and speech.completed for a given utterance_id. A few frames on either side are normal.
  • Do not use speech.completed as a hard signal to stop playing audio. Continue playing binary frames as they arrive; end-of-utterance on the wire is when frames stop (or a new speech.started with a different utterance_id arrives).
  • Do use utterance_id to group audio with its speech events for logging, UI, or analytics, but treat the association as best-effort around the boundaries.
  • Minimum-integration servers (binary only) are unaffected — they never see these events.

Sample session

# 1. Handshake
[Bluejay  -> Server]  GET /voice HTTP/1.1
                      Authorization: Basic dG9tYXM6dG9tYXM=
                      Upgrade: websocket
[Server   -> Bluejay] HTTP/1.1 101 Switching Protocols

# 2. User speaks
[Server   -> Bluejay] binary <640 B: 20 ms of audio @ 16k mono>
[Server   -> Bluejay] binary <640 B>
[Server   -> Bluejay] binary <640 B>

# 3. Digital Human responds.
#    Note: speech.started arrives AFTER the first few audio frames (VAD analysis window),
#    and speech.completed arrives BEFORE the last few audio frames (VAD end-of-speech
#    declared before the audio buffer drains). See "Event ordering".
[Bluejay  -> Server]  binary <640 B: Digital Human audio>
[Bluejay  -> Server]  binary <640 B>
[Bluejay  -> Server]  text   {"type":"speech.started","data":{"utterance_id":"agent-u1"}, ...}
[Bluejay  -> Server]  binary <640 B>
[Bluejay  -> Server]  binary <640 B>
[Bluejay  -> Server]  text   {"type":"speech.completed","data":{"utterance_id":"agent-u1"}, ...}
[Bluejay  -> Server]  binary <640 B: trailing audio for agent-u1>
[Bluejay  -> Server]  binary <640 B: trailing audio for agent-u1>

# 4. Hang up
[Server   -> Bluejay] WS close 1000 "call ended"

Barge-in

If you run your own VAD or push-to-talk, send speech.started while the Digital Human is mid-utterance to interrupt it. There is no separate interrupt message; speech.started is the interrupt.
You send speech.started while…Bluejay’s behavior
Digital Human is mid-utteranceStops sending audio, interrupts the Digital Human, emits speech.completed for the canceled utterance, listens
Digital Human is idleInformational. Bluejay picks up user audio from binary frames regardless.