We provide two integration modes: Audio-Only for devices like IoT products, and Audio+Image for devices that support image uploads, such as smartphones and AR glasses.
WebSocket Connection
Obtaining ASSISTANT_ID
and API_KEY
Simply follow instruction in the README.
Connection URL:
Audio+Image Mode:
URL:
wss://assistant-audio-prod.dvcbot.net/ws?vision=1&assistant_id=<ASSISTANT_ID>
Specifying Subprotocols:
Subprotocols:
['proto', 'API_KEY']
Note. The client specifies one or more desired subprotocols by including the
Sec-WebSocket-Protocol
header in the handshake request.
Handling Frames in Packets
The DaVinci Voice Engine (DVE) server-side handles packet serialization and deserialization using protobuf. Therefore, the packets sent and received by the client must also comply with protobuf's encoding and decoding methods. This ensures that the data is correctly structured and understood by both the server and client during communication.
Packet Types
The client can send and receive different types of frames, distinguished by protobuf's oneof feature.
The frames for receiving are
TextFrame
andAudioRawFrame
.The frames for sending
Audio-only Mode
The client can only send
AudioRawFrame
.Audio+Image Mode
The client can send
AudioRawFrame
combined withImageRawFrame
The structure definitions are as follows:
Frame
TextFrame
: Used for transmitting non-speech information, such as commands to interrupt and stop audio playback.message TextFrame { uint64 id = 1; string name = 2; string text = 3; }
AudioRawFrame
: Contains the actual audio data that is to be played out.message AudioRawFrame { uint64 id = 1; string name = 2; bytes audio = 3; uint32 sample_rate = 4; uint32 num_channels = 5; }
ImageRawFrame
(Audio+Image Mode Applicable) Contains image data that can be sent alongside audio data.message ImageRawFrame { uint64 id = 1; string name = 2; bytes image = 3; repeated uint32 size = 4; // Width, Height string format = 5; // e.g., "JPEG", "PNG" }
Decode/Encode w/ ProtoBuf
Protocol buffers support generated code in C++, C#, Dart, Go, Java, Kotlin, Objective-C, Python, and Ruby. With proto3, you can also work with PHP.
Please refer to the following links for the tutorials of common programming languages.
If you are unable to use protobuf due to memory constraints or other issues, please refer to Appendex.
Protobuf in 3 steps
Define message formats in a
.proto
file.Please refer to frames.proto of Davinci Voice Engine.
Compile your
.proto
file to get the generated source code.Include/import the generated source code with protocol buffer API to encode (serialize) and decode (deserialize) messages.
Example
Python
from protobuf import frames_pb2 from google.protobuf.json_format import MessageToJson # Encode (Serialize) fake_audio_data = b'\x00\x01\x02\x03' frame = frames_pb2.Frame() frame.audio.audio = fake_audio_data frame.audio.sample_rate = 16000 frame.audio.num_channels = 1 serialized_bystring = frame.SerializeToString() # Decode (deserialize) frame = frames_pb2.Frame() frame.ParseFromString(serialized_bystring) json_frame = MessageToJson(frame) ## Check the type of frame. if frame.HasField('audio'): pass elif frame.HasField('text'): pass
Detailed Frame Processing and Control Actions
Handle Audio Data
When an AudioRawFrame is received, it contains five fields within the frame, as shown in the example below:
Frame:{ audio: { id: 1533535, //sequence number name: "AudioRawFrame#1152093", audio: "\x00\x01\x02\x03", // Audio data with WAV headers sampleRate: 16000, numChannels: 1 } }
Field of audio
AudioRawFrame received from the server The audio field (i.e.,
Frame.audio.audio
) includes WAV headers. To access the raw PCM data, you should start extracting from the 44th byte onwards.AudioRawFrame being sent to the server Do not include the WAV header in the audio field.
Handle Interruptions
When a TextFrame is received, it consists of three fields within the frame, as illustrated in the example below:
Frame:{ text: { id: 832297, //sequence number name: "TextFrame#15143", text: "__interrupt__", // Interruption signal } }
If the text in the Frame is __interrupt__
, it indicates that the user has interrupted the assistant's response. In this case, you should clear the client-side audio buffer or stop audio playback immediately.
Send Image (Audio+Image Mode Applicable)
To send an image alongside audio data, the image (in PNG or JPG format) must be encoded into bytes and then packed into a protobuf message for serialization. Check python sample code for sending image.
Sample Code
Audio-Only Mode
Python (w/o interruption handling)
install pyaudio, websockets
import the generated source code by ProtoBuf (
)Check Source code or the example below
import asyncio import pyaudio import websockets import frames_pb2 from loguru import logger # Configure the logger to output to the 'packet.log' file. logger.add('packet.log') # Configure the audio stream. FRAMES_PER_BUFFER = 512 FORMAT = pyaudio.paInt16 CHANNELS = 1 RATE = 16000 # Initialize the PyAudio instance. p = pyaudio.PyAudio() # Open the audio stream. stream = p.open( format=FORMAT, channels=CHANNELS, rate=RATE, input=True, output=True, frames_per_buffer=FRAMES_PER_BUFFER, ) # Configure the assistant. assistant_id = "your_assistant_id_here" assistant_api_key = "your_api_key_here" endpoint = f'wss://assistant-audio-prod.dvcbot.net/ws?assistant_id={assistant_id}' # Websocket endpoint. async def send(websocket): """Coroutine to send audio data to the server via a websocket.""" while True: try: # Read audio data from the stream. data = stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False) frame = frames_pb2.Frame() frame.audio.audio = data frame.audio.sample_rate = RATE frame.audio.num_channels = CHANNELS logger.debug(f"Sent frame={frame}") logger.debug(f"Sent bytestring={frame.SerializeToString()}") # Send the audio data to the server. await websocket.send(frame.SerializeToString()) except websockets.exceptions.ConnectionClosedError as e: logger.error(e) assert e.code == 4008 break except Exception as e: logger.error("Not a websocket 4008 error", e) assert False await asyncio.sleep(0.01) async def receive(websocket): """Coroutine to receive and process messages from the server via a websocket.""" async for message in websocket: logger.debug(f"Received bytestring={message}") # Deserialize the message using Protobuf. frame = frames_pb2.Frame() frame.ParseFromString(message) logger.debug(f"Received frame={frame}") logger.debug(f"Received frame length={frame.ParseFromString(message)}") # Check the type of frame. if frame.HasField('audio'): # Play the received audio data. audio_data = frame.audio.audio logger.debug(f"Received audio_data={audio_data}") logger.debug(f"Type of audio data={type(audio_data)}") stream.write(audio_data[44:]) # Ignore the WAV header in the audio data. elif frame.HasField('text'): # If an interrupt is received, stop and restart the audio stream to clear the buffer. if frame.text.text == '__interrupt__': logger.debug(f"Received text={frame.text.text}") logger.debug(f"Type of text data={type(frame.text.text)}") # stream.stop_stream() # stream.start_stream() # Allow the event loop to handle other tasks. await asyncio.sleep(0.01) async def send_receive(): """Main coroutine to establish a websocket connection and handle sending/receiving.""" logger.info(f'Connecting to the websocket at URL: {endpoint}') async with websockets.connect(endpoint, subprotocols=['proto', assistant_api_key], ssl=True) as ws: await asyncio.sleep(0.1) logger.info("Beginning to send messages...") # Create tasks for sending and receiving data. send_task = asyncio.create_task(send(ws)) receive_task = asyncio.create_task(receive(ws)) # Run both tasks concurrently. await asyncio.gather(send_task, receive_task) # Run the main coroutine to start the program. asyncio.run(send_receive())
HTML code w/ embedded JavaScript.
Check this file
Audio+Image Mode
Python (w/o interruption handling)
install pyaudio, websockets, aiofiles
import the generated source code by ProtoBuf (frames_pb2.py)
Check Source code or the example below
import asyncio import pyaudio import websockets import frames_pb2 from loguru import logger import aiofiles import threading # Configure the logger to output to the 'packet.log' file. logger.add('packet.log') # Configure the audio stream. FRAMES_PER_BUFFER = 512 FORMAT = pyaudio.paInt16 CHANNELS = 1 RATE = 16000 # Initialize the PyAudio instance. p = pyaudio.PyAudio() # Open the audio stream. stream = p.open( format=FORMAT, channels=CHANNELS, rate=RATE, input=True, output=True, frames_per_buffer=FRAMES_PER_BUFFER, ) # Configure the assistant. assistant_id = "your_assistant_id_here" assistant_api_key = "your_api_key_here" endpoint = f'wss://assistant-audio-stag.dvcbot.net/ws?vision=1&assistant_id={assistant_id}' async def send_audio(websocket): """Coroutine to send audio data to the server via a websocket.""" while True: try: data = stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False) frame = frames_pb2.Frame() frame.audio.audio = data frame.audio.sample_rate = RATE frame.audio.num_channels = CHANNELS await websocket.send(frame.SerializeToString()) except websockets.exceptions.ConnectionClosedError as e: logger.error(e) break except Exception as e: logger.error("Error sending audio data", e) break await asyncio.sleep(0.01) async def send_image(websocket, image_path): """Coroutine to send image data to the server via a websocket.""" async with aiofiles.open(image_path, 'rb') as img_file: image_data = await img_file.read() frame = frames_pb2.Frame() frame.image.image = image_data frame.image.size.extend([640, 480]) # Assuming the image size is 640x480 frame.image.format = "png" # or jpg logger.info("Sending image...") await websocket.send(frame.SerializeToString()) logger.info("Image sent successfully.") async def receive(websocket): """Coroutine to receive and process messages from the server via a websocket.""" async for message in websocket: frame = frames_pb2.Frame() frame.ParseFromString(message) if frame.HasField('audio'): stream.write(frame.audio.audio[44:]) # Ignore the WAV header in the audio data. elif frame.HasField('text'): if frame.text.text == '__interrupt__': # Handle interrupt if necessary. pass def image_upload_trigger(websocket): """Function to wait for user input and trigger image upload.""" while True: image_path = input("Please enter the image path and press Enter to upload (or type 'exit' to quit):\n") if image_path.lower() == 'exit': break if image_path: asyncio.run(send_image(websocket, image_path)) async def main(): """Main function to run the program.""" async with websockets.connect(endpoint, subprotocols=['proto', assistant_api_key], ssl=True) as ws: logger.info('WebSocket connection established.') # Start audio send/receive in asyncio tasks send_task = asyncio.create_task(send_audio(ws)) receive_task = asyncio.create_task(receive(ws)) # Start a separate thread for waiting user input and image upload upload_thread = threading.Thread(target=image_upload_trigger, args=(ws,)) upload_thread.start() await asyncio.gather(send_task, receive_task) upload_thread.join() # Run the main function to start the program. asyncio.run(main())
HTML code w/ embedded JavaScript.
Check this file
Appendix: Decode/Encode w/o ProtoBuf
Decoding:
Determine the frame type by inspecting the first byte, and the frame length by inspecting the following bytes. (Refer to the illustration below)
AudioRawFrame
The whole frame
TextFrame
The whole frame
Encoding:
Must include the first few bytes indicating the frame type to the server.
AudioRawFrame
Bytestring starts from
b'\x12'
TextFrame
Bytestring starts from
b'\n'