DaVinci Voice Engine (with Image)

Open network access for app service ports (*.prod.ingress and *.stag.ingress) during deployment.

WebSocket Connection

Obtaining ASSISTANT_ID and API_KEY

Simply follow instruction in the README.

Connection URL:

Audio+Image Mode:

URL: wss://assistant-audio-prod.dvcbot.net/ws?vision=1&assistant_id=<ASSISTANT_ID>

Specifying Subprotocols:

  • Subprotocols: ['proto', 'API_KEY']

  • Note. The client specifies one or more desired subprotocols by including the Sec-WebSocket-Protocol header in the handshake request.

Handling Frames in Packets

The DaVinci Voice Engine (DVE) server-side handles packet serialization and deserialization using protobuf. Therefore, the packets sent and received by the client must also comply with protobuf's encoding and decoding methods. This ensures that the data is correctly structured and understood by both the server and client during communication.

Packet Types

The client can send and receive different types of frames, distinguished by protobuf's oneof feature.

  • The frames for receiving are TextFrame and AudioRawFrame.

  • The frames for sending

    • Audio-only Mode

      The client can only send AudioRawFrame.

    • Audio+Image Mode

      The client can send AudioRawFrame combined with ImageRawFrame

The structure definitions are as follows:

  • Frame

    • TextFrame: Used for transmitting non-speech information, such as commands to interrupt and stop audio playback.

      message TextFrame { uint64 id = 1; string name = 2; string text = 3; }
    • AudioRawFrame: Contains the actual audio data that is to be played out.

      message AudioRawFrame { uint64 id = 1; string name = 2; bytes audio = 3; uint32 sample_rate = 4; uint32 num_channels = 5; }
    • ImageRawFrame (Audio+Image Mode Applicable) Contains image data that can be sent alongside audio data.

      message ImageRawFrame { uint64 id = 1; string name = 2; bytes image = 3; repeated uint32 size = 4; // Width, Height string format = 5; // e.g., "JPEG", "PNG" }

Decode/Encode w/ ProtoBuf

Protocol buffers support generated code in C++, C#, Dart, Go, Java, Kotlin, Objective-C, Python, and Ruby. With proto3, you can also work with PHP.

  • Please refer to the following links for the tutorials of common programming languages.

    • C++

    • C#

    • Go

    • Java

    • Kotlin

    • Python

    • C (Nanopb) Note. Google's Protocol Buffers natively does not provide support for C. However, nanopb is a popular alternative that provides a lightweight implementation of Google's Protocol Buffers for C.

  • If you are unable to use protobuf due to memory constraints or other issues, please refer to Appendex.

Protobuf in 3 steps

  1. Define message formats in a .proto file.

    • Please refer to of Davinci Voice Engine.

  2. Compile your .proto file to get the generated source code.

    • pre-compiled generated source code (ready-to-use)

  3. Include/import the generated source code with protocol buffer API to encode (serialize) and decode (deserialize) messages.

    • Example

      • Python

Detailed Frame Processing and Control Actions

Handle Audio Data

When an AudioRawFrame is received, it contains five fields within the frame, as shown in the example below:

  • Field of audio

    • AudioRawFrame received from the server The audio field (i.e., Frame.audio.audio) includes WAV headers. To access the raw PCM data, you should start extracting from the 44th byte onwards.

    • AudioRawFrame being sent to the server Do not include the WAV header in the audio field.

      Wave_format.png

       

Handle Interruptions

When a TextFrame is received, it consists of three fields within the frame, as illustrated in the example below:

If the text in the Frame is __interrupt__, it indicates that the user has interrupted the assistant's response. In this case, you should clear the client-side audio buffer or stop audio playback immediately.

Send Image (Audio+Image Mode Applicable)

To send an image alongside audio data, the image (in PNG or JPG format) must be encoded into bytes and then packed into a protobuf message for serialization. Check python sample code for sending image.

Sample Code

Audio-Only Mode

Python (w/o interruption handling)

  • install pyaudio, websockets

  • import the generated source code by ProtoBuf ( )

HTML code w/ embedded JavaScript.

Audio+Image Mode

Python (w/o interruption handling)

  • install pyaudio, websockets, aiofiles

  • import the generated source code by ProtoBuf (frames_pb2.py)

  • Check Source code or the example below

HTML code w/ embedded JavaScript.

Appendix: Decode/Encode w/o ProtoBuf

  • Decoding:

    • Determine the frame type by inspecting the first byte, and the frame length by inspecting the following bytes. (Refer to the illustration below)

      • AudioRawFrame

        • The whole frame

          pb_series_audiorawframe.png
      • TextFrame

        • The whole frame

  • Encoding:

    • Must include the first few bytes indicating the frame type to the server.

      • AudioRawFrame

        • Bytestring starts from b'\x12'

      • TextFrame

        • Bytestring starts from b'\n'