Client-side Integration Documentation Ver.1

Integration is straightforward across both web browsers and edge devices equipped with DSP, enabling voice control of intelligent assistants.

Note. For Mediatek Internal Integration We provide two integration modes: Audio-Only for devices like IoT products, and Audio+Image for devices that support image uploads, such as smartphones and AR glasses.

Notes for Mediatek DaVinci Integration

Please be aware of domain and network access restrictions for internal usage.

OA laptops: OA laptops are capable of accessing the service, which allows for testing within the company's network. Check this certificate for possible SSL error.
OA Server: Direct access from OA servers to the service is currently restricted due to network limitations, which affects the ability to connect to Azure's STT/TTS services for real-time processing. (Previously, batch processing was achievable, but real-time processing is not supported at this time.)

附註
Image content will be provided on 9/30 (Mon).

...

1. WebSocket Connection

i. Obtaining ASSISTANT_ID and API_KEY

Simply follow instruction in the README.

ii. Connection URL:

Audio-Only Mode:

URL: wss://assistant-audio-prod.dvcbot.net/ws?assistant_id=<ASSISTANT_ID>

Audio+Image Mode:

URL: wss://assistant-audio-stag.dvcbot.net/ws?vision=1&assistant_id=<ASSISTANT_ID>

iii. Specifying Subprotocols:

Subprotocols: ['proto', 'API_KEY']
Note. The client specifies one or more desired subprotocols by including the Sec-WebSocket-Protocol header in the handshake request.

2. Handling Frames in Packets

The DaVinci Voice Engine (DVE) server-side handles packet serialization and deserialization using protobuf. Therefore, the packets sent and received by the client must also comply with protobuf's encoding and decoding methods. This ensures that the data is correctly structured and understood by both the server and client during communication.

i. Packet Types

Each frame received can be either a TextFrame or an AudioRawFrame, utilizing The client can send and receive different types of frames, distinguished by protobuf's oneof feature to distinguish between the two types.

The

...

Frame
TextFrame: Used for transmitting non-speech information, such as commands to interrupt and stop audio playback.
程式碼區塊message
frames for receiving are TextFrame and AudioRawFrame.
The frames for sending
- Audio-only Mode
  The client can only send AudioRawFrame.
- Audio+Image Mode
  The client can send AudioRawFrame combined with ImageRawFrame

The structure definitions are as follows:

Frame
- TextFrame: Used for transmitting non-speech information, such as commands to interrupt and stop audio playback.
  程式碼區塊
  language py
  message TextFrame { uint64 id = 1; string name = 2; string text = 3; }
- AudioRawFrame: Contains the actual audio data that is to be played out.
  message
  程式碼區塊
  language py
  message AudioRawFrame { uint64 id = 1; string name = 2; bytes audio = 3; uint32 sample_rate = 4; uint32 num_channels = 5; }

ii. Decode/Encode w/ ProtoBuf

Protocol buffers support generated code in C++, C#, Dart, Go, Java, Kotlin, Objective-C, Python, and Ruby. With proto3, you can also work with PHP.

Please refer to the following links for the tutorials of common programming languages.
- C++
- C#
- Go
- Java
- Kotlin
- Python
- C (Nanopb) Note. Google's Protocol Buffers natively does not provide support for C. However, nanopb is a popular alternative that provides a lightweight implementation of Google's Protocol Buffers for C.
If you are unable to use protobuf due to memory constraints or other issues, please refer to Appendex.

Protobuf in 3 steps

Define message formats in a .proto file.
- Please refer to frames.proto of Davinci Voice Engine.
Compile your .proto file to get the generated source code.
- pre-compiled generated source code (ready-to-use)
  - Python
  - C

Include/import the generated source code with protocol buffer API to encode (serialize) and decode (deserialize) messages.

Example

Python

程式碼區塊

language	py

from protobuf import frames_pb2
from google.protobuf.json_format import MessageToJson

# Encode (Serialize)
fake_audio_data = b'\x00\x01\x02\x03'
frame = frames_pb2.Frame()
frame.audio.audio = fake_audio_data
frame.audio.sample_rate = 16000
frame.audio.num_channels = 1
serialized_bystring = frame.SerializeToString()

# Decode (deserialize)
frame = frames_pb2.Frame()
frame.ParseFromString(serialized_bystring)
json_frame = MessageToJson(frame)
## Check the type of frame.
if frame.HasField('audio'):
    pass
elif frame.HasField('text'):
    pass

Handle Audio Data

When an AudioRawFrame is received, it contains five fields within the frame, as shown in the example below:

...

ImageRawFrame (Audio+Image Mode Applicable) Contains image data that can be sent alongside audio data.

程式碼區塊

language	py

message ImageRawFrame {
    uint64 id = 1;
    string name = 2;
    bytes image = 3;
    repeated uint32 size = 4; // Width, Height
    string format = 5; // e.g., "JPEG", "PNG"
    }

ii. Decode/Encode w/ ProtoBuf

Protocol buffers support generated code in C++, C#, Dart, Go, Java, Kotlin, Objective-C, Python, and Ruby. With proto3, you can also work with PHP.

Please refer to the following links for the tutorials of common programming languages.
- C++
- C#
- Go
- Java
- Kotlin
- Python
- C (Nanopb) Note. Google's Protocol Buffers natively does not provide support for C. However, nanopb is a popular alternative that provides a lightweight implementation of Google's Protocol Buffers for C.
If you are unable to use protobuf due to memory constraints or other issues, please refer to Appendex.

Protobuf in 3 steps

Define message formats in a .proto file.
- Please refer to frames.proto of Davinci Voice Engine.
Compile your .proto file to get the generated source code.
- pre-compiled generated source code (ready-to-use)
  - Python
  - C

Include/import the generated source code with protocol buffer API to encode (serialize) and decode (deserialize) messages.

Example

Python

程式碼區塊

language	py

from protobuf import frames_pb2
from google.protobuf.json_format import MessageToJson

# Encode (Serialize)
fake_audio_data = b'\x00\x01\x02\x03'
frame = frames_pb2.Frame()
frame.audio.audio = fake_audio_data
frame.audio.sample_rate = 16000
frame.audio.num_channels = 1
serialized_bystring = frame.SerializeToString()

# Decode (deserialize)
frame = frames_pb2.Frame()
frame.ParseFromString(serialized_bystring)
json_frame = MessageToJson(frame)
## Check the type of frame.
if frame.HasField('audio'):
    pass
elif frame.HasField('text'):
    pass

Detailed Frame Processing and Control Actions

Handle Audio Data

When an AudioRawFrame is received, it contains five fields within the frame, as shown in the example below:

程式碼區塊
Frame:{ audio: { id: 1533535, //sequence number name: "AudioRawFrame#1152093", audio: "\x00\x01\x02\x03", // Audio data with WAV headers sampleRate: 16000, numChannels: 1 } }

Field of audio
- AudioRawFrame received from the server The audio field (i.e., Frame.audio.audio) includes WAV headers. To access the raw PCM data, you should start extracting from the 44th byte onwards.
- AudioRawFrame being sent to the server Do not include the WAV header in the audio field.
  Image Added

Handle Interruptions

When a TextFrame is received, it consists of three fields within the frame, as illustrated in the example below:

程式碼區塊
Frame:{ text: { id: 832297, //sequence number name: "TextFrame#15143", text: "__interrupt__", // Interruption signal } }

If the text in the Frame is __interrupt__, it indicates that the user has interrupted the assistant's response. In this case, you should clear the client-side audio buffer or stop audio playback immediately.

Send Image (Audio+Image Mode Applicable)

To send an image alongside audio data, the image (in PNG or JPG format) must be encoded into bytes and then packed into a protobuf message for serialization. Check python sample code for sending image.

3. Sample Code

Audio-Only Mode

i. Python (w/o interruption handling)

install pyaudio, websockets

import the generated source code by ProtoBuf (

View file

name	frames_pb2.py

)

Check Source code or the example below

程式碼區塊

language	py

import asyncio
import pyaudio
import websockets
import frames_pb2
from loguru import logger

# Configure the logger to output to the 'packet.log' file.
logger.add('packet.log')

# Configure the audio stream.
FRAMES_PER_BUFFER = 512
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000

# Initialize the PyAudio instance.
p = pyaudio.PyAudio()
# Open the audio stream.
stream = p.open(
    format=FORMAT,
    channels=CHANNELS,
    rate=RATE,
    input=True,
    output=True,
    frames_per_buffer=FRAMES_PER_BUFFER,
)

# Configure the assistant.
assistant_id = "your_assistant_id_here"
assistant_api_key = "your_api_key_here"

endpoint = f'wss://assistant-audio-prod.dvcbot.net/ws?assistant_id={assistant_id}'  # Websocket endpoint.

async def send(websocket):
    """Coroutine to send audio data to the server via a websocket."""
    while True:
        try:
            # Read audio data from the stream.
            data = stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False)
            frame = frames_pb2.Frame()
            frame.audio.audio = data
            frame.audio.sample_rate = RATE
            frame.audio.num_channels = CHANNELS
            logger.debug(f"Sent frame={frame}")
            logger.debug(f"Sent bytestring={frame.SerializeToString()}")

            # Send the audio data to the server.
            await websocket.send(frame.SerializeToString())
        except websockets.exceptions.ConnectionClosedError as e:
            logger.error(e)
            assert e.code == 4008
            break
        except Exception as e:
            logger.error("Not a websocket 4008 error", e)
            assert False
        await asyncio.sleep(0.01)

async def receive(websocket):
    """Coroutine to receive and process messages from the server via a websocket."""
    async for message in websocket:
        logger.debug(f"Received bytestring={message}")

        # Deserialize the message using Protobuf.
        frame = frames_pb2.Frame()
        frame.ParseFromString(message)
        logger.debug(f"Received frame={frame}")
        logger.debug(f"Received frame length={frame.ParseFromString(message)}")

        # Check the type of frame.
        if frame.HasField('audio'):
            # Play the received audio data.
            audio_data = frame.audio.audio
            logger.debug(f"Received audio_data={audio_data}")
            logger.debug(f"Type of audio data={type(audio_data)}")
            stream.write(audio_data[44:])  # Ignore the WAV header in the audio data.

        elif frame.HasField('text'):
            # If an interrupt is received, stop and restart the audio stream to clear the buffer.
            if frame.text.text == '__interrupt__':
                logger.debug(f"Received text={frame.text.text}")
                logger.debug(f"Type of text data={type(frame.text.text)}")
                # stream.stop_stream()
                # stream.start_stream()

...

- #

...

- Allow

...

- the

...

- event loop to handle other tasks.

...

    await asyncio.sleep(0.01)

async def

...

- send_receive():

...

- """Main coroutine to establish a

...

Field of audio
- AudioRawFrame received from the server The audio field (i.e., Frame.audio.audio) includes WAV headers. To access the raw PCM data, you should start extracting from the 44th byte onwards.
- AudioRawFrame being sent to the server Do not include the WAV header in the audio field.
  Image Removed

Handle Interruptions

When a TextFrame is received, it consists of three fields within the frame, as illustrated in the example below:

...

websocket connection and handle sending/receiving."""
    logger.info(f'Connecting to the websocket at URL: {endpoint}')
    async with websockets.connect(endpoint, subprotocols=['proto', assistant_api_key], ssl=True) as ws:

...

await asyncio.sleep(0.1)
        logger.info("Beginning to send

...

- messages...") # Create

...

tasks for sending and receiving data.

...

If the text in the Frame is __interrupt__, it indicates that the user has interrupted the assistant's response. In this case, you should clear the client-side audio buffer or stop audio playback immediately.

3. Sample Code

i. Python (w/o interruption handling)

install pyaudio, websockets

import the generated source code by ProtoBuf

程式碼區塊

language	py

import asyncio import pyaudio import websockets import frames_pb2 from loguru import logger # Configure the logger to output to the 'packet.log' file. logger.add('packet.log') # Configure the audio stream. FRAMES_PER_BUFFER = 512 FORMAT = pyaudio.paInt16 CHANNELS = 1 RATE = 16000 # Initialize the PyAudio instance. p = pyaudio.PyAudio() # Open the audio stream. stream = p.open( format=FORMAT, channels=CHANNELS, rate=RATE, input=True, output=True, frames_per_buffer=FRAMES_PER_BUFFER, ) # Configure the assistant. assistant_id = "your_assistant_id_here" assistant_api_key = "your_api_key_here" endpoint = f'wss://assistant-audio-prod.dvcbot.net/ws?assistant_id={assistant_id}' # Websocket endpoint. async def send(websocket): """Coroutine to send audio data to the server via a websocket.""" while True: try: # Read audio data from the stream. data = stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False) frame = frames_pb2.Frame() frame.audio.audio = data

 send_task = asyncio.create_task(send(ws))
        receive_task = asyncio.create_task(receive(ws))
        # Run both tasks concurrently.
        await asyncio.gather(send_task, receive_task)

# Run the main coroutine to start the program.
asyncio.run(send_receive())

ii. HTML code w/ embedded JavaScript.

Check this file

Audio+Image Mode

i. Python (w/o interruption handling)

install pyaudio, websockets, aiofiles
import the generated source code by ProtoBuf (frames_pb2.py)
Check Source code or the example below

程式碼區塊

language	py

import asyncio
import pyaudio
import websockets
import frames_pb2
from loguru import logger
import aiofiles
import threading

# Configure the logger to output to the 'packet.log' file.
logger.add('packet.log')

# Configure the audio stream.
FRAMES_PER_BUFFER = 512
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000

# Initialize the PyAudio instance.
p = pyaudio.PyAudio()
# Open the audio stream.
stream = p.open(
    format=FORMAT,
    channels=CHANNELS,
    rate=RATE,
    input=True,
    output=True,
    frames_per_buffer=FRAMES_PER_BUFFER,
)

# Configure the assistant.
assistant_id = "your_assistant_id_here"
assistant_api_key = "your_api_key_here"
endpoint = f'wss://assistant-audio-stag.dvcbot.net/ws?vision=1&assistant_id={assistant_id}'

async def send_audio(websocket):
    """Coroutine to send audio data to the server via a websocket."""
    while True:

...

try:
            data =

...

stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False)
            frame =

...

frames_pb2.

...

Frame()

...

frame.audio.audio = data
            frame.audio.sample_rate = RATE

...

frame.audio.num_channels = CHANNELS
            await websocket.send(frame.SerializeToString())
        except websockets.exceptions.ConnectionClosedError as e:
            logger.error(e)

...


            break
        except Exception as e:
            logger.error("

...

Error

...

sending

...

audio

...

data", e)

...

break

...

await asyncio.sleep(0.01)

async def

...

send_image(websocket, image_path):
    """Coroutine to

...

send

...

image

...

data

...

to the server via a websocket."""
    async

...

 with aiofiles.open(image_path, 'rb') as img_file:

...

image_data = await img_file.read()

...

frame

...

= frames_pb2.Frame()
        frame.image.image =

...

image_data
        frame.image.size.

...

extend([640, 480])  # Assuming the image size is 640x480

...

frame.image.format = "png"  # or jpg
        logger.

...

info(

...

Sending image...")

...

await websocket.send(frame.SerializeToString())

...

logger.

...

info("Image sent successfully.")

async def receive(websocket):
    """Coroutine to

...

receive

...

and

...

process

...

messages

...

from

...

the server via a websocket."""

...

async for message in websocket:
        frame =

...

frames_pb2.

...

Frame(

...

)
        frame.ParseFromString(message)

...

    if frame.HasField('audio'):
            stream.write

...

(frame.audio.audio[44:])  # Ignore the WAV header in the audio data.

...

elif frame.HasField('text'):

...

if frame.text.text == '__interrupt__':
                # Handle interrupt if necessary.

...

pass

...


def image_upload_trigger(websocket):
    """Function to wait for user input and trigger image upload."""
    while

...

True:
        image_path = input("Please enter the image path and press Enter to upload

...

(or type 'exit' to quit):\n")
        if image_path.lower() == 'exit':

...

     break
        if image_path:

...

          asyncio.run(send_image(websocket, image_path))

async def main():

...

"""Main function to

...

run

...

the

...

program."""
    async with websockets.connect(endpoint, subprotocols=['proto',

...

assistant_api_key], ssl=True) as ws:

...

 logger.info('WebSocket connection

...

established.')

...

Start audio send/receive in asyncio tasks
        send_task

...

= asyncio.

...

create_task(send_audio(ws))

...

receive_task = asyncio.create_task(receive(ws))

...

Start a separate thread for waiting user input and

...

image upload

...

upload_

...

thread =

...

threading.Thread(target=image_upload_trigger, args=(ws,))

...

upload_thread.start()
        
        await asyncio.gather(send_task, receive_task)
        upload_thread.join()

# Run the main

...

function to start the program.
asyncio.run(

...

main())

ii. HTML code w/ embedded JavaScript.

Check this file

Appendix: Decode/Encode w/o ProtoBuf

Decoding:
- Determine the frame type by inspecting the first byte, and the frame length by inspecting the following bytes. (Refer to the illustration below)
  - AudioRawFrame
    - The whole frame
  - TextFrame
    - The whole frame

...

Encoding:
- Must include the first few bytes indicating the frame type to the server.
  - AudioRawFrame
    - Bytestring starts from b'\x12'
  - TextFrame
    - Bytestring starts from b'\n'

...

版本	Old Version 2	新版本 3
由下列人員執行的變更：	DaVinci Support	DaVinci Support
儲存日期：	9月 27, 2024	10月 01, 2024

已比較的版本

索引鍵

Client-side Integration Documentation Ver.1

Notes for Mediatek DaVinci Integration

1. WebSocket Connection

2. Handling Frames in Packets

i. Packet Types

ii. Decode/Encode w/ ProtoBuf

ii. Decode/Encode w/ ProtoBuf

Detailed Frame Processing and Control Actions

3. Sample Code

Audio-Only Mode

i. Python (w/o interruption handling)

3. Sample Code

i. Python (w/o interruption handling)

ii. HTML code w/ embedded JavaScript.

Audio+Image Mode

i. Python (w/o interruption handling)

ii. HTML code w/ embedded JavaScript.

Appendix: Decode/Encode w/o ProtoBuf

Content Comparison

已比較的版本

索引鍵

Client-side Integration Documentation Ver.1

Notes for Mediatek DaVinci Integration

1. WebSocket Connection

2. Handling Frames in Packets

i. Packet Types

ii. Decode/Encode w/ ProtoBuf

ii. Decode/Encode w/ ProtoBuf

Detailed Frame Processing and Control Actions

3. Sample Code

Audio-Only Mode

i. Python (w/o interruption handling)

3. Sample Code

i. Python (w/o interruption handling)

ii. HTML code w/ embedded JavaScript.

Audio+Image Mode

i. Python (w/o interruption handling)

ii. HTML code w/ embedded JavaScript.

Appendix: Decode/Encode w/o ProtoBuf