Speech to Text Spec ver.1

Provide a real-time speech to text (STT) service and its software development kit (SDK).
Either on-premise server or cloud server is acceptable.
Network communication should be secure.
Should implement a python library for connecting the STT service.

Prerequisite

python 3.10+

Design

In this version, the library can only support English voice dictation. The class constructors and methods below are necessary for our development.

A python class VoiceDictation similar with SpeechRecognizer under azure.cognitiveservices.speech.
A python class DictationConfig can store development configurations.
A python class AudioStream can store (or send) bytes to somewhere after writing a bytes object.

class VoiceDictation(
    config: DictationConfig,
    stream: AudioStream,
    autodetect_languages: list[str] = ["en-US"] # reserved
)
(method) def on_recognized(callback: Callable[[str], None]) -> None
(method) def start_recognition() -> None
(method) def stop_recognition() -> None

The on_recognized method should pass the recognized sentence as str to the callback function for further text processing. The callback function is triggered after every sentence is recognized.
VoiceDictation should handle recognition asynchronously (use start_recognition and stop_recognition to control).

class DictationConfig(
    api_key: str = "",
    endpoint: str = "wss://localhost/voice/v1"
)

# PCM format only
# The defaults are we need currently.
class AudioStream(
    samples_per_second: int = 16_000,
    bits_per_sample: int = 16,
    channels: int = 1,
)
(method) def write(buffer: bytes) -> None
(method) def close() -> None

An audio stream object should be able to be written with multiple frames (samples) as bytes.

Reference

azure-cognitiveservices-speech ~1.40.0

Text to Speech Spec ver.1

Provide a real-time text to speech to text (TTS) service and its software development kit (SDK).
Either on-premise server or cloud server is acceptable.
Network communication should be secure.
Should implement a python library for connecting the TTS service.

Prerequisite

python 3.10+

Design

In this version, the library can only support English voice synthesis. The class constructors and methods below are necessary for our development.

A python class VoiceSynthesis similar with SpeechSynthesizer under azure.cognitiveservices.speech.
A python class VoiceClientConfig can store development configurations.
A python class AudioFormat can specify the audio format.
A python class VoiceSynthesisResult can store the result status, audio data, and necessary message.
A python class VoiceSynthesisResultStatus is an enum class to specify the result status.

class VoiceSynthesis(
    config: VoiceClientConfig
    format: AudioFormats
)
(method) def speak_ssml(ssml: str) -> VoiceSynthesisResult

The SSML string (ref1, ref2) in the method speak_ssml should contains at least two tags: voice and prosody. For instance:

<speak>
    <voice language="en-US" name="en-US-BigBigNeural">
        <prosody rate="1.05">
            This is a sample sentence.
        </prosody>
    </voice>
</speak>

class VoiceClientConfig(
    api_key: str = "",
    endpoint: str = "wss://localhost/voice/synthesis/v1"
)

# PCM format only
# The defaults are we need currently.
class AudioFormat(
    samples_per_second: int = 16_000,
    bits_per_sample: int = 16,
    channels: int = 1,
)

class VoiceSynthesisResult
(property) status: VoiceSynthesisResultStatus
(property) audio_data: bytes
(property) message: str

The audio_data should not contain any audio file header.

class VoiceSynthesisResultStatus(Enum):
    SynthesizingCompleted = 1
    Canceled = 2

Reference

Azure AI Speech voices

STT & TTS model Spec

Speech to Text Spec ver.1

Prerequisite

Design

Reference

Text to Speech Spec ver.1

Prerequisite

Design

Reference