跳至橫幅的結尾
前往橫幅的開頭

STT & TTS model Spec

跳至中繼資料的結尾
前往中繼資料的開頭

您正在檢視此頁面的舊版本。請檢視目前版本

比較目前 檢視頁面歷程記錄

版本 1 下一步 »

Speech to Text Spec ver.1

Provide a real-time speech to text (STT) service and its software development kit (SDK).
Either on-premise server or cloud server is acceptable.
Network communication should be secure.
Should implement a python library for connecting the STT service.

Prerequisite

  • python 3.10+

Design

In this version, the library can only support English voice dictation. The class constructors and methods below are necessary for our development.

  • A python class VoiceDictation similar with SpeechRecognizer under azure.cognitiveservices.speech.

  • A python class DictationConfig can store development configurations.

  • A python class AudioStream can store (or send) bytes to somewhere after writing a bytes object.

class VoiceDictation(
    config: DictationConfig,
    stream: AudioStream,
    autodetect_languages: list[str] = ["en-US"] # reserved
)
(method) def on_recognized(callback: Callable[[str], None]) -> None
(method) def start_recognition() -> None
(method) def stop_recognition() -> None

The on_recognized method should pass the recognized sentence as str to the callback function for further text processing. The callback function is triggered after every sentence is recognized.
VoiceDictation should handle recognition asynchronously (use start_recognition and stop_recognition to control).

class DictationConfig(
    api_key: str = "",
    endpoint: str = "wss://localhost/voice/v1"
)
# PCM format only
# The defaults are we need currently.
class AudioStream(
    samples_per_second: int = 16_000,
    bits_per_sample: int = 16,
    channels: int = 1,
)
(method) def write(buffer: bytes) -> None
(method) def close() -> None

An audio stream object should be able to be written with multiple frames (samples) as bytes.

Reference

Text to Speech Spec ver.1

Provide a real-time text to speech to text (TTS) service and its software development kit (SDK).
Either on-premise server or cloud server is acceptable.
Network communication should be secure.
Should implement a python library for connecting the TTS service.

Prerequisite

  • python 3.10+

Design

In this version, the library can only support English voice synthesis. The class constructors and methods below are necessary for our development.

  • A python class VoiceSynthesis similar with SpeechSynthesizer under azure.cognitiveservices.speech.

  • A python class VoiceClientConfig can store development configurations.

  • A python class AudioFormat can specify the audio format.

  • A python class VoiceSynthesisResult can store the result status, audio data, and necessary message.

  • A python class VoiceSynthesisResultStatus is an enum class to specify the result status.

class VoiceSynthesis(
    config: VoiceClientConfig
    format: AudioFormats
)
(method) def speak_ssml(ssml: str) -> VoiceSynthesisResult

The SSML string (ref1, ref2) in the method speak_ssml should contains at least two tags: voice and prosody. For instance:

<speak>
    <voice language="en-US" name="en-US-BigBigNeural">
        <prosody rate="1.05">
            This is a sample sentence.
        </prosody>
    </voice>
</speak>
class VoiceClientConfig(
    api_key: str = "",
    endpoint: str = "wss://localhost/voice/synthesis/v1"
)
# PCM format only
# The defaults are we need currently.
class AudioFormat(
    samples_per_second: int = 16_000,
    bits_per_sample: int = 16,
    channels: int = 1,
)
class VoiceSynthesisResult
(property) status: VoiceSynthesisResultStatus
(property) audio_data: bytes
(property) message: str

The audio_data should not contain any audio file header.

class VoiceSynthesisResultStatus(Enum):
    SynthesizingCompleted = 1
    Canceled = 2

Reference

  • 無標籤