Speech to Text Spec ver.1

Provide a real-time speech to text (STT) service and its software development kit (SDK).
Either on-premise server or cloud server is acceptable.
Network communication should be secure.
Should implement a python library for connecting the STT service.

Prerequisite

python 3.10+

Design

In this version, the library can only support English voice dictation. The class constructors and methods below are necessary for our development.

...

An audio stream object should be able to be written with multiple frames (samples) as bytes.

Reference

azure-cognitiveservices-speech ~1.40.0

Text to Speech Spec ver.1

Provide a real-time text to speech to text (TTS) service and its software development kit (SDK).
Either on-premise server or cloud server is acceptable.
Network communication should be secure.
Should implement a python library for connecting the TTS service.

Prerequisite

python 3.10+

Design

In this version, the library can only support English voice synthesis. The class constructors and methods below are necessary for our development.

A python class VoiceSynthesis similar with SpeechSynthesizer under azure.cognitiveservices.speech.
A python class VoiceClientConfig can store development configurations.
A python class AudioFormat can specify the audio format.
A python class VoiceSynthesisResult can store the result status, audio data, and necessary message.
A python class VoiceSynthesisResultStatus is an enum class to specify the result status.

程式碼區塊

language	py

class VoiceSynthesis(
    config: VoiceClientConfig
    format: AudioFormats
)
(method) def speak_ssml(ssml: str) -> VoiceSynthesisResult

The SSML string (ref1, ref2) in the method speak_ssml should contains at least two tags: voice and prosody. For instance:

程式碼區塊

language	py

<speak>
    <voice language="en-US" name="en-US-BigBigNeural">
        <prosody rate="1.05">
            This is a sample sentence.
        </prosody>
    </voice>
</speak>

程式碼區塊

language	py

class VoiceClientConfig(
    api_key: str = "",
    endpoint: str = "wss://localhost/voice/synthesis/v1"
)

程式碼區塊

language	py

# PCM format only
# The defaults are we need currently.
class AudioFormat(
    samples_per_second: int = 16_000,
    bits_per_sample: int = 16,
    channels: int = 1,
)

程式碼區塊

language	py

class VoiceSynthesisResult
(property) status: VoiceSynthesisResultStatus
(property) audio_data: bytes
(property) message: str

The audio_data should not contain any audio file header.

程式碼區塊

language	py

class VoiceSynthesisResultStatus(Enum):
    SynthesizingCompleted = 1
    Canceled = 2

Reference

Azure AI Speech voices

已比較的版本

Old Version 1

新版本目前

索引鍵

Speech to Text Spec ver.1

Prerequisite

Design

Reference

Text to Speech Spec ver.1

Prerequisite

Design

Reference

頁面比較

已比較的版本

Old Version 1

新版本 目前

索引鍵

Speech to Text Spec ver.1

Prerequisite

Design

Reference

Text to Speech Spec ver.1

Prerequisite

Design

Reference

新版本目前