已比較的版本

索引鍵

  • 此行已新增。
  • 此行已移除。
  • 格式已變更。

Speech to Text Spec ver.1

Provide a real-time speech to text (STT) service and its software development kit (SDK).
Either on-premise server or cloud server is acceptable.
Network communication should be secure.
Should implement a python library for connecting the STT service.

Prerequisite

  • python 3.10+

Design

In this version, the library can only support English voice dictation. The class constructors and methods below are necessary for our development.

...

An audio stream object should be able to be written with multiple frames (samples) as bytes.

Reference

Text to Speech Spec ver.1

Provide a real-time text to speech to text (TTS) service and its software development kit (SDK).
Either on-premise server or cloud server is acceptable.
Network communication should be secure.
Should implement a python library for connecting the TTS service.

Prerequisite

  • python 3.10+

Design

In this version, the library can only support English voice synthesis. The class constructors and methods below are necessary for our development.

  • A python class VoiceSynthesis similar with SpeechSynthesizer under azure.cognitiveservices.speech.

  • A python class VoiceClientConfig can store development configurations.

  • A python class AudioFormat can specify the audio format.

  • A python class VoiceSynthesisResult can store the result status, audio data, and necessary message.

  • A python class VoiceSynthesisResultStatus is an enum class to specify the result status.

程式碼區塊
languagepy
class VoiceSynthesis(
    config: VoiceClientConfig
    format: AudioFormats
)
(method) def speak_ssml(ssml: str) -> VoiceSynthesisResult

The SSML string (ref1, ref2) in the method speak_ssml should contains at least two tags: voice and prosody. For instance:

程式碼區塊
languagepy
<speak>
    <voice language="en-US" name="en-US-BigBigNeural">
        <prosody rate="1.05">
            This is a sample sentence.
        </prosody>
    </voice>
</speak>
程式碼區塊
languagepy
class VoiceClientConfig(
    api_key: str = "",
    endpoint: str = "wss://localhost/voice/synthesis/v1"
)
程式碼區塊
languagepy
# PCM format only
# The defaults are we need currently.
class AudioFormat(
    samples_per_second: int = 16_000,
    bits_per_sample: int = 16,
    channels: int = 1,
)
程式碼區塊
languagepy
class VoiceSynthesisResult
(property) status: VoiceSynthesisResultStatus
(property) audio_data: bytes
(property) message: str

The audio_data should not contain any audio file header.

程式碼區塊
languagepy
class VoiceSynthesisResultStatus(Enum):
    SynthesizingCompleted = 1
    Canceled = 2

Reference