Speech to Text Spec ver.1
Provide a real-time speech to text (STT) service and its software development kit (SDK).
Either on-premise server or cloud server is acceptable.
Network communication should be secure.
Should implement a python library for connecting the STT service.
Prerequisite
python 3.10+
Design
In this version, the library can only support English voice dictation. The class constructors and methods below are necessary for our development.
...
An audio stream object should be able to be written with multiple frames (samples) as bytes
.
Reference
azure-cognitiveservices-speech ~1.40.0
Text to Speech Spec ver.1
Provide a real-time text to speech to text (TTS) service and its software development kit (SDK).
Either on-premise server or cloud server is acceptable.
Network communication should be secure.
Should implement a python library for connecting the TTS service.
Prerequisite
python 3.10+
Design
In this version, the library can only support English voice synthesis. The class constructors and methods below are necessary for our development.
A python class
VoiceSynthesis
similar withSpeechSynthesizer
underazure.cognitiveservices.speech
.A python class
VoiceClientConfig
can store development configurations.A python class
AudioFormat
can specify the audio format.A python class
VoiceSynthesisResult
can store the result status, audio data, and necessary message.A python class
VoiceSynthesisResultStatus
is an enum class to specify the result status.
程式碼區塊 | ||
---|---|---|
| ||
class VoiceSynthesis( config: VoiceClientConfig format: AudioFormats ) (method) def speak_ssml(ssml: str) -> VoiceSynthesisResult |
The SSML string (ref1, ref2) in the method speak_ssml
should contains at least two tags: voice
and prosody
. For instance:
程式碼區塊 | ||
---|---|---|
| ||
<speak> <voice language="en-US" name="en-US-BigBigNeural"> <prosody rate="1.05"> This is a sample sentence. </prosody> </voice> </speak> |
程式碼區塊 | ||
---|---|---|
| ||
class VoiceClientConfig( api_key: str = "", endpoint: str = "wss://localhost/voice/synthesis/v1" ) |
程式碼區塊 | ||
---|---|---|
| ||
# PCM format only # The defaults are we need currently. class AudioFormat( samples_per_second: int = 16_000, bits_per_sample: int = 16, channels: int = 1, ) |
程式碼區塊 | ||
---|---|---|
| ||
class VoiceSynthesisResult (property) status: VoiceSynthesisResultStatus (property) audio_data: bytes (property) message: str |
The audio_data
should not contain any audio file header.
程式碼區塊 | ||
---|---|---|
| ||
class VoiceSynthesisResultStatus(Enum): SynthesizingCompleted = 1 Canceled = 2 |