Speech to Text Spec ver.1
Provide a real-time speech to text (STT) service and its software development kit (SDK).
Either on-premise server or cloud server is acceptable.
Network communication should be secure.
Should implement a python library for connecting the STT service.
Prerequisite
python 3.10+
Design
In this version, the library can only support English voice dictation. The class constructors and methods below are necessary for our development.
A python class
VoiceDictation
similar withSpeechRecognizer
underazure.cognitiveservices.speech
.A python class
DictationConfig
can store development configurations.A python class
AudioStream
can store (or send) bytes to somewhere after writing abytes
object.
class VoiceDictation( config: DictationConfig, stream: AudioStream, autodetect_languages: list[str] = ["en-US"] # reserved ) (method) def on_recognized(callback: Callable[[str], None]) -> None (method) def start_recognition() -> None (method) def stop_recognition() -> None
The on_recognized
method should pass the recognized sentence as str
to the callback function for further text processing. The callback function is triggered after every sentence is recognized.
VoiceDictation should handle recognition asynchronously (use start_recognition
and stop_recognition
to control).
class DictationConfig( api_key: str = "", endpoint: str = "wss://localhost/voice/v1" )
# PCM format only # The defaults are we need currently. class AudioStream( samples_per_second: int = 16_000, bits_per_sample: int = 16, channels: int = 1, ) (method) def write(buffer: bytes) -> None (method) def close() -> None
An audio stream object should be able to be written with multiple frames (samples) as bytes
.
Reference
azure-cognitiveservices-speech ~1.40.0
Text to Speech Spec ver.1
Provide a real-time text to speech to text (TTS) service and its software development kit (SDK).
Either on-premise server or cloud server is acceptable.
Network communication should be secure.
Should implement a python library for connecting the TTS service.
Prerequisite
python 3.10+
Design
In this version, the library can only support English voice synthesis. The class constructors and methods below are necessary for our development.
A python class
VoiceSynthesis
similar withSpeechSynthesizer
underazure.cognitiveservices.speech
.A python class
VoiceClientConfig
can store development configurations.A python class
AudioFormat
can specify the audio format.A python class
VoiceSynthesisResult
can store the result status, audio data, and necessary message.A python class
VoiceSynthesisResultStatus
is an enum class to specify the result status.
class VoiceSynthesis( config: VoiceClientConfig format: AudioFormats ) (method) def speak_ssml(ssml: str) -> VoiceSynthesisResult
The SSML string (ref1, ref2) in the method speak_ssml
should contains at least two tags: voice
and prosody
. For instance:
<speak> <voice language="en-US" name="en-US-BigBigNeural"> <prosody rate="1.05"> This is a sample sentence. </prosody> </voice> </speak>
class VoiceClientConfig( api_key: str = "", endpoint: str = "wss://localhost/voice/synthesis/v1" )
# PCM format only # The defaults are we need currently. class AudioFormat( samples_per_second: int = 16_000, bits_per_sample: int = 16, channels: int = 1, )
class VoiceSynthesisResult (property) status: VoiceSynthesisResultStatus (property) audio_data: bytes (property) message: str
The audio_data
should not contain any audio file header.
class VoiceSynthesisResultStatus(Enum): SynthesizingCompleted = 1 Canceled = 2