The following scripts support the production of an aural presentation of the document via speech synthesis, or “text-to-speech” (TTS):
The result is either an audio-only version of the document, or the combination of a text layer and an audio layer synchronized with each other.
The way text-to-speech is configured is common to all scripts. Speech engines are configured with user properties, and the exact aural rendering of the document (TTS voices, pronunciations, speech pitch, speech rates, speech levels, etc.) is controlled with TTS configuration files, CSS style sheets and PLS lexicons.
TTS configuration files may either be specified “statically”, through
the org.daisy.pipeline.tts.config
user property, or “dynamically”
through an optional script input.
org.daisy.pipeline.tts.config
The TTS configuration file format is as follows:
<config> <voice engine="acapela" name="claire" lang="fr" gender="female-adult" priority="12"/> ... </config>
The audio encoder LAME must be installed. In addition, one of the following text-to-speech processors must be installed.
For Unix users,
For Windows users,
For MacOS users,
The following online text-to-speech processor can also be used for all users, given they have an account for the service and the appropriate plan or license :
It is strongly recommended to install eSpeak anyway, as it can handle almost any language out there.
The audio encoder and the TTS processors are configured with user properties. The following properties are available:
org.daisy.pipeline.tts.audio.tmpdir
org.daisy.pipeline.tts.mp3.bitrate
org.daisy.pipeline.tts.maxmem
org.daisy.pipeline.tts.threads.number
org.daisy.pipeline.tts.threads.encoding.number
org.daisy.pipeline.tts.threads.speaking.number
org.daisy.pipeline.tts.threads.each.memlimit
org.daisy.pipeline.tts.encoding.speed
org.daisy.pipeline.tts.acapela.samplerate
org.daisy.pipeline.tts.acapela.threads.reserved
org.daisy.pipeline.tts.acapela.speed
org.daisy.pipeline.tts.acapela.servers
org.daisy.pipeline.tts.acapela.priority
org.daisy.pipeline.tts.espeak.path
org.daisy.pipeline.tts.espeak.priority
org.daisy.pipeline.tts.osxspeech.path
org.daisy.pipeline.tts.osxspeech.priority
org.daisy.pipeline.tts.sapi.samplerate
org.daisy.pipeline.tts.sapi.bytespersample
org.daisy.pipeline.tts.sapi.priority
org.daisy.pipeline.tts.qfrency.path
org.daisy.pipeline.tts.qfrency.address
org.daisy.pipeline.tts.qfrency.priority
org.daisy.pipeline.tts.google.apikey
org.daisy.pipeline.tts.google.samplerate
org.daisy.pipeline.tts.google.priority
org.daisy.pipeline.tts.azure.key
org.daisy.pipeline.tts.azure.region
org.daisy.pipeline.tts.azure.threads
org.daisy.pipeline.tts.azure.priority
org.daisy.pipeline.tts.cereproc.server
org.daisy.pipeline.tts.cereproc.port
org.daisy.pipeline.tts.cereproc.client
org.daisy.pipeline.tts.cereproc.priority
org.daisy.pipeline.tts.cereproc.dnn.port
org.daisy.pipeline.tts.cereproc.dnn.priority
org.daisy.pipeline.tts.lame.path
org.daisy.pipeline.tts.lame.cli.options
The text-to-speech voices and prosody can be configured with aural CSS. To do so, attach CSS style sheets to the source document. Style sheets can be linked (using an ‘xml-stylesheet’ processing instruction or a ‘link’ element), embedded (using a ‘style’ element) and/or inlined (using ‘style’ attributes). Below is an example of an aural CSS style sheet:
p { volume: soft; voice-family: female; }
The CSS properties that are supported by DAISY Pipeline are a subset of Aural CSS 2.1 (and partly inspired by CSS 3 Speech):
voice-family
speak
volume
pitch
speech-rate
pitch-range
speak-numeral
pause-before
, pause-after
and pause
cue-before
, cue-after
and cue
voice-family
is a comma-separated list of voice characteristics that
place conditions on the voice selection. It is inspired by (but not
the same as) the specification of the
voice-family property in CSS 3.
If a full voice name is provided, e.g. “acapela, alice”, this voice will be selected regardless of the document language. If this voice is not available, a fallback voice will be chosen such that it will match with the same characteristics as those of the requested voice: same language, same engine, same gender. If none is available, the pipeline broadens its search by relaxing the criteria: first the gender is relaxed and then the engine.
If no voice name is provided, e.g. “acapela”, “female” or “female, old”, the selection algorithm will take into consideration only the voices that match the current language. It starts by looking for a voice with the specified gender and supplied by the specified engine, and will broaden to any gender if the first search yielded no results. If neither the gender nor the engine match, language will be the only criterion.
When multiple voices match the criteria, the algorithm chooses the voice with the highest priority. Each voice has a default priority, though they can be overridden via the “voice” entries of the configuration file, as follows:
<config> <voice engine="sapi" name="Microsoft Todd" gender="male-adult" priority="100" lang="en"/> </config>
Notice that it is also a convenient way to add voices that are not natively supported by the Pipeline. In the example above, Todd is now a registered voice and, as such, can be selected automatically by the Pipeline when the document is written in English.
AT&T, eSpeak and Acapela’s voice names can be found in their corresponding documentation. For Windows users, SAPI voices are enumerated in the system settings (Start > All Control Panel Items > Speech Recognition > Advanced Speech Options). You will also need to know the value of the “engine” attribute. This attribute must take as value one of the following:
In case of any doubt, engines and voice names can be retrieved from the server’s log in which all the voices are enumerated:
Available voices: * {engine:'sapi', name:'NTMNTTS Voice (Male)'} by sapi-native * {engine:'acapela', name:'alice'} by acapela-jna
PLS lexicons allow you to define custom pronunciations of words. It is meant to help TTS processors deal with ambiguous abbreviations and pronunciation of proper names. When a word is defined in a lexicon the processor will use the provided pronunciation in place of the default rendering.
Lexicons are configured using the “lexicon” elements in the configuration file. If the “href” attribute is missing, the pipeline will read the lexicons inside the config nodes, as in this example:
<config> <lexicon xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" version="1.0" alphabet="ipa" xml:lang="en"> <lexeme> ... </lexeme> </lexicon> </config>
The syntax of a PLS lexicon is defined in Pronunciation Lexicon Specification Version 1.0, extended with regular expression matching. To enable regular expression matching, add the “regex” attribute, as follows:
<lexicon xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" version="1.0" alphabet="ipa" xml:lang="en"> <lexeme regex="true"> <grapheme>([0-9]+)-([0-9]+)</grapheme> <alias>between $1 and $2</alias> </lexeme> </lexicon>
The regex feature works only with alias-based substitutions. The regex syntax used is that from XQuery 1.0 and XPath 2.0.
Whether or not the regex attribute is set to “true”, the grapheme matching can be made more accurate by specifying the “positive-lookahead” and “negative-lookahead” attributes:
<lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en"> <lexeme> <grapheme positive-lookahead="[ ]+is">SB</grapheme> <alias>somebody</alias> </lexeme> <lexeme> <grapheme>SB</grapheme> <alias>should be</alias> </lexeme> <lexeme xml:lang="fr"> <grapheme positive-lookahead="[ ]+[cC]ity">boston</grapheme> <phoneme>bɔstøn</phoneme> </lexeme> </lexicon>
Graphemes with “positive-lookahead” will match if the beginning of what follows matches the “position-lookahead” pattern. Graphemes with “negative-lookahead” will match if the beginning of what follows does not match the “negative-lookahead” pattern. The lookaheads are case-sensitive while the grapheme contents are not.
The lexemes are reorganized so as to be matched in this order:
Within these categories, lexemes are matched in the same order as they appear in the lexicons.