Text-To-Speech Configuration

The following scripts support the production of an aural presentation of the document via speech synthesis, or “text-to-speech” (TTS):

The result is either an audio-only version of the document, or the combination of a text layer and an audio layer synchronized with each other.

The way text-to-speech is configured is common to all scripts. Speech engines are configured with user properties, and the exact aural rendering of the document (TTS voices, pronunciations, speech pitch, speech rates, speech levels, etc.) is controlled with TTS configuration files, CSS style sheets and PLS lexicons.

TTS configuration files may either be specified through the org.daisy.pipeline.tts.config user property, or through an optional script input.

org.daisy.pipeline.tts.config: File to load TTS configuration from; Defaults to the file “tts-default-config.xml” located in the “etc/” directory in the base directory of the Pipeline installation, or “/etc/opt/daisy-pipeline2/tts-default-config.xml” on Debian/Ubuntu.

The TTS configuration file format is as follows:

<config>
  <voice engine="acapela" name="claire" lang="fr" gender="female-adult" priority="12"/>
  ...
</config>

Engine configuration

The audio encoder LAME must be installed. In addition, one of the following text-to-speech processors must be installed.

For Unix users,

Acapela;
eSpeak.

For Windows users,

eSpeak;
SAPI with adequate voices.

For MacOS users,

say.

The following online text-to-speech processor can also be used for all users, given they have an account for the service and the appropriate plan or license :

Google Cloud Text-To-Speech

It is strongly recommended to install eSpeak anyway, as it can handle almost any language out there.

The audio encoder and the TTS processors are configured with user properties. The following properties are available:

Common settings

org.daisy.pipeline.tts.audio.tmpdir: Temporary directory used during audio synthesis; Defaults to “${java.io.tmpdir}”
org.daisy.pipeline.tts.mp3.bitrate: Bit rate of MP3 files
org.daisy.pipeline.tts.maxmem: Maximum amount of memory in Mb to be used by TTS and audio encoding; Defaults to 50% of the total amount of memory that the JVM will attempt to use, or 500 Mb if there is no such limit; FIXME
org.daisy.pipeline.tts.threads.number: Number of threads for audio encoding and regular text-to-speech; Defaults to the number of processors available to the JVM
org.daisy.pipeline.tts.threads.encoding.number: Number of audio encoding threads; Defaults to “${org.daisy.pipeline.tts.threads.number}”
org.daisy.pipeline.tts.threads.speaking.number: Number of regular text-to-speech threads; Defaults to “${org.daisy.pipeline.tts.threads.number}”
org.daisy.pipeline.tts.threads.each.memlimit: Maximum amount of memory consumed by each text-to-speech thread (in Mb); Defaults to “20”
org.daisy.pipeline.tts.encoding.speed: Maximum number of seconds of encoded audio per seconds of encoding; Defaults to “2.0”

Acapela

org.daisy.pipeline.tts.acapela.samplerate: Sample rate (in Hz); Defaults to “22050”
org.daisy.pipeline.tts.acapela.threads.reserved: Number of reserved text-to-speech threads; Defaults to “3”
org.daisy.pipeline.tts.acapela.speed: Defaults to “300”
org.daisy.pipeline.tts.acapela.servers: Defaults to “localhost:0”
org.daisy.pipeline.tts.acapela.priority: This engine is chosen over another engine that serves the same voice if this one has a higher priority.; Defaults to “15”

eSpeak

org.daisy.pipeline.tts.espeak.path: Path to eSpeak executable; If not specified, Pipeline will automatically look for “espeak” in the directories specified by the “PATH” environment variable.
org.daisy.pipeline.tts.espeak.priority: This engine is chosen over another engine that serves the same voice if this one has a higher priority.; Defaults to “2”

Mac OS

org.daisy.pipeline.tts.osxspeech.path: Alternative path to OSX’s command line program “say”; Defaults to “/usr/bin/say”
org.daisy.pipeline.tts.osxspeech.priority: This engine is chosen over another engine that serves the same voice if this one has a higher priority.; Defaults to “2”

SAPI

org.daisy.pipeline.tts.sapi.samplerate: Sample rate (in Hz); Defaults to “22050”; Only applies to legacy SAPI voices.; Can not be overridden at runtime. The server must be restarted to change this property.
org.daisy.pipeline.tts.sapi.bytespersample: Defaults to “2”; Only applies to legacy SAPI voices.; Can not be overridden at runtime. The server must be restarted to change this property.
org.daisy.pipeline.tts.sapi.priority: This engine is chosen over another engine that serves the same voice if this one has a higher priority.; Defaults to “7”

Qfrency

org.daisy.pipeline.tts.qfrency.path: Path to “synth” executable; If not specified, will automatically look for “synth” in the directories specified by the environment variable “PATH”.
org.daisy.pipeline.tts.qfrency.address: Address of the Qfrency server; Defaults to “localhost”
org.daisy.pipeline.tts.qfrency.priority: This engine is chosen over another engine that serves the same voice if this one has a higher priority.; Defaults to “2”

Google Cloud Text-to-speech

org.daisy.pipeline.tts.google.apikey: API key to connect to Google Text-To-Speech service. See Authenticate using API keys.; Mandatory
org.daisy.pipeline.tts.google.samplerate: Sample rate (in Hz); Defaults to “22050”
org.daisy.pipeline.tts.google.priority: This engine is chosen over another engine that serves the same voice if this one has a higher priority.; Defaults to “15”

Microsoft Azure Cognitive Speech Services

org.daisy.pipeline.tts.azure.key: Access key of your Cognitive Speech resource; Mandatory
org.daisy.pipeline.tts.azure.region: Region of your Cognitive Speech resource; Mandatory
org.daisy.pipeline.tts.azure.threads: Number of reserved text-to-speech threads; Defaults to “2”
org.daisy.pipeline.tts.azure.priority: This engine is chosen over another engine that serves the same voice if this one has a higher priority.; Defaults to “15”

CereProc

org.daisy.pipeline.tts.cereproc.server: Host address of CereProc server; Defaults to “localhost”
org.daisy.pipeline.tts.cereproc.port: Port of CereProc server for regular voices; Mandatory
org.daisy.pipeline.tts.cereproc.client: Location of client program for communicating with CereProc server; Defaults to “/usr/bin/cspeechclient”
org.daisy.pipeline.tts.cereproc.priority: This engine is chosen over another engine that serves the same voice if this one has a higher priority.; Defaults to “15”
org.daisy.pipeline.tts.cereproc.dnn.port: Port of CereProc server for DNN voices; Mandatory
org.daisy.pipeline.tts.cereproc.dnn.priority: This engine is chosen over another engine that serves the same voice if this one has a higher priority.; Defaults to “15”

LAME encoder

org.daisy.pipeline.tts.lame.path: Path to LAME executable; If not specified, will automatically look for “lame” in the directories specified by the environment variable “PATH”.
org.daisy.pipeline.tts.lame.cli.options: Additional command line options passed to lame; Deprecated

CSS

The text-to-speech voices and prosody can be configured with aural CSS. To do so, attach CSS style sheets to the source document. Style sheets can be linked (using an ‘xml-stylesheet’ processing instruction or a ‘link’ element), embedded (using a ‘style’ element) and/or inlined (using ‘style’ attributes). Below is an example of an aural CSS style sheet:

p {
  volume: soft;
  voice-family: female;
}

The CSS properties that are supported by DAISY Pipeline are a subset of Aural CSS 2.1 (and partly inspired by CSS 3 Speech):

voice-family: Used for selecting a voice based on gender, age, name and/or vendor.; See text below.
speak: “none” or “spell-out”; Used for preventing certain text to be rendered aurally, or for spelling text one letter at a time.
volume: A number, “silent”, “x-soft”, “soft”, “medium”, “loud” or “x-loud”; Used for controlling the loudness of the speech.
pitch: “x-low”, “low”, “medium”, “high” or “x-high”; Used for controlling the average pitch of the speech.
speech-rate: A number, “x-slow”, “slow”, “medium”, “fast” or “x-fast”; Used for controlling the rate of the speech in terms of words per minute.
pitch-range: A number; Used for controlling the variation in average pitch of the speech.
speak-numeral: “digits” or “continuous”; Used for speaking out numbers one digit at a time.
pause-before, pause-after and pause: A duration; Used for specifying silences with a certain duration before or after an element.
cue-before, cue-after and cue: A URL; Used for playing pre-recorded sound clips before or after an element.

voice-family specifies voice characteristics that place conditions on the voice selection. A prioritized list of component values, separated by commas, can be specified to indicate that they are alternatives.

If a full voice name is provided, e.g. 'Alice', this voice will be selected regardless of the document language. If this voice is not available, a fallback voice will be chosen such that it will match with the same characteristics as those of the requested voice: same language, same engine, same gender. If none is available, Pipeline broadens its search by relaxing the criteria: first the gender is relaxed, then the engine.

If no voice name is provided, e.g. female 'Acapela' or old female, the selection algorithm will take into consideration only the voices that match the current language. It starts by looking for a voice with the specified gender and supplied by the specified engine, and will broaden to any gender if the first search yielded no results. If neither the gender nor the engine match, language will be the only criterion.

When multiple voices match the criteria, the algorithm chooses the voice with the highest priority. Each voice has a default priority, though they can be overridden via the “voice” entries of the configuration file, as follows:

<config>
  <voice engine="sapi" name="Microsoft Todd" gender="male-adult" priority="100" lang="en"/>
</config>

–> Notice that it is also a convenient way to add voices that are not natively supported by the Pipeline. In the example above, Todd is now a registered voice and, as such, can be selected automatically by the Pipeline when the document is written in English.

AT&T, eSpeak and Acapela’s voice names can be found in their corresponding documentation. For Windows users, SAPI voices are enumerated in the system settings (Start > All Control Panel Items > Speech Recognition > Advanced Speech Options). You will also need to know the value of the “engine” attribute. This attribute must take as value one of the following:

“att” for AT&T voices;
“espeak” for eSpeak voices;
“acapela” for Acapela voices;
“osx-speech” for Apple voices;
“sapi” for Microsoft voices or for any other voice installed to work with the SAPI engine, including some versions of AT&T and Acapela’s products.

In case of any doubt, engines and voice names can be retrieved from the server’s log in which all the voices are enumerated:

Available voices:
* {engine:'sapi', name:'NTMNTTS Voice (Male)'} by sapi-native
* {engine:'acapela', name:'alice'} by acapela-jna -->

PLS

PLS lexicons allow you to define custom pronunciations of words. It is meant to help TTS processors deal with ambiguous abbreviations and pronunciation of proper names. When a word is defined in a lexicon the processor will use the provided pronunciation in place of the default rendering.

Lexicons are configured using the “lexicon” elements in the configuration file. If the “href” attribute is missing, the pipeline will read the lexicons inside the config nodes, as in this example:

<config>
  <lexicon xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" version="1.0"
           alphabet="ipa" xml:lang="en">
    <lexeme>
      ...
    </lexeme>
  </lexicon>
</config>

The syntax of a PLS lexicon is defined in Pronunciation Lexicon Specification Version 1.0, extended with regular expression matching. To enable regular expression matching, add the “regex” attribute, as follows:

<lexicon xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" version="1.0"
         alphabet="ipa" xml:lang="en">
  <lexeme regex="true">
    <grapheme>([0-9]+)-([0-9]+)</grapheme>
    <alias>between $1 and $2</alias>
  </lexeme>
</lexicon>

The regex feature works only with alias-based substitutions. The regex syntax used is that from XQuery 1.0 and XPath 2.0.

Whether or not the regex attribute is set to “true”, the grapheme matching can be made more accurate by specifying the “positive-lookahead” and “negative-lookahead” attributes:

<lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
         alphabet="ipa" xml:lang="en">
  <lexeme>
    <grapheme positive-lookahead="[ ]+is">SB</grapheme>
    <alias>somebody</alias>
  </lexeme>
  <lexeme>
    <grapheme>SB</grapheme>
    <alias>should be</alias>
  </lexeme>
  <lexeme xml:lang="fr">
    <grapheme positive-lookahead="[ ]+[cC]ity">boston</grapheme>
    <phoneme>bɔstøn</phoneme>
  </lexeme>
</lexicon>

Graphemes with “positive-lookahead” will match if the beginning of what follows matches the “position-lookahead” pattern. Graphemes with “negative-lookahead” will match if the beginning of what follows does not match the “negative-lookahead” pattern. The lookaheads are case-sensitive while the grapheme contents are not.

The lexemes are reorganized so as to be matched in this order:

Graphemes with regex=”false” come first, no matter if there is a lookahead or not;
Graphemes with regex=”true” and no lookahead;
Graphemes with regex=”true” and one or two lookaheads.

Within these categories, lexemes are matched in the same order as they appear in the lexicons.