Text-To-Speech Configuration

The following scripts support the production of an aural presentation of the document via speech synthesis, or “text-to-speech” (TTS):

The result is either an audio-only version of the document, or the combination of a text layer and an audio layer synchronized with each other.

The way text-to-speech is configured is common to all scripts. Speech engines are configured with user properties, and the exact aural rendering of the document (TTS voices, pronunciations, speech pitch, speech rates, speech levels, etc.) is controlled with TTS configuration files, CSS style sheets and PLS lexicons.

TTS configuration files may either be specified “statically”, through the org.daisy.pipeline.tts.config user property, or “dynamically” through an optional script input.

org.daisy.pipeline.tts.config
File to load TTS configurations from at start-up
Defaults to the file “tts-default-config.xml” located in the “etc/” directory in the base directory of the Pipeline installation, or “/etc/opt/daisy-pipeline2/tts-default-config.xml” on Debian/Ubuntu.

The TTS configuration file format is as follows:

<config>
  <voice engine="acapela" name="claire" lang="fr" gender="female-adult" priority="12"/>
  ...
</config>

Engine configuration

The audio encoder LAME must be installed. In addition, one of the following text-to-speech processors must be installed.

For Unix users,

For Windows users,

For MacOS users,

The following online text-to-speech processor can also be used for all users, given they have an account for the service and the appropriate plan or license :

It is strongly recommended to install eSpeak anyway, as it can handle almost any language out there.

The audio encoder and the TTS processors are configured with user properties. The following properties are available:

Common settings

org.daisy.pipeline.tts.audio.tmpdir
Temporary directory used during audio synthesis
Defaults to “${java.io.tmpdir}”
org.daisy.pipeline.tts.mp3.bitrate
Bit rate of MP3 files
org.daisy.pipeline.tts.maxmem
Maximum amount of memory in Mb to be used by TTS and audio encoding
Defaults to 50% of the total amount of memory that the JVM will attempt to use, or 500 Mb if there is no such limit
FIXME
org.daisy.pipeline.tts.threads.number
Number of threads for audio encoding and regular text-to-speech
Defaults to the number of processors available to the JVM
org.daisy.pipeline.tts.threads.encoding.number
Number of audio encoding threads
Defaults to “${org.daisy.pipeline.tts.threads.number}”
org.daisy.pipeline.tts.threads.speaking.number
Number of regular text-to-speech threads
Defaults to “${org.daisy.pipeline.tts.threads.number}”
org.daisy.pipeline.tts.threads.each.memlimit
Maximum amount of memory consumed by each text-to-speech thread (in Mb)
Defaults to “20”
org.daisy.pipeline.tts.encoding.speed
Maximum number of seconds of encoded audio per seconds of encoding
Defaults to “2.0”

Acapela

org.daisy.pipeline.tts.acapela.samplerate
Sample rate (in Hz)
Defaults to “22050”
org.daisy.pipeline.tts.acapela.threads.reserved
Number of reserved text-to-speech threads
Defaults to “3”
org.daisy.pipeline.tts.acapela.speed
Defaults to “300”
org.daisy.pipeline.tts.acapela.servers
Defaults to “localhost:0”
org.daisy.pipeline.tts.acapela.priority
This engine is chosen over another engine that serves the same voice if this one has a higher priority.
Defaults to “15”

eSpeak

org.daisy.pipeline.tts.espeak.path
Path to eSpeak executable
If not specified, Pipeline will automatically look for “espeak” in the directories specified by the “PATH” environment variable.
org.daisy.pipeline.tts.espeak.priority
This engine is chosen over another engine that serves the same voice if this one has a higher priority.
Defaults to “2”

Mac OS

org.daisy.pipeline.tts.osxspeech.path
Alternative path to OSX’s command line program “say”
Defaults to “/usr/bin/say”
org.daisy.pipeline.tts.osxspeech.priority
This engine is chosen over another engine that serves the same voice if this one has a higher priority.
Defaults to “2”

SAPI

org.daisy.pipeline.tts.sapi.samplerate
Sample rate (in Hz)
Defaults to “22050”
Only applies to legacy SAPI voices.
Can not be overridden at runtime. The server must be restarted to change this property.
org.daisy.pipeline.tts.sapi.bytespersample
Defaults to “2”
Only applies to legacy SAPI voices.
Can not be overridden at runtime. The server must be restarted to change this property.
org.daisy.pipeline.tts.sapi.priority
This engine is chosen over another engine that serves the same voice if this one has a higher priority.
Defaults to “7”

Qfrency

org.daisy.pipeline.tts.qfrency.path
Path to “synth” executable
If not specified, will automatically look for “synth” in the directories specified by the environment variable “PATH”.
org.daisy.pipeline.tts.qfrency.address
Address of the Qfrency server
Defaults to “localhost”
org.daisy.pipeline.tts.qfrency.priority
This engine is chosen over another engine that serves the same voice if this one has a higher priority.
Defaults to “2”

Google Cloud Text-to-speech

org.daisy.pipeline.tts.google.apikey
API key to connect to Google Text-To-Speech service. See Authenticate using API keys.
Mandatory
org.daisy.pipeline.tts.google.samplerate
Sample rate (in Hz)
Defaults to “22050”
org.daisy.pipeline.tts.google.priority
This engine is chosen over another engine that serves the same voice if this one has a higher priority.
Defaults to “15”

Microsoft Azure Cognitive Speech Services

org.daisy.pipeline.tts.azure.key
Access key of your Cognitive Speech resource
Mandatory
org.daisy.pipeline.tts.azure.region
Region of your Cognitive Speech resource
Mandatory
org.daisy.pipeline.tts.azure.threads
Number of reserved text-to-speech threads
Defaults to “2”
org.daisy.pipeline.tts.azure.priority
This engine is chosen over another engine that serves the same voice if this one has a higher priority.
Defaults to “15”

CereProc

org.daisy.pipeline.tts.cereproc.server
Host address of CereProc server
Defaults to “localhost”
org.daisy.pipeline.tts.cereproc.port
Port of CereProc server for regular voices
Mandatory
org.daisy.pipeline.tts.cereproc.client
Location of client program for communicating with CereProc server
Defaults to “/usr/bin/cspeechclient”
org.daisy.pipeline.tts.cereproc.priority
This engine is chosen over another engine that serves the same voice if this one has a higher priority.
Defaults to “15”
org.daisy.pipeline.tts.cereproc.dnn.port
Port of CereProc server for DNN voices
Mandatory
org.daisy.pipeline.tts.cereproc.dnn.priority
This engine is chosen over another engine that serves the same voice if this one has a higher priority.
Defaults to “15”

LAME encoder

org.daisy.pipeline.tts.lame.path
Path to LAME executable
If not specified, will automatically look for “lame” in the directories specified by the environment variable “PATH”.
org.daisy.pipeline.tts.lame.cli.options
Additional command line options passed to lame
Deprecated

CSS

The text-to-speech voices and prosody can be configured with aural CSS. To do so, attach CSS style sheets to the source document. Style sheets can be linked (using an ‘xml-stylesheet’ processing instruction or a ‘link’ element), embedded (using a ‘style’ element) and/or inlined (using ‘style’ attributes). Below is an example of an aural CSS style sheet:

p {
  volume: soft;
  voice-family: female;
}

The CSS properties that are supported by DAISY Pipeline are a subset of Aural CSS 2.1 (and partly inspired by CSS 3 Speech):

voice-family
Used for selecting a voice based on gender, age, name and/or vendor.
See text below.
speak
“none” or “spell-out”
Used for preventing certain text to be rendered aurally, or for spelling text one letter at a time.
volume
A number, “silent”, “x-soft”, “soft”, “medium”, “loud” or “x-loud”
Used for controlling the loudness of the speech.
pitch
“x-low”, “low”, “medium”, “high” or “x-high”
Used for controlling the average pitch of the speech.
speech-rate
A number, “x-slow”, “slow”, “medium”, “fast” or “x-fast”
Used for controlling the rate of the speech in terms of words per minute.
pitch-range
A number
Used for controlling the variation in average pitch of the speech.
speak-numeral
“digits” or “continuous”
Used for speaking out numbers one digit at a time.
pause-before, pause-after and pause
A duration
Used for specifying silences with a certain duration before or after an element.
cue-before, cue-after and cue
A URL
Used for playing pre-recorded sound clips before or after an element.

voice-family is a comma-separated list of voice characteristics that place conditions on the voice selection. It is inspired by (but not the same as) the specification of the voice-family property in CSS 3.

If a full voice name is provided, e.g. “acapela, alice”, this voice will be selected regardless of the document language. If this voice is not available, a fallback voice will be chosen such that it will match with the same characteristics as those of the requested voice: same language, same engine, same gender. If none is available, the pipeline broadens its search by relaxing the criteria: first the gender is relaxed and then the engine.

If no voice name is provided, e.g. “acapela”, “female” or “female, old”, the selection algorithm will take into consideration only the voices that match the current language. It starts by looking for a voice with the specified gender and supplied by the specified engine, and will broaden to any gender if the first search yielded no results. If neither the gender nor the engine match, language will be the only criterion.

When multiple voices match the criteria, the algorithm chooses the voice with the highest priority. Each voice has a default priority, though they can be overridden via the “voice” entries of the configuration file, as follows:

<config>
  <voice engine="sapi" name="Microsoft Todd" gender="male-adult" priority="100" lang="en"/>
</config>

Notice that it is also a convenient way to add voices that are not natively supported by the Pipeline. In the example above, Todd is now a registered voice and, as such, can be selected automatically by the Pipeline when the document is written in English.

AT&T, eSpeak and Acapela’s voice names can be found in their corresponding documentation. For Windows users, SAPI voices are enumerated in the system settings (Start > All Control Panel Items > Speech Recognition > Advanced Speech Options). You will also need to know the value of the “engine” attribute. This attribute must take as value one of the following:

In case of any doubt, engines and voice names can be retrieved from the server’s log in which all the voices are enumerated:

Available voices:
* {engine:'sapi', name:'NTMNTTS Voice (Male)'} by sapi-native
* {engine:'acapela', name:'alice'} by acapela-jna

PLS

PLS lexicons allow you to define custom pronunciations of words. It is meant to help TTS processors deal with ambiguous abbreviations and pronunciation of proper names. When a word is defined in a lexicon the processor will use the provided pronunciation in place of the default rendering.

Lexicons are configured using the “lexicon” elements in the configuration file. If the “href” attribute is missing, the pipeline will read the lexicons inside the config nodes, as in this example:

<config>
  <lexicon xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" version="1.0"
           alphabet="ipa" xml:lang="en">
    <lexeme>
      ...
    </lexeme>
  </lexicon>
</config>

The syntax of a PLS lexicon is defined in Pronunciation Lexicon Specification Version 1.0, extended with regular expression matching. To enable regular expression matching, add the “regex” attribute, as follows:

<lexicon xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" version="1.0"
         alphabet="ipa" xml:lang="en">
  <lexeme regex="true">
    <grapheme>([0-9]+)-([0-9]+)</grapheme>
    <alias>between $1 and $2</alias>
  </lexeme>
</lexicon>

The regex feature works only with alias-based substitutions. The regex syntax used is that from XQuery 1.0 and XPath 2.0.

Whether or not the regex attribute is set to “true”, the grapheme matching can be made more accurate by specifying the “positive-lookahead” and “negative-lookahead” attributes:

<lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
         alphabet="ipa" xml:lang="en">
  <lexeme>
    <grapheme positive-lookahead="[ ]+is">SB</grapheme>
    <alias>somebody</alias>
  </lexeme>
  <lexeme>
    <grapheme>SB</grapheme>
    <alias>should be</alias>
  </lexeme>
  <lexeme xml:lang="fr">
    <grapheme positive-lookahead="[ ]+[cC]ity">boston</grapheme>
    <phoneme>bɔstøn</phoneme>
  </lexeme>
</lexicon>

Graphemes with “positive-lookahead” will match if the beginning of what follows matches the “position-lookahead” pattern. Graphemes with “negative-lookahead” will match if the beginning of what follows does not match the “negative-lookahead” pattern. The lookaheads are case-sensitive while the grapheme contents are not.

The lexemes are reorganized so as to be matched in this order:

  1. Graphemes with regex=”false” come first, no matter if there is a lookahead or not;
  2. Graphemes with regex=”true” and no lookahead;
  3. Graphemes with regex=”true” and one or two lookaheads.

Within these categories, lexemes are matched in the same order as they appear in the lexicons.