Int J Speech Technol (2010) 13: 75–84 DOI 10.1007/s10772-010-9072-2
The Bonn Open Synthesis System 3 Stefan Breuer · Wolfgang Hess
Received: 23 March 2010 / Accepted: 15 April 2010 / Published online: 20 May 2010 © Springer Science+Business Media, LLC 2010
Abstract The Bonn Open Synthesis System (BOSS) is an open-source software distribution for unit selection speech synthesis that aims to be easily extensible to new target languages and different applications. To achieve this flexibility, many aspects of the software have been changed in recent years, including the addition of a refined interface to synthesis modules and a more strict separation of language-specific and language-independent code. This article wants to give an overview of the architecture from a technical perspective and explain how it can be adapted for a particular purpose and voice. This is preceded by a short introduction to the unit selection paradigm in general and a section on the specifics of the approach taken by BOSS. A particular focus will be placed on the extensions made for the integration of Polish during which some of the flexibilisation measures were conducted. Further information on the application to Polish but with an emphasis on the linguistic, phonetic and acoustic aspects as well as the speech corpus used can be found in the second part of this two-part article, “Polish unit selection speech synthesis with BOSS”, also published in this issue of the Journal. Keywords Speech synthesis · Unit selection · Polish · Phonetics · BOSS
S. Breuer now with Phonetics Arts Ltd., Cambridge, UK. S. Breuer () · W. Hess Institut für Kommunikationswissenschaften, Abteilung Sprache und Kommunikation, Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany e-mail:
[email protected] W. Hess e-mail:
[email protected]
1 Overview The Bonn Open Synthesis System (BOSS) (2010) is a cross-platform developer framework for the design of multilingual and multi-functional unit selection speech synthesis applications. It is completely written in C++ and can be freely obtained under an open source licence from the Sourceforge software development website (see References). The main goal of BOSS is to relieve researchers in the field of speech synthesis of the need to implement their own systems from scratch. To achieve this, it has been designed from the start to modularise synthesis tasks as much as possible. This entails the decision to make the front-end software (the client) an independent application that communicates with the core synthesis program (the server) via the network. The BOSS project was initiated in 2000 at the Institute for Communication Research and Phonetics (IKP),1 University of Bonn, by Karlheinz Stöber (Stöber 2003), who provided the design and implementation of the base system, consisting of tools for the preparation of speech corpora, a utility library, a signal processing library, the core synthesis application, called Boss Server, including the unit selection algorithms and a transcription module, as well as a demonstration client for text-to-speech synthesis. BOSS was first introduced to the research community at Eurospeech 2001 (Klabbers et al. 2001). Since then it has been under constant development and a number of modules and tools have been added and extended, among them classes for decision tree-based grapheme-to-phoneme conversion, duration prediction, pitch and duration manipulation and soft concatenation of units. Much focus was also put on the redesign of the core architecture to provide for an easy exchange of 1 Now
the Department for Language, Speech and Communication at the Institute for Communication Science.
76
Int J Speech Technol (2010) 13: 75–84
individual modules (Breuer 2009). Most of the development and testing was done for the synthesis of German, for which two unit selection speech corpora were created. The female “Lioba” voice was the German back-end of the speech-tospeech translation project Verbmobil (Stöber et al. 2000). As such, it is tuned for dialogue related to the task of fixing an appointment. Another large database was recorded for a directory enquiries application capable of reading out names and addresses (Breuer and Abresch 2003). BOSS has also been used for synthesising Dutch (Klabbers and Stöber 2001) and Ibibio (Bachmann and Breuer 2007). The adaption to Polish is the subject of the second part of this paper, “Polish unit selection speech synthesis with BOSS”, also published in this issue. In the following section we will give a short introduction to the history of unit selection speech synthesis with a particular focus on the genesis and general principles of the Verbmobil speech synthesiser, out of which BOSS developed conceptually. Afterwards we will describe the structure of the BOSS synthesis system and the details of its selection algorithm, give insight into the data formats employed by BOSS as we explain the process of setting up a new voice and, lastly, explore the future direction that the development of this project will take.
more natural and their intelligibility is much increased. Yet they have their limitations: there is only one instance of each pre-defined unit in the corpus and above all, prosody is completely generated by model. In particular it is the prosodic model that puts a limit to naturalness. So far, as utterances by a human speaker are regarded to be maximally natural and as no model whatsoever can match the naturalness of human speech, we can improve the quality of synthetic speech by making it more “human”, i.e., by using data that allow for preserving much of the natural quality of the original signal. Speech synthesis from a corpus with multiple instances of segmental units and some possibility of preserving natural prosody has shown potential for a large improvement in naturalness and even intelligibility of synthetic speech. The system described in this paper follows this principle. There are two origins for corpus-based concatenative speech synthesis:
2 Unit selection speech synthesis
Reproductive speech synthesis has been known and applied for a long time in simple announcement tasks (for instance, time announcement via telephone). Such systems have widely intruded into the content-to-speech domain for applications that are prosodically restricted (weather news, news about traffic jams, inquiry systems). The availability of large amounts of computer memory and a dialogue module able to put strong constraints on the prosody of the synthetic target utterances have made this possible. Reproductive speech synthesis does not manipulate signals. Rather, it puts together appropriate chunks of speech (typically phrases) according to the slot-filling principle. Its domain is necessarily restricted: a target utterance whose components are not in the database cannot be synthesized. The CHATR approach, on the other hand, was developed from the idea that any signal manipulation whatsoever results in a degradation of the output and has thus to be avoided. Manipulation is replaced with selecting an instance of the target element from multiple instances in the database; the target element can be anything from a semiphone (half a phone) to a whole utterance. The database is a coherent corpus of speech from a single speaker. The advantages to be expected are:
Most speech synthesis systems follow one of three paradigms on the acoustic level: parametric synthesis by rule, concatenative synthesis with predefined units, and synthesis by (non-uniform) unit selection. Parametric synthesis by rule, the oldest of the speech synthesis principles, uses a front-end model of the vocal tract with well-defined acoustic-phonetic parameters to model the segmental level of the output signal. The time course of these parameters is modeled by rule. A detailed discussion of this principle is found in Klatt (1987). In the late seventies systems arose which, although using the principle of parametric synthesis, concatenated (parameterized) small units of natural speech (phones, diphones, demisyllables or a mixture of them) and reached higher intelligibility than systems where everything was modeled by rule. Yet the breakthrough for concatenative synthesis came when it became possible to manipulate the acoustic correlates of prosody (i.e., pitch, amplitude, duration) without changing the segmental properties of the signal (see e.g. Hess 1992, for a survey). The well-known PSOLA method allowed for such manipulation; other methods like the hard cut of the LPC residual signal or harmonic-plusnoise modeling followed. All these methods operate directly on the speech signal and make its parameterization unnecessary. Compared to parametric modeling these systems sound
• reproductive speech synthesis in dialog systems for limited domains (e.g., Klabbers 1997); • CHATR and its forerunners (Sagisaka 1988; Campbell and Black 1996; Hunt and Black 1996): developments for overcoming some of the shortcomings of concatenative synthesis, with the rationale of replacing manipulation with unit selection.
• multiple instances of each unit will be available in different contexts; • minimization of number of concatenation points by using longer units;
Int J Speech Technol (2010) 13: 75–84
Fig. 1 The principle of unit selection. Upper half : All instances of the units available for the target utterance are represented as a directed graph. Lower half : An optimal path through the graph is determined by minimizing the pertinent cost function: i is the unit’s position or slot in the target utterance. j0 and j1 are candidate units for slot i1 , k0 is a candidate unit for slot i2 . c(k0 ) are the summed up unit costs for a candidate unit k0 in the ith position of the target utterance, uc(k0 ), plus the minimum of the added path costs up to unit j , c(j ) and the transition costs from j to k0 , tc(j, k0 ). Example: Polish utterance “Dzie´n dobry!” (“Good morning”,“hello!”, literally translated: “Good day!”) composed of word, syllable and phone elements. For more details, see text
• integration of natural prosody from the data base is possible, thus avoiding many of the signal manipulations otherwise necessary. The problem to be faced for an unrestricted domain is that of coverage: what to do with elements not available in the database (Möbius 2000)? On any level (above the level of a phone), we have to expect a large number of rare events which cannot be ignored because they are essential. If a target utterance is well represented in the corpus, we can expect to reach a quality close to that of reproductive synthesis; if not, the quality will be non-uniform and degraded in those parts where the corpus matches the target utterance badly. How does speech synthesis from a large corpus work? We will explain this using the immediate ancestor of the BOSS system described in this article as an example. The speech synthesis module (Stöber et al. 2000) created for the Verbmobil speech-to-speech translation project draws on both the principles of phrase-slot filling and unit selection synthesis. For each word of a given input utterance, it searches
77
the speech corpus for a fitting word unit. Subword units are used only when no suitable word instance is found in the corpus. Suitability is assessed by matching the units in the corpus against a set of linguistic constraints given by the unit’s desired properties and position in the target utterance, the foremost of these being its phonetic transcription. Only exact matches are considered as candidates for selection. If none can be found, the search proceeds to the next lower level. Thus, depending on the outcome of this pre-selection step, different unit types can be used for different sections of the utterance, but only one fixed type of unit can be used for a certain stretch. In the two-word utterance in Fig. 1, these are the word type for the first target word, the phone type for the first syllable of the second word and the syllable type for its second syllable. Following the nomenclature established by Sagisaka (1988), though describing a slightly different notion, we call this non-uniform unit selection. Not all synthesisers have a pre-selection step to narrow down the types and number of candidates to select from and many perform selection only on a uniform set of diphones or phones. What they usually share, however, to be called unit selection systems, is the following common approach: From all the instances of the target utterance’s words or subword units found in the corpus as a whole (or a subset of it as defined by pre-selection constraints) a table is constructed. The columns of the table represent the elements of the text to be synthesised and the rows represent the candidate units for each of these elements. This is converted into a directed graph (Fig. 1). For each unit in the graph two types of cost are calculated: unit(-intrinsic) costs, also called target costs, that describe quantitatively how well it matches the requirements of the target utterance, and transition costs, also called concatenation or join costs that reflect how well the unit connects with each of its adjacent units in the graph. Typical unit costs include: • coarticulation/phonetic context: phonetic context is described by the first phone of the current and the last phone of the preceding word (left context), and the corresponding phones on the right-hand side (right context); • position within utterance (initial, phrase final, sentence final, medial); mainly to cope with phrase-final lengthening; • unit duration (difference between duration predicted by the prosody module and actual duration); • sentence mode (declarative, interrogative, progredient; for phrase-final position only); • prosodic matching (several subcriteria). Typical transition costs include • connectedness: two words have the property of being connected if they follow each other in the same carrier phrase in the database;
78
Int J Speech Technol (2010) 13: 75–84
Fig. 2 Pre-selection configurations for three levels with two sets of constraints each. The sets are set apart by an empty line. Each row in a set represents the name of the desired feature in the target utterance in the first column and the name of a corresponding feature in the corpus
in the second column. As should normally be the case, these names are identical in all instances here. TKey is the transcription of the unit, Stress is the level of lexical stress, PMode and PInt describe the type and position of phrase-boundary, if any
• spectral matching (e.g. by measuring differences between Mel Frequency Cepstrum coefficients and/or F0 at the boundaries of adjacent units).
units are found or until no more queries are defined for this level. In the latter case, pre-selection proceeds to the next lower level, repeating the above procedure, but using a (potentially) different set of constraints. After obtaining a number of candidates for each target unit in the utterance, the next step is to assign unit costs to each candidate and transition costs to each join between neighbouring candidates and perform the actual unit selection. BOSS uses the standard minimisation technique for unit and transition costs as described in the previous section. The weights applied for each cost term currently have to be set manually. The implementation strictly separates selection and dynamic programming functions from the cost functions, enabling developers to tailor the latter to their needs without changing the core algorithms. This is further elaborated on in Sect. 4.
All costs are weighted and added up as a sum of products. A dynamic programming algorithm is then employed to find an optimal path through this graph which minimizes the actual cost function.
3 Unit selection in BOSS The Bonn Open Synthesis System retains the Verbmobil synthesiser’s ability to perform unit selection on different linguistic levels. In detail, these are the word, syllable, phone or, alternatively, half-phone (semiphone) levels. This property makes BOSS useful for applications in which simple phrase-slot filling using whole words is sufficient, such as phone-number reading in the aforementioned directory enquiries application (Sect. 1), as well as for fully fledged synthesis in an open domain using uniform selection of phone segments or half-phones, or any hybrid, non-uniform approach in-between these two extremes. Any level can be turned on or off simply by means of a configuration setting. Unit selection in BOSS starts by pre-selecting potentially suitable units from the corpus, progressing from the highest configured level down to the lowest. The difference is that, for each unit level, we can define one or more sets of decreasingly strict constraints on the candidate units in a configuration file (Fig. 2). An initial search would, for example, aim to retrieve all units from the word database whose transcription, preceding context, following context, and position in the phrase exactly match the respective predicted properties of the required unit. If this lookup yields no units, another less restrictive search may be performed, in the simplest case with all constraints but the unit’s transcription removed. This process continues until matching
4 The architecture of BOSS 4.1 Communication features The Bonn Open Synthesis System is designed to be used as a client/server application over a network. The rationale behind this design choice is that: 1. Synthesis functionality can be made available on remote machines, even if they would otherwise not be able to run the BOSS synthesiser. 2. There is no need to distribute voice data and configurations to pure client machines. 3. The same server software can serve different types of synthesis applications, such as text-to-speech or contentto-speech, without any modifications. This latter and probably most important property is possible because the server’s input is a common data format across all applications, more precisely an XML vocabulary proprietary to BOSS that contains, at a minimum, the words
Int J Speech Technol (2010) 13: 75–84
79
Fig. 3 The communication and module structure of the Bonn Open Synthesis System
of the input utterance in a tokenised and normalised form. Depending on the nature of the application, the task of the client software is to create such server XML documents either from pure text or a more phonetically explicit representation and to send it to the synthesis server (Fig. 3). During the various stages of synthesis, this document is then further enriched to contain all the phonetic and acoustic information available for the target utterance to select the appropriate speech units. It is thus what Schröder and Breuer (2004) term an XML representation language as opposed to a pure input language that can only describe shallow linguistic detail of an utterance. At the first stage however, the server XML document serves the same purpose as other input XML specifications for speech synthesis such as SSML, from which it can easily be converted as demonstrated by the same authors. For the previously employed example “Dzie´n dobry!” it would look like the following: Example 1 The output of the client is the minimal server XML document.
This minimal server document can contain multiple sentences, marked by
XML tags which are in
turn grouped by a superordinate element. Each sentence contains at least one tag. The linguistic information is encoded in the attributes of these tags. For instance, the orthographic representation of the first word in the example is defined by an attribute named Orth. No information is stored between the opening and closing tags, e.g. , to ensure a uniform access to all linguistic annotations via attribute names. PInt and PMode describe the nature of the phrase boundary (for details on phrase boundary classification, see Table 2 of Part II of this article). 4.2 Server and modules Figure 3 gives a schematic overview of the BOSS architecture and data sources. The BOSS server software operates either in file or network mode. If supplied with a server XML file on the command-line, it will start in file mode, try to synthesise an utterance from the input specification, write the output to a file and exit. In network mode, the server waits for input documents to be sent by BOSS clients, processes these and returns the synthesised utterance(s) to the sending party. The server takes its configuration either from a standard system location (/etc/boss3conf.xml on UNIX systems) or reads a file specified on the commandline. BOSS configuration files are again XML files that commonly specify options for all components of the system. They can keep settings for several voice inventories to be loaded at startup and made available for synthesis by the server. Both the input server document and the configuration XML document are represented within the server and its
80
classes by the Document Object Model (DOM), an objectoriented approach to storing and accessing XML data in memory. The main component of the server is the class BOSS_ Synthesis which is responsible for loading and running the modules that perform symbolic and acoustic preprocessing, unit selection and concatenation of waveforms in the order that is configured for each voice inventory in a settings file. It is thus called the module scheduler. BOSS_Synthesis also initialises the connection to the databases that store the annotation and segmentation information for the inventories used. Each module to be called by the scheduler must be derived from a virtual base class BOSS::Module which defines a common interface. In effect, each module can access and manipulate the representation of the target utterance and read from the annotation database and the global configuration file. Every input utterance submitted to the scheduler by the server is processed sentence-by-sentence. The current element is checked for a set of attributes specifying the language and voice inventory to be used for synthesis. If none are present, the default voice as given by the configuration file is used. The scheduler then calls every module configured for the detected voice inventory and language to process the current sentence. This property of the scheduler means that BOSS can be configured as a multilingual system, a system generally capable of synthesising different target languages as e.g. the Bell Labs synthesiser was for traditional, single-instance diphone concatenation (Sproat 1998). The BOSS distribution comprises classes for the functionality most commonly found in unit selection systems. Apart from the selection method discussed in Sect. 3, it also contains classes for grapheme-to-phoneme conversion and duration prediction. While the supplied modules for transcription and duration are adapted to the synthesis of German, all language-dependent functionality is strictly separated from language-independent linguistic processing and general purpose classes. By this design, implementing a module for a different language may entail nothing more than copying the language-dependent part and modifying it accordingly. We will exemplify this later by referring to the structure of the duration module. The first module to be called for a target sentence in BOSS will usually be the grapheme-to-phoneme transcription, as text tokenisation and normalisation have already been performed by the client. The German module takes up to three steps to obtain a transcription for every input word, these being lexicon lookup, morpheme decomposition and finally decision-tree based g2p using the algorithm by Daelemans and van den Bosch (1996), if all else fails. When a transcription has been found, the module extends the DOM representation with information on the syllabic and segmental structure of the sentence. The following excerpt of the
Int J Speech Technol (2010) 13: 75–84
extended utterance shows this for our “Dzie´n dobry!” example: Example 2 The target utterance representation for “Dzie´n dobry!” enriched with phonetic information. Each of the added , and tags, as well as now contain an attribute TKey that holds the transcription of the corresponding linguistic level. Phones and syllables have a Stress tag that can be either 0 (unstressed), 1 (primary stress) or 2 (secondary stress). ...
While it would have been easy enough to build a new language module for the Polish version described in Part II of this article using the language-independent classes, we were fortunate to possess an existing rule-based transcription library for Polish. This being a closed-source binary, it was linked into the proprietary client software rather than wrapped into a module called by the scheduler. Hence the extended XML document shown in Example 2 would, in the Polish synthesiser, have been created before being passed to the server. As the minimal structure expected by the server is a subset of this format and by design, BOSS modules do not overwrite information already present in a document, this setup does not pose a problem for the architecture. As stated before, BOSS provides a set of classes to predict phone durations. These classes use decision trees as generated by the wagon software from the Edinburgh Speech Tools that accompany the Festival speech synthesiser (Black et al. 1999). To demonstrate the concept of separating languageindependent code from language-specific and general purpose algorithms employed in BOSS, Fig. 4 shows a Unified Modelling Language (UML) diagram of the base class BOSS_Duration, the class BOSS_Duration_DE derived from it and the CART reader class to parse decision tree files. For any language, the functionality of a complete duration module based on this concept would be to
Int J Speech Technol (2010) 13: 75–84
81
Fig. 4 UML diagram for the classes comprising the German duration module. Each box contains the name of the class or struct in the uppermost segment, the member variables in the second from above and the methods in the bottom one. Visibility attributes prefix each variable and method: + (public), # (protected) and − (private)
1. generate any features that are needed to predict the duration of a target phone and add them to the DOM representation, 2. retrieve the duration from the decision tree using these features, and 3. propagate the predicted value to all linguistic levels of the DOM by adding the Dur attribute to each. While the base class loads the CART reader and implements the method add_durations() to store the result of a phone-based duration prediction in the XML server document, it does not implement the functions to generate any additional linguistic features as these may be languagedependent. BOSS_Duration is therefore called taskspecific and it is the responsibility of a language-specific (BOSS_Duration_DE in Fig. 4) subclass to provide such features. This subclass receives its name from the base class with the addition of an ISO country or language code. BOSS_CartReader, on the other hand, can be used for classification tasks other than duration prediction and so is kept completely separate from the other classes. As with any other task-specific class derived from BOSS::Module, the duration prediction base class is compiled into a library of its own and can be linked in to compile a language-specific derived class. Also, as all modules are dynamically loaded and initialised by the scheduler at startup, there is no need to change or recompile the boss server or any of its libraries to create a new language module. Code written for the synthesis of a particular language can thus be bundled separately from the BOSS distribution. This was done for the language-specific classes in the Polish adaptation project, including BOSS_Duration_PL which uses significantly more features than the German version. Details on these features and their impact on Polish duration prediction can be found in Part II, Sect. 3 of this publication.
For both the German and Polish versions of BOSS, the unit selection module is called immediately after duration targets have been set. While none of the pre-selection or dynamic programming code in the class BOSS_Unitselection contains any language-dependent elements, a developer may choose to modify the unit and transition cost functions and the features used therein. For that reason, there are separate classes that represent candidate units with their features (BOSS_Node) and the cost functions (Cost), respectively. Both are task-specific base classes that languagespecific classes inherit from. As with the modules in the scheduler, both are loaded dynamically for each language defined in the configuration file when the unit selection module is instantiated. There are thus no general transition or unit costs that BOSS uses across all language versions, but classes and functions for comparing e.g. Mel Cepstra and F0 values are provided and used in the German version. Please refer to Part II to learn about the features and metrics employed for Polish. After unit selection has determined the units to be used for each part of the target specification, the last module to be called by the scheduler in any setup is the module to retrieve the speech signal associated with these units and concatenate them to form an utterance. This is the BOSS_Concat module. It performs only basic boundary smoothing capabilities using hanning-windowed overlap-add and is itself derived from a class BOSS_ConMan that is designed to be subclassed and extended by different manipulation algorithms.
5 Database creation The prerequisite for building a new voice for BOSS is a speech corpus consisting of a header-less PCM file for each
82
Int J Speech Technol (2010) 13: 75–84
Fig. 5 Conversion of corpus XML documents into a database representation. Tags and feature attributes in corpus xml documents are defined analogously to the server xml versions. Additional features are First and Last to mark the start and end of a unit in samples and for manually annotated data, TReal to describe the actual rendering of a unit where it differs from a desired or automatically generated transcription. See text for details on the conversion process
utterance with accompanying segmental annotation containing markers for lexical stress and syllable, word and phrase boundaries. The annotation then has to be converted into the BOSS corpus XML format which is almost completely analogous to the server representation XML, omitting and adding only a few attributes (cf. Fig. 5 for details). Even more so than for the server, using XML to describe the corpus data is convenient in that third-party editors and libraries for all major programming languages exist to parse and modify this format. This facilitates writing tools that add information to the basic annotation in a format that can be used by BOSS. However, for the standard features such as lexical stress, phrasing, left and right context as well as Mel Frequency Cepstrum Coefficients and F0 values, tools are shipped with the BOSS distribution to add these features as attributes to the units in the XML format. These have recently been integrated into a single application to facilitate the setup of voices in BOSS. If voice data is to be labelled manually, BOSS also provides a definition for a simple non-XML label file format called BLF (see Part II, Sect. 2.3 for a Polish example) that can be read and written by the audio editor wavesurfer (Sjölander and Beskow 2000) and for which a tool is supplied to convert it into corpus XML. For the Polish voice, an existing aligner was modified to output these BOSS Label Files. Details on the aligner and the annotation of the Polish corpus can be found in Part II of this article. Although adding information to corpus files is facilitated by the XML format, the latter is neither a particularly good format to analyse the corpus as a whole nor to quickly retrieve unit annotations at runtime. Therefore the annotation is converted into data tables stored in a relational database server. This conversion is shown in Fig. 5. Tags and feature attributes in corpus XML documents are defined analogously to the server XML versions. Additional features are First and Last to mark the start and end of a unit in samples and for manually annotated data, TReal to
describe the actual rendering of a unit where it differs from a desired or automatically generated transcription. See text for details on the conversion process. For each linguistic/acoustic level annotated in the XML files a separate_data table is created to contain the features of all units of that type. More explicitly, there are five data tables—one for the sentence, word, syllable, phone and half-phone type each—with the rows representing individual units and the columns containing the various feature values of those units. Linking tables (_map) connect the levels using unit indices so the association between e.g. a word and its constituent syllables is retained. The _data tables are accessed by the pre-selection process described in Sect. 3 to retrieve candidate units matching certain constraints. For the actual unit and transition cost calculations, each candidate table row is then converted into a BOSS_Node (see Sect. 4). The _map tables are used by the concatenation module (also Sect. 4) to find the filename of the sentence that contains each unit selected for synthesis in the sentence_data table. As all modules inherit methods to access the database, further tables may be created for the purposes of modules other than unit selection, such as the lexicon and phone class tables used by the German transcription module.
6 Conclusion and future developments A speech synthesis architecture has been presented that can be adapted to other languages and different applications with relatively small effort. Since its first extension from German to Dutch (Klabbers and Stöber 2001), it has undergone significant improvements in flexibility (Breuer 2009), some of which have been made in the course of the Polish project described in more detail in Part II of this paper. These improvements made it possible to extend BOSS by several modules and adapt it to the Benue-Congo language Ibibio (ISO language code: ibb) within the time-frame of a master thesis project (Bachmann and Breuer 2007).
Int J Speech Technol (2010) 13: 75–84
In summary, the following minimal steps are necessary to use BOSS to create a unit selection synthesiser for a new language: 1. Provide a phonetically segmented corpus of speech and convert the annotation into BOSS corpus XML. 2. Add phonetic and acoustics details to the annotation and create a database using the tools provided. 3. Implement a client using the provided network classes and protocol. 4. Using the German version as a template, implement the language specific parts of the duration and transcription modules, alternatively supply this information in the client. (a) If necessary, provide a pronunciation lexicon and train the decision trees for grapheme-to-phoneme conversion using the provided MBL-implementation and (b) train a duration CART tree using wagon. 5. Copy the German implementation of the candidate unit class BOSS_Node_DE and the cost functions Cost_DE and adapt, if desired. To integrate a new German voice into the existing synthesiser, only steps 1, 2, 4(a) and 4(b) have to be performed. It is planned to further facilitate this process in the future by adding graphical tools for corpus and configuration creation. A likely candidate for a platform-independent GUI framework is Qt, which would also provide an XML DOM parser and connectivity for various database systems as well as other functionality now provided by a host of different libraries. The number of dependencies would thus be reduced and alternatives to the MySQL database engine currently employed by BOSS, e.g. a more lightweight product such as sqlite for stand-alone use, would be made possible without taking away the advantages that a shared server (MySQL or other) has for cooperative development. Optimisation will be another topic for the near future. The core selection algorithm needs to be sped up to allow for a faster non-uniform unit selection. While there are a number of ways how this could be achieved, we would rather sacrifice size than flexibility and retain the ability for fast implementation of new ideas and voices in BOSS, even if this means it will never be a low-footprint system. On the side of speech quality, the main route to explore will be the units used. For good reasons, most systems use diphones as base units and we are looking for ways to integrate these smoothly without losing our multi-level nonuniform selection capabilities. Another route that has been explored for some time is the single-level use of phone extensions for synthesis (phoxsy). These are multi-phone units consisting of clusters of spectrally highly variant phones (Breuer and Abresch 2004). They are currently being tested in fast unit selection synthesis for the visually impaired (Moers et al. 2007, 2010). Another area for improvement is the
83
generation of intonation targets and an according manipulation of the output. Neither the German nor the Polish versions of BOSS use any sort of predicted F0 target values. A PSOLA implementation based on the BOSS_ConMan module described in Sect. 4 exists but is not part of the distribution for legal reasons. As an alternative, a Harmonicplus-Noise implementation has been created which is yet to be integrated into BOSS (Rohde and Breuer 2005). In the way of transition costs, we have yet to investigate how we can best penalise energy differences between units. Unit selection is a mature technology, but data sparsity will always define the upper bounds of the quality that can be attained and much research has been conducted on alternatives such as HMM-based synthesis (Zen et al. 2007) in recent years. Breakthroughs have also been achieved in the quality of articulatory speech synthesisers (Birkholz and Jackèl 2003). However, The Bonn Open Synthesis System, although initially designed with unit selection in mind, is, in principle, flexible enough to function as a wrapper for any kind of synthesis. A project to implement this for a 3D-articulatory synthesiser is described in Birkholz et al. (2007).
References Bachmann, A., & Breuer, S. (2007). Development of a BOSS unit selection module for tone languages. In SSW6-2007 (pp. 166–171). Birkholz, P., & Jackèl, D. (2003). A three-dimensional model of the vocal tract for speech synthesis. In Proceedings of the 15th international congress of phonetic sciences (pp. 2597–2600), Barcelona, Spain. Birkholz, P., Steiner, I., & Breuer, S. (2007). Control concepts for articulatory speech synthesis. In 6th ISCA workshop on speech synthesis (pp. 5–10), Bonn, Germany. Black, A. W., Taylor, P., & Caley, R. (1999). The festival speech synthesis system: system documentation. CSTR, Edinburgh, edition 1.4 for festival version 1.4.0 edition. Bonn Open Synthesis System (BOSS) (2010). Project Homepage: http://sourceforge.net/projects/boss-synth/. Breuer, S. (2009). Multilinguale und multifunktionale Unit-SelectionSprachsynthese: Designprinzipien für Architektur und Sprachbausteine. PhD thesis, Universität Bonn. http://hss.ulb.unibonn.de/diss_online/phil_fak/2009/breuer_stefan/breuer.htm. Breuer, S., & Abresch, J. (2003). Unit selection speech synthesis for a directory enquiries service. In Proceedings of the ICPhS, Barcelona, Spain. Breuer, S., & Abresch, J. (2004). Phoxsy: Multi-phone segments for unit selection speech synthesis. In Proceedings of the international conference on spoken language processing (ICSLP), Jeju. Campbell, W. N., & Black, A. (1996). Prosody and the selection of source units for concatenation synthesis. In J. P. H. Van Santen, R. Sproat, J. Olive, & J. Hirschberg (Eds.), Progress in speech synthesis (pp. 279–291). New York: Springer. Daelemans, W. M., & van den Bosch, A. P. J. (1996). Languageindependent data-oriented grapheme-to-phoneme conversion. In J. van Santen, R. Sproat, J. Olive, & J. Hirschberg (Eds.), Progress in speech synthesis (pp. 77–89). New York: Springer. Hess, W. (1992). Speech synthesis—a solved problem? In Signal processing VI, proceedings EUSIPCO, Brussels, Belgium.
84 Hunt, A., & Black, A. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proceedings of ICASSP (pp. 373–376). Klabbers, E. (1997). High-quality speech output generation through advanced phrase concatenation. In Speech technology in the public telephone network: Where are we today?, Proceedings COST Telecom workshop, Rhodes, Greece. Klabbers, E., & Stöber, K. (2001). Creation of speech corpora for the multilingual Bonn Open Synthesis system. In 4th ISCA tutorial and research workshop on speech synthesis, Pitlochry, Scotland. Klabbers, E., Stöber, K., Veldhuis, R., Wagner, P., & Breuer, S. (2001). Speech synthesis development made easy: The Bonn Open Synthesis system. In Proceedings of EUROSPEECH, Aalborg, Denmark. Klatt, D. H. (1987). Review of text-to-speech conversion for English. Journal of the Acoustical Society of America, 82, 737–793. Möbius, B. (2000). Corpus-based speech synthesis: Methods and challenges. In W. Sendlmeier (Ed.), Forum Phoneticum: Vol. 69. Speech and signals: Aspects of speech synthesis and automatic speech recognition (pp. 79–96). Frankfurt a. M.: Hector. Moers, D., Wagner, P., & Breuer, S. (2007). Assessing the adequate treatment of fast speech in unit selection speech synthesis systems for the visually impaired. In SSW6-2007 (pp. 282–287). Moers, D., Wagner, P., Möbius, B., Müllers, F., & Jauk, I. (2010). Integrating a fast speech corpus in unit selection speech synthesis: Experiments on perception, segmentation and duration prediction.
Int J Speech Technol (2010) 13: 75–84 In Speech prosody 2010, satellite workshop on prosodic prominence: Perceptual and automatic identification, Chicago, IL. Rohde, H., & Breuer, S. (2005). An HMM-synthesizer for BOSS. In Proceedings of the 16th conference on electronic speech signal processing (ESSP), Prague. Sagisaka, Y. (1988). Speech synthesis by rule using an optimal selection of non-uniform synthesis units. In Proceedings IEEE ICASSP, New York, USA. Schröder, M., & Breuer, S. (2004). XML representation languages as a way of interconnecting tts modules. In Proceedings of the international conference on spoken language processing (ICSLP), Jeju. Sjölander, K., & Beskow, J. (2000). Wavesurfer—an open source speech tool. In Proc. of ICSLP (Vol. 4, pp. 464–467), Beijing. Sproat, R. (Ed.) (1998). Multilingual text-to-speech synthesis: The Bell labs approach. Dordrecht: Kluwer Academic. Stöber, K. (2003). Bestimmung und Auswahl von Zeitbereichseinheiten für die konkatenative Sprachsynthese. Frankfurt a. M.: Lang. Stöber, K., Wagner, P., Helbig, J., Köster, S., Stall, D., Thomae, M., Blauert, J., Hess, W., Hoffmann, R., & Mangold, H. (2000). Speech synthesis using multilevel selection and concatenation of units from large speech corpora. In W. Wahlster (Ed.), Verbmobil: Foundations of speech-to-speech translation (pp. 519–536). Berlin: Springer. Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A. W., & Tokuda, K. (2007). The HMM-based speech synthesis system version 2.0. In Proc. of ISCA SSW6, Bonn, Germany.