EXPLORATION OF ACOUSTIC CORRELATES IN SPEAKERSELECTION FOR CONCATENATIVE SYNTHESISAnn K. Syrdal Alistair Conkie Yannis StylianouATT Labs - Research, Florham Park, NJ, USAABSTRACTIt is often dicult to determine the suitabilityof a speaker to serve as a model for concatenativetext-to-speech synthesis. The perceived quality of aspeakers natural voice is not necessarily predictive of its(even relative) synthetic quality. The selection of femaleand male speakers on whom to base two synthetic voicesfor the new ATT text-to-speech system was made em-pirically. Brief readings of identical text materials wererecorded from pre-selected professional speakers (6 fe-males, and 9 males). Small-scale TTS systems were con-structed with a minimal diphone inventory, suitable forsynthesizing a limited number of test sentences. Synthe-sized sentences, and their naturally spoken references, werepresented to listeners in a formal listening evaluation. Lis-teners rated each test sentence independently on intelligi-bility, naturalness, and pleasantness. A variety of acous-tic measurements of the speakers were made in order todetermine which acoustic characteristics correlated withsubjective synthesis quality. The results have implicationsboth for speaker selection and for improving concatenativesynthesis methods.1. INTRODUCTIONThe suitability of a speaker to serve as a model for con-catenative text-to-speech synthesis is often dicult to de-termine. The perceived quality of a speakers natural voiceis not necessarily predictive of its (even relative) syntheticquality, and many researchers have horror stories of timeand e ort wasted working on synthesizing what turned outto be the wrong speaker.This paper brie y describes our procedures in empir-ically selecting speakers to serve as models for the newATT American English concatenative synthesis text-to-speech system by way of a formal listening test. In orderto determine what acoustic characteristics are most pre-dictive of good speakers for


