Untitled

Automatic singing voice classification

Index:
Introduction
Database of singing voice sounds
Singing voice parameterization
Results of neural network-based automatic recognition
Experts' objectivity in vocal quality judgments
Application in Matlab environment
Demo download
Bibliography

Introduction

A parametric description is necessary in many applications related to automatic sound recognition. Such systems are well developed in speech, and in musical instrument sound domains, and there are many existing applications that automatically recognize speech content/speaker or retrieve musical information (MIR applications).
Singing and speech have a common voice production organ. However, singing is a form of an artistic expression, and thus it needs additional parameters to be defined and extracted. The evaluation of these parameters is one of the tasks of the Ph.D. Thesis entitled Expert system for objectivisation of judgements of singing voices' submitted by Pawel Zwan, supervised by Prof. Bozena Kostek in the Multimedia Systems Dept., Gdansk University of Technology. A very complicated biomechanics of the singing voice requires numerous features to be properly described. Such a parametric representation needs intelligent decision systems to perform the process of classification. In the above mentioned Ph.D. thesis, artificial neural network (ANN) and k-NN decision systems are employed for the purpose of the voice type/quality recognition. The systems are trained with sound samples, of which a large part (1440 samples) was recorded in the studio and another 1250 samples were edited from professional CD recordings. For every sound sample, a feature vector containing 331 parameters was extracted. Resulting parameters were divided into two groups: the so-called 'dedicated' parameters especially designed by the authors to characterize singing voices and more general parameters which may be found in rich literature on MIR and speech recognition.
In the Ph.D.Thesis, the decision system ability of automatic singing voice recognition is discussed by comparing the efficiency of ANNs and RSs in two recognition categories: 'voice type' (classes: bas, baritone, tenor, alto, mezzo-soprano, soprano) and 'voice quality' (classes: amateur, semi-professional, professional).

Database of singing voice sounds

The singing voice database contains over 2690 sound samples. 1440 of them were recorded from 42 singers in a studio. Each of the vocalists recorded 5 vowels: 'a', 'e', 'i', 'o', 'u' at several sound pitches belonging to natural voice scale. Vocalists consisted of three groups: amateurs (Gdansk University of Technology Choir vocalists), semi-professionals (Gdansk Academy of Music, Vocal Faculty students) and professionals (qualified vocalists, graduates of the Vocal Faculty of the Gdansk Academy of Music). The second group of samples was prepared on the basis of CD audio recordings of famous singers. The database of professionals needed to be extended due to the fact that voice type recognition is possible only among professional voices. Amateur voices do not show much differences within groups of male and female voices.

Singing voice parameterization

Singing is produced by vibration of human vocal cords and resonances in throat and head cavities. As a result of resonances, formants appear in the spectrum of produced sounds. Formants are not only related to articulation that allows to produce different vowels, but they also characterize timbre and voice type qualities. For example, the formant of the middle frequency band (3.5 kHz) is described in literature as 'singer's formant' and its relation to voice quality is proved. This concept is well recognized in a reach literature related to singing. However, the interaction between two factors, namely glottal source and resonance characteristics shapes, and the resulting timbre and power of an outgoing vocal sound is equally important. The relation between them is not straightforward but can be simplified by assuming the linearity of the vocal tract filter. Since in the proposed model, there exists an analogy between FIR filtering and the singing sound (which can be represented as a convolution of the glottal source and the impulse response of the vocal tract), singing voice parameters can be divided into two groups associated with those two factors. In literature some inverse filtration methods for deriving glottis parameters are presented, however they are inefficient due to phase problems. In this aspect, only parameters of vocal tract formants are possible to be calculated directly from the inverse filtering analysis since they are defined in frequency. Thus, glottal parameters must be parameterized by other methods which will be shown later. On the other hand, vocal tract parameters can be derived from the warped-LPC method (further called WLPC analysis) resulting in frequencies and levels of formants. The WLPC analysis allows to control low frequency resolution, which is crucial for precise extraction of formants performed independently from sound pitch. Since WLPC analysis is applied to small signal frames, it can be performed for several parts of the analyzed sounds. Therefore, each formant parameter is represented by a vector which describes its values in consecutive frames. Median values of those vectors represent so-called static parameters, while variances of vector values are their dynamic representation. Some of the singing voice parameters must be calculated for a whole sound -- not for single frames. Those parameters are defined on the basis of the fundamental frequency contour analysis and they are related to vibrato and intonation. Vibrato is described as a modulation of the fundamental frequency of sounds made by singers in order to change timbre, while intonation is their ability to produce sounds perceived as stable and precise in tune. In Fig.1 a division of singing voice parameters is presented.

Fig.1 Division of singing voice 'dedicated' parameters

Results of Neural Network Automatic Recognition

Since Artificial Neural Networks are widely used in automatic sound recognition, an ANN classifier was tested. The ANN was a simple feed-forward, three layer network with 100 neurons in the hidden layer and 3 or 6 neurons in the output layer respectively (depending on the number of classes being recognized). Since, in total, there were 331 parameters in a single feature vector, the input layer consisted of 331 neurons. Sounds from the database were divided into three groups. The first part of the samples (70%) was used for training, the second part (10%) -- for validation, and the third (20%) -- for testing. The network was trained smoothly, and the validation error started to increase after approx. 3000 cycles of training. The learning was stopped when the validation error was increasing for 50 successive cycles. Table 1 shows the results for the two discussed categories: voice quality/type. The rows in the tables describe the recognized quality class, and the columns correspond to the ANN-based classification.

**Table 1** . ANN singing voice recognition results.
[%]	amateur	semi-professional	professional
amateur	96.3	2.8	3.5
semi-professional	4.5	94.3	7
professional	3.5	1.1	89.5

**Table 2** . Results of ANN singing voice type category recognition
[%]	bass	baritone	tenor	alto	mezzo	soprano
bass	90.6	3.3	0	0	0	0
baritone	6.3	90	3.6	0	0	0
tenor	3.1	6.7	89.3	4	0	2.9
alto	0	0	7.1	80	0	0
mezzo	0	0	0	12	93.8	2.9
soprano	0	0	0	4	6.3	94.1

The classifier was tested on the total number of 546 sounds in the voice quality category and on the 443 sounds in the voice type category. The average recognition result amounted to 94.1% and 90% respectively. The important thing is that in most cases errors of recognition occurred for 'neighboring' classes. For example, only 0.9% professional voice samples were recognized as belonging to amateur class, and no tenor samples were recognized as basses, mezzo-sopranos or baritones.

Experts' objectivity in vocal quality judgments

Automatic classification of singing sounds was only part of the work done. Some other experiments proved that it is possible to train an automatic expert system to act similarly to human experts. The quality of each of the recorded vowels was judged by experts, who were singing teachers and professional vocalists. The sounds were assessed and qualified to one of 9 quality classes. All experts were also checked as to the quality of their evaluation scores, by comparing distribution of their judgments to the average judgments of all the 6 experts. To train an artificial classifier, sound recordings of 42 vocalists were randomly divided into training and testing sets, so that the testing set didn't contain any recordings used in training. As a sound quality label, an average judgment of all 6 experts was used. In order to check the network, a so-called k-fold cross validation method was employed, for which purpose 42 neural networks were trained. In each case, the network was trained using the sounds of 41 singers while the sounds of the one singer were retained for testing. The results of recognition were averaged for all 42 trained networks and compared to experts judgments using the Pearson criterion which showed strong similarity between them. An artificial expert system used to judge voice quality proved to be as efficient as a human professional singer/teacher who judges voices.

Application in Matlab environment

In order to present the results, an application was created in the Matlab environment. It allows to analyze singing voice sounds. The analysis consists of the pitch contour determination, and the power spectrum and warped LPC analysis. It enables automatic calculation of the parameters described in the Ph.D. thesis. The main view of the application interface is presented in Fig.2.

Fig.2 Application user interface

First, a singing voice sound file is open to be analyzed in either a simplified or a detailed manner. Simplified analysis is faster but allows only to calculate static parameters, while the detailed option let us calculate all 331 parameters presented in Ph.D. Thesis.

Fig.3 Application user interface

When the file is open (Fig.3a) and the detailed analysis option is chosen, sound parameter values are written to a text file (parameters.txt) and dynamic analysis figures are presented (Fig.3b).

Fig.4 Dynamic analysis results

When the automatic classification is chosen, the analyzed sound is recognized by two ANN networks trained and used in the experiments of the Ph.D. thesis. One of the networks performs an automatic voice quality recognition, the second one - voice type recognition. Recognition results are returned as outputs of networks, the 'winning neuron rule' is chosen as a classification criterion.

Fig.5 Presentation of the ANN automatic recognition results

The application is written in Matlab 7 environment. It requires Matlab Neural Network Toolbox to perform automatic classification and Matlab Data Acquisition Toolbox to allow real time recording of audio files.

Demo Download:

The application demo has been prepared in a movie wmv file. In order to download a demo use the following link: The demo movie

Bibliography:

[1] BLOOTHOOF G., The sound level of the singer's formant in professional singing, J. Acoust. Soc. Am., vol. 79, 2028-2032, 1986.
[2] CHILDERS D.G., LEE C.K., Vocal quality factors: analysis, synthesis, and perception, J. Acoust. Soc.Am., vol. 90, 2394-2410, 1991.
[3] CLEVELAND T., Acoustic properties of voice timbre types and their influence on voice classification., Journal of the Acoustical Society of America, 61(6):1622-9, 1977.
[4] DEJONCKERE P.H., OLEK M.P., Exactness of intervals in singing voice: A comparison between singing students and professional singers, Proc. 17th International Congress on Acoustics, vol. VIII, 120-121, Rome, 2001.
[5] DEMUTH H., BEALE M., Neural Network Toolbox for Matlab, The MathWorks Inc, USA 2001.
[6] DIAZ J.A., ROTHMAN H.B., Acoustic parameters for determining the differences between good and poor vibrato in singing, Proc. 17th International Congress on Acoustics, Rome, vol. VIII, 110-111, Rome, 2001.
[7] DZIUBIÑSKI M., KOSTEK B., Octave Error Immune and Instantaneous Pitch Detection Algorithm, Journal of New Music Research, vol. 34, 273-292, 2005.
[8] ESKENAZI L., CHILDERS D., HICKS D., Acoustic correlates of vocal quality. Journal of Speech and Hearing Research, 33:298-306, 1990.
[9] FRY D.B., Basis for the acoustical study of singing, J. Acoust. Soc. Am, vol. 28, 789-798, 1957.
[10] GERHARD D., Pitch extraction and Fundamental Frequency: History and Current Techniques, Technical Report TR-CS 2003-6, University of Regina Department of Computer Science, 2003.
[11] HARMA A. KARIALAINEN M, Matlab Toolbox for Warped DSP, http://www.acoustics.hut.fi/software/warp/, 2000.
[12] HERRERA P., BONADA J., Vibrato extraction and parameterization in the spectral modeling synthesis framework, Proc. COST G-6 Conference on Digital Audio Effects, Barcelona, Spain, 1998.
[13] ISHIZAKA K., FLANAGAN J. L., Synthesis of voiced sounds from a two-mass model of the vocal cords. Bell System Tech. Journal, vol. 51, no. 6, 1233-1268, 1972.
[14] ISSHIKI N., Vocal intensity and air flow rate., Folia Phoniatrica. 17:92-104, 1965.
[15] JOLIVEAU E., SMITH J., WOLFE J., Vocal tract resonances in singing: the soprano voice, J. Acoust. Soc. America, 116, 2434-39, 2004.
[16] KIM, Y. E., Singing Voice Analysis, Synthesis, and Modeling, Handbook on Signal Processing for Acoustics, David Havelock, New York, Springer Verlag, 2005.
[17] KOSTEK B., CZYZEWSKI A., Representing Musical Instrument Sounds for Their Automatic Classification, J. Audio Eng. Soc., vol. 49, 768-785, 2001.
[18] KOSTEK B., ZWAN P., DZIUBIÑSKI M., Musical Sound Parameters Revisited, Proc. Music Acoustics Conference, 623-626, Stockholm, 2003.
[19] KOSTEK B., SZCZUKO P., ¯WAN P., DALKA P., Processing of Musical Data Employing Rough Sets and Artificial Neural Networks, Transactions on Rough Sets, Springer Verlag, Berlin, Heidelberg, New York, 112-133, 2005.
[20] KRUGER E., STRUBE H.W., Linear prediction on a warped frequency scale, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.36, no.9, 1529-1531, 1988.
[21] LAYER J. The phonetic description of voice quality, Cambridge University Press, Great Britain, 1980.
[22] MILLER D. G., Formant Tuning in a Professional Baritone. Journal of Voice, 1990b, 4:231-7, 1990.
[23] RABINER L., On the use of autocorrelation analysis for pitch detection, IEEE Trans, ASSP, vol. 25, 24-33, 1977.
[24] ROTHENBERG M., Some relations between glottal air flow and vocal fold contact area. In National Institutes of Health, Proceedings of the Conference on the Assessment of Vocal Pathology, vol. 11, 88-96, 1979.
[25] ROTHMAN H.B., Why we don't like these singers, Proc. 17th International Congress on Acoustics, vol. VIII, 114-115, Rome, 2001.
[26] SCHUTTE H.K., MILLER D.G., Acoustic Details of Vibrato Cycle in Tenor High Notes. Journal of Voice., vol. 5:217-231, 1990.
[27] SIKORA T., HYOUNG-GOOK K., MOREAU N, MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval, Wiley, 2005.
[28] SUNDBERG J., The science of the singing voice, Northern Illinois University Press, Illinois, 1987.
[29] SUNDBERG J., Acoustics of the singing voice, Proc. 17th International Congress on Acoustics, Norway, 1995.
[30] TITZE I. R., The physics of small-amplitude oscillations of the vocal folds, J. Acoust. Soc. Am., vol. 83, 1536-1552, 1988.
[31] WOLF S.K., Quantitative studies on the singing voice, J. Acoust. Soc. Am., vol.6, 255-266, 1935.
[32] ZWAN P., Expert System for Automatic Classification and Quality Assessment of Singing Voices, Proc. 121 AES Convention, San Francisco, USA, 2006.
[33] ZWAN P., Expert system for objectivization of judgments of singing voices, Ph.D. Tesis, Gdansk (in Polish). University of Technology, Faculty of Electronics, Telecommunication and Informatics, Multimedia Systems Department, Gdansk, Poland, 2007.
[34] www.ncvs.org - National Center for Voice and Speech, Colorado, singing voice tutorial.