Prediction-driven computational auditory scene analysis -
The document

Here is the PostScript to the entire document (180 pages). You can choose between the full 12 MB uncompressed file, or download one of the compressed versions to suit your available decoding engines and net bandwidth. The file is PostScript level-1 compatible but certain pages may overload printers with less than 8 MB of RAM (?). It has been tested with a QMS PS-1700, an HP LaserJet 4M, and the previewers GhostScript and xpsview.

       2094234 Aug  9  1996 pdcasa.pdf             Acrobat PDF

      12133055 May 14 13:06 pdcasa.ps              plain postscript
       2778491 May 16 12:17 pdcasa.ps.Z            compressed with compress(1)
       2082358 May 16 12:16 pdcasa.ps.gz           compressed with gzip(1)

If you have trouble with the electronic version and would like me to send you a paper copy, please email me at <dpwe@media.mit.edu> . If you try to print the postscript and it breaks on a specific printer, I would be grateful to know the details of that printer.

Prediction-driven computational auditory scene analysis

by Daniel P. W. Ellis

Submitted to the Department of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology in partial fulfillment of the requirements for the degree of Doctor of Philosophy of Electrical Engineering.

June 1996

Abstract

The sound of a busy environment, such as a city street, gives rise to a perception of numerous distinct events in a human listener - the `auditory scene analysis' of the acoustic information. Recent advances in the understanding of this process from experimental psychoacoustics have led to several efforts to build a computer model capable of the same function. This work is known as `computational auditory scene analysis'.

The dominant approach to this problem has been as a sequence of modules, the output of one forming the input to the next. Sound is converted to its spectrum, cues are picked out, and representations of the cues are grouped into an abstract description of the initial input. This `data-driven' approach has some specific weaknesses in comparison to the auditory system: it will interpret a given sound in the same way regardless of its context, and it cannot `infer' the presence of a sound for which direct evidence is hidden by other components.

The `prediction-driven' approach is presented as an alternative, in which analysis is a process of reconciliation between the observed acoustic features and the predictions of an internal model of the sound-producing entities in the environment. In this way, predicted sound events will form part of the scene interpretation as long as they are consistent with the input sound, regardless of whether direct evidence is found. A blackboard-based implementation of this approach is described which analyzes dense, ambient sound examples into a vocabulary of noise clouds, transient clicks, and a correlogram-based representation of wide-band periodic energy called the weft.

The system is assessed through experiments that firstly investigate subjects' perception of distinct events in ambient sound examples, and secondly collect quality judgments for sound events resynthesized by the system. Although rated as far from perfect, there was good agreement between the events detected by the model and by the listeners. In addition, the experimental procedure does not depend on special aspects of the algorithm (other than the generation of resyntheses), and is applicable to the assessment and comparison of other models of human auditory organization.

Thesis supervisor: Barry L. Vercoe

Title: Professor of Media Arts and Sciences

1 Introduction
 1.1    Auditory Scene Analysis for real scenes
 1.2    Modeling auditory organization - motivation and approach
 1.3    The prediction-driven model 
 1.4    Applications 
 1.5    Ideas to be investigated 
 1.6    Specific goals 
 1.7    Outline of this document 

2 An overview of work in Computational Auditory Scene Analysis 
 2.1    Introduction 
 2.1.1  Scope
 2.2    Foundation: Auditory Scene Analysis
 2.3    Related work 
 2.3.1  Sound models 
 2.3.2  Music analysis
 2.3.3  Models of the cochlea and auditory periphery 
 2.3.4  Speech processing and pre-processing
 2.3.5  Machine vision scene analysis systems 
 2.4    The data-driven computational auditory scene analysis system 
 2.5    A critique of data-driven systems 
 2.6    Advances over the data-driven approach 
 2.6.1  Weintraub's state-dependent model 
 2.6.2  Blackboard systems 
 2.6.3  The IPUS blackboard architecture 
 2.6.4  Other innovations in control architectures 
 2.6.5  Other `bottom-up' systems 
 2.6.6  Alternate approaches to auditory information processing 
 2.6.7  Neural network models 
 2.7    Conclusions and challenges for the future 

3 The prediction-driven approach
 3.1    Psychophysical motivation 
 3.2    Central principles of the prediction-driven approach
 3.3    The prediction-driven architecture 
 3.4    Discussion 
 3.5    Conclusions
 
4 The implementation 
 4.1    Implementation overview 
 4.1.1  Main modules
 4.1.2  Overview of operation : prediction and reconciliation
 4.2    The front end 
 4.2.1  Cochlear filterbank 
 4.2.2  Time-frequency intensity envelope 
 4.2.3  Correlogram 
 4.2.4  Summary autocorrelation (periodogram) 
 4.2.5  Other front-end processing 
 4.3    Representational elements 
 4.3.1  Noise clouds 
 4.3.2  Transient (click) elements 
 4.3.3  Weft (wideband periodic) elements
 4.4    The reconciliation engine 
 4.4.1  The blackboard system
 4.4.2  Basic operation 
 4.4.3  Differences from a traditional blackboard system 
 4.5    Higher-level abstractions

5 Results and assessment
 5.1    Example analyses 
 5.1.1  Bregman's alternating noise example 
 5.1.2  A speech example 
 5.1.3  Mixtures of voices 
 5.1.4  Complex sound scenes: the "city-street ambience" 
 5.2    Testing sound organization systems 
 5.2.1  General considerations for assessment methods 
 5.2.2  Design of the subjective listening tests 
 5.3    Results of the listening tests
 5.3.1  The training trial 
 5.3.2  The city-sound 
 5.3.3  "Construction" sound example 
 5.3.4  "Station" sound example 
 5.3.5  The double-voice example
 5.3.6  Experiment part B: Rating of resyntheses 
 5.3.7  Experiment part C: Ranking of resynthesis versions 
 5.4    Summary of results 

6 Summary and conclusions 
 6.1    Summary 
 6.1.1  What has been presented 
 6.1.2  Future developments of the model 
 6.2    Conclusions 
 6.2.1  Reviewing the initial design choices 
 6.2.2  Insights gained during the project 
 6.2.3  A final comparison to real audition 
 6.3    The future of Computational Auditory Scene Analysis 

Appendix A: Derivation of the weft update equation 

Appendix B: Sound examples 

Appendix C: Computational environment

References

Back to the Prediction-driven computational auditory scene analysis home page.

DAn Ellis <dpwe@media.mit.edu>
MIT Media Lab Perceptual Computing

Prediction-driven computational auditory scene analysis - The document