Loading…

Computational auditory scene analysis

Although the ability of human listeners to perceptually segregate concurrent sounds is well documented in the literature, there have been few attempts to exploit this research in the design of computational systems for sound source segregation. In this paper, we present a segregation system that is...

Full description

Saved in:
Bibliographic Details
Published in:Computer speech & language 1994-10, Vol.8 (4), p.297-336
Main Authors: Brown, Guy J., Cooke, Martin
Format: Article
Language:English
Subjects:
Citations: Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Although the ability of human listeners to perceptually segregate concurrent sounds is well documented in the literature, there have been few attempts to exploit this research in the design of computational systems for sound source segregation. In this paper, we present a segregation system that is consistent with psychological and physiological findings. The system is able to segregate speech from a variety of intrusive sounds, including other speech, with some success. The segregation system consists of four stages. Firstly, the auditory periphery is modelled by a bank of bandpass filters and a simulation of neuromechanical transduction by inner hair cells. In the second stage of the system, periodicities, frequency transitions, onsets and offsets in auditory nerve firing patterns are made explicit by separate auditory representations. The representations, auditory maps, are based on the known topographical organization of the higher auditory pathways. Information from the auditory maps is used to construct a symbolic description of the auditory scene. Specifically, the acoustic input is characterized as a collection of time-frequency elements, each of which describes the movement of a spectral peak in time and frequency. In the final stage of the system, a search strategy is employed which groups elements according to the similarity of their fundamental frequencies, onset times and offset times. Following the search, a waveform can be resynthesized from a group of elements so that segregation performance may be assessed by informal listening tests. The system has been evaluated using a database of voiced speech mixed with a variety of intrusive noises such as music, "office" noise and other speech. A technique for quantitative evaluation of the system is described, in which the signal-to-noise ratio (SNR) is compared before and after the segregation process. After segregation, an increase in SNR is obtained for each noise condition. Additionally, the performance of our system is significantly better than that of the frame-based segregation scheme described by Meddis and Hewitt (1992).
ISSN:0885-2308
1095-8363
DOI:10.1006/csla.1994.1016