Important Dates

  • April 1, 2012
    Full Paper Submission Deadline
  • June 8, 2012
    Notification of Paper Acceptance
  • June 16, 2012
    Grant Application Deadline
  • June 22, 2012
    Camera-ready Paper Due
  • June 30, 2012
    Early Registration Deadline
    Deadline for Presenters to Register
  • August 8, 2012
    Hotel and Standard Registration Deadline

Join our mailing list!

Organizing Secretariat

Conference Solutions

Uncertainty Handling for Environment-Robust Speech Recognition

 

Abstract

In today's world, where mobile computing is more prevalent than any time in the history, automatic speech recognition (ASR) in environments with non-stationary noise remains a very challenging problem. The ubiquity of speech applications for hand-held devices, best exemplified by the recent success of personal assistant Siri on the iPhone 4S, requires ASR systems to deal with a wide variety of acoustic environments. Furthermore, the short interaction times left very little information for ASR systems to adapt.

While speech enhancement is typically carried out in the short-time Fourier (STFT) domain, where speech corruption is easier to model, ASR operates typically on nonlinearly transformed features such as MFCC, which result in more compact features and models. In recent years, a breed of robust ASR methods have arisen that exploit both the advantages of the STFT and the nonlinear feature domains by employing the notion of uncertainty propagation/decoding. These techniques estimate an uncertain description of speech in the feature domain which accounts for the effect of the distortions in the STFT domain or the residual noise after speech enhancement. This uncertain description of the features is then used to dynamically compensate the ASR model and thus attain robust ASR with lower computational loads than classical model-based compensation. Furthermore, uncertainty propagation/decoding provides a formal framework allowing the incorporation of expertise from the speech enhancement field into robust ASR. The field is also currently in expansion with promising directions including model training with uncertain data or integration with multichannel and multi-modal algorithms

This tutorial will introduce the topic of uncertainty handling for robust ASR, review the latest trends and discuss future development directions. The tutorial will cover how an uncertain description of the speech features can be determined by exploiting STFT domain information and how uncertainty can be integrated into an ASR model. Both feature-domain and STFT-domain methods to determine uncertainty will be considered. Regarding feature domain, the latest developments around the ALGOQUIN model will be introduced. A general taxonomy of feature and model-domain approaches will also be provided. Regarding STFT domain, the STFT-Uncertainty Propagation approach, integrating STFT speech enhancement and robust ASR, will be presented. Along with uncertainty decoding and propagation approaches, recent progress in extending the use of uncertainty for robust recognition to training with uncertain data will also be presented. Other novel approaches like improved Bayesian estimation of STFT uncertainties will also be addressed. The tutorial will finish with an analysis of future perspectives.
 

Outline

Uncertainty Propagation/Decoding General Overview [R. F. Astudillo, 15min]

  • STFT-domain, log-feature-domain, acoustic modeling and decoding
  • Brief overview of uncertainty propagation/decoding and tutorial outline

Log-Feature and Model Domain Approaches to Uncertainty Handling in ASR [L. Deng, 1h]

  • ASR using a Bayesian decision rule with unreliable input features
  • Feature enhancement and uncertainty estimation, the ALGOQUIN model
  • ASR using a Bayesian decision rule with unreliable model parameters
  • Model and Feature Compensation: A taxonomy-oriented overview

Linear-STFT Domain Approaches to Uncertainty Handling in ASR [R. F. Astudillo, 1h]

  • Speech enhancement in the STFT domain and residual uncertainty.
  • Approaches to uncertainty propagation, the complex Gaussian uncertainty model
  • Uncertainty propagation for various features including MFCC, RASTA-PLP and MLP
  • Integration of STFT speech enhancement and robust ASR

Learning from Noisy data [E. Vincent, 30min]

  • Bayesian uncertainty estimation for STFT-domain enhancement
  • Expectation maximization training of acoustic models with unreliable input features

Wrap-up and perspectives [E. Vincent, 15min]

  • Exploitation of yet unused uncertainties
  • Integration with uncertainty in other modalities, e.g. video
  • Popularity and impact of uncertainty handling in the recent 2011 PASCAL CHiME Speech Separation and Recognition Challenge 

 

Short Biographies

Ramon Fernandez Astudillo

Spoken Language Laboratory, INESC-ID-Lisboa, Lisboa, Portugal
 
Ramon F. Astudillo obtained the industrial engineering degree with specialization electronics in automatic regulation at the Escuela Politecnica Superior de Ingenieria de Gij´on (Spain) in 2005, completing the last two years of this degree with an Erasmus scholarship at the Technische Universit¨at Berlin. In 2006 he worked as an intern at Peiker Acustic researching model-based speech enhancement. On this same year he was awarded with a La Caixa and the German Academic Exchange Service (DAAD) scholarship for research towards the Ph.D. degree. He obtained the title with distinction from the Technische Universit¨at Berlin in 2010 in the fields of speech processing and robust automatic speech recognition. Dr. Astudillo is currently a Post.- Doc. researcher at INESC-ID in Lisbon, researching both on robust speech recognition and robust natural language processing speech applications in a Bayesian setting. He is also an ISCA member and reviewer of IEEE-TASLP/SPL as well as CSL.
 
 
Emmanuel Vincent
INRIA Rennes, 35042 Rennes cedex, France

Emmanuel Vincent is a Research Scientist with the French National Institute for Research in Computer Science and Control (INRIA, Rennes, France). He holds a PhD degree in signal processing from University Pierre et Marie Curie (Paris, France) and worked as a Research Assistant with the Centre for Digital Music at Queen Mary, University of London (London, U.K.) from 2004 to 2006. His research focuses on probabilistic machine learning for speech and audio signal processing, with application to real-world audio source localization and separation, noise-robust speech recognition and music information retrieval. He has authored more than 90 papers in these fields and currently serves as an Associate Editor for IEEE T-ASL and as a Guest Editor for the special issue of CSL on speech separation and recognition in multisource environments. He is also the Founding Chair of the annual Signal Separation Evaluation Campaign (SiSEC) and an organizer of the PASCAL ’CHiME’ Speech Separation and Recognition Challenge. His achievements have recently been honored by the 2012 SPIE ICA Unsupervised Learning Pioneer Award.

 
 
Li Deng 
Microsoft Research, Redmond, WA, USA
 

Li Deng joined the Department of Electrical and Computer Engineering, University ofWaterloo, Waterloo, ON, Canada, in 1989 as an Assistant Professor, where he became a Full Professor in 1996. In 1999, he joined Microsoft Research, Redmond, WA, where he is currently a Principal Researcher. Since 2000, he has also been an Affiliate Full Professor in the Department of Electrical Engineering, University of Washington, Seattle. Prior to Microsoft Research, he also worked or taught at the Massachusetts Institute of Technology (Cambridge, MA), ATR Interpreting Telecommunications Research Laboratories (Kyoto, Japan), Hong Kong University of Science and Technology, and Nortel (Canada). In the general areas of audio/speech/language processing, neural information processing, digital communication, and machine learning, he has published over 300 refereed papers and 3 books. He has been granted over 60 patents. He is a Fellow of the IEEE, the Acoustical Society of America and ISCA, and is ISCAs Distinguished Lecturer. He has received awards/honors bestowed by the IEEE, ISCA, ASA, Microsoft, and other organizations. He served on the Board of Governors of the IEEE Signal Processing Society (2008-2010). He served as Editor-in-Chief for the IEEE Signal Processing Magazine (2009- 2011), which, according to the Thomson Reuters Journal Citation Report released June 2010 and 2011, ranks first in both years among all IEEE publications and all publications within the Electrical and Electronics Engineering Category worldwide in terms of its impact factor. He currently serves as Editor-In-Chief of IEEE Trans. Audio, Speech, and Language Processing.

 

 

Thank you to our Sponsors

 

 

 

 

“Microsoft is a trademark of the Microsoft group of companies and is used under license from Microsoft.”

 

 

 

http://www.ets.org/

 

 

“Intel” and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other Countries.