Important Dates

  • April 1, 2012
    Full Paper Submission Deadline
  • June 8, 2012
    Notification of Paper Acceptance
  • June 16, 2012
    Grant Application Deadline
  • June 22, 2012
    Camera-ready Paper Due
  • June 30, 2012
    Early Registration Deadline
    Deadline for Presenters to Register
  • August 8, 2012
    Hotel and Standard Registration Deadline

Join our mailing list!

Organizing Secretariat

Conference Solutions

Topic Models for Acoustic Processing of Speech

Overview

Topic models have recently been the center of a flurry of research activity and, in various forms, are the basis for many highly successful text and language tools (e.g. Google).  Even though this technology has been initially developed for text applications and discrete data, it is not constrained to that domain and once expanded for use on time-series it can be quite a formidable tool for dealing with many of the major speech signal problems, especially these involving mixtures.

 

Sounds, particularly speech, are typically characterized through spectro-temporal representations such as short-time Fourier transforms and Mel-spectral representations.  These representations naturally lend themselves to a histogram-based interpretation: the energy in any time-frequency bin for the signal is a scaled count of the number of quanta of energy in that frequency at that time.  When abstracted, such a quanta-based representation instantly becomes indistinguishable from the histogram-based characterizations that underlie topic models and consequently much of the mathematical development that underlies topic models can also be employed to analyze and make highly useful inferences from the signals.By employing this model, various previously difficult-to-handle problems such as signal de-noising, bandwidth expansion, analysis of mixed signals, signal prediction, signal tracking, de-reverberation etc. now become easily tractable inference of additive components.  This approach has increasingly become very visible in the signal processing field and, to date, has contributed to solutions which are very efficient and produce state-of-the-art results.

 

In this tutorial we will describe how topic models and their signal-specific extensions can be used to analyze and process speech. We will begin with the basics of latent variable multinomial decompositions, and work our way upwards through various higher-level models, their interpretations and extensions, and their relationship to other popular matrix decomposition techniques, computer vision methods, as well to the compressive sensing literature.  We will show how this field combines elements from machine learning and signal processing to produce hybrid approaches to produce novel approaches (and solutions) to some of the hardest problems in speech processing.We will cover models that can be very effectively used for a large number of applications, ranging from signal separation, signal de-noising, speech recognition, pitch tracking, de-reverberation, audio/visual object extraction, user-assisted audio selection, echo cancellation, etc.

 

Because this is an emerging field that has not yet been exhaustively studied, we will have the opportunity to cover its theory from first principles.  Due to that, the target audience of this tutorial can be very wide and will not be expected to have any prior experience in this area.  Even though our target participant will be a signal-oriented researcher this tutorial will also help machine-learning and text/language-processing researchers see how their expertise can be used for many speech processing problems.  We hope that this tutorial will help uncover some of the theoretical overlaps between these fields and help foster cross-pollination between these two types of participants.

 

Outline

The Theory (Prof. Raj)

  • The Basics
    • Latent variable modeling
    • Multinomial decompositions
    • Topic models
  • From text to signals
    • Signal transforms as histograms
    • Bag-of-frequencies models
    • Bag-of-spectrogram models
    • Tensorial forms
  • Employing priors
    • The Dirichlet prior
    • The entropic prior
    • Other quadratic priors
    • Cross-entropic priors
  • Dynamic models
    • Embedding PLSA in HMMs
    • Latent temporal models
  • Topic modeling in context
    • Relationship to SVD, PCA and ICA
    • Relationship to NMF and NTF
    • Relationship to Compressive Sensing

The Applications (Dr. Smaragdis)

  • Audio models
    • Low-rank dictionary learning
    • Overcomplete dictionaries
    • Convolutive dictionaries
  • Separation of monophonic audio mixtures
    • Additive properties of topic models
    • Separation of speech with known sound classes
    • Separation of speech with unknown classes
  • From separation to denoising
    • Denoising of known speakers
    • Denoising from known interference
  • Recognition
    • Recognizing sounds in mixtures
    • Temporal models
  • Temporal separation
    • Echo removal and dereverberation
    • Echo cancellation
  • Missing data
    • Recovering missing spectral data portions
    • Bandwidth expansion
    • Compression artifacts recovery
  • Pitch tracking
    • Pitch tracking of single sources
    • Pitch tracking of multiple mixed sources
  • User-interfaces
    • User-driven audio selection
  • Multimodality
    • Audio/visual analysis

Conclusions

  • Brief recapitulation
    • Current trends and ideas
    • The future of topic models in speech processing

 

Short Biographies

Dr. Bhiksha Raj,  Carnegie Mellon University's Language Technologies Institute, USA

Dr. Bhiksha Raj is an associate professor and non-tenured faculty chair at Carnegie Mellon University's Language Technologies Institute, and is also affiliated the Electrical and Computer Engineering and Machine Learning departments at CMU.  Dr. Raj obtained his PhD from CMU in 2000 and was at Mistubishi Electric Research Laboratories from 2001-2008. Dr. Raj's chief research interests lie in robust automatic speech recognition, machine learning and associated topics. Since 2005 he has also investigated topic models for signal processing, particularly in the context of modelling, enhancing and modifying speech signals, and has published several papers on the topic.

Contact information:6705 Hillman Building, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213, tel. +1-(412)-268-9826, email: bhiksha@cs.cmu.edu
 
Dr. Paris Smaragdisis, University of Illinois at Urbana-Champaign, USA
Dr. Paris Smaragdisis an assistant professor of computer science and electrical and computer engineering at the university of Illinois at Urbana-Champaign.  Prior to that he was a senior research scientist at Adobe Systems Inc., and a research scientist with Mitsubishi Electric Research Labs (MERL).  Prof. Smaragdis obtained his Ph.D. at MIT in 2001, and was a postdoc there in 2002.  In 2006 Prof. Smaragdis' research accomplishments were recognized by the MIT Tech Review, which selected him as one of the top young innovators of the year.  Prof. Smaragdis' research interests are applications to machine learning for audio signal processing problems, machine listening and computation and the arts.  Along with Prof. Raj, he has been very active in publishing and popularizing the appeal of topic models for signal processing applications, especially as they relate to common audio processing problems such as monophonic source separation, dereverberation, echo cancellation, music transcription, speech recognition, etc.  Prof. Smaragdis is a senior member of the IEEE and a member of the MLSP-TC and AASP-TC.
Contact information:Siebel Center for Computer Science, 201 N. Goodwin Ave.  Urbana, IL 61801, Office 3231.  Tel: +1-(217)-265-6893, email: paris@illinois.edu

 

Thank you to our Sponsors

 

 

 

 

“Microsoft is a trademark of the Microsoft group of companies and is used under license from Microsoft.”

 

 

 

http://www.ets.org/

 

 

“Intel” and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other Countries.