Important Dates
Join our mailing list! |
Topic Models for Acoustic Processing of SpeechOverviewTopic models have recently been the center of a flurry of research activity and, in various forms, are the basis for many highly successful text and language tools (e.g. Google). Even though this technology has been initially developed for text applications and discrete data, it is not constrained to that domain and once expanded for use on time-series it can be quite a formidable tool for dealing with many of the major speech signal problems, especially these involving mixtures.
Sounds, particularly speech, are typically characterized through spectro-temporal representations such as short-time Fourier transforms and Mel-spectral representations. These representations naturally lend themselves to a histogram-based interpretation: the energy in any time-frequency bin for the signal is a scaled count of the number of quanta of energy in that frequency at that time. When abstracted, such a quanta-based representation instantly becomes indistinguishable from the histogram-based characterizations that underlie topic models and consequently much of the mathematical development that underlies topic models can also be employed to analyze and make highly useful inferences from the signals.By employing this model, various previously difficult-to-handle problems such as signal de-noising, bandwidth expansion, analysis of mixed signals, signal prediction, signal tracking, de-reverberation etc. now become easily tractable inference of additive components. This approach has increasingly become very visible in the signal processing field and, to date, has contributed to solutions which are very efficient and produce state-of-the-art results.
In this tutorial we will describe how topic models and their signal-specific extensions can be used to analyze and process speech. We will begin with the basics of latent variable multinomial decompositions, and work our way upwards through various higher-level models, their interpretations and extensions, and their relationship to other popular matrix decomposition techniques, computer vision methods, as well to the compressive sensing literature. We will show how this field combines elements from machine learning and signal processing to produce hybrid approaches to produce novel approaches (and solutions) to some of the hardest problems in speech processing.We will cover models that can be very effectively used for a large number of applications, ranging from signal separation, signal de-noising, speech recognition, pitch tracking, de-reverberation, audio/visual object extraction, user-assisted audio selection, echo cancellation, etc.
Because this is an emerging field that has not yet been exhaustively studied, we will have the opportunity to cover its theory from first principles. Due to that, the target audience of this tutorial can be very wide and will not be expected to have any prior experience in this area. Even though our target participant will be a signal-oriented researcher this tutorial will also help machine-learning and text/language-processing researchers see how their expertise can be used for many speech processing problems. We hope that this tutorial will help uncover some of the theoretical overlaps between these fields and help foster cross-pollination between these two types of participants.
OutlineThe Theory (Prof. Raj)
The Applications (Dr. Smaragdis)
Conclusions
Short BiographiesDr. Bhiksha Raj, Carnegie Mellon University's Language Technologies Institute, USA Dr. Bhiksha Raj is an associate professor and non-tenured faculty chair at Carnegie Mellon University's Language Technologies Institute, and is also affiliated the Electrical and Computer Engineering and Machine Learning departments at CMU. Dr. Raj obtained his PhD from CMU in 2000 and was at Mistubishi Electric Research Laboratories from 2001-2008. Dr. Raj's chief research interests lie in robust automatic speech recognition, machine learning and associated topics. Since 2005 he has also investigated topic models for signal processing, particularly in the context of modelling, enhancing and modifying speech signals, and has published several papers on the topic.
Contact information:6705 Hillman Building, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213, tel. +1-(412)-268-9826, email: bhiksha@cs.cmu.edu
Dr. Paris Smaragdisis, University of Illinois at Urbana-Champaign, USA
Dr. Paris Smaragdisis an assistant professor of computer science and electrical and computer engineering at the university of Illinois at Urbana-Champaign. Prior to that he was a senior research scientist at Adobe Systems Inc., and a research scientist with Mitsubishi Electric Research Labs (MERL). Prof. Smaragdis obtained his Ph.D. at MIT in 2001, and was a postdoc there in 2002. In 2006 Prof. Smaragdis' research accomplishments were recognized by the MIT Tech Review, which selected him as one of the top young innovators of the year. Prof. Smaragdis' research interests are applications to machine learning for audio signal processing problems, machine listening and computation and the arts. Along with Prof. Raj, he has been very active in publishing and popularizing the appeal of topic models for signal processing applications, especially as they relate to common audio processing problems such as monophonic source separation, dereverberation, echo cancellation, music transcription, speech recognition, etc. Prof. Smaragdis is a senior member of the IEEE and a member of the MLSP-TC and AASP-TC.
Contact information:Siebel Center for Computer Science, 201 N. Goodwin Ave. Urbana, IL 61801, Office 3231. Tel: +1-(217)-265-6893, email: paris@illinois.edu
|



.gif)

.jpg)

.png)
.png)

.png)




.jpg)
