This is my ICSI page and for the most part it has not been updated since 2008. From 2000-2008 I was a member of the ICSI Speech group studying automatic speech recognition (ASR), or in other words, making computers turn speech into text. Although I have left ICSI, I will try to respond to queries about my work, so please feel free to contact me:
Multi-stream automatic speech recognition systems which use the combined decisions of an ensemble of classifiers, each with its own feature vector, are popular in the research literature. Past published work on feature selection for such systems has dealt with features in blocks. In my thesis, I tried feature selection at the level of individual features, using Ho's random subspace method and Tsymbal et al.'s hill-climbing method. The thesis can be downloaded here. For a shorter overview of my work, see the INTERSPEECH 2009 paper I co-authored, which can be downloaded from the ICSI Publications page. The paper also includes results from a tech report I wrote after the thesis.
During my thesis work, I created versions of the OGI ISOLET and OGI Numbers corpora that are degraded by background noise, using various noises and signal-to-noise ratios. Other researchers can exactly reproduce these noisy corpora given copies of the original corpora from OGI. I have also set up ASR systems for ISOLET and Numbers which are built using open source components and are available to other researchers. See here for my ISOLET resources and here for my Numbers resources.
I have experimented with a mean subtraction algorithm for reverberation compensation that was developed as part of Carlos Avendano's thesis work at OGI. I redesigned the algorithm to produce time-domain output, making it much easier to integrate with existing ASR software. I then evaluated it using data from a corpus of spoken digits recordings collected (not by me) using tabletop microphones at ICSI. We published a paper on this at ASRU 2001. Please also see this page which contains corrections, source code, audio files, and additional results that were not included in the paper. That page also has a bibliography of related publications. We published further results at AVIOS 2002 and ICSLP 2002, and other research groups have published results since then using the time-domain output version of the algorithm.
I co-authored a EUROSPEECH 2003 paper with Docio and Morgan in which we compared the performance of different types of tabletop microphones, as well as investigating the performance of noise reduction.
Human speech recognition accuracy is often much higher than computer accuracy, even in tasks (like nonsense syllables) where semantic understanding does not play a role. This has inspired work that aims to build computer speech recognition using signal processing inspired by the human hearing system. I have co-authored papers on this topic with Werner Hemmert and others.
I helped Michael Kleinschmidt with his thesis work on the use of Gabor filters for speech recognition. The work has since been continued by Bernd Meyer and others. This page contains a bibliography, links to source code, and some information about unpublished results.
Technologically, it is becoming increasingly simple to record and preserve the audio of meetings. The value of such recordings is higher if ASR-based speech indexing and search is possible, much like how preserving old emails is more useful if one can search through them for emails containing particular keywords. There are also potential uses of ASR technology while a meeting is ongoing.
ICSI has been doing quite a bit of work in this area. My main contribution was to extend Transcriber to support multiple-talker transcription. The modified tool was used by a number of people in 2001. Transcriber and other tools have made progress since then, so I am not sure whether my version still useful.
I also helped integrate noise reduction into ICSI's ASR system for meetings. The code can be found here. (I didn't write the core code, I just cleaned it up a bit and made it easier to use with meeting data.)
I have some ideas about adding reverberation to non-reverberant conversational speech training data (such as Switchboard), in order to increase the amount of reverberant training data available for meeting recognition. However, due to other priorities, I'm not working on this at the moment. If you are interested, please feel free to get in touch.