Sunday, February 13, 2011


The Microsoft Research Audio Video Indexing System (MAVIS) is a set of software components that use speech recognition technology to enable searching of digitized spoken content, whether they are from meetings, conference calls, voice mails, presentations, online lectures, or even Internet video.


As the role of multimedia continues to grow in the enterprise, Government, and the Internet, the need for technologies that better enable discovery and search of such content becomes all the more important.

Microsoft Research has been working in the area of speech recognition for over two decades, and speech-recognition technology is integrated in a number of Microsoft products, such as Windows 7,, Exchange 2010, and Office OneNote. Using integrated speech-recognition technology in the Windows 7 operating system, users can dictate into applications like Microsoft Word, or use speech to interact with their Windows system. The service allows mobile users to get directory services using speech while on the go. Exchange 2010 now provides a rough transcript of incoming voicemails and in Office OneNote, users can search their speech recordings using keywords.

MAVIS Adds to the list of Microsoft applications and services that use speech recognition. MAVIS is designed to enable searching of 100s or even 10,000s of hours of conversational speech with different speakers on different topics. The MAVIS UI, which is a set of aspx pages, resembles that of a web search UI as illustrated below but can be changed to suite different applications.

MAVIS comprises of speech recognition software components that run as a service in the Windows Azure Platform , full text search components that run in SQL Server 2005/2008 sample aspx pages for the UI and some client side PowerShell and .NET tools. The MAVIS client side tools make it easy to submit audio video content to the speech recognition application running in the Azure service using an RSS formatted file and retrieve the results so they can be imported into a SQL Server for full text indexing which enables the audio video content to be searched just like other textual content.

MAVIS is currently a research project with a limited technical preview program. If you have deployed Microsoft SQL Server, have large speech archives and are interested in the MAVIS technical preview program, contact us.

MAVIS Architecture

Speech-Recognition for Audio Indexing Backgrounder

There are two fundamentally different approaches to speech recognition, one referred to as Phonetic indexing and the other Large-vocabulary Continuous Speech Recognition (LVCSR).

  • Phonetic indexing is based on phonetic representations of the pronunciation of the spoken terms and has no notion of words. It performs phonetic based recognition during the indexing process, and at search time, the query is translated into its phonetic spelling which is then matched against the phonetic recognition result. Although this technique has the advantage of not depending on a preconfigured vocabulary, it is not appropriate for searching large audio archives of 10,000s hours because of the high probability of errors using phonetic recognition. It is however appropriate for relatively small amounts of audio as might be the case for searching personal recordings of meetings or lectures. Microsoft has utilized this technique with success to enable the “Audio Search” feature in Office OneNote 2007.
  • Large-vocabulary continuous speech recognition or LVCSR, which is used in MAVIS, turns the audio signals into text using a preconfigured vocabulary and language grammar. The resulting text is then indexed using a text indexer. The LVCSR technique is appropriate for searching large amounts of audio archives which can be 10,000s of hours in length. The vocabulary can be configured to enable recognition of proper nouns such as names of people, places or thing.

Although LVCSR based audio search systems can provide a more accurate search result than phonetic based systems, State-of-the-art LVCSR based speech-recognition accuracy on conversational speech is still not perfect. Researchers at MSR Asia have developed a more accurate technique called “Probabilistic Word-Lattice Indexing” which takes into account how confident the recognition of a word is, as well as what alternate recognition candidates were considered. It also preserves time stamps to allow direct navigation to keyword matches in the audio or video.

Probabilistic Word-Lattice Indexing

For conversational speech, typical speech recognizers can only achieve accuracy of about 60%. To improve the accuracy of speech search, Microsoft Research Asia developed a technique called ”Probabilistic Word-Lattice Indexing,” which helps to improve search accuracy in three ways:

  • Less false negatives Lattices allow to find (sub-)phrase and ‘AND’ matches where individual words are of low confidence, but the fact that they are queried together allows us to infer that they still may be correct. Word-lattices represent alternative recognition candidates that were also considered by the recognizer, but did not turn out to be the top-scoring candidate.
  • Less false positives Lattices also provide a confidence score for each word match. This can be used to suppress low-confidence matches.
  • Time stamps Lattices, unlike text, retain the start times of spoken words, which is useful for navigation.

Word lattices accomplish this by representing the words that may have been spoken in a recording as a graph structure. Experiments show that indexing and searching this lattice structure instead of plain speech-to-text transcripts significantly improves document-retrieval accuracy for multi-word queries (30-60% for phrase queries, and over 200% for AND queries).

For more information about Microsoft Research Asia’s basic lattice method, see:

A challenge in implementing probabilistic word-lattice indexing is the size of the word lattices. Raw lattices as obtained from the recognizer can contain hundreds of alternates for each spoken word. To address this challenge MSRA has devised a technique referred to as Time-based Merging for Indexing which brings down lattice size to about 10 x the size of a corresponding text-transcript index. This is orders of magnitude less than using Raw lattices.