CSD Home | SCS Home



Research Areas - Vision, Speech, and Natural Language Research in the Computer Science Department at Carnegie Mellon


CSD faculty: Jaime Carbonell, Mike Christel (ETC), Alexei Efros, Scott Fahlman, Carlos Guestrin, Alex Hauptmann, Takeo Kanade, Qifa Ke, John Lafferty, Tai-Sing Lee, Srinivas Narasimhan (RI), Raj Reddy, Alex Rudnicky, Eric Xing (ML/LTI)


Vision, Speech, and Natural Language are three core areas of Artificial Intelligence in which Carnegie Mellon Computer Science has had a continuing strong presence. Research in these areas has motivated and influenced research in other areas in CSD, including multi-processing, parallel processing, robotics, and learning. While the Robotics Institute (RI) and the Language Technology Institute (LTI) were founded as focused research and educational organizations within SCS, CSD maintains sizable activities in Vision, Speech, and Natural Language


1 Vision

Computer vision is about acquiring and interpreting the rich visual world around us. This is an exciting multi-disciplinary field of research with a wide spectrum of applications that can impact our daily lives. Today, cameras are ubiquitous, the amount of visual information (images and videos) generated is overwhelming, and automatic visual information processing has never been more important.

Although vision systems have so far enjoyed great success in controlled and structured environments, several substantial challenges must be overcome for these systems to be successful in unstructured, real-world situations. Since the inception of DARPA’s Image Understanding Program in the early 70s, Carnegie Mellon has been a recognized leader in computer vision research. There are seven CSD faculty members, and more than 20 across all of SCS, whose research spans various aspects of computer vision. Collectively, they conduct both basic theoretical investigations and end-to-end system building in a wide variety of areas. Some highlight areas are as follows:

Core Vision Capabilities Since the 70s, the Carnegie Mellon vision group (Hebert (RI), Ikeuchi (currently Tokyo), Kanade, McKeown, Reddy, Shafer (currently Microsoft), and Witkin (currently Pixar)) has made fundamental contributions to core vision capabilities, including segmentation, motion tracking, structure from motion, color, stereo, and 3D modeling. The following three examples show that the tradition continues. In the Virtualized Reality project, Kanade pioneered the use of a very large number of cameras for modeling a dynamic scene (“3D Room”), one of the popular practices today, which led to the Matrix-like replay system “EyeVision” used in Super Bowl XXXV. Efros, a relatively new faculty member, is a leader of the “exemplar-based approach” to human action recognition, where a collection of example videos are used for matching, rather than explicitly modeling of motion. Another new faculty, Narasimhan (with RI), is opening a new field of “bad weather vision,” in which physics-based modeling of imaging under fog, haze and other bad weather conditions can turn those images into a source of new information, rather than nuisance.

Cartography and Photo Interpretation Automated analysis and interpretation of remotely-sensed imagery, aerial or satellite, was one of the most important application areas of the DARPA Image Understanding Program, and its importance continues to grow. The Digital Mapping Laboratory group (Cochran, Harvey, McGlone, McKeown) has been the national focus of research in this area. They have focused on the use of knowledge-intensive techniques for the detailed analysis of remotely sensed imagery, with a broad scope of research ranging from low-level vision (e.g., stereo matching and scene registration), to systems for cartographic feature analysis (e.g., buildings, roads, airports), to information fusion, using multiple cooperative methods, as well as more recent emphasis on multi-spectral imagery and large-scale databases for advanced distributed simulation.

Biology-Motivated Vision Understanding the computational principles and neural algorithms of the adaptive biological visual systems is essential to building the next generation of intelligent machines and the path to knowing ourselves. Lee and Lewicki (also with the Center for the Neural Basis of Cognition) are using probabilistic formulations to study various topics, such as the inference of 3D surface structures in the early visual areas, the analysis of motion in the dorsal visual stream, and sensory coding of natural images.

People Image Analysis In recent years, there has been growing emphasis on interpreting images and videos of people and their activities for many applications, including surveillance and security, human computer interfaces, and driver assist technologies. This challenge is met in CSD through strong research for detection and recognition of face, facial expression analysis, gaze tracking, and marker-less human tracking (Kanade, with RI vision faculty (Baker, Matthews, Schneiderman)), and daily human activity recognition, such as eating and engaging in sports (Atkeson, Efros, Hodgins).

Convergence of Computer Vision and Graphics Recently, the convergence of computer vision and graphics has presented new approaches and opportunities, including image-based rendering, and physics-based and data-driven modeling and animation. Building stronger ties between computer vision and graphics research in CSD has become a high priority. Two recent hires, Narasimhan and Efros, along with Hebert and Liu (RI), whose research interests are at the intersection of these two fields, put us in a unique and strong position in this new trend.

Collaborative Research and Opportunities in Vision Traditionally, the CSD vision group has had strong collaborations with several research areas inside and outside of CSD, such as artificial intelligence, machine learning, medical imaging, computational neuroscience, mobile and field robotics, and the National Robotics Engineering Consortium. This collaborative spirit is unique among academic institutions and has resulted in many successful applications: mobile robot navigation, intelligent transportation systems, medical image understanding, Informedia, and vision-based human-computer interaction systems.

Computer vision in the 2000s is like computer graphics in 1980s. As cameras are introduced everywhere from personal cell phones to regional surveillance systems, computer vision applications are about to explode. With strong faculty, proven research traditions, and unique spirit of collaboration, the CSD vision group will be able to grasp the new opportunities and contribute to them.


2 Speech Recognition

Carnegie Mellon pioneered research in Speech Recognition in the 1970s, with the advent of the DRAGON, HEARSAY and HARPY well-known speech recognizers. DRAGON was the foundation of Dragon Systems, Inc. and its well-known product Naturally Speaking (R) in the 1990s. HEARSAY led to the invention of black-board architectures for flexible control of multi-knowledge source artificial intelligence systems. HARPY led to the invention of beam-search for semi-optimization problems, which to this date remains a central tool in the AI repertoire and is used in virtually all modern speech recognition systems. The early Carnegie Mellon speech research was led by Allen Newell, Raj Reddy, Jim Baker, Bruce Lowerre, Rick Hayes-Roth, and Lee Erman.

In the 1980s the Computer Science Department continued its research in speech recognition with the Angel System (Raj Reddy and Ron Cole), and later with the well-known Sphinx System (Kai-Fu Lee, Raj Reddy, Roni Rosenfeld, Xuedong Huang, et al), with top performance on multiple DARPA and NIST evaluations with respect to recognition accuracy. Sphinx served as the model prototype for Speech Recognition research in industry, including the efforts at Apple and Microsoft. Currently Sphinx II is available as open source software for researchers worldwide, and has formed the core of dozens if not hundreds of research efforts.

In the 1990’s Speech Recognition Research at Carnegie Mellon helped spawn the Language Technologies Institute (LTI), and continues to this date as a strong collaborative effort between CSD and LTI. In addition to continued SPHINX innovations, the JANUS system for multi-lingual speech recognition was born (Alex Waibel, Tanja Schultz, Michael Finke, Monika Woszczyna, et al) and it too pushed the state of the art, topping several NIST evaluations for accuracy. JANUS also served as the basis for first ever speech-to-speech machine translation system, and continues to date as the leading multilingual speech recognition project. Also beginning in the 1990s and continuing into the 2000s, Carnegie Mellon produced a top performing spoken language system as part of the DARPA ATIS program (Ward, Reddy). Later work produced innovations in dialog systems (Rudnicky el al) and in speech interfaces (Rosenfeld).

Research Thrusts Current research in Speech Recognition is more tightly coupled with Natural Language Processing and Machine Translation, especially under the ambitious, newly-launched DARPA GALE initiative, addressing challenges from the intelligence community as well as those from industry and NGOs. An interesting project produced the “Speechalator” a PDA-based speech recognizer, machine-translation system and text-to-speech synthesizer in the medical domain targeted at world-wide relief operations, where the relief workers need to communicate with the indigenous population regarding basic medicine. Such projects challenge not only the language technologies aspects, but also severe system design, integration, networking, user-interfaces and form factor analysis.


3 Natural Language Processing

Initial work on Natural Language Processing (NLP) at Carnegie Mellon started in support of speech recognition, but quickly took a life of its own with the arrival of Jaime Carbonell and Phil Hayes in 1979. Early efforts were focused on building robust NLP interfaces to data bases and to interactive applications, which required parsing of imperfectly typed input (spelling error, grammatically disfluencies, elided constituents, etc.). Several innovations in parsing, including Masaru Tomita’s generalized LR parsing algorithm, Carbonell and Hayes’s case-frame parsing, and Carbonell and Tomita’s unification-based parsing put Carnegie Mellon on the map as one of the two or three top NLP research centers.

By the mid 1980s the main focus of NLP activities shifted to Machine Translation (MT), spawning the Center for Machine Translation, which later expanded into the Language Technologies Institute. Whereas the focus of MT research is at the LTI, much of the algorithms insight and system building methods come from CSD, with both departments collaborating closely, including cross-advised PhD students.

By the 1990s, Carnegie Mellon became the leading academic institution in Machine Translation, including the creation of interlingual knowledge-based MT, the most accurate method developed for restricted-domain MT, and its embodiment in the KANT system (Nyberg, Mitamura, Carbonell), whose most notable application was in translating Caterpillar literature (manuals, designs, notes, etc.) from English into various languages. The first speech-to-speech MT was demonstrated in the early 1990s (interlingual MT coupled with the JANUS system described above). MT re-search included English, French, Spanish, German, Japanese, Chinese, Arabic, Hindi, Korean, Italian, Russian and several other languages.

Research Thrusts Presently, most Carnegie Mellon MT efforts focus on open-domain MT, which require statistical machine learning models and training from large scale parallel (pre-translated text). The advent of Example-Based MT (Brown, Carbonell) and Statistical MT (Vogel, et al) have made major impact on the field, winning or placing high on multiple NIST MT evaluations. Additionally some work focuses on transfer-rule-learning for MT of minor or endangered languages (e.g., Mapudungun, Quechua) where there are no significant parallel corpora for training EBMT or SMT systems.

Whereas MT remains the centerpiece of NLP research at Carnegie Mellon, other work focuses on the role of NLP in intelligent tutoring systems together with the Science of Learning Center (Koedinger, Penstein, Mitamura, Levin), and in interactive dialog systems (Rudnicky, Waibel).

Some projects transcend vision, speech and NLP. For instance the Informedia Project (Wactlar, Hauptmann) uses vision and face recognition, as well as closed captions and speech recognition to index, retrieve and summarize video archives.



      CSD Home   Webteam  ^ Top   SCS Home