Research Areas - Machine Learning Interaction Research in the Computer Science Department at Carnegie Mellon
CSD faculty: Ziv Bar-Joseph (ML, Lane), Avrim Blum, Emma Brunskill, Jaime Carbonell, Alexei Efros, Scott Fahlman, Christos Faloutsos, Eugene Fink, Carlos Guestrin, John Lafferty, Chris Langmead, Roy Maxion, Tom Mitchell, Andrew Moore, Tuomas Sandholm, Manuela Veloso, Eric Xing (ML, LTI)
1 The machine learning problem
The scientific method can be described as the process of observing a phenomenon, forming a hypothesis, making predictions, and then iteratively revising the hypothesis before making further observations. The same general process may be used by people in the decision making steps required to carry out a multitude of tasks. The broad goal of machine learning is to automate this type of process, so that computer-automated predictions can make a task more efficient, accurate, or cost-effective than it would be using only human decision making.
Figure 1: The broad goal of machine learning is to automate the iterative process of observing data, formulating a model, and making predictions, so that the computer effectively “learns” to make more accurate predictions as more data become available.
At a high level, the core scientific problems in machine learning can be characterized by three inter-related challenges:
In sketching a picture of research in machine learning within the Computer Science Department, it will be convenient to refer to these basic challenges and discuss the different application areas that are the focus of current work in the department.
2 The growth of machine learning
Machine learning is a burgeoning field. Attendance has increased more than 50% during the last four years at the Neural Information Processing Systems (NIPS) conference, one of the premier conferences for machine learning research. The field is progressing rapidly, and an air of excitement among researchers is being created by the increasing scope of applications to which machine learning is relevant, and by the many technical advances that have been made in recent years. It is also an increasingly popular area for young people entering the field of computer science. Among the applicants to Carnegie Mellon’s Computer Science Ph.D. program in 2005, 22% indicated an interest in machine learning; 103 candidates applied this year to Carnegie Mellon’s new Ph.D. Program in Computational and Statistical Learning.
Carnegie Mellon is widely regarded as one of the world’s leading centers for machine learning research. The editorial boards of the top two journals in the field currently include seven Carnegie Mellon faculty members (Atkeson, Blum, Cohen, Lafferty, Mitchell, Moore, Touretzky) and machine learning conferences typically include many papers from Carnegie Mellon researchers. The scope of machine learning at Carnegie Mellon is broad, with significant research components not only in CSD, but also in RI, LTI, HCII, and ML. Carnegie Mellon has been a leading player since the beginning of the field; Pittsburgh was host to the first Machine Learning Workshop in 1980, and in the summer of 2006 Carnegie Mellon will host the International Conference on Machine Learning, which will provide an opportunity to showcase local research.
One reason machine learning is such a rapidly developing field lies in the fact that modern societies have entered the “data era”—an unprecedented investment is being made in the collection of data, with archives being formed on an enormous scale. Financial transactions and computer traces are now being logged, biological data are being collected using increasingly fast machines to scan genomes, hyperspectral satellite imagery is being stored on a massive scale, web documents are appearing at an explosive rate, and the list goes on. The development of effective ways of extracting useful information from these data stores is an overall challenge to computer science as a discipline, with the goal of providing a greater return on the huge investment that is being made in data. This goal drives much of machine learning research.
3 Research directions
Carnegie Mellon research in machine learning is being pursued in several different directions, from purely theoretical investigations to application-driven work in many domains. The following provides examples of significant efforts involving CSD faculty. We note, however, that the description is certainly not exhaustive, and that machine learning at Carnegie Mellon extends well outside the Computer Science Department. In many cases this work is indicative of trends in the larger field of AI, and has been sparked by work carried out at Carnegie Mellon.
One research trend in AI and machine learning addresses learning in games, where there are multiple learners with different interests. This work complements work in economics, which forms a longer tradition. Recent theoretical work of Tuomas Sandholm and his students shows how communication complexity, a fundamental notion from theoretical computer science, can be used to derive lower bounds on the required learning times to reach Nash equilibria or iterated dominance strategies for multi-agent systems. Preference elicitation and mechanism design in combinatorial auctions is also a topic of active interest (Blum, Sandholm).
Avrim Blum is a leader in the area of online learning, and has developed connections between fundamental problems in machine learning, online algorithms, and optimization. Examples of work in this direction include simple learning algorithms to solve basic optimization problems, online learning techniques to design adaptive data-structures and other online algorithms with improved guarantees, and the use of online learning for e-commerce applications.
Another trend is in the area of semi-supervised learning (Blum, Lafferty, Mitchell). In many traditional approaches to machine learning, a target function is estimated using labeled data, which can be thought of as examples given by a “teacher” to a “student.” Labeled examples are often, however, very time consuming and expensive to obtain, as they require the efforts of human annotators, who must often be quite skilled. To combine unlabeled data with labeled data, methods have been proposed to exploit the “manifold structure” of the data. In this line of work, Carnegie Mellon researchers have introduced an approach to semi-supervised learning that is based on a random field model defined on a weighted graph over the unlabeled and labeled data, where the weights are given in terms of a similarity function between instances. The optimal configuration is determined by linear programming, graph flow algorithms, or by computing harmonic functions on the data graph.
Astrostatistics is another new research direction in which Carnegie Mellon has become a leader. The field is an example of how new scientific challenges can arise from massive datasets, which can in turn drive novel machine learning research. In this case, the explosion of sky survey data has made nonparametric statistics and machine learning methods of great interest in astrophysics. The Pittsburgh Computational Astrostatistics group (PiCA, with CSD component led by Andrew Moore), has developed efficient statistical algorithms for computing clusters, spatial correlation, and density estimates using large and high dimensional data sets. The problems here include estimating the density of galaxies in a sky survey, or the power spectrum of temperature fiuctuations in a cosmic microwave background survey.
Intrusion detection is the problem of monitoring networks to discover unauthorized usage or other suspect behavior. The intrusion detection problem is extremely difficult given only data in the form of computer traces, which have a natural sequential structure. Anomalous behavior can take many forms, including abuse of access by trusted users and automated attacks launched by outsiders; the problem is particularly difficult because of the multi-resolution nature of the problem, where attacks can range from seconds to months. CSD faculty member Roy Maxion is a recognized leader in this area.
Structured prediction involves the annotation of data items having multiple components, with each component requiring a classification label. Such problems are challenging because the interaction between the components can be rich and complex. In text, speech, and image processing, for example, it is often useful to label individual words, sounds, or image patches with categories to enable higher level processing; but these labels can depend on one another in a highly complex manner. Conditional random fields and maximum margin Markov networks have been proposed by CSD faculty for modeling the interactions between labels in such problems using the tools of graphical models (Guestrin, Lafferty). These frameworks have been shown to give promising results in a number of domains where there is interaction between labels, including tagging, parsing and information extraction in natural language processing and the modeling of spatial dependencies in image processing.
Tom Mitchell, one of the leaders in the machine learning community, has within the last several years invested in a major new research direction, centered on the problem of automatically mapping brain images onto cognitive models of the thought process. From labeled data collected by asking human subjects questions during fMRI scans, the scientific challenge is to solve the difficult inverse problem of inferring cognitive states from the raw fMRI data in new samples. The problem has a natural time series structure that is currently being modeled with hidden process models.
The department has recently hired four assistant professors whose research has major machine learning components. The work of Alyosha Efros (via Berkeley) is in the area of computer graphics and computer vision, using data-driven techniques for problems that are difficult to model parametrically. Work in his group is using large stores of visual information (digital photo albums, webcams, movies, etc.) for scene segmentation and recognition, image and video-based modeling and rendering, and other problems. Carlos Guestrin (via Stanford) is applying machine learning methods to wireless sensor networks, to develop eficient distributed algorithms for inference, learning and control. The challenges here include making the networks robust to losses and failures, and limiting communication and power requirements. The research of Ziv Bar-Joseph (via MIT) is in computational biology, in particular analyzing high throughput biological datasets such as time series gene expression data and protein-DNA binding data. The work in his group has led to clustering algorithms to aid in high level analysis of cell cycle networks. Chris Langmead (via Dartmouth) works in computational structural biology and systems biology, for example using machine learning methods to analyze NMR spectroscopic data in the design of efficient algorithms that can minimize wet-lab costs.
Eric Xing's research spans several areas in machine learning, statistics, molecular biology, and their intersections. The major theme of his current research is understanding and modeling the mechanism and evolution of living systems based on mathematical principles, and developing probabilistic inference and learning methods for both computational biology and generic intelligent systems applied to information retrieval and reasoning under certainty in open, dynamical environments.
4 Machine learning in large systems and fielded applications
Research at Carnegie Mellon is distinguished in its serious focus on applications and real systems. A notable example from machine learning is research from the Auton Lab (Moore), in collaboration with the University of Pittsburgh, which has led to the fielding of a system for early detection of disease outbreaks. The system monitors health care data for irregularities by comparing the distribution of recent data against a baseline distribution. Determining the baseline is difficult due to the presence of different trends in health care data, such as trends caused by the day of week and by seasonal variations in temperature and weather. The research led to Bayesian network methods to produce the baseline, which was part of a system called WSARE that has been run on actual Emergency Department data; this work has been supported by the DARPA BioALIRT program.
Carnegie Mellon has received ongoing recognition from its Robotic soccer research program, led by Manuela Veloso. The Carnegie Mellon team finished in fourth place in the Sony AIBO and smallsize leagues in the recent RoboCup 2003 tournament. Robotic soccer provides a rich environment for machine learning that “improves with experience,” involving problem solving in complex domains with multiple agents, dynamic environments, the need for learning from feed-back, real-time planning, and many other artificial intelligence issues.
Parallel to the RADAR project, the CALO project is a major effort funded by DARPA that involves several faculty in the department (led by Mitchell) in collaboration with machine learning researchers at a number of other universities. The project is centered around the development of personalized information agents that monitor a user’s work environment—email, computer files, presentations, calendar data, meeting transcripts, etc.—in order to assist the user in tasks such as scheduling meetings, managing email and “to do lists,” and otherwise helping the user to more effectively deal with the information overload that plagues so many busy people. An interesting aspect of the project is that DARPA has dictated that learning is to be a central focus. The system is required to “learn in the wild” in an on-going manner without direct intervention from the user, and also to be able to transfer learning between tasks and domains. The project is unprecedented in its scale and focus on machine learning in an application that has potential to have direct impact in both the workplace and the military.
5 Interface between computer science and statistics
A notable trend in machine learning research during the past several years is the increasing use of statistical methodology. Indeed, in the view of many researchers machine learning and statistics are one and the same discipline, with the former more focused on computational aspects, large data sets, and high dimensional data, while statistical theory has emerged as the most useful framework for developing and analyzing learning algorithms.
At Carnegie Mellon there has been significant interaction between computer science and statistics for many years, beginning at least in 1998 with the formation of the Center for Automated Learning and Discovery (CALD). This interaction between the disciplines is a key source of strength for machine learning at Carnegie Mellon, and there are now meaningful relationships between the departments. At other universities with strong computer science and statistics departments there appear to be much greater barriers to significant interaction. However, in the past two years several universities have hired junior faculty under joint appointments in computer science (or electrical and computer engineering) and statistics, and new joint Ph.D. programs are beginning to appear; this trend is expected to continue.
|CSD Home Webteam ^ Top SCS Home|