|
|
Statistical and probabilistic methods are beginning to pervade computer science, and they are already fundamental and essential in most areas of artificial intelligence and many other computational sciences, notably computational biology. The use of statistical and probabilistic methods in CS represents a paradigm shift that is "here to stay," but it is clear that many aspects of current approaches and methodology are inadequate, and that many important problems remain. The central focus of my research is machine learning—including algorithms, theory, and statistical methods for learning from data. The applications that I most often develop are in natural language processing and information retrieval. The broad direction of my research is the development of theoretically motivated methods for addressing practical problems, typically using ideas and techniques from statistics, probability, and information theory. Two representative research projects I am working on with colleagues and students are structured prediction and semisupervised learning. Structured Prediction: Sequential data--text, sound, event logs, biological sequences—arise naturally in a wide variety of scientific and engineering problems. More generally, many important types of data can be viewed as graphs connecting basic data elements--handwriting and other visual data, gene networks, and linked data structures like the Web. Important tasks involving structured data require the computation of a labeling for the nodes or the edges of the underlying graph. For example, part-of-speech tagging of natural language text can be seen as the labeling of nodes representing the successive words with linguistic labels. A good labeling will depend not just on individual nodes but also the contents and labels of nearby nodes, that is, the preceding and following words--thus, the labels are not independent. Recent work has led to significant advances in machine learning for structured data. This work, in both my research with collaborators and in work by other research groups, has begun to combine the strengths of graphical models with discriminative classification methods such as support vector machines (SVMs). We are beginning to see formal frameworks, mathematical guarantees in the form of convergence rates and generalization error bounds, and flexible methods involving kernels and feature combination methods that build upon similar advances made for standard independent classification tasks in the machine learning community over the past ten years. One framework that we have recently developed is based on conditionalrandom fields. A conditional random field (CRF) is a model thatassigns a joint probability distribution over labels that is globallyconditioned on the input, where the distribution respects theindependence relations encoded in a graph. In general, the labels arenot assumed to be independent, nor are the observations conditionallyindependent given the labels, as is assumed in generative models suchas hidden Markov models. The CRF framework has already been used toobtain promising results in a number of domains where there isinteraction between labels, including tagging, parsing and informationextraction in natural language processing, as well as the modeling ofspatial dependencies in image processing. Semisupervised Learning: Great strides have been made in the past ten years in machine learningmethods for large data sets. However, the primary advances have beenfor supervised classification or regression problems using algorithmictechniques. In this traditional approach to machine learning, a targetfunction is estimated using only labeled data, which can be thought ofas examples given by a "teacher" to the "student." Unfortunately,labeled examples are typically very time consuming and expensive toobtain, as they require the efforts of human annotators, who mustoften be quite skilled. As one example, protein shape classification,which stands out as one of the grand challenges of biological andcomputational science, requires months of expensive analysis by expertcrystallographers to obtain a single labeled example. The nextgeneration of machine learning algorithms should be able to learn notjust from explicit examples, but through a combination of labeled dataand large amounts of unlabeled data. In doing so, it should bepossible to achieve higher accuracies with less human effort. One of the emerging themes in recent research in semisupervisedlearning is that kernel methods, in particular those based ongraphical representations of unlabeled data, form a theoreticallyattractive and empirically promising set of techniques. The commonidea behind these methods is to express our prior beliefs about thecorrelations, or more generally, the similarities, between pairs ofpoints in data space in terms of a kernel function which is usuallythought of as implicitly mapping data to a high-dimensional Hilbertspace. We have recently been applying ideas from spectral graphtheory to develop families of kernels on graphs and other discretedata. This is important since graph-like structures occur in so manyguises, and the resulting kernels apply to the semisupervised learningproblem by viewing data points as nodes in a weighted graph. This line of work also connects to our work on structured predictionusing CRFs, as we have recently shown how CRFs can be extended to usekernels. As an example application, we have carried out experimentsusing kernel CRFs for protein secondary structure prediction, the taskof mapping primary sequences of amino acids onto a string of secondarystructure assignments, such as helix, sheet, or coil.
|
||||