|
|
Research Areas - Database Research in the Computer Science Department at Carnegie Mellon CSD faculty: Anastassia Ailamaki, Panos Chrysanthis (Pitt), Christos Faloutsos, Greg Ganger (ECE), Phil Gibbons (Intel), Mor Harchol-Balter, Alexandros Labrinidis (Pitt), Todd Mowry, Chris Olston, Howard Wactlar The main characteristic of our database group is its cross-disciplinarity, and the many collaborations, both within as well outside SCS. As a result, our group has been known for building bridges with other CS areas and other disciplines. The collaborations with computer architecture led to the creation of a whole new research sub-area, “cache-aware” database systems and the PAX storage method. Fractals and self-similarity from physics and geography led to the revolutionary, power-law discovery in computer network topology, as well as to more accurate modeling of disk traffic (the award-winning PQRS model below). Bridges with signal processing and statistics led to award-winning Fourier, wavelet and ICA methods for multimedia, sensor and video indexing. In detail, the database research has the following major focal points:
1 Database Engine and Performance The Carnegie Mellon Database Group has pioneered the area of architecture-conscious database systems. Spearheaded by Ailamaki, the work was the first to recognize the relationship between database software and the underlying hardware characteristics, opening a new field of research in the data management community. The work at Carnegie Mellon has attracted four best-paper awards at prestigious conferences, and has built bridges to other research groups such as the Parallel Data Lab (PDL and the Computer Architecture Lab (CALCM), and several scientific groups. The project has two major focal points: database architectures for new computers and devices, and database support for scientific applications. Next we elaborate on each of these directions.
Database architectures for new computers and devices Although database software architecture has fundamentally remained unchanged over the past two decades, hardware infrastructure has undergone significant changes. Hardware is becoming increasingly modular with chip multiprocessors, shared cache hierarchies with non-uniform memory access latencies, and simultaneous multithreading, whereas database software is monolithic; there is one database server handling all requests, and multifeaturism only makes the server larger. In addition, current database systems handle each request as a process or thread; context switching is done by the OS which does not exploit opportunities for data and instruction reuse across requests (as shown on the left of Figure 1 ). Ailamaki and her group have introduced StagedDB, a new database software architecture that allows for application-induced control of execution. Following the componentized programming paradigm, the database server is composed of multiple individual services, called stages. Each stage is responsible for one service and acts as an independent server, as shown on the right of Figure 1 . The benefits from using StagedDB include not only an order of magnitude throughput and response time improvements, but also easier maintainance, automatic configuration, and higher fault tolerance and expandability of the componentized system. Work along the same lines include instruction cache performance optimizations for on-line transaction processing applications, using STEPS, a novel technique that eliminates most instruction-related delays without increasing the size of the instruction cache. A related project is QPipe, a query execution engine that enables data and work sharing when executing multiple queries on the same database. Current plans include a hardware/software codesign to support such servers (in collaboration with Falsafi (ECE)). For data management software, the most important hardware changes involve the processor/memory speed gap, which keeps on increasing. As a result, memory systems become deeper every year using intermediate cache levels. In order to alleviate memory hierarchy related problems, Ailamaki pioneered the solution of “cache-aware” algorithms. The idea is to focus on data placement and algorithms for query processing, making them conscious of today’s computer microarchitectures. She invented PAX, a data placement technique thatseamlessly maximizes spatial locality in caches; the work received the prestigious Best Paper Award (VLDB 2001). In 2002, Ailamaki, Mowry and Gibbons started the Cache-Resident Data Bases (CRDB), a project that revisits database query algorithms and data structures for new processor and memory microarchitectures. The resulting algorithms achieved an order of magnitude improvements over the current state of the art and received a best paper award in ICDE 2004. Storage devices also play a critical role in performance. The Fates project (Ailamaki, Ganger) focuses exactly on the design and implementation of an automatically-tuned storage manager. Fates ensures optimal data layout at all levels of memory hierarchy (including disks or MEMS devices) and automatically tunes the database storage parameters without need for manual configuration (Best Paper Award in the VLDB PhD Workshop 2003). In addition, with Faloutsos and Brockwell (Statistics), we use fractals and machine learning techniques to predict storage device performance when no details about the device are known (”black box”). This work has received a Best Student Paper Award in Performance 2002. Finally, Ailamaki and Harchol-Balter are collaborating to improve performance by differential customer treatment. For example, an online shopping web site like Amazon would like to offer faster responses to certain ”big spending customers.” The project implements prioritization internally to the database and achieves an order of magnitude lower delays for high-priority customers, with very little penalty to low-priority customers, and no drop in throughput. Database support for scientific applications Scientific databases are particularly suited for the application of automated physical design techniques, because of their data volume and the complexity of the scientific workloads. Ailamaki leads the AutoPart project, that automatically designs appropriate schema for scientific data assuming prior knowledge of a representative workload. As mentioned in SCIENTIFIC-COMPUTING, we have applied AutoPart to real astronomy databases and we are now extending it to automate cache management in a federated virtual observatory database (joint project between Ailamaki and Burns, Thakar, and Szalay of Johns Hopkins University). Recently, we started collaborations with two other scientific groups, an environmental engineering group and an earthquake modeling and study group. The latter is led by O’Hallaron (winner of the 2004 Gordon Bell Supercomputing Award), Ailamaki and Ganger. The goal is to provide database indexing support for challenging earthquake studies using very large unstructured tetrahedral meshes, which may be amenable to better solutions than the traditional R-trees and z-ordering methods. See SCIENTIFIC-COMPUTING for more details. 2 Web Searching Olston leads the effort on the design of effective crawlers, i.e., ones that maintain the local index up-to-date. The results are elaborate algorithms to achieve the trade-off between speed and freshness, outperforming older algorithms. A related project focuses on creating a comprehensive historical archive of the evolving Web. The second major focus is on alternative user interfaces for Web search, with a project called WebTrails. The goal is to assist people with open-ended Web search tasks that are not amenable to one-stop keyword searching. An example is: “I want to find a suitable Thai curry recipe to cook tonight (I'll know ‘suitable' when I see it).” These types of search tasks are quite common, and require much user time and effort to zoom in on the related items (e.g., only recipes that are easy to make and not too spicy turn out to be suitable, and all else being equal I prefer to obtain recipes from sites I know and trust). WebTrails solves the problem, by highlighting links that it believes are most promising. Specifically, WebTrails obtains keywords from the user that express at least part of their goal (e.g., “Thai curry”) and returns matching URLs, but, in addition, WebTrails automatically highlights hyperlinks that lead to content matching the search keywords, serving to guide the exploratory process without severing the context of the user’s ongoing browsing session. The salient property of WebTrails is seamless integration keyword searching with exploratory browsing, two previously distinct modes of interaction that bring complementary strengths. Preliminary user-study results point to a two-fold improvement in search time compared with traditional searching and browsing interfaces, for certain search tasks. 3 Multimedia and Data Mining Fractals and Power Laws The following efforts span over a decade. Our goal is to find patterns in large datasets, like a collection of images or video clips, or one or more streams of data (e.g., temperatures from sensors), or patterns in a graph or social network. Typical queries are “find images that look like a sunset” or “learn what is the normal temperature pattern, and alert me when the temperature behaves abnormally”. What is unique to Carnegie Mellon is the use of fractals and power laws, to look for patterns and regularities that traditional methods will never find. For example, the usual assumptions of Poisson arrivals, the uniformity and independence assumptions, and the Gaussian assumption are often violated in real datasets. Multimedia, biological and spatial Databases (Faculty: Faloutsos, Wactlar, Ailamaki, O’Hallaron, Murphy) We have pioneered the GEMINI method for multimedia indexing, which allows fast similarity searching on large datasets. The idea is to extract a few features from each multimedia object (e.g., Fourier or Wavelet coefficients from time series, or color histograms, from images), and then use off-the-shelf database methods for organizing the resulting multi-dimensional points. The method is the typical approach used in multimedia systems, and received the best paper award in SIGMOD 94. Crucial for fast similarity search are the so called “spatial access methods”(SAMs). We have introduced several influential methods, like the R+trees, the Hilbert R-trees and the SR-trees. The first was honored with the “test of time” award in VLDB 1997; the second has inspired an indexing method for the award-winning QUAKE project for earthquake simulation (see SCIENTIFIC-COMPUTING). The last, SR-tree, is used for video retrieval in Informedia (see TECHNOLOGY-AND-SOCIETY). For multimedia and video indexing, we are bridging the gap between signaland video processing, and databases. We proposed using independent component analysis, an even more powerful method than principal component analysis and singular value decomposition, and we are achieving parameter-free segmentation, like foreground/background in still images and motion segmentation in motion capture databases. Our work received the best student paper award in PAKDD 2004. We have recently started working on biological image databases, with Murphy and Kovacevic (Biology). The goal is to use state of the art methods for similarity search in sub-cellular protein location images, as well as relevance feedback. Graphs, power laws and virus propagation (Faloutsos, Blelloch, Chenxi Wang (ECE)) What does a “natural” graph look like? How can we spot abnormalities in a socialor computer network? Contrary to unspoken assumptions of uniformity, we were the first to notice that the Internet topology obeys a Zipf-like distribution: some nodes have a high degree, while the vast majority of nodes are barely connected, with degree 1. Our findings appeared in SIGCOMM ’99, and have attracted significant interest since then. In his book Linked, Barabasi calls the paper “seminal;” citeseer ranks it in the top 10 most cited papers of 1999. Popular network generators and simulators (like INET, BRITE) are carefully designed to generate power-law graphs. With Blelloch and Kleinberg (visiting from Cornell), we are looking for additional such regularities, and we have developed a fractal-inspired method, RMAT, that naturally generates realistic graphs. In an award-winning paper (KDD’05), we reported that that (a) graph diameters shrink with time, and (b) the graphs densify over time at a surprising rate: , where and are the number of edges and nodes at time . Finally, we have discovered properties of virus propagation, and, specifically, that the extinction of a virus (technically, its “epidemic threshold”) depends only on the first eigenvalue of the adjacency matrix of the graph, a surprising result that includes many older efforts as special cases. Streams and sensors (Faloutsos, Olston, Gibbons, Brockwell (Statistics), Vanbriessen (EPP), Ailamaki) There are two goals in this project. The first is efficiency: How often should sensors transmit their information, so that the central site has an accurate enough picture? The tools include Lagrange multipliers to optimize the transmission rate and approximation methods, and involved strong collaboration with the IrisNet project at the Intel lablet. The second goal is to observe the typical traffic (e.g., temperature fluctuations), “learn” it, and thus spot outliers, and/or do forecasting. We are interested in single sequences, as well as multiple, co-evolving sequences. What makes the problem hard is that (a) each stream is semi-infinite, with no end in sight (b) the method has to work alone, with no human intervention and (c) the memory and CPU resources are limited. We used advanced signal processing methods (Wavelets), in conjunction with the so-called AutoRegression method, and derived AWSOM, a system that can spot arbitrary periodicities, automatically, in a streaming fashion. The paper was judged among the top 5 of VLDB’03, and was invited for fast-track journal publication. We are continuing the work, in collaboration with Gibbons, Brockwell (Stat) and Sakurai (NTT), on Berkeley Motes. One of the driving applications is water quality sensors, and specifically chlorine level in drinking water (Faloutsos, Vanbriessen and Ailamaki). Again, the goal is to study co-evolving streams of chlorine levels, and raise alerts whenever there are deviations from “’normal.” Another application is to study sequences of disk accesses, under the self-* project (Ganger and Ailamaki), that was mentioned earlier. The idea is to use fractals and self-similarity, to model bursty traffic on the disk. Our PQRS model was the first to capture all three characteristics of real traffic: burstiness on time, burstiness on space (track-id), and correlations/locality between space and time (best student paper award, PEVA 2002). 4 Strengths The main strengths of our database work are (a) the breadth and balance of topics and (b) the cross-disciplinarity. With respect to the former, we cover the full spectrum from database engine internals and performance; to data mining in streams, graphs, and multimedia; to web retrieval and interfaces. With respect to cross-disciplinarity, we have strong ties both within CS (architecture, storage, parallel and distributed systems, machine learning, video and language technologies) as well as outside CS (civil engineering, statistics, astrophysics, biology).
|
||||||