CSD Home | SCS Home

 

 

Research Areas - Computer Architecture Research in the Computer Science Department at Carnegie Mellon

 

CSD faculty: Randy Bryant, Kayvon Fatahalian, Seth Goldstein, James Hoe (ECE), Todd Mowry, Onur Mutlu (ECE), Dave O'Hallaron, Dan Siewiorek

1 Current Research in Computer Architecture

Computer architecture research seeks to improve existing computer systems as well as to develop new ones in an effort to increase performance, improve reliability, or to adapt to new computing environments. While the focus is on high-level hardware design, architects must be adept at the underlying technology, the system software (especially compilers and operating systems), and the properties of the key application programs. The Computer Architecture Laboratory at Carnegie Mellon (CALCM) brings together researchers interested in computer architecture to conduct interdisciplinary research across several of these connected areas.

Historically, CSD was a leader in developing major computer systems that embodied important advancements in computer architecture. These included C.mmp and Cm*, two of the original shared-memory multiprocessors, as well as Warp and iWarp, array processors that were targeted to what are now referred to as streaming applications. Each of these projects had major impact on how computer systems are designed and constructed. C.mmp and Cm* demonstrated the potential for creating high-performance machines by assembling multiple, commodity processors. This has become the standard method for building large-scale supercomputers. iWarp implemented a routing network in which a number of statically generated routing patterns could be programmed into the routing fabric. This idea ran counter to the dynamic routing schemes used in most processor interconnection networks, but it has reappeared in the scalar operand network developed by the MIT RAW project.

Our more recent efforts have involved system-building projects of more modest scales. Our development of wearable computers (Siewiorek) consisted of an integrated effort to build complete systems suitable for highly mobile use in environments where traditional I/O devices, such as keyboards and large displays, are not appropriate. This project addresses a wide range of issues in system design, from power management up to human factors concerns. We have developed over two dozen wearable computer systems, each addressing a different class of applications. These systems have been measured in the field to reduce task time by up to 70% over existing practices. Three of these systems have won awards in international design competitions, one system served with NATO troops in Bosnia, and technology developed in three systems formed the basis of a spin-off company whose software is used throughout the world maintaining US Navy F/A- 18 fighter jets. In collaboration with three other research groups, the Carnegie Mellon wearable group spearheaded the formation of the International Symposium on Wearable Computers, whose first general chair was Dan Siewiorek, and the creation of the IEEE Computer Society Technical Committee on Wearable Information Systems with Dan Siewiorek as founding chair.

A second area has been to mitigate the effects of latency in the memory system, through improved cache design, improved memory management at the OS level, and better data structures and data structure layouts within applications, especially database programs (Mowry, Ailamaki (adjunct)). Addressing performance concerns at all levels (hardware, system software, application) in the system hierarchy is a hallmark of our systems research.

A final area is in adapting to fundamental changes in the underlying technology. In particular, circuit technology is due to change in fundamental ways within the next decade. Whether constructed via chemical self assembly (nanotechnology) or by very fine etching of silicon (nanoscale CMOS), future circuits will consist of highly regular structures containing billions or trillions of low-quality (i.e., unreliable or highly variable) components with local interconnects. Conventional microprocessors, with their many specialized subsystems, their global communication, and their sequential operation, are not well matched to this technology. Instead, we are exploring implementing systems via programmable logic (Goldstein). This approach allows a high degree of physical regularity when encoding computations that are highly irregular. It also allows greater tolerance of unreliable components by routing around the faults.

 

 

2 Future Directions

A major challenge for computer architecture today is in finding ways to exploit the essentially unbounded hardware resources that can be fabricated, especially when planning for systems that will be built 10–20 years from now. Even now, we can no longer improve microprocessor performance significantly by simply using more hardware to implement more exotic microarchitectures—the approach pursued by industry for the past 25 years.

Application-Specific Memory Performance Improvement (Ailamaki (adjunct), Falsafi (adjunct)). The ever-growing gap between memory system and processor performance stresses the need for improvements in the memory hierarchy. We believe that further progress requires adapting the characteristics of particular applications.

STeMS: Spatio-Temporal Memory Streaming (Ailamaki (adjunct), Falsafi (adjunct)). The STeMS project targets novel memory systems that extract streams of data that are spatially- (i.e., references to data with related physical proximity) and/or temporally- (i.e., referenced close in time) correlated and move them between processors and memory proactively to hide the memory latency. Unlike conventional cache hierarchies that move data in fixed sizes upon demand, a streaming memory system gradually moves all correlated data just-in-time to hide the latency prior to a processor demand. Our preliminary results indicate that memory streaming can eliminate 50%-70% of all hierarchy misses in commercial server workloads (e.g., OLTP, DSS and Web). We are currently studying programming abstractions that allow for software to specify streams information (when available) directly to hardware to obviate the need for hardware prediction.

Log-based Computer Architecture (Ailamaki (adjunct), Falsafi (adjunct), Mowry). Many applications, particularly in databases, could benefit from a complete logging of all transactions, but this would be prohibitively expensive when implemented in software. The ln Project, a joint project between Intel Research Pittsburgh and Carnegie Mellon, is investigating the use of execution logs in a computer architecture context. The idea is to create a log-based memory hierarchy to replace the traditional notion of memories, and then use the hardware-supported log structure towards two main goals. First, the log can be read by other threads or processors (nannies) to dynamically monitor the main processor’s execution (to improve performance, look for security problems, detect errors, etc.). Second, the log will be used to efficiently roll back execution in case a problem occurs. The project borrows high-level ideas from database and filesystem logging to solve problems that will exist in future multi-core architectures.

TRUSS: Reliable, Scalable Server Architecture (Falsafi (adjunct), Hoe). Information processing and storage are becoming key pillars of a modern society’s infrastructure. As such, server availability and reliability are ever more a critical aspects of computing. Unfortunately, the continued scaling of CMOS fabrication processes and circuits has resulted in an ever-growing reduction in chip reliability due to a number of error sources: (1) transient (soft) error resulting from shrinking devices and the resulting vulnerability to cosmic radiation, (2) the increase in device performance variability and the resulting gradual error in computation, and (3) error due to degradation and wear-out of devices. Consequently, while availability and reliability are becoming increasing crucial, it is also ever more challenging to design, manufacture, and market reliable server platforms. To enable both cost/performance scalability and tolerate failure within and across chips, TRUSS integrates low-overhead distributed redundancy into distributed shared memory. TRUSS relies on lightweight error detection and recovery mechanisms to enable tolerating both permanent and (multibit) soft errors in processors, memory and interconnect. We recently proposed fingerprinting to hash machine state updates into small (e.g., 16-bit) signatures and allow for efficient state comparison with near-perfect coverage between redundant processors that are distributed across the interconnect. Fingerprinting not only requires minimal logic per chip, but also bounds the detection and recovery window, thereby minimizing the recovery hardware. Variations of fingerprinting are already slated to appear in future vendor commodity microprocessor products. We are also planning on building a proof-of-concept prototype of a four-node TRUSS system using a cluster of Berkeley’s BEE2 prototyping boards.

SimFlex: Fast, Accurate, Flexible Simulation & Prototyping. (Ailamaki (adjunct), Falsafi (adjunct), Hoe). Computer architects have long relied on software simulation to measure dynamic performance metrics (e.g. CPI) of a proposed design. Unfortunately, with the ever-growing size and complexity of modern microprocessor chips, detailed software simulators have become four or more orders of magnitude slower than their hardware counterparts. Full-system simulators modeling all system effects (i.e., system software and peripherals) further increase the simulation overhead. For instance, measuring seconds of a TPC benchmark execution on a database system using a simulated multiprocessor server is simply infeasible because it can take years of simulation time! The SimFlex project targets fast, accurate and flexible computer system evaluation. We have shown that using a structurally-decomposed simulation model enhances modeling flexibility and allows for trading off model accuracy for speed on a per component basis. We have showcased the use of rigorous statistical sampling as a technique to allow for practical (e.g., hours) simulation turnaround when modeling the execution of commercial workloads on multiprocessors. Because, the infrastructure is component-based with well-defined interfaces, we are also exploring accelerated prototyping of a subset of components with hardware description models running on FPGA boards that interface the simulator. Our infrastructure is available for distribution in the architecture community and is already in use by many academic and industrial research groups.

Wearable Computers (Siewiorek, Smailagic). Our hypothesis is that it is possible to combine portable, non-intrusive sensors with machine learning algorithms to infer user context so that electronic devices such as cellphones and PDAs can be proactive and have interaction behavior approaching that of an efficient, polite human assistant. We are employing unsupervised machine learning to combine real-time data from multiple sensors into an individualized behavior model. We are developing a novel prototype context-aware cellular telephone based on Intel next generation cellphone-capable platforms. The architecture must support varying security, privacy, and power aware strategies. We will evaluate the context-aware system by comparing results to user self-reports about disruptiveness and by studies to characterize user performance improvements.

Nanoscale Systems (Goldstein). Nanotechnology (or even future generations of CMOS technology) calls for a fundamental rethinking of how high-level programs are transformed into a form that can be “executed” by hardware. Traditionally, this has been done by compiling into a sequential machine language and then building hardware to interpret this language. Our current efforts attempt to perform a more direct mapping to programmable fabrics, yielding much higher degrees of parallelism and allowing for the high degrees of physical regularity and fault tolerance required by the technology.

Such a mapping bears more resemblance to hardware synthesis than to traditional compilation. As a result, we feel it is important to automatically verify the validity of a mapping, and to be able to certify safety properties of the mapping. Toward this goal we are working with faculty involved in formal verification (Clarke) and certified code (Crary, Peter Lee).

Our research in nanoscale systems is at the top of a nanotechnology pyramid that includes collaborations with electrical engineering and chemistry, both at Carnegie Mellon and elsewhere (Penn State, HP Labs).

Radically New Systems (Mowry, Goldstein). The Claytronics project was initiated as an effort to imagine what could be done once a complete processing element could be shrunk down to a submillimeter scale. From this, we thought of creating programmable matter, consisting of millions of “catoms” (for “Claytronics atoms”) each with its own processor, wireless network interface, power source, actuators, and display. The actuators would provide adhesion and locomotion capabilities (e.g., through electrostatics), so that the catoms can move relative to one another, allowing the assembly to dynamically assume different shapes under the collective control of the processors. Programmable matter could be used to reconstruct three dimensional objects at a remote location or to synthesize entirely new shapes and forms. The applications are nearly endless, including 3-D object rendering, electronic entertainment, dynamically configurable antennas, or even new ways to custom-furnish a building.

 


Figure 1: Prototype catoms for a 2-dimensional, macro scale version of Claytronics

 

From a research perspective, realizing programmable matter introduces a host of challenges. How can we construct the miniature electronics needed by catoms? What adhesion mechanisms should be used? How do they configure as a network? How can catoms be located and identified? What programming notations are appropriate? How can distributed algorithms be designed that will achieve desired dynamic configurations? Figure 1 shows a pair of prototype catoms. These move in two dimensions and are several centimeters in diameter. The technology will become a compelling way to create programmable matter when we have 3-D catoms that are at most 1mm in diameter.

Beyond the goal of building programmable matter, we see this effort as addressing several fundamental intellectual ideas: can we apply the principle of self assembly at a system level? Can we develop a science of massively distributed programming that allows global behavior to be predicted from local computation? Can such a system derive the local computations needed to create a particular global effect?

 

3 Summary

Much of academic research in computer architecture has become very incremental, seeking to make small (less than 10%) improvements in system performance (both speed and power) via improved cache designs, better branch prediction, etc. As you can see by the description above, our research has a different style. We take on projects that address very long term trends in technology and application needs, especially ones that draw together people from many different disciplines. We focus on high-risk/high-impact projects that build on the wide range of talent available at Carnegie Mellon.

 

      CSD Home   Webteam  ^ Top   SCS Home