5th Master's Thesis Presentation - Yuchen Liang

— 4:30pm

Location:
In Person - ASA Conference Room, Gates Hillman 6115

Speaker:
YUCHEN LIANG , Master's Student
Computer Science Department
Carnegie Mellon University

https://liangyc.me/

Feedback-Driven Query Optimization: Design and Infrastructure

Query optimizers are critical components in database management systems (DBMSs) that turn a query that might otherwise take hours to run into one that completes in seconds. However, modern data stacks allow applications to generate data files (e.g., Parquet) outside the DBMS’s purview that lack statistical summaries. Without information about data distribution, optimizers fall back to ungrounded guesses when computing cardinality and cost estimates needed to select the best plan from an exponential number of candidates. 

This thesis investigates the design of an adaptive query optimization strategy that leverages query feedback to improve planning for future queries. We first study how an optimizer should memoize query plans during optimization so that it can associate past traces with future queries. To facilitate this investigation, we built a Cascades-style transformational search engine with a memo table that organizes equivalent sub-plans and detects duplicates. We measured the amount of reuse when maintaining the optimization state across queries. Our results showed that future queries can reuse the cached optimization information from previous queries when they exhibit similar subplan patterns. We then analyze runtime feedback collection methods, focusing on integration strategies and the resulting artifacts. 

Our study reveals a tension between the query coverage provided by runtime artifacts and the collection overhead they entail. Although most DBMSs can provide runtime row-count profiles at the operator level, the optimizer can only use this information for future queries that match the exact predicates. To address this limitation, we present a method that measures the true selectivity of each prefix in the conjunctive predicate during execution, which the optimizer can then leverage to improve planning for a broader range of queries. We implemented this method in PostgreSQL and measured the runtime performance overhead. Our experiments show that fine-grained instrumentation incurs a constant overhead regardless of the number of conjuncts. We conclude the thesis by discussing the design choices and trade-offs of integrating an extensible query optimizer service with existing DBMSs.

Thesis Committee
Andy Pavlo (Chair)
JIgnesh Patel

Additional Information 

For More Information:
amalloy@cs.cmu.edu


Add event to Google
Add event to iCal