Mahbod Afarin

Aug 8, 2024

Research

My research spans a diverse array of areas within computer science, encompassing hardware accelerators, graph processing, parallel algorithms, compiler optimizations, computer architecture, and GPU architecture. I focus on designing and optimizing systems to enhance computational efficiency and scalability, with particular attention to memory-intensive workloads such as graph analytics. My work involves leveraging hardware/software co-design approaches to accelerate graph processing, developing parallel algorithms to improve performance, and exploring compiler techniques to optimize code size and execution speed. Additionally, I investigate architectural innovations in GPUs and other computing platforms to push the boundaries of computational capabilities, ensuring that the systems I design meet the demands of modern applications. Here is the list of my research interests:

Hardware Accelerators
Graph Processing
Parallel Algorithms
Compiler Optimizations
Computer Architecture
GPU Architecture

1. Hardware Accelerators

Graph hardware accelerators are specialized processors or systems designed to optimize and accelerate the processing of graph-based computations. These computations are essential in various applications, such as social network analysis, recommendation systems, biological network analysis, and more. Traditional CPUs and GPUs, while powerful, are often not fully optimized for the irregular data structures and memory access patterns typical of graph processing. Graph hardware accelerators address these challenges by offering dedicated architectures that enhance memory bandwidth, reduce latency, and increase parallelism for graph traversal, node updates, and edge processing tasks. These accelerators can significantly boost the performance of graph analytics workloads, making them crucial for handling large-scale, dynamic graphs in real-time applications. We have designed and implemented two hardware accelerators for dynamic graph processing, streaming and evovling accelerators.

Streaming Graph Hardware Accelerator – JetStream is a hardware accelerator for evaluating queries over streaming graphs, capable of handling additions, deletions, and updates of edges. Streaming graphs are graphs that are constantly changing, i.e., new nodes and edges are added or removed over time. JetStream is an event-based accelerator for graph workloads designed to support streaming updates. It handles both accumulative and monotonic graph algorithms through an event-driven computation model that limits access to a smaller subset of graph vertices, efficiently reuses prior query results to eliminate redundancy, and optimizes the memory access pattern for enhanced memory bandwidth utilization. To the best of our knowledge, JetStream is the first graph accelerator to support streaming graphs, reducing computation time by 90% compared with cold-start computation using existing accelerators. Additionally, JetStream achieves approximately 18× speedup over the KickStarter and GraphBolt software frameworks at the large baseline batch sizes used by these systems, with significantly higher speedups at smaller batch sizes (Paper: JetStream).

Evolving Graph Hardware Accelerator – MEGA is a hardware accelerator designed for efficiently evaluating queries over evolving graphs. Evolving graphs are graphs that change over time. The problem is challenging because it requires the evaluation of a graph query on a sequence of graph snapshots over a time window, typically to track the progression of a property over time. MEGA leverages CommonGraph, a recently proposed software approach for incrementally processing evolving graphs that improves efficiency by avoiding the need to process expensive deletions, instead converting them into additions. MEGA supports incremental event-based streaming of edge additions and the concurrent execution of multiple snapshots to handle evolving graphs effectively. We propose Batch-Oriented Execution (BOE), a novel batch-update scheduling technique that activates snapshots sharing batches simultaneously to achieve both computational and data reuse. Additionally, we introduce optimizations that pack compatible batches together and pipeline batch processing for improved performance. To the best of our knowledge, MEGA is the first graph accelerator for evolving graphs that evaluates graph queries over multiple snapshots simultaneously. MEGA achieves speedups ranging from 24× to 120× over CommonGraph. It also outperforms JetStream, a state-of-the-art streaming graph accelerator, with speedups ranging from 4.08× to 5.98× (Paper: MEGA).

2. Graph Processing

This is the algorithmic perspective of my research, where we aim to improve the performance of graph processing for both static and dynamic graph processing, as well as enhance the performance of distributed graph processing. In static graph processing, the graph remains unchanged over time, while in evolving graph processing, the graph undergoes constant changes over a time window. Here is the list of projects in this area:

Static Graph Processing – When evaluating iterative graph queries over large graphs, systems face significant overheads due to repeated graph transfers across the memory hierarchy and redundant propagation of values over graph edges. A two-phase approach addresses these challenges by using a small proxy graph, called Core Graph (CG), alongside the large original graph. The first phase evaluates the query on the CG, incurring low overheads while producing mostly precise results, which are then used to bootstrap the second phase on the larger graph for fully precise results. Unlike prior methods that generate either large or imprecise proxy graphs, CG is both compact and highly accurate. It contains all vertices but only 10.7% of edges on average, yet achieves 94.5–99.9% precision for different queries. This is enabled by identifying a small subset of high-centrality edges critical for determining converged results across queries. Techniques for constructing effective CGs and optimizing the evaluation of remaining vertices yield substantial performance improvements, achieving speedups of up to 4.48× on GPU-based Subway, 13.62× on out-of-core GridGraph, and 9.31× on in-memory Ligra systems across various graph queries and datasets (Paper: Core Graph).

Dynamic (Evovling) Graph Processing – We address the challenge of graph analytics on evolving graphs, where a query must be applied to multiple snapshots over an extended time window. We propose CommonGraph, an efficient approach for processing queries on evolving graphs. Observing that edge deletions are significantly more expensive than additions, CommonGraph converts deletions into additions by identifying a common graph shared across all snapshots. Queries are first computed on this graph, and missing edges are incrementally added to reach specific snapshots, updating results as needed. CommonGraph further optimizes performance by sharing common additions among snapshots and eliminating the sequential dependency of traditional streaming approaches, enabling parallelism. Integrated into the KickStarter streaming framework, CommonGraph achieves 1.38× to 8.17× performance improvements across various benchmarks (Paper: Common Graph).

Distributed Graph Processing – Distributed graph analytics is widely used across various domains to analyze large real-world graphs. Many distributed frameworks have been developed to improve scalability by enabling the processing of massive graphs that cannot fit into a single machine’s memory. However, these frameworks often introduce message-passing overheads among multiple machines, leading to underutilized computing resources. To address this, we propose Expressway, a technique that identifies critical edges, or “Highways,” which significantly influence boundary vertex results. Expressway processes queries using only these Highways, drastically reducing the number of edges involved, allowing queries to be executed independently on each machine without message-passing overhead. The results from Highway-based queries are then used to initialize vertices, accelerating the convergence of graph algorithms. Our experiments demonstrate that integrating Expressway into state-of-the-art frameworks achieves up to a 4.08× speedup for single-query frameworks and up to a 4.04× speedup for batch-processing frameworks (Paper: Expressway).

3. Compiler Optimization

Code Size Reduction – This project is a collaborative effort with the compiler team at Google, initiated during my student researcher position at the company. The goal of the project is to reduce the code size of x86 and ARM binaries by folding identical basic blocks. Traditional code outlining techniques are limited to within a single function, but our new approach extends the scope across multiple functions, enabling the compiler to merge identical blocks even when they appear in different parts of a program. Additionally, traditional outliners have other limitations, which we aim to address through this project. The method leverages interprocedural analysis to identify identical sequences of instructions that perform the same computations. By folding these blocks, we achieve a reduction in code size without compromising performance, making it particularly beneficial for optimizing memory-constrained environments such as embedded systems and mobile platforms. This technique represents a significant evolution in identical code folding, enhancing the efficiency of code generation and aligning with LLVM’s objective of creating more space-efficient binaries. This is an ongoing project, and you will hear more about it soon.

4. GPU Architecture

Optimizing the Yield of SIMT Processor – As CMOS nanotechnology advances with smaller feature sizes and increased process variation, manufacturing anomalies in integrated circuits (ICs) are becoming more prevalent, particularly in the complex designs of special-purpose SIMT (Single Instruction, Multiple Threads) processors. Traditional methods like fault tolerance and redundancy are insufficient for significantly improving fabrication yield. This research proposes leveraging approximate computing, which exploits the error tolerance of applications like graphics and gaming, allowing the use of chips with minor defects. By demonstrating the effectiveness of this approach on SIMT processors, such as the Radeon HD 7970, the study shows that yield improvements of up to 0.17% are achievable, translating into substantial cost savings. Additionally, the introduction of low-area redundancy focused on critical instructions further enhances yield by 0.03%. Collectively, these techniques could save millions of dollars annually in SIMT processor manufacturing, offering a practical solution to the challenges posed by increasingly complex IC designs.