Publication

System and Architecture for Deep Learning

[HPCA’25]

VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference

[HPCA’25]

MANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type

[Neurips’24]

Nimbus: Secure and Efficient Two-Party Inference for Transformers

[TACO’24]

Potamoi: Accelerating Neural Rendering via a Unified Streaming Architecture

[TC’24]

DSTC: Dual-Side Sparsity Tensor Core for DNNs Acceleration On Modern GPU Architectures

[TC’24]

Accelerating Sparse DNNs based on Tiled GEMM

[ISCA’24]

Cicero: Addressing Algorithmic and Architectural Bottlenecks in Neural Rendering by Radiance Warping and Memory Optimizations

[ISCA’24]

A Tale Of Two Domains: Exploring Efficient Architecture Design for Truly Autonomous ThingsBest Paper Nomination

[ASPLOS’24]

GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

[ASPLOS’24]

JUNO: Optimizing High-Dimensional Approximate Nearest Neighbour Search with Sparsity-Aware Algorithm and Ray-Tracing Core Mapping

[ASPLOS’24]

Amanda: Unified Instrumentation Framework for Deep Neural Networks

[ISCA’23]

OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization

[ISCA’23]

ImaGen: A General Framework for Generating Memory- and Power-Efficient Image Processing Accelerators

[ASPLOS’23]

uGrapher: High-performance Graph Operator Computation
via Unified Abstraction for Graph Neural Networks

[HPCA’23]

Chimera: An Analytical Optimizing Framework
for Effective Compute-intensive Operators Fusion

[MICRO’22]

ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural Network Quantization
IEEE Micro Top Picks from Computer Architecture Conferences Honorable Mention

[ICCD’22]

Nesting Forward Automatic Differentiation for Memory-Efficient Deep Neural Network Training

[ICLR’22]

SQuant: On-the-fly Data-free Quantization via Diagonal Hessian Approximation

[ACL’22]

Transkimmer: Transformer Learns to Layer-wise Skim

[AAAI’22]

BlockSkim: Efficient Question Answering for Transformer

[ASPLOS’22]

VELTAIR: Towards High-Performance Multi-Tenant Deep Learning Services via Adaptive Compilation and Scheduling

[HPCA’22]

Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS

[SC’21]

Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction

[IISWC’21]

Characterizing and Demystifying the Implicit Convolution Algorithm on Commercial Matrix-Multiplication Accelerators

[ICCD’21]

Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks

[ISCA’21]

Dual-side Sparse Tensor Core

[COLING’20]

How Far Does BERT Look At: Distance-based Clustering and Analysis of BERT’s Attention

[SC’20]

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity

[MICRO’20]

Ptolemy: Architecture Support for Robust Deep Learning

[PACT’20]

Low-Latency Proactive Continuous Vision

[ICDCS’20]

CODA: Improving Resource Utilization by Slimming and Co-locating DNN and CPU Jobs

[DAC’20]

Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration

[CAL’20]

Architectural Implication of Graph Neural Networks

[CVPR’19]

Adversarial Defense Through Network Profiling Based Path Extraction

[ICCD’19]

Ebird: Elastic Batch for Improving Responsiveness and Throughput of Deep Learning Services

[NPC’19]

Characterizing Perception Module Performance and Robustness in Production-Scale Autonomous Driving System

[CAL’19]

Asymmetric Resilience for Accelerator-Rich Systems

[CAL’19]

SVSoC: Speculative Vision Systems-on-a-Chip

Resiliency and Efficiency

[HPCA’20]

Asymmetric Resilience: A System Architecture for Transient Error Recovery in Accelerator-Rich Processors

[TDMR’20]

Exceeding Conservative Limits: A Consolidated Analysis on Modern Hardware Margins

[JPDC’20]

Predicting and Reining in Application-level Slowdown on Spatial Multitasking GPUs

[TCAD’20]

Predictive Guardbanding: Program-driven Timing Margin Reduction for GPUs

[TCAD’20]

Voltage-Stacked Power Delivery Systems: Reliability, Efficiency, and Power Management

[IOLTS’19]

Modern Hardware Margins: CPUs, GPUs, FPGAs

[TC’19]

DR Refresh: Releasing DRAM Potential by Enabling Read Accesses under Refresh

[MICRO’18]

Voltage-Stacked GPUs: A Control Theory Driven Cross-Layer Solution for Practical Voltage Stacking in GPUs

[ICCD’18]

DR DRAM: Accelerating Memory-Read-Intensive Applications

[DAC’18]

Efficient And Reliable Power Delivery In Voltage-stacked Manycore System With Hybrid Charge-recycling Regulators

[DAC’17]

IVORY: Early Stage Design Space Exploration Tool for Integrated Voltage Regulators

[MICRO’15]

Safe Limits on Voltage Reduction Efficiency in GPUs: a Direct Measurement Approach

[MICRO’15]

Adaptive Guardband Scheduling to Improve System-Level Efficiency of the POWER7+

[HPCA’15]

GPU Voltage Noise: Characterizing and Hierarchical Smoothing Spatial and Temporal Voltage Noise Interference in GPU Architectures

[ISLPED’14]

GPUVolt: Modeling And Characterizing Voltage Noise In GPUs

[MICRO’13]

A Locality-Aware Memory Hierarchy for Energy-Efficient GPU Architectures

[ISCA’13]

GPUWattch: Enabling Energy Optimization in GPGPUs

[CAL’12]

Exploiting Webpage Characteristics for Energy-Efficient Mobile Web Browsing

Cloud Computing

[ASPLOS’22]

Astraea: Towards QoS-Aware and Resource-Efficient Multi-stage GPU Services

[IPDPS’21]

AlphaR: Learning-Powered Resource Management for Irregular, Dynamic Microservice Graph

[IPDPS’20]

Sturgeon: Preference-aware Co-location for Improving Utilization of Power Constrained Computers

[IPDPS’19]

Themis: Predicting and Reining in Application-Level Slowdown on Spatial Multitasking GPUs

[ICS’19]

Avalon: Towards QoS Awareness and Improved Utilization through Multi-Resource Management in Datacenters

[TACO’18]

Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory

Misc.

[FCS’19]

Probabilistic Robust Regression With Adaptive Weights—a Case Study On Face Recognition

[WNTC’14]

Lightweight Detection and Recovery Mechanisms to Extend Algorithm Resiliency in Noisy Computation

[SELSE’14]

Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture