Accelerating Fully Homomorphic Encryption (FHE): Bridging the Gap Between Cryptography and Computer Architecture

(Latest update: Mar 3rd, 2026)

My long-term research goal is to bridge the gap between the performance demands of emerging applications and the potential of massively parallel systems. For the last several years, one of my primary thrusts has been the acceleration of fully homomorphic encryption (FHE), with a specific focus on the Cheon-Kim-Kim-Song (CKKS) scheme. CKKS is uniquely suited for privacy-preserving machine learning because it natively supports approximate arithmetic on real and complex numbers.

However, FHE computations are notoriously slow, expanding data sizes severely and transforming simple arithmetic into complex modular polynomial operations. While application-specific integrated circuits (ASICs) may eventually rule the world once FHE algorithms fully mature, I firmly believe that GPUs are highly compelling platforms today due to their programmability and massive parallel compute capabilities. My research traces a path from understanding fundamental bottlenecks on commercial hardware to proposing custom ASIC architectures, returning to co-designing state-of-the-art GPU and memory systems, and ultimately repeating the cycle.

Phase 1: Demystifying FHE on Conventional Hardware (2020–2021)

My journey began with a rigorous, architecture-centric analysis of FHE workloads on existing CPUs and GPUs. In our IISWC 2020 paper, we tackled the Number Theoretic Transform (NTT), the most compute-intensive algorithm in FHE. We revealed that prior GPU implementations lacked a comprehensive analysis of the differences between NTT and standard Fast Fourier Transform (FFT), and we proposed optimizations like on-the-fly root-of-unity (so called twiddle factor) generation to maximize GPU memory bandwidth.

Building on this, our 2021 IEEE Access paper provided an in-depth dissection of FHE multiplication (HMult). We parallelized primary functions—such as NTT and the Chinese Remainder Theorem (CRT)—across multiple CPU cores and massive GPU threads, securing speedups of 2.06× and 4.05×, respectively.

However, the true bottleneck in FHE is bootstrapping, the process required to refresh ciphertext noise and enable unbounded computation. In our CHES 2021 paper (so called ‘100x’), we identified that bootstrapping is heavily constrained by global memory bandwidth rather than arithmetic logic. By devising memory-centric optimizations, including extensive kernel fusion and optimal decomposition number selection, we achieved over 100× faster bootstrapping on GPUs compared to single-thread CPUs.

Phase 2: Proposing Custom ASIC Accelerators (2022–2024)

Insights from our GPU research made it clear that to achieve further magnitude-level improvements, we needed custom hardware. Our ISCA 2022 paper, BTS, introduced an ASIC architecture designed specifically for bootstrapping. We balanced the design such that the massive computation of FHE operations could be hidden behind the memory latency of loading evaluation keys.

Yet, off-chip memory bandwidth remained a severe bottleneck due to massive single-use data generation. In our MICRO 2022 paper, ARK, we introduced an algorithm-architecture co-design that used runtime data generation and inter-operation key reuse to eliminate 88% of off-chip memory accesses.

We also challenged the conventional wisdom of using 64-bit word lengths in FHE hardware. In our ISCA 2023 paper, SHARP, we proved that a 36-bit word length provides robust workload precision while enabling a compact, hierarchical architecture. This reduced chip area and power consumption by roughly half compared to prior monolithic ASICs. Recognizing that large monolithic ASICs face severe manufacturing yield limits, our IEEE SEED 2024 paper, CiFHER, proposed a scalable multi-chip module (MCM) architecture. By using a resizable core structure and optimizing die-to-die communication, we demonstrated that chiplet-based designs can rival monolithic ASIC performance at a fraction of the cost.

Phase 3: Algorithm-Architecture Co-Design for Private AI (2023–2024)

To evaluate these systems, we actively deployed convolutional neural networks (CNNs) over FHE. In our 2024 IEEE Access paper, HyPHEN, we observed that the mismatch between CNN feature maps and 1D ciphertext slots caused excessive homomorphic rotations. We introduced a hybrid packing method and polynomial activation functions to drastically cut the memory footprint and rotation overhead. We advanced this further in our CCS 2024 paper, NeuJeans, by proposing Coefficients-in-Slot (CinS) encoding. This technique bypasses costly slot permutations and fuses convolutions directly with the bootstrapping process, enabling end-to-end ImageNet-scale CNN inference in seconds.

Phase 4: Breaking the Memory Wall on GPUs (2025–2026)

As FHE algorithms evolved, we brought our insights back to commodity hardware. In our HPCA 2025 paper, Anaheim, we debunked the assumption that NTT is the primary bottleneck on optimized GPUs. We found that simple element-wise operations had become the new memory wall due to limited off-chip DRAM bandwidth. Anaheim proposed a Processing-in-Memory (PIM) architecture and a software framework to offload these operations directly into DRAM, vastly improving energy efficiency.

Our efforts culminated in our ASPLOS 2026 paper, Cheddar, a swift, full-fledged GPU library for FHE. We designed a novel “25-30 prime system” to natively exploit the 32-bit integer datapath of modern GPUs, avoiding expensive 64-bit emulation. Coupled with aggressive sequential and parallel kernel fusion, Cheddar outperforms custom FPGA designs and older GPU libraries by up to 4.45×, completing a ResNet-20 inference in just 0.72 seconds on a single GPU.

Furthermore, by building on Cheddar alongside our earlier work, AESPA (arXiv 2022)—which bypasses costly bootstrapping by approximating activation functions with square polynomials (requiring a slight, strategic modification to the original 7-layer CNN specification)—and introducing a new encrypted convolution method inspired by HyPHEN, we pushed the hardware even further. As demonstrated in our recent microarchitectural studies (e.g., Theodosian @ ISPASS 2026), we have pushed modern GPUs so hard that the bottleneck has now shifted to the on-chip L2 cache bandwidth. By taming these final bottlenecks, we recently achieved a sub-25ms inference time for a 7-layer CNN on an off-the-shelf RTX 5090 GPU, matching the ambitious performance targets originally reserved for custom ASICs by the DARPA DPRIVE program.

Summary

While specialized FHE ASICs demonstrate remarkable peak efficiency, the rapid evolution of cryptographic algorithms risks making fixed hardware prematurely obsolete (this is evidenced by the announcement of Intel's programmable FHE accelerator, HERACLES, at ISSCC 2026). By deeply understanding the interaction between FHE mathematics and hardware microarchitecture, my research proves that intelligently optimized GPUs and memory systems can deliver real-time, practical privacy-preserving computation today.