Another paper accepted at IEEE Computer Architecture Letters

Offloading less-used experts from Mixture-of-Experts (MoE) LLMs to SSDs is an attractive strategy for running large models on memory-constrained devices. However, our research highlights a critical consideration: energy efficiency.

We found that the energy required to fetch expert weights from an SSD can become a significant bottleneck, potentially dominating the total power consumption during inference. Our paper quantifies this energy trade-off and concludes that SSD offloading is most beneficial under specific scenarios:

Systems with low batch sizes, such as those found on mobile or edge devices.
If future SSDs achieve a >10× improvement in energy efficiency (reaching ∼10 pJ/bit).

Furthermore, emerging techniques like speculative decoding can increase the ‘effective’ batch size even on edge devices, further challenging the viability of this offloading approach. Our work emphasizes that the role of storage hardware in MoE systems must be evaluated prudently.

Implications of SSD offloading for LLM MoE on energy efficiency

Title

SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency

Authors

Kwanhee Kyung, Sungmin Yun, Jung Ho Ahn

Abstract

Large Language Models (LLMs) applying Mixture-of-Experts (MoE) scale to trillions of parameters but require vast memory, motivating a line of research to offload expert weights from fast-but-small DRAM (HBM) to denser Flash SSDs. While SSDs provide cost-effective capacity, their read energy per bit is substantially higher than that of DRAM. This paper quantitatively analyzes the energy implications of offloading MoE expert weights to SSDs during the critical decode stage of LLM inference. Our analysis, comparing SSD, CPU memory (DDR), and HBM storage scenarios for models like DeepSeek-R1, reveals that offloading MoE weights to current SSDs drastically increases per-token-generation energy consumption (e.g., by up to ∼12× compared to the HBM baseline), dominating the total inference energy budget. Although techniques like prefetching effectively hide access latency, they cannot mitigate this fundamental energy penalty. We further explore future technological scaling, finding that the inherent sparsity of MoE models could potentially make SSDs energy-viable if Flash read energy improves significantly, roughly by an order of magnitude.