We have another paper accepted at IEEE Computer Architecture Letters (#19 @ IEEE CAL). This is the second collaborative effort with the CXL team at Samsung Electronics. This time, we focus on accelerating RAG (retrieval-augmented generation) with the near-data processing (NDP, Samsung calls it Processing-Near Memory) of CXL-DRAM devices. Compared to the previous solutions, we minimize the amount of data transfers between a host and CXL devices by exploiting the combination of general-purpose cores and domain-specific processing units. Our paper can be found at the following arXiv page and the DOI link.
Cosmos: A CXL-Based Full In-Memory System for Approximate Nearest Neighbor Search
Seoyoung Ko, Hyunjeong Shim, Wanju Doh, Sungmin Yun, Jinin So, Yongsuk Kwon, Sang-Soo Park, Si-Dong Roh, Minyong Yoon, Taeksang Song, Jung Ho Ahn
Retrieval-Augmented Generation (RAG) is crucial for improving the quality of large language models by injecting proper contexts extracted from external sources. RAG requires high-throughput, low-latency Approximate Nearest Neighbor Search (ANNS) over billion-scale vector databases. Conventional DRAM/SSD solutions face capacity/latency limits, whereas specialized hardware or RDMA clusters lack flexibility or incur network overhead. We present Cosmos, integrating general-purpose cores within CXL memory devices for full ANNS offload and introducing rank-level parallel distance computation to maximize memory bandwidth. We also propose an adjacency-aware data placement that balances search loads across CXL devices based on inter-cluster proximity. Evaluations on SIFT1B and DEEP1B traces show that Cosmos achieves up to 6.72× higher throughput than the baseline CXL system and 2.35× over a state-of-the-art CXL-based solution, demonstrating scalability for RAG pipelines.