2025-11-19infradatacode

DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures

Peiming Yang, Sankeerth Durvasula, Ivan Fernandez, Mohammad Sadrosadati, Onur Mutlu, Gennady Pekhimenko, Christina Giannoula

Key claim

DCC achieves up to 13.17x speedup on PIM devices.

DCC is a new ML compiler that optimizes data rearrangements and compute code for PIM devices, significantly improving performance. It achieves up to 13.17x speedup on specific PIM architectures compared to GPU-only execution, which is crucial for builders focused on maximizing efficiency in ML applications.

Novelty

8.0/10

DCC introduces a novel approach to jointly optimize data rearrangements and compute code for PIM systems.

Reliability

8.0/10

The methodology includes rigorous evaluations and is open-sourced, supporting its claims.

Deep reliability assessment

The methodology supports the claim that DCC can significantly optimize data rearrangement and compute scheduling for PIM architectures, but it may overclaim the extent of performance improvements without considering specific hardware limitations or configurations. The results are based on simulations and may not fully translate to all real-world scenarios.

Reproducibility

yes, DCC is open-sourced at https://github.com/SPIN-Research-Group/DCC.

Discussion questions

What assumptions about the interdependence of data rearrangement and compute scheduling might not hold for all types of ML kernels?
How can builders effectively integrate DCC into existing ML workflows, especially when dealing with diverse hardware setups?
What specific conditions or configurations would lead to DCC underperforming compared to traditional methods?

Key figure

Figure 1 illustrates the workflow of near-bank PIM architecture, detailing the three steps of input data rearrangement, computation execution, and output data rearrangement.

Benchmark results

GPT-3 and LLaMA-2speedup: 4.52vs GPUup to 7.71× in LLaMA-2SOTA

various ML kernelsspeedup: 7.68vs GPU-only execution2.21× average on HBM-PIMSOTA

various ML kernelsspeedup: 13.17vs GPU-only execution3.92× average on AttAcc PIMSOTA

GitHub1 repo

SPIN-Research-Group/DCCOfficial

Read on arXiv →