Up to 96% Faster Processing-Near-Memory, Without Touching the Hardware

BY
Adam Awale

Why the real bottleneck for edge AI isn't silicon, it's the compiler.

Adam Awale, On-device research, AGI Inc., PhD candidate, MPS Lab, Arizona State University (advisor: Prof. Aviral Shrivastava)

TL;DR

Figure 1: Overview of DPC

The Processing-Near-Memory moment is coming

Processing-Near-Memory is a technology in which compute units sit physically next to the memory cells. The result is higher effective bandwidth: instead of shipping data across the chip to a distant processor, those units operate on it where it already sits, so the same power budget buys more computation. For edge devices balancing latency, area, and battery, that tradeoff is critical.

The industry has moved fast in the last year:

This is the trajectory that edge compute is on.

A terminology note. "In-memory" means compute before the sense amplifier, inside the DRAM. "Near-memory" means compute after the sense amplifier. People use the terms interchangeably and shouldn't: only the in-vs-near distinction actually matters, because it changes the technology stack underneath. The "processing" vs "compute" prefix is purely cosmetic.

Why this matters for what we're building at AGI Inc.

PNM is the industry's most promising answer to the data movement challenge.

At AGI, Inc. data movement is also the problem we solve every day. The agents we ship act across any app, on phones, cars, and wearables; what limits the model on those devices is memory, bandwidth, and power, not the compute. How well an agent runs comes down to how efficient the dataflows are.

Adam Awale on our on-device research team spent his PhD solving this problem, with Prof. Aviral Shrivastava at MPS Lab at Arizona State. The result is DPC, a compilation framework that gets optimal performance out of any DRAM-PNM architecture, where every previous attempt was tied to one chip or one workload. 

At AGI we build upon this same cost-model-driven approach to map our models onto the silicon we already ship on: optimize the workload first, fit it to each device's memory via cost models, tune for the specific hardware last - which allows us to deliver the fastest inference on edge hardware that transfers from chip to chip. 

That's our answer to one of the hardest scaling problems in on-device AI, because every device maker runs different silicon: supporting a new device becomes a mapping problem, not a rebuild. It's how we get the most out of the hardware that exists now, and why the same stack will be ready on day one when PNM chips ship in volume.

So why isn't PNM in your phone yet?

There are two reasons:

DRAM standards are over twenty years old, and adding compute without breaking memory-controller compatibility has historically been hard. Transistor scaling has finally made it tractable, which is why prototype silicon is appearing now.

As we mentioned above, the second reason is the compiler. Most production frameworks today (LLVM, MLIR, NVCC, MLX) were built for CPUs, GPUs, and NPUs, and PNM is a fundamentally different architecture. The scheduling and mapping heuristics that work there don't transfer, and until that gets rethought, PNM hardware doesn't reach consumers.

Existing PNM work falls into three buckets, none of which solve the general problem:

  1. Workload-specific accelerators. A chip optimized for one workload, with a compiler hand-tuned for that chip-workload pair. The benchmark numbers look great, but the approach falls apart in a real product, where workloads are dynamic and the user can't tolerate stalls.
  2. Academic compilation frameworks. Prior work like SimplePIM (UPMEM-only) and CINM, which defines PNM primitives but assumes placement and tile sizes upfront, shows that compilation can be partially automated. None of these projects generalize across vendors, workloads, and architectures at once.
  3. Hand-optimized commercial attempts. Samsung's and SK Hynix's compilers fall here. They work on the workloads they were tuned for, and performance collapses on anything else.

A generalized compiler for PNM

DPC (the DRAM-PNM Compiler) addresses the aforementioned issues.

Developed by Adam Awale and Prof. Aviral Shrivastava in the MPS Lab at Arizona State University, DPC treats mapping as a global optimization problem rather than a hand-tuning exercise. For a given target, it constructs a pruned design space defined by the architecture's parameters, its PNM ISA, and the affine workload, then iteratively searches that space, scoring each candidate point with analytical cost models, to find the set of optimizations that sits at the global cost minimum outlined in figure 2.

Figure 2: Breakdown of DPC compilation exploration

Three pieces:

  1. Generalized architecture parameters. A small parameter set that describes any DRAM-based PNM architecture. Different vendors place their compute units differently, with different bandwidth, parallelism, and memory layout. DPC's generalized parameters span the design space, so the same compiler covers any vendor's DRAM-PNM silicon.
  2. A generalized PNM Instruction Set Architecture (ISA). Higher-level PNM commands that compose into any vendor's instruction set. This is the lowering target. From DPC's ISA, code emits to LLVM IR for any supported backend.
  3. Analytical cost models for design-space exploration. The interesting move here: borrowing old math. Cost-modeling techniques from GCC-era CPU compilers fell out of fashion in the GPU era. DPC adapts them for PNM. They're closed-form equations rather than measured runs, so the full optimization space: loop tiling, loop reuse, operator fusion, loop fusion - gets explored in seconds. For each (architecture, workload) pair, DPC finds the global minimum on a combined latency + power cost function, not an educated guess.

Workload scope: Affine programs (loops of the form AX + B). That covers matrix multiplication, matrix-vector multiplication, and element-wise operations, which is most of what a transformer or CNN does at inference.

Results

DPC was benchmarked against the hand-tuned compilers shipped with Samsung Aquabolt-XL (PIMSimulator) and SK Hynix AiM (aim_simulator), using both vendors' cycle-accurate simulators.

Workloads: DORN, GPT2, LSTM, RNN, StarGAN, ResNet, ViT.

Validation

Before reporting performance numbers, DPC's analytical cost model was validated against both vendors' cycle-accurate simulators on GEMV across multiple dimensions. All deltas were zero, constant, or negligible at scale. That's the bar for using an analytical model to drive design-space exploration.

DPC vs Native Compilers

Head-to-head on native architectures. DPC beats both native compilers on the hardware they were designed for. Samsung Aquabolt-XL: average 67.42% latency improvement, max 96.42% (ResNet). SK Hynix AiM: average 18.9% improvement, max 87.82% (ResNet). The SK Hynix gap is smaller because their architecture uses a cheap SRAM buffer for the operations where DPC's optimizations apply, so there's less cost for the compiler to shave off.

Across architectures (generalization)

DPC architecture sweep

On DDR architectures: average latency reduction 36.54%, max 95.3%.

On HBM architectures: average latency 80.76%, up to 89.74%. The same compiler holds across the design space, not just at one operating point.

Power

On DDR architectures: average power improvement 4.03% (same BU-cost reason as above).

On HBM architectures: average 84.36%.

No hardware changes. Every number above is from compilation alone: memory placement, instruction selection, design-space search. The silicon stays identical.

The paper is currently under peer review. Code release will follow acceptance. The open-source simulators DPC builds on are publicly available: Samsung's PIMSimulator and SK Hynix's AiM Simulator.

What's next

The next milestone for DPC is LLVM lowering for the first commercial PNM silicon that ships. The ISA is designed to lower into LLVM IR for any backend, so when real PNM hardware reaches developers, the compilation path will be ready.

Beyond DPC, the broader question is whether the industry catches up on the compiler side fast enough to match the silicon. PNM chips arrive in 2027 whether their toolchains are ready or not. The interesting work over the next 18 months is everything in the gap between announcement and a developer actually getting useful throughput out of one.

The wider take

Today's consensus in on-device AI: quantize aggressively, throw the model onto the NPU, hope it runs fast. On high-end devices with generous power budgets, fine. On the edge, it falls apart. Heterogeneous compilation, putting the right computation on the right unit, with memory placement that doesn't waste cycles moving data, is what actually gets the latency down at low power. Quantization is a band-aid on top of an unsolved problem.

Look at the history. Vector instructions started as specialized accelerator hardware. So did VLIW. Both are standard in every CPU and GPU now. Compute-near-memory is at the same point in that arc today: vendor-specific, hard to program, niche. 

Same pattern. Within five years, every edge device will ship with some form of compute-near-memory. The vendors will keep changing the name. The architecture is settled. The compilers are not.

The next five years of edge compute will be defined by hardware-software co-design at the memory layer, not another round of quantization tricks. PNM is the first concrete instance of that shift. The compiler is what's gating it.

Edge Cases Research Series

Our research series on on-device AI, compilers, and edge inference.

Work with us

We're hiring on-device research talent. If the problems in this post are the problems you work on, check out our open roles.

References

Open-source simulators

Prior compilation work

Industry announcements

PNM startup ecosystem

Academic

The DPC paper is currently under peer review. Code release will follow acceptance.