Inside the M4 Apple Neural Engine, Part 1…

manjeet singh

Feb 28

How we bypassed CoreML and talked directly to the hardware

Read →

9 Comments

Maynard Handley

Mar 3

You guys really don’t have to guess or argue about this stuff!

I have described the ANE HW in substantial detail (100s of pages) here:

https://github.com/name99-org/AArch64-Explore

Volume 7 is ANE, but you may find the other volumes interesting and even relevant.

And to put it bluntly

1. No, ANE has zero similarity to SME.

2. The first “ANE” unit, the one on the A11, was actually something like a Lattice FPGA. Whatever the details, it was vastly different from the next model which is the start of the “real” ANE lineage, and where I begin the story. It’s always misleading to begin a table of ANE capabilities with the A11 - just bcs marketing want to confuse things doesn’t mean we have to go along with their nonsense!

Have fun with what looks like a fascinating project. I hope understanding the HW in much more detail will allow you to move faster.

Reply (1)

Josh Morgan

Mar 3

Let's see, other than the identical performance metrics:

- They both appeared in the hardware at the same time

- They both convert bfloat16's to float32's so the throughput is the same

- Neither support FP8 (even though ARM v9.2 specifications do)

- They have the same data layout preferences (z-tile streaming/coreml models)

- They "contend for resources" so only "one" of them can operate at a time

- Same SVL (512-bit) maps perfectly to ANE's 16×16 FP32 tile size

- Same amount of documentation (none)

The lore is certainly a testament to Apple's Marketing department, I'll give it that.

Josh Morgan

Mar 2

I have a bit of a different take. I believe the "Neural Engine" is just a mostly functional SME2 unit. I was going to submit a PR to the llama.cpp code base for that one issue that's been open for ages re: the ane, but haven't gotten around to finishing that yet.

I'll share the source code I have so far though: github.com/joshmorgan1000/ane

Reply (1)

manjeet singh

Mar 2

Thanks for the share, very cool project.

I agree llama.cpp should immediately be using ANE after all inference has been sort of known to work on it.

This is also the reason i opened this box because I wanted my computer compiler to leverage it as a backend.

Reply (1)

Josh Morgan

Mar 2

Well, here's the thing. The cblas calls that llama.cpp is already doing for Apple silicon already calls the same instructions under the hood.

The *real* benefit would be adding true 8-bit inference support to llama.cpp, because the ZA tiles in the matrix unit are *4096 bytes large*. That's a LOT of 8-bit values to process in a single SMOPA instruction, and if you do the math + run 4 threads (for each ZA tile) the performance you see matches Apple's NE marketing claims exactly.

If you just call the float32 matmul's in the code I posted, you'll essentially see the same performance as the cblas calls. The real "hidden" piece is the 8-bit ZA tile paths... as well as the LUTI2 and LUTI4 instructions which could be extremely powerful for quantized models if written correctly.

Reply (1)

manjeet singh

Mar 2

Thanks for the LUT hint and

Regarding SMOPA it doesn't execute on ANE there's a separate SME unit(single shared unit on M4 across all PE cores) on soc which has a peak of around 4 TFLOPS (I ran the University of Jena benchmark to confirm that)

ANE path is totally separate though.

Reply (2)

Josh Morgan

Mar 2

Imagine that you're a hardware engineer. Your boss comes to you and says I need you to design this unit with these capabilities. They are huge capabilities. Very large registers take up a lot of processor die space. So you work for a while and you figure it out. Then your boss comes back to you and says, OK now I need you to design another separate unit that does the exact same thing and put it on a separate place on the chip, but it can't operate at the same time as the other one that you just designed. Oh and by the way, they both have to be released at the same time in the next chip we produce.

Kinda silly, right?

Josh Morgan

Mar 2

Hah, right. Separate SME unit. The one that has the exact same performance capabilities of the ANE, the exact same limitations, and can't be accessed at the same time due to "power issues" or something like that. Meanwhile the ANE is this mysterious thing that can only be accessed via CoreML.

You can believe that if you want. I've got all the clues I need.

Rolv Heggenhougen

Mar 3

Massive moment for on‑device AI — and it gets even bigger when you zoom out. The ANE breakthrough shows there’s hidden performance inside every chip. http://ROLV.ai shows how to unlock that performance everywhere, all at once. ANE training is impressive, but it’s still one chip, one custom kernel, one model slice. ROLV is a universal sparse compute primitive that already runs identically on: CPUs (where ROLV makes Intel beat NVIDIA on real workloads) GPUs (A100, H100, MI300) TPUs Apple ANE Mobile and EV silicon Any future accelerator And instead of a single transformer layer, ROLV accelerates full real‑world models — LLMs, recsys, FEM solvers — with 10–700× speedups and 65–99% energy savings, independently validated by the University of Miami. So the difference is simple: ANE breakthrough: “Look what this one chip can secretly do.” ROLV: “Now apply that efficiency to every chip, every model, everywhere, with no retraining.” Pair the two and your Mac doesn’t just train — it becomes part of a global, universal compute fabric where training and inference run at GPU‑class speeds on whatever hardware you already own. The future isn’t just local training. It’s universal compute with zero waste — and that’s the world ROLV is already enabling.

maderix’s Substack

Inside the M4 Apple Neural Engine, Part 1…