You've done truly impressive reverse engineering and benchmarking here. I'm interested in your comment "The ideal LLM inference strategy on M4 is hybrid: prefill (large batch, high throughput) on ANE, decode (single token, latency-sensitive) on SME."
It seems to me that the KV cache will not fit in on-chip SRAM for any but the very smallest LLMs. I would therefore expect the KV cache to be stored in DRAM for Apple silicon. The KV cache has to be read in full for each token output during the decode phase of LLM inference. Since Apple's GPU has access to more DRAM bandwidth than SME, I would expect the decode phase of LLM inference to be done by Apple's GPU. I agree the prefill phase of LLM inference could be done on the ANE.
Vol. 7 in the link below contains Maynard Handley's 122 page description of the ANE based on his analysis of Apple patents.
The decode phase of LLM inference is DRAM bandwidth limited. On Apple's Max and Ultra chips, the DRAM bandwidth available to the GPU is significantly higher than the DRAM bandwidth available to the CPU/SME. Since the decode phase should be on the GPU for Max and Ultra, the decode phase might as well be done on the GPU for M4. The prefill phase could be done on the GPU or ANE.
The GPU on M5 Max can read 614 GB/s from DRAM while the CPU clusters can read a total of 288 GB/s from DRAM. Doing the decode phase of LLM inference on the GPU of M5 Max would be 614/288 = 2.1x faster than using the SME unit of each CPU cluster.
There are 2 ways to use INT8 - linear INT8 and quantized INT8. Are you using linear INT8? That’s what you need to use. You also need to ensure correct data layout - see my PDF for the details.
I looked at all the places in Vol. 7 what contain the words "linear" or "quantized". I saw no mention of linear vs quantized INT8. Could you please explain what you are referring to and why you recommend linear INT8?
Suppose you have 8 bits available to store a large set of weights.
One option is to use some sort of table lookup, so you
- create a histogram of the set of weights
- find 256 values that are "optimally" dense within that histogram
- map each weight (say FP16) to the nearest one of those 256 value.
The details of this are interesting and have been refined for many years – the techniques began with image compression long ago, and in my earlier life media compression was my job/interest. But of course software will handle all this for you.
This seems to be called "quantization" these days due to history, but "palletization" is a more accurate term.
Now think about what this means - the table index is 8 bit, but the index is worthless for arithmetic; you have to load the value (presumably FP16) from the table and then perform arithmetic with that. So the 8 bits save you memory space and bandwidth. but you still have to perform FP16 ops.
Another alternative is to consider the range of the set of FP16 weights. Divide that range by 256 giving a scale. Also subtract off an offset if the range is not even around zero.
Now you can approximate each FP16 value by INT8*scale+offset.
(Again there is theory for how to do this optimally, generally by not using the full range, so allowing outliers all to collapse to a single MAXINT value, but again, SW will handle those details.)
You CAN now do arithmetic with these INT8 weights, and at the end of the arithmetic you can rescale back to the original range. If you're doing any sort of non-linear operation (ReLU, SwiLU, logistic, and all that, you may be able to fold the rescaling back into the non-linear op, or you can use the ANE's planar engine hardware to do the job).
So that's the difference. Palletized INT8 uses FP16 arithmetic. Linear INT8 uses INT8 arithmetic. I know CoreML distinguishes the two, but I don't know the API's well so I couldn't give you details.
There are two more issues.
The first is details of the op you want to perform. If you are routing straight to matrix math, obviously there's no issue. But if you are using a neural net, a particular model may talk about A8W16 or similar. This refers to the length of each Activation in bits vs the length of each Weight in bits. If your model is set up something like A16W8, then even though you think your weights are Linear INT8, if the model wants activations (ie the overall matrix multiply result) to be FP16 it may be forcing the arithmetic to do that. (You obviously could just rescale at the end of INT8/INT8 matrix multiply, but maybe the model did not allow for that).
Secondly, the big constraint in both ANE and the GPU is getting data into the arithmetic units every cycle. For ANE what they appear to have done is
- we started with a 16b data path to an execution unit that could perform either and FP16 or INT8 MAC
- it's "fairly" easy to slightly augment that execution unit to perform 2 INT8 MACs
- but how to get data in? Obviously you use the top and bottom halves of the FP16 data path.
- but that in turn implies you have an operation that's something like a SIMD INT8[2] operation.
So you have to set up the ANE in a way that's friendly to these 2-wide SIMD at every stage. Again I don't know the APIs (I just look at the hardware!) but this probably requires some care in setting up the shape of the tensors.
You might find it worthwhile looking at this page.
I have no idea *exactly* what this guy is doing, since his iPhone numbers seem to suggest a double-sized NE unit (he may have made a mistake in how he sets up timing?)
But his M4 numbers are pretty much what I would expect.
Thanks for your comments and the link to the work of Koan-Sin Tan (freedomtan). It's interesting that Koan-Sin measured twice the performance of INT8 compared to FP16 on the ANE of M4 Pro. Manjeet Singh measured equal performance for INT8 and FP16 on the ANE of M4. The difference between Koan-Sin's and Manjeet's measurements seems important for choosing between palletized/quantized INT8 and linear INT8. If there is a hardware table to depalletize/dequantize INT8 into FP16 and there is no performance difference between INT8 and FP16, I'd think palletized/quantized INT8 would be preferred because it would be more accurate without any performance loss. On the other hand, if INT8 is twice as fast as FP16, then linear INT8 would be preferred wherever it is accurate enough.
You've done truly impressive reverse engineering and benchmarking here. I'm interested in your comment "The ideal LLM inference strategy on M4 is hybrid: prefill (large batch, high throughput) on ANE, decode (single token, latency-sensitive) on SME."
It seems to me that the KV cache will not fit in on-chip SRAM for any but the very smallest LLMs. I would therefore expect the KV cache to be stored in DRAM for Apple silicon. The KV cache has to be read in full for each token output during the decode phase of LLM inference. Since Apple's GPU has access to more DRAM bandwidth than SME, I would expect the decode phase of LLM inference to be done by Apple's GPU. I agree the prefill phase of LLM inference could be done on the ANE.
Vol. 7 in the link below contains Maynard Handley's 122 page description of the ANE based on his analysis of Apple patents.
https://github.com/name99-org/AArch64-Explore
Thanks for the follow. I agree that full kvcache for any decently size model can't stay in SRAM. Hence the decode on CPU/GPU makes more sense.
The decode phase of LLM inference is DRAM bandwidth limited. On Apple's Max and Ultra chips, the DRAM bandwidth available to the GPU is significantly higher than the DRAM bandwidth available to the CPU/SME. Since the decode phase should be on the GPU for Max and Ultra, the decode phase might as well be done on the GPU for M4. The prefill phase could be done on the GPU or ANE.
The GPU on M5 Max can read 614 GB/s from DRAM while the CPU clusters can read a total of 288 GB/s from DRAM. Doing the decode phase of LLM inference on the GPU of M5 Max would be 614/288 = 2.1x faster than using the SME unit of each CPU cluster.
There are 2 ways to use INT8 - linear INT8 and quantized INT8. Are you using linear INT8? That’s what you need to use. You also need to ensure correct data layout - see my PDF for the details.
I looked at all the places in Vol. 7 what contain the words "linear" or "quantized". I saw no mention of linear vs quantized INT8. Could you please explain what you are referring to and why you recommend linear INT8?
Suppose you have 8 bits available to store a large set of weights.
One option is to use some sort of table lookup, so you
- create a histogram of the set of weights
- find 256 values that are "optimally" dense within that histogram
- map each weight (say FP16) to the nearest one of those 256 value.
The details of this are interesting and have been refined for many years – the techniques began with image compression long ago, and in my earlier life media compression was my job/interest. But of course software will handle all this for you.
This seems to be called "quantization" these days due to history, but "palletization" is a more accurate term.
Now think about what this means - the table index is 8 bit, but the index is worthless for arithmetic; you have to load the value (presumably FP16) from the table and then perform arithmetic with that. So the 8 bits save you memory space and bandwidth. but you still have to perform FP16 ops.
Another alternative is to consider the range of the set of FP16 weights. Divide that range by 256 giving a scale. Also subtract off an offset if the range is not even around zero.
Now you can approximate each FP16 value by INT8*scale+offset.
(Again there is theory for how to do this optimally, generally by not using the full range, so allowing outliers all to collapse to a single MAXINT value, but again, SW will handle those details.)
You CAN now do arithmetic with these INT8 weights, and at the end of the arithmetic you can rescale back to the original range. If you're doing any sort of non-linear operation (ReLU, SwiLU, logistic, and all that, you may be able to fold the rescaling back into the non-linear op, or you can use the ANE's planar engine hardware to do the job).
So that's the difference. Palletized INT8 uses FP16 arithmetic. Linear INT8 uses INT8 arithmetic. I know CoreML distinguishes the two, but I don't know the API's well so I couldn't give you details.
There are two more issues.
The first is details of the op you want to perform. If you are routing straight to matrix math, obviously there's no issue. But if you are using a neural net, a particular model may talk about A8W16 or similar. This refers to the length of each Activation in bits vs the length of each Weight in bits. If your model is set up something like A16W8, then even though you think your weights are Linear INT8, if the model wants activations (ie the overall matrix multiply result) to be FP16 it may be forcing the arithmetic to do that. (You obviously could just rescale at the end of INT8/INT8 matrix multiply, but maybe the model did not allow for that).
Secondly, the big constraint in both ANE and the GPU is getting data into the arithmetic units every cycle. For ANE what they appear to have done is
- we started with a 16b data path to an execution unit that could perform either and FP16 or INT8 MAC
- it's "fairly" easy to slightly augment that execution unit to perform 2 INT8 MACs
- but how to get data in? Obviously you use the top and bottom halves of the FP16 data path.
- but that in turn implies you have an operation that's something like a SIMD INT8[2] operation.
So you have to set up the ANE in a way that's friendly to these 2-wide SIMD at every stage. Again I don't know the APIs (I just look at the hardware!) but this probably requires some care in setting up the shape of the tensors.
You might find it worthwhile looking at this page.
https://github.com/freedomtan/measure_ane_capacity
I have no idea *exactly* what this guy is doing, since his iPhone numbers seem to suggest a double-sized NE unit (he may have made a mistake in how he sets up timing?)
But his M4 numbers are pretty much what I would expect.
Thanks for your comments and the link to the work of Koan-Sin Tan (freedomtan). It's interesting that Koan-Sin measured twice the performance of INT8 compared to FP16 on the ANE of M4 Pro. Manjeet Singh measured equal performance for INT8 and FP16 on the ANE of M4. The difference between Koan-Sin's and Manjeet's measurements seems important for choosing between palletized/quantized INT8 and linear INT8. If there is a hardware table to depalletize/dequantize INT8 into FP16 and there is no performance difference between INT8 and FP16, I'd think palletized/quantized INT8 would be preferred because it would be more accurate without any performance loss. On the other hand, if INT8 is twice as fast as FP16, then linear INT8 would be preferred wherever it is accurate enough.