Metal Backend (Apple Silicon)¶
This document describes how the Metal backend compiles models for Apple Silicon's GPU, Neural Engine, and CPU via Core ML and MPSGraph.
Table of Contents¶
- Core ML Model Generation
- MPSGraph Code Generation
- Apple Silicon Optimisations
- Neural Engine vs GPU vs CPU Selection
- Known Limitations
Core ML Model Generation¶
The Metal backend primarily targets Core ML, Apple's on-device machine learning
framework. Core ML models (.mlpackage / .mlmodelc) can execute on the Neural
Engine (ANE), GPU (via Metal), or CPU, with automatic scheduling by the framework.
Generation Pipeline¶
Quantised IRGraph
│
▼
┌───────────────────-┐
│ 1. Legalise ops │─── Map IR ops to Core ML neural network layers
│ │── Decompose unsupported ops into supported primitives
│ │── Handle layout differences (NHWC → NCHW where needed)
└────────┬─────────-─┘
│
▼
┌───────────────────┐
│ 2. Assign compute │── Determine optimal compute unit for each op
│ units │── ANE for conv/pool, GPU for attention, CPU for control flow
└────────┬──────────┘
│
▼
┌───────────────────-┐
│ 3. Build Core ML │── Create ct.models.MLModel via coremltools
│ model spec │── Add neural network layers
│ │── Set input/output descriptions
│ │── Configure model metadata
└────────┬──────────-┘
│
▼
┌──────────────────-─┐
│ 4. Optimise for │── Fuse adjacent layers (Conv+BN, Conv+ReLU)
│ Apple Silicon │── Configure palette (quantisation) for weights
│ │── Set compute unit preferences
└────────┬─────────-─┘
│
▼
┌──────────────────-─┐
│ 5. Compile and │── Convert to .mlpackage (directory)
│ validate │── Run Core ML model validation
│ │── Test inference with dummy input
│ │── Write output file
└────────┬─────────-─┘
│
▼
model.mlpackage/
Core ML Model Structure¶
model.mlpackage/
├── Data/
│ └── com.apple.CoreML/
│ ├── weights/
│ │ ├── weight_0.bin # Conv layer weights (INT8 quantised)
│ │ ├── weight_1.bin # Next layer weights
│ │ └── ...
│ └── metadata.json # Weight metadata
├── Model/
│ └── model.mlmodel # Protobuf model spec
└── com.apple.CoreML/
└── Info.plist # Package metadata
Model Spec Construction¶
edgecompiler builds the Core ML model spec programmatically using coremltools:
import coremltools as ct
from coremltools.models.neural_network import NeuralNetworkBuilder
# Create model description
inputs = [
ct.TensorType(name="input_0", shape=(1, 224, 224, 3), dtype=np.float32)
]
outputs = [
ct.TensorType(name="output_0", shape=(1, 1001), dtype=np.float32)
]
# Build neural network
builder = NeuralNetworkBuilder(inputs=inputs, outputs=outputs)
# Add layers from IR
for op in ir_graph.ops:
if op.op_type == "Conv2D":
builder.add_convolution(
name=op.name,
kernel_channels=weight_shape[3],
output_channels=weight_shape[0],
height=weight_shape[1],
width=weight_shape[2],
stride_height=op.attributes["stride_h"],
stride_width=op.attributes["stride_w"],
border_mode=op.attributes["padding"].lower(),
groups=op.attributes.get("groups", 1),
W=weight_data,
b=bias_data,
input_name=op.inputs[0],
output_name=op.outputs[0],
)
elif op.op_type == "ReLU":
builder.add_activation(
name=op.name,
non_linearity="RELU",
input_name=op.inputs[0],
output_name=op.outputs[0],
)
# ... more op types
# Set compute unit
model = ct.models.MLModel(builder.spec)
model.compute_unit = ct.ComputeUnit.ALL # ANE + GPU + CPU
INT8 Palette (Weight Quantisation)¶
Core ML supports weight quantisation via "palettes" — lookup tables that compress weights to fewer bits:
from coremltools.optimize.coreml import (
OpLinearQuantizerConfig,
OpPalettizerConfig,
OptimizationConfig,
palettize_weights,
linear_quantize_weights,
)
# INT8 linear quantisation
config = OptimizationConfig(
global_config=OpLinearQuantizerConfig(
mode="linear_symmetric",
weight_threshold=2048, # Only quantise weights > 2 KB
)
)
model = linear_quantize_weights(model, config=config)
# Or: 4-bit palettisation (even more compression)
config = OptimizationConfig(
global_config=OpPalettizerConfig(
nbits=4,
mode="kmeans",
)
)
model = palettize_weights(model, config=config)
MPSGraph Code Generation¶
For operations not supported by Core ML's neural network layer API (e.g., custom
attention mechanisms, dynamic control flow), edgecompiler generates MPSGraph
shaders that execute directly on the Metal GPU.
When MPSGraph is Used¶
MPSGraph is used as a supplementary code path when:
- Custom ops: The IR contains ops not mappable to Core ML layers.
- Dynamic shapes: Ops that produce dynamically-shaped outputs (e.g.,
NonMaxSuppression). - Custom shaders: The user provides custom Metal compute shaders.
- MPS-accelerated ops: Ops where MPS provides a faster implementation than Core ML (e.g., batched matrix multiplication on GPU).
MPSGraph Integration Architecture¶
┌───────────────────────────────────────────────────────────┐
│ Core ML Model (.mlpackage) │
│ │
│ ┌─────────────────┐ ┌────────────────────────────-─┐ │
│ │ Core ML Layers │ │ Custom Model (MPSGraph) │ │
│ │ │ │ │ │
│ │ Conv2D │ │ ┌─────────────────────────┐ │ │
│ │ BatchNorm │ │ │ MPSGraph executable │ │ │
│ │ ReLU │ │ │ │ │ │
│ │ MaxPool │ │ │ ├── MPSGraphMatMul │ │ │
│ │ ... │ │ │ ├── MPSGraphSoftmax │ │ │
│ │ │ │ │ ├── Custom MTLFunction │ │ │
│ │ Compute: ANE/GPU│ │ │ └── ... │ │ │
│ │ │ │ │ │ │ │
│ └─────────────────┘ │ │ Compute: GPU only │ │ │
│ │ └─────────────────────────┘ │ │
│ └────────────────────────────-─┘ │
└───────────────────────────────────────────────────────────┘
Generating MPSGraph Code¶
from edgecompiler.backends.metal.mpsgraph_builder import MPSGraphBuilder
builder = MPSGraphBuilder()
# Map IR ops to MPSGraph operations
for op in ir_graph.ops:
if op.op_type == "MatMul":
builder.add_matmul(
name=op.name,
input_a=op.inputs[0],
input_b=op.inputs[1],
output=op.outputs[0],
)
elif op.op_type == "Softmax":
builder.add_softmax(
name=op.name,
input=op.inputs[0],
output=op.outputs[0],
axis=op.attributes.get("axis", -1),
)
elif op.op_type == "Custom":
# Load user-provided Metal shader
builder.add_custom_shader(
name=op.name,
source_path=op.attributes["shader_path"],
inputs=op.inputs,
outputs=op.outputs,
threadgroup_size=op.attributes.get("threadgroup_size", [32, 1, 1]),
)
# Generate the MPSGraph executable
graph_executable = builder.compile()
Custom Metal Shaders¶
Users can provide custom Metal compute shaders for specialised operations:
// custom_attention.metal
#include <MetalPerformanceShadersGraph/MetalPerformanceShadersGraph.h>
kernel void scaled_dot_product_attention(
device const float* queries [[buffer(0)]],
device const float* keys [[buffer(1)]],
device const float* values [[buffer(2)]],
device float* output [[buffer(3)]],
constant int& seq_len [[buffer(4)]],
constant int& head_dim [[buffer(5)]],
uint2 gid [[thread_position_in_grid]]
) {
int q = gid.x;
int d = gid.y;
if (q >= seq_len || d >= head_dim) return;
float scale = 1.0 / sqrt(float(head_dim));
float sum = 0.0;
for (int k = 0; k < seq_len; k++) {
float dot = 0.0;
for (int i = 0; i < head_dim; i++) {
dot += queries[q * head_dim + i] * keys[k * head_dim + i];
}
float weight = exp(dot * scale);
sum += weight;
output[q * head_dim + d] += weight * values[k * head_dim + d];
}
output[q * head_dim + d] /= sum;
}
Apple Silicon Optimisations¶
edgecompiler applies several optimisations specific to Apple Silicon hardware.
Memory Layout Optimisations¶
Standard NHWC (TFLite) → Optimised for Apple Silicon
┌───────────────────────────┐ ┌──────────────────────────--─┐
│ Tensor: [1, 224, 224, 3] │ │ Tensor: [3, 224, 224, 1] │
│ Layout: NHWC │ │ Layout: NCHW (channel-first)│
│ │ │ │
│ ANE prefers NCHW for │ │ Cache-friendly access │
│ convolution operations │ │ patterns on NE engine │
└───────────────────────────┘ └───────────────────────────--┘
The Metal backend automatically selects the optimal memory layout for each operation based on the target compute unit:
| Compute Unit | Preferred Layout | Notes |
|---|---|---|
| Neural Engine (ANE) | NCHW | Hardware-optimised for channel-first |
| GPU (Metal) | NCHW | Better texture cache utilisation |
| CPU | NHWC | Standard for TFLite compatibility |
Layer Fusion¶
The backend fuses common layer patterns for reduced memory bandwidth:
Before fusion: After fusion:
┌──────────┐ ┌──────────┐ ┌─────┐ ┌──────────────────────┐
│ Conv2D │→ │ BatchNorm│→ │ ReLU│ │ FusedConvBNReLU │
│ │ │ │ │ │ │ (single ANE op) │
└──────────┘ └──────────┘ └─────┘ └──────────────────────┘
Before fusion: After fusion:
┌──────────┐ ┌──────────┐ ┌──────────────────────┐
│ Conv2D │→ │ BiasAdd │ │ Conv2D with bias │
│ │ │ │ │ (bias folded into W)│
└──────────┘ └──────────┘ └──────────────────────┘
Before fusion: After fusion:
┌──────────┐ ┌──────────┐ ┌─────-┐ ┌──────────────────────┐
│ Conv2D │→ │ BatchNorm│→ │ ReLU6│ │ FusedConvBNReLU6 │
│ │ │ │ │ clip │ │ (clamp output [0,6])│
└──────────┘ └──────────┘ └─────-┘ └──────────────────────┘
Weight Compression¶
Apple Silicon supports several weight compression strategies:
| Strategy | Compression | Accuracy Impact | Speed |
|---|---|---|---|
| FP32 → FP16 | 2× | Negligible | Same or faster |
| FP16 → INT8 (palette) | 2–4× | Small (< 1% typical) | Same on ANE |
| FP16 → 4-bit palette | 4× | Moderate (1–3%) | Same on ANE |
| FP16 → 2-bit palette | 8× | Significant (> 5%) | Same on ANE |
| Structured sparsity | 2× | Small | Potentially faster |
# Apply weight compression
from edgecompiler import compile
result = compile(
"model.pt",
target="metal",
quantize="ptq",
weight_compression="int8", # "fp16", "int8", "4bit", "2bit"
output="model.mlpackage",
)
Buffer Pooling¶
For inference workloads, edgecompiler's runtime implements buffer pooling to
minimise memory allocation overhead:
┌────────────────────────────────────────────────-─┐
│ Buffer Pool │
│ │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ 4 KB │ │ 16KB │ │ 64KB │ │256KB │ ... │
│ └──────┘ └──────┘ └──────┘ └──────┘ │
│ │
│ Buffers are reused across inference calls. │
│ Allocations are rounded to nearest power of 2. │
│ Pool grows dynamically up to a configurable max.│
└─────────────────────────────────────────────────-┘
Neural Engine vs GPU vs CPU Selection¶
Apple Silicon has three compute units for ML inference. edgecompiler configures
Core ML's compute unit selection to maximise performance.
Compute Unit Characteristics¶
| Property | Neural Engine (ANE) | GPU (Metal) | CPU |
|---|---|---|---|
| Latency | Very low (~0.1 ms/op) | Low (~0.3 ms/op) | High (~1 ms/op) |
| Throughput | Very high | High | Moderate |
| Power | Very low | Moderate | High |
| Precision | FP16, INT8 | FP32, FP16, INT8 | FP32, FP16, INT8 |
| Dynamic shapes | ❌ Not supported | ✅ Supported | ✅ Supported |
| Custom ops | ❌ Not supported | ✅ Via MPSGraph | ✅ Any |
| Concurrent | ❌ Single model | ✅ Multiple models | ✅ Multiple |
Selection Strategy¶
┌──────────────────┐
│ Operation? │
└────────┬─────────┘
│
┌────────▼─────────┐
│ Dynamic shapes? │──────── Yes ────▶ CPU
└────────┬─────────┘
│ No
┌────────▼─────────┐
│ Custom op? │──────── Yes ────▶ GPU (MPSGraph)
└────────┬─────────┘
│ No
┌────────▼─────────┐
│ In ANE support │
│ list? │──────── Yes ────▶ ANE
└────────┬─────────┘
│ No
┌────────▼─────────┐
│ GPU-compatible? │──────── Yes ────▶ GPU
└────────┬─────────┘
│ No
▼
CPU
ANE-Supported Operations¶
The Apple Neural Engine efficiently accelerates these operations:
Conv2D(standard, depthwise, dilated)DepthwiseConv2DFullyConnected(InnerProduct)MaxPool2D,AveragePool2DReLU,ReLU6,LeakyReLUSigmoid,Tanh,HardSigmoidSoftmaxAdd,Sub,Mul,Div(element-wise)ConcatenationResizeBilinear,ResizeNearestNeighborPad(constant, reflect, replicate)BatchNorm(fused into Conv2D)ReduceMean,ReduceMax,ReduceMin,ReduceSum
GPU-Preferred Operations¶
These operations run on the GPU when the ANE is not suitable:
MatMul(large matrices)Attention/ScaledDotProductAttentionLayerNormLSTM,GRU(recurrent layers)EinsumScatterND,GatherND- Custom Metal shaders
- Operations with dynamic shapes
Configuration Options¶
from edgecompiler import compile
# Let Core ML auto-select (recommended)
result = compile("model.pt", target="metal",
compute_unit="auto") # Default
# Force all ops to ANE (fails if unsupported ops exist)
result = compile("model.pt", target="metal",
compute_unit="ane_only")
# Force all ops to GPU
result = compile("model.pt", target="metal",
compute_unit="gpu_only")
# Use ANE + GPU, fall back to CPU if needed
result = compile("model.pt", target="metal",
compute_unit="cpu_and_ne")
# Use all available units
result = compile("model.pt", target="metal",
compute_unit="all")
Compute Unit Assignment Example¶
For a MobileNetV2 model:
┌───────────────────────────────────────────────────────────┐
│ MobileNetV2 Compute Plan │
│ │
│ Op 0: Conv2D (stem) → ANE (0.05 ms) │
│ Op 1: BatchNorm → FUSED into Op 0 │
│ Op 2: ReLU6 → FUSED into Op 0 │
│ Op 3: DepthwiseConv2D → ANE (0.08 ms) │
│ Op 4: BatchNorm + ReLU6 → FUSED into Op 3 │
│ Op 5: Conv2D (expand) → ANE (0.06 ms) │
│ Op 6: BatchNorm + ReLU → FUSED into Op 5 │
│ ... │
│ Op 140: Conv2D (final) → ANE (0.04 ms) │
│ Op 141: GlobalAvgPool → ANE (0.01 ms) │
│ Op 142: FullyConnected → ANE (0.03 ms) │
│ Op 143: Softmax → ANE (0.01 ms) │
│ │
│ Total estimated latency: ~1.2 ms │
│ Compute unit breakdown: ANE 100%, GPU 0%, CPU 0% │
└───────────────────────────────────────────────────────────┘
For a Transformer model:
┌───────────────────────────────────────────────────────────┐
│ Transformer Compute Plan │
│ │
│ Op 0: MatMul (Q projection) → GPU (0.12 ms) │
│ Op 1: MatMul (K projection) → GPU (0.12 ms) │
│ Op 2: MatMul (V projection) → GPU (0.12 ms) │
│ Op 3: ScaledDotProductAtt → GPU (0.15 ms) │
│ Op 4: MatMul (output proj) → GPU (0.12 ms) │
│ Op 5: LayerNorm → GPU (0.05 ms) │
│ Op 6: Add (residual) → ANE (0.01 ms) │
│ Op 7: FullyConnected (FFN) → GPU (0.18 ms) │
│ Op 8: FullyConnected (FFN) → GPU (0.18 ms) │
│ Op 9: LayerNorm → GPU (0.05 ms) │
│ Op 10: Add (residual) → ANE (0.01 ms) │
│ ... │
│ │
│ Total estimated latency: ~4.8 ms │
│ Compute unit breakdown: ANE 15%, GPU 80%, CPU 5% │
└───────────────────────────────────────────────────────────┘
Known Limitations¶
Core ML / Neural Engine Limits¶
| Limitation | Detail | Workaround |
|---|---|---|
| ANE dynamic shapes | The Neural Engine does not support dynamically-shaped tensors | Pre-define all shapes at compile time |
| ANE INT8 support | ANE supports INT8 for some ops but not all | Use FP16 for unsupported ops |
| Custom ops on ANE | Custom ops cannot run on the Neural Engine | Route custom ops to GPU/CPU |
| Model size | Very large models (> 1 GB) may exceed ANE on-chip memory | Split into multiple models |
| Concurrent ANE | Only one model can use the ANE at a time | Queue models sequentially |
| LayerNorm on ANE | LayerNorm runs on GPU, not ANE | Use BatchNorm (ANE-accelerated) when possible |
MPSGraph Limits¶
| Limitation | Detail | Workaround |
|---|---|---|
| macOS version | MPSGraph requires macOS 12.3+ | Use Core ML on older systems |
| No ANE | MPSGraph runs on GPU only, not Neural Engine | Use Core ML for ANE-eligible ops |
| Debugging | Limited debugging tools for MPSGraph | Use Xcode GPU profiler |
| Memory | MPSGraph shares GPU memory with display | Leave headroom for UI |
macOS-Specific Limits¶
| Limitation | Detail |
|---|---|
| Thermal throttling | Sustained GPU inference on M1 Pro may throttle after 30–60 seconds |
| Unified memory | GPU and CPU share memory; large models may compete with system memory |
| External GPU | eGPU support for Core ML is limited; internal GPU is preferred |
| iOS deployment | Models compiled on macOS can be deployed to iOS, but ANE behaviour may differ |
Debugging Tips¶
- Check compute unit assignment: Use
--verboseto see which compute unit each op is assigned to. - Profile with Instruments: Use Xcode Instruments → Metal System Trace to profile GPU utilisation.
- Force compute unit: If the ANE is underutilised, try
--compute-unit cpu_and_neto prevent GPU fallback. - Validate accuracy: Compare Core ML output against reference implementation:
from edgecompiler.runtime import InferenceSession, compare_outputs
session = InferenceSession("model.mlpackage", target="metal")
result = session.run({"input": test_input})
reference = run_reference_model(test_input)
mse, max_diff = compare_outputs(result, reference)
print(f"MSE: {mse:.6f}, Max diff: {max_diff:.6f}")
- Reduce model size: If the model is too large for the ANE, try weight
compression (
--weight-compression int8) or model splitting.