Metal Backend (Apple Silicon)¶

This document describes how the Metal backend compiles models for Apple Silicon's GPU, Neural Engine, and CPU via Core ML and MPSGraph.

Table of Contents¶

Core ML Model Generation
MPSGraph Code Generation
Apple Silicon Optimisations
Neural Engine vs GPU vs CPU Selection
Known Limitations

Core ML Model Generation¶

The Metal backend primarily targets Core ML, Apple's on-device machine learning framework. Core ML models (.mlpackage / .mlmodelc) can execute on the Neural Engine (ANE), GPU (via Metal), or CPU, with automatic scheduling by the framework.

Generation Pipeline¶

  Quantised IRGraph
        │
        ▼
┌───────────────────-┐
│ 1. Legalise ops    │─── Map IR ops to Core ML neural network layers
│                    │──   Decompose unsupported ops into supported primitives
│                    │──   Handle layout differences (NHWC → NCHW where needed)
└────────┬─────────-─┘
         │
         ▼
┌───────────────────┐
│ 2. Assign compute │──   Determine optimal compute unit for each op
│    units          │──   ANE for conv/pool, GPU for attention, CPU for control flow
└────────┬──────────┘
         │
         ▼
┌───────────────────-┐
│ 3. Build Core ML   │──   Create ct.models.MLModel via coremltools
│    model spec      │──   Add neural network layers
│                    │──   Set input/output descriptions
│                    │──   Configure model metadata
└────────┬──────────-┘
         │
         ▼
┌──────────────────-─┐
│ 4. Optimise for    │──   Fuse adjacent layers (Conv+BN, Conv+ReLU)
│    Apple Silicon   │──   Configure palette (quantisation) for weights
│                    │──   Set compute unit preferences
└────────┬─────────-─┘
         │
         ▼
┌──────────────────-─┐
│ 5. Compile and     │──   Convert to .mlpackage (directory)
│    validate        │──   Run Core ML model validation
│                    │──   Test inference with dummy input
│                    │──   Write output file
└────────┬─────────-─┘
         │
         ▼
  model.mlpackage/

Core ML Model Structure¶

model.mlpackage/
├── Data/
│   └── com.apple.CoreML/
│       ├── weights/
│       │   ├── weight_0.bin        # Conv layer weights (INT8 quantised)
│       │   ├── weight_1.bin        # Next layer weights
│       │   └── ...
│       └── metadata.json           # Weight metadata
├── Model/
│   └── model.mlmodel               # Protobuf model spec
└── com.apple.CoreML/
    └── Info.plist                  # Package metadata

Model Spec Construction¶

edgecompiler builds the Core ML model spec programmatically using coremltools:

import coremltools as ct
from coremltools.models.neural_network import NeuralNetworkBuilder

# Create model description
inputs = [
    ct.TensorType(name="input_0", shape=(1, 224, 224, 3), dtype=np.float32)
]
outputs = [
    ct.TensorType(name="output_0", shape=(1, 1001), dtype=np.float32)
]

# Build neural network
builder = NeuralNetworkBuilder(inputs=inputs, outputs=outputs)

# Add layers from IR
for op in ir_graph.ops:
    if op.op_type == "Conv2D":
        builder.add_convolution(
            name=op.name,
            kernel_channels=weight_shape[3],
            output_channels=weight_shape[0],
            height=weight_shape[1],
            width=weight_shape[2],
            stride_height=op.attributes["stride_h"],
            stride_width=op.attributes["stride_w"],
            border_mode=op.attributes["padding"].lower(),
            groups=op.attributes.get("groups", 1),
            W=weight_data,
            b=bias_data,
            input_name=op.inputs[0],
            output_name=op.outputs[0],
        )
    elif op.op_type == "ReLU":
        builder.add_activation(
            name=op.name,
            non_linearity="RELU",
            input_name=op.inputs[0],
            output_name=op.outputs[0],
        )
    # ... more op types

# Set compute unit
model = ct.models.MLModel(builder.spec)
model.compute_unit = ct.ComputeUnit.ALL  # ANE + GPU + CPU

INT8 Palette (Weight Quantisation)¶

Core ML supports weight quantisation via "palettes" — lookup tables that compress weights to fewer bits:

from coremltools.optimize.coreml import (
    OpLinearQuantizerConfig,
    OpPalettizerConfig,
    OptimizationConfig,
    palettize_weights,
    linear_quantize_weights,
)

# INT8 linear quantisation
config = OptimizationConfig(
    global_config=OpLinearQuantizerConfig(
        mode="linear_symmetric",
        weight_threshold=2048,  # Only quantise weights > 2 KB
    )
)
model = linear_quantize_weights(model, config=config)

# Or: 4-bit palettisation (even more compression)
config = OptimizationConfig(
    global_config=OpPalettizerConfig(
        nbits=4,
        mode="kmeans",
    )
)
model = palettize_weights(model, config=config)

MPSGraph Code Generation¶

For operations not supported by Core ML's neural network layer API (e.g., custom attention mechanisms, dynamic control flow), edgecompiler generates MPSGraph shaders that execute directly on the Metal GPU.

When MPSGraph is Used¶

MPSGraph is used as a supplementary code path when:

Custom ops: The IR contains ops not mappable to Core ML layers.
Dynamic shapes: Ops that produce dynamically-shaped outputs (e.g., NonMaxSuppression).
Custom shaders: The user provides custom Metal compute shaders.
MPS-accelerated ops: Ops where MPS provides a faster implementation than Core ML (e.g., batched matrix multiplication on GPU).

MPSGraph Integration Architecture¶

┌───────────────────────────────────────────────────────────┐
│                   Core ML Model (.mlpackage)              │
│                                                           │
│  ┌─────────────────┐     ┌────────────────────────────-─┐ │
│  │ Core ML Layers  │     │ Custom Model (MPSGraph)      │ │
│  │                 │     │                              │ │
│  │ Conv2D          │     │  ┌─────────────────────────┐ │ │
│  │ BatchNorm       │     │  │ MPSGraph executable     │ │ │
│  │ ReLU            │     │  │                         │ │ │
│  │ MaxPool         │     │  │  ├── MPSGraphMatMul     │ │ │
│  │ ...             │     │  │  ├── MPSGraphSoftmax    │ │ │
│  │                 │     │  │  ├── Custom MTLFunction │ │ │
│  │ Compute: ANE/GPU│     │  │  └── ...                │ │ │
│  │                 │     │  │                         │ │ │
│  └─────────────────┘     │  │  Compute: GPU only      │ │ │
│                          │  └─────────────────────────┘ │ │
│                          └────────────────────────────-─┘ │
└───────────────────────────────────────────────────────────┘

Generating MPSGraph Code¶

from edgecompiler.backends.metal.mpsgraph_builder import MPSGraphBuilder

builder = MPSGraphBuilder()

# Map IR ops to MPSGraph operations
for op in ir_graph.ops:
    if op.op_type == "MatMul":
        builder.add_matmul(
            name=op.name,
            input_a=op.inputs[0],
            input_b=op.inputs[1],
            output=op.outputs[0],
        )
    elif op.op_type == "Softmax":
        builder.add_softmax(
            name=op.name,
            input=op.inputs[0],
            output=op.outputs[0],
            axis=op.attributes.get("axis", -1),
        )
    elif op.op_type == "Custom":
        # Load user-provided Metal shader
        builder.add_custom_shader(
            name=op.name,
            source_path=op.attributes["shader_path"],
            inputs=op.inputs,
            outputs=op.outputs,
            threadgroup_size=op.attributes.get("threadgroup_size", [32, 1, 1]),
        )

# Generate the MPSGraph executable
graph_executable = builder.compile()

Custom Metal Shaders¶

Users can provide custom Metal compute shaders for specialised operations:

// custom_attention.metal

#include <MetalPerformanceShadersGraph/MetalPerformanceShadersGraph.h>

kernel void scaled_dot_product_attention(
    device const float* queries [[buffer(0)]],
    device const float* keys    [[buffer(1)]],
    device const float* values  [[buffer(2)]],
    device float* output        [[buffer(3)]],
    constant int& seq_len       [[buffer(4)]],
    constant int& head_dim      [[buffer(5)]],
    uint2 gid                   [[thread_position_in_grid]]
) {
    int q = gid.x;
    int d = gid.y;
    if (q >= seq_len || d >= head_dim) return;

    float scale = 1.0 / sqrt(float(head_dim));
    float sum = 0.0;

    for (int k = 0; k < seq_len; k++) {
        float dot = 0.0;
        for (int i = 0; i < head_dim; i++) {
            dot += queries[q * head_dim + i] * keys[k * head_dim + i];
        }
        float weight = exp(dot * scale);
        sum += weight;
        output[q * head_dim + d] += weight * values[k * head_dim + d];
    }

    output[q * head_dim + d] /= sum;
}

Apple Silicon Optimisations¶

edgecompiler applies several optimisations specific to Apple Silicon hardware.

Memory Layout Optimisations¶

Standard NHWC (TFLite)          →    Optimised for Apple Silicon

┌───────────────────────────┐       ┌──────────────────────────--─┐
│ Tensor: [1, 224, 224, 3]  │       │ Tensor: [3, 224, 224, 1]    │
│ Layout: NHWC              │       │ Layout: NCHW (channel-first)│
│                           │       │                             │
│ ANE prefers NCHW for      │       │ Cache-friendly access       │
│ convolution operations    │       │ patterns on NE engine       │
└───────────────────────────┘       └───────────────────────────--┘

The Metal backend automatically selects the optimal memory layout for each operation based on the target compute unit:

Compute Unit	Preferred Layout	Notes
Neural Engine (ANE)	NCHW	Hardware-optimised for channel-first
GPU (Metal)	NCHW	Better texture cache utilisation
CPU	NHWC	Standard for TFLite compatibility

Layer Fusion¶

The backend fuses common layer patterns for reduced memory bandwidth:

Before fusion:                          After fusion:
┌──────────┐  ┌──────────┐  ┌─────┐    ┌──────────────────────┐
│ Conv2D   │→ │ BatchNorm│→ │ ReLU│    │ FusedConvBNReLU      │
│          │  │          │  │     │    │  (single ANE op)     │
└──────────┘  └──────────┘  └─────┘    └──────────────────────┘

Before fusion:                          After fusion:
┌──────────┐  ┌──────────┐             ┌──────────────────────┐
│ Conv2D   │→ │ BiasAdd  │             │ Conv2D with bias     │
│          │  │          │             │  (bias folded into W)│
└──────────┘  └──────────┘             └──────────────────────┘

Before fusion:                          After fusion:
┌──────────┐  ┌──────────┐  ┌─────-┐    ┌──────────────────────┐
│ Conv2D   │→ │ BatchNorm│→ │ ReLU6│    │ FusedConvBNReLU6     │
│          │  │          │  │ clip │    │  (clamp output [0,6])│
└──────────┘  └──────────┘  └─────-┘    └──────────────────────┘

Weight Compression¶

Apple Silicon supports several weight compression strategies:

Strategy	Compression	Accuracy Impact	Speed
FP32 → FP16	2×	Negligible	Same or faster
FP16 → INT8 (palette)	2–4×	Small (< 1% typical)	Same on ANE
FP16 → 4-bit palette	4×	Moderate (1–3%)	Same on ANE
FP16 → 2-bit palette	8×	Significant (> 5%)	Same on ANE
Structured sparsity	2×	Small	Potentially faster

# Apply weight compression
from edgecompiler import compile

result = compile(
    "model.pt",
    target="metal",
    quantize="ptq",
    weight_compression="int8",    # "fp16", "int8", "4bit", "2bit"
    output="model.mlpackage",
)

Buffer Pooling¶

For inference workloads, edgecompiler's runtime implements buffer pooling to minimise memory allocation overhead:

┌────────────────────────────────────────────────-─┐
│              Buffer Pool                         │
│                                                  │
│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐             │
│  │ 4 KB │ │ 16KB │ │ 64KB │ │256KB │  ...        │
│  └──────┘ └──────┘ └──────┘ └──────┘             │
│                                                  │
│  Buffers are reused across inference calls.      │
│  Allocations are rounded to nearest power of 2.  │
│  Pool grows dynamically up to a configurable max.│
└─────────────────────────────────────────────────-┘

Neural Engine vs GPU vs CPU Selection¶

Apple Silicon has three compute units for ML inference. edgecompiler configures Core ML's compute unit selection to maximise performance.

Compute Unit Characteristics¶

Property	Neural Engine (ANE)	GPU (Metal)	CPU
Latency	Very low (~0.1 ms/op)	Low (~0.3 ms/op)	High (~1 ms/op)
Throughput	Very high	High	Moderate
Power	Very low	Moderate	High
Precision	FP16, INT8	FP32, FP16, INT8	FP32, FP16, INT8
Dynamic shapes	❌ Not supported	✅ Supported	✅ Supported
Custom ops	❌ Not supported	✅ Via MPSGraph	✅ Any
Concurrent	❌ Single model	✅ Multiple models	✅ Multiple

Selection Strategy¶

                    ┌──────────────────┐
                    │   Operation?     │
                    └────────┬─────────┘
                             │
                    ┌────────▼─────────┐
                    │ Dynamic shapes?  │──────── Yes ────▶ CPU
                    └────────┬─────────┘
                             │ No
                    ┌────────▼─────────┐
                    │ Custom op?       │──────── Yes ────▶ GPU (MPSGraph)
                    └────────┬─────────┘
                             │ No
                    ┌────────▼─────────┐
                    │ In ANE support   │
                    │ list?            │──────── Yes ────▶ ANE
                    └────────┬─────────┘
                             │ No
                    ┌────────▼─────────┐
                    │ GPU-compatible?  │──────── Yes ────▶ GPU
                    └────────┬─────────┘
                             │ No
                             ▼
                           CPU

ANE-Supported Operations¶

The Apple Neural Engine efficiently accelerates these operations:

Conv2D (standard, depthwise, dilated)
DepthwiseConv2D
FullyConnected (InnerProduct)
MaxPool2D, AveragePool2D
ReLU, ReLU6, LeakyReLU
Sigmoid, Tanh, HardSigmoid
Softmax
Add, Sub, Mul, Div (element-wise)
Concatenation
ResizeBilinear, ResizeNearestNeighbor
Pad (constant, reflect, replicate)
BatchNorm (fused into Conv2D)
ReduceMean, ReduceMax, ReduceMin, ReduceSum

GPU-Preferred Operations¶

These operations run on the GPU when the ANE is not suitable:

MatMul (large matrices)
Attention / ScaledDotProductAttention
LayerNorm
LSTM, GRU (recurrent layers)
Einsum
ScatterND, GatherND
Custom Metal shaders
Operations with dynamic shapes

Configuration Options¶

from edgecompiler import compile

# Let Core ML auto-select (recommended)
result = compile("model.pt", target="metal",
                 compute_unit="auto")  # Default

# Force all ops to ANE (fails if unsupported ops exist)
result = compile("model.pt", target="metal",
                 compute_unit="ane_only")

# Force all ops to GPU
result = compile("model.pt", target="metal",
                 compute_unit="gpu_only")

# Use ANE + GPU, fall back to CPU if needed
result = compile("model.pt", target="metal",
                 compute_unit="cpu_and_ne")

# Use all available units
result = compile("model.pt", target="metal",
                 compute_unit="all")

Compute Unit Assignment Example¶

For a MobileNetV2 model:

┌───────────────────────────────────────────────────────────┐
│                  MobileNetV2 Compute Plan                 │
│                                                           │
│  Op 0:  Conv2D (stem)           → ANE  (0.05 ms)          │
│  Op 1:  BatchNorm               → FUSED into Op 0         │
│  Op 2:  ReLU6                   → FUSED into Op 0         │
│  Op 3:  DepthwiseConv2D         → ANE  (0.08 ms)          │
│  Op 4:  BatchNorm + ReLU6       → FUSED into Op 3         │
│  Op 5:  Conv2D (expand)         → ANE  (0.06 ms)          │
│  Op 6:  BatchNorm + ReLU        → FUSED into Op 5         │
│  ...                                                      │
│  Op 140: Conv2D (final)         → ANE  (0.04 ms)          │
│  Op 141: GlobalAvgPool          → ANE  (0.01 ms)          │
│  Op 142: FullyConnected         → ANE  (0.03 ms)          │
│  Op 143: Softmax                → ANE  (0.01 ms)          │
│                                                           │
│  Total estimated latency: ~1.2 ms                         │
│  Compute unit breakdown: ANE 100%, GPU 0%, CPU 0%         │
└───────────────────────────────────────────────────────────┘

For a Transformer model:

┌───────────────────────────────────────────────────────────┐
│                  Transformer Compute Plan                 │
│                                                           │
│  Op 0:  MatMul (Q projection)  → GPU  (0.12 ms)           │
│  Op 1:  MatMul (K projection)  → GPU  (0.12 ms)           │
│  Op 2:  MatMul (V projection)  → GPU  (0.12 ms)           │
│  Op 3:  ScaledDotProductAtt    → GPU  (0.15 ms)           │
│  Op 4:  MatMul (output proj)   → GPU  (0.12 ms)           │
│  Op 5:  LayerNorm              → GPU  (0.05 ms)           │
│  Op 6:  Add (residual)         → ANE  (0.01 ms)           │
│  Op 7:  FullyConnected (FFN)   → GPU  (0.18 ms)           │
│  Op 8:  FullyConnected (FFN)   → GPU  (0.18 ms)           │
│  Op 9:  LayerNorm              → GPU  (0.05 ms)           │
│  Op 10: Add (residual)         → ANE  (0.01 ms)           │
│  ...                                                      │
│                                                           │
│  Total estimated latency: ~4.8 ms                         │
│  Compute unit breakdown: ANE 15%, GPU 80%, CPU 5%         │
└───────────────────────────────────────────────────────────┘

Known Limitations¶

Core ML / Neural Engine Limits¶

Limitation	Detail	Workaround
ANE dynamic shapes	The Neural Engine does not support dynamically-shaped tensors	Pre-define all shapes at compile time
ANE INT8 support	ANE supports INT8 for some ops but not all	Use FP16 for unsupported ops
Custom ops on ANE	Custom ops cannot run on the Neural Engine	Route custom ops to GPU/CPU
Model size	Very large models (> 1 GB) may exceed ANE on-chip memory	Split into multiple models
Concurrent ANE	Only one model can use the ANE at a time	Queue models sequentially
LayerNorm on ANE	LayerNorm runs on GPU, not ANE	Use BatchNorm (ANE-accelerated) when possible

MPSGraph Limits¶

Limitation	Detail	Workaround
macOS version	MPSGraph requires macOS 12.3+	Use Core ML on older systems
No ANE	MPSGraph runs on GPU only, not Neural Engine	Use Core ML for ANE-eligible ops
Debugging	Limited debugging tools for MPSGraph	Use Xcode GPU profiler
Memory	MPSGraph shares GPU memory with display	Leave headroom for UI

macOS-Specific Limits¶

Limitation	Detail
Thermal throttling	Sustained GPU inference on M1 Pro may throttle after 30–60 seconds
Unified memory	GPU and CPU share memory; large models may compete with system memory
External GPU	eGPU support for Core ML is limited; internal GPU is preferred
iOS deployment	Models compiled on macOS can be deployed to iOS, but ANE behaviour may differ

Debugging Tips¶

Check compute unit assignment: Use --verbose to see which compute unit each op is assigned to.
Profile with Instruments: Use Xcode Instruments → Metal System Trace to profile GPU utilisation.
Force compute unit: If the ANE is underutilised, try --compute-unit cpu_and_ne to prevent GPU fallback.
Validate accuracy: Compare Core ML output against reference implementation:

from edgecompiler.runtime import InferenceSession, compare_outputs

session = InferenceSession("model.mlpackage", target="metal")
result = session.run({"input": test_input})
reference = run_reference_model(test_input)

mse, max_diff = compare_outputs(result, reference)
print(f"MSE: {mse:.6f}, Max diff: {max_diff:.6f}")

Reduce model size: If the model is too large for the ANE, try weight compression (--weight-compression int8) or model splitting.