Skip to content

Metal Backend (Apple Silicon)

This document describes how the Metal backend compiles models for Apple Silicon's GPU, Neural Engine, and CPU via Core ML and MPSGraph.


Table of Contents

  1. Core ML Model Generation
  2. MPSGraph Code Generation
  3. Apple Silicon Optimisations
  4. Neural Engine vs GPU vs CPU Selection
  5. Known Limitations

Core ML Model Generation

The Metal backend primarily targets Core ML, Apple's on-device machine learning framework. Core ML models (.mlpackage / .mlmodelc) can execute on the Neural Engine (ANE), GPU (via Metal), or CPU, with automatic scheduling by the framework.

Generation Pipeline

  Quantised IRGraph
┌───────────────────-┐
│ 1. Legalise ops    │─── Map IR ops to Core ML neural network layers
│                    │──   Decompose unsupported ops into supported primitives
│                    │──   Handle layout differences (NHWC → NCHW where needed)
└────────┬─────────-─┘
┌───────────────────┐
│ 2. Assign compute │──   Determine optimal compute unit for each op
│    units          │──   ANE for conv/pool, GPU for attention, CPU for control flow
└────────┬──────────┘
┌───────────────────-┐
│ 3. Build Core ML   │──   Create ct.models.MLModel via coremltools
│    model spec      │──   Add neural network layers
│                    │──   Set input/output descriptions
│                    │──   Configure model metadata
└────────┬──────────-┘
┌──────────────────-─┐
│ 4. Optimise for    │──   Fuse adjacent layers (Conv+BN, Conv+ReLU)
│    Apple Silicon   │──   Configure palette (quantisation) for weights
│                    │──   Set compute unit preferences
└────────┬─────────-─┘
┌──────────────────-─┐
│ 5. Compile and     │──   Convert to .mlpackage (directory)
│    validate        │──   Run Core ML model validation
│                    │──   Test inference with dummy input
│                    │──   Write output file
└────────┬─────────-─┘
  model.mlpackage/

Core ML Model Structure

model.mlpackage/
├── Data/
│   └── com.apple.CoreML/
│       ├── weights/
│       │   ├── weight_0.bin        # Conv layer weights (INT8 quantised)
│       │   ├── weight_1.bin        # Next layer weights
│       │   └── ...
│       └── metadata.json           # Weight metadata
├── Model/
│   └── model.mlmodel               # Protobuf model spec
└── com.apple.CoreML/
    └── Info.plist                  # Package metadata

Model Spec Construction

edgecompiler builds the Core ML model spec programmatically using coremltools:

import coremltools as ct
from coremltools.models.neural_network import NeuralNetworkBuilder

# Create model description
inputs = [
    ct.TensorType(name="input_0", shape=(1, 224, 224, 3), dtype=np.float32)
]
outputs = [
    ct.TensorType(name="output_0", shape=(1, 1001), dtype=np.float32)
]

# Build neural network
builder = NeuralNetworkBuilder(inputs=inputs, outputs=outputs)

# Add layers from IR
for op in ir_graph.ops:
    if op.op_type == "Conv2D":
        builder.add_convolution(
            name=op.name,
            kernel_channels=weight_shape[3],
            output_channels=weight_shape[0],
            height=weight_shape[1],
            width=weight_shape[2],
            stride_height=op.attributes["stride_h"],
            stride_width=op.attributes["stride_w"],
            border_mode=op.attributes["padding"].lower(),
            groups=op.attributes.get("groups", 1),
            W=weight_data,
            b=bias_data,
            input_name=op.inputs[0],
            output_name=op.outputs[0],
        )
    elif op.op_type == "ReLU":
        builder.add_activation(
            name=op.name,
            non_linearity="RELU",
            input_name=op.inputs[0],
            output_name=op.outputs[0],
        )
    # ... more op types

# Set compute unit
model = ct.models.MLModel(builder.spec)
model.compute_unit = ct.ComputeUnit.ALL  # ANE + GPU + CPU

INT8 Palette (Weight Quantisation)

Core ML supports weight quantisation via "palettes" — lookup tables that compress weights to fewer bits:

from coremltools.optimize.coreml import (
    OpLinearQuantizerConfig,
    OpPalettizerConfig,
    OptimizationConfig,
    palettize_weights,
    linear_quantize_weights,
)

# INT8 linear quantisation
config = OptimizationConfig(
    global_config=OpLinearQuantizerConfig(
        mode="linear_symmetric",
        weight_threshold=2048,  # Only quantise weights > 2 KB
    )
)
model = linear_quantize_weights(model, config=config)

# Or: 4-bit palettisation (even more compression)
config = OptimizationConfig(
    global_config=OpPalettizerConfig(
        nbits=4,
        mode="kmeans",
    )
)
model = palettize_weights(model, config=config)

MPSGraph Code Generation

For operations not supported by Core ML's neural network layer API (e.g., custom attention mechanisms, dynamic control flow), edgecompiler generates MPSGraph shaders that execute directly on the Metal GPU.

When MPSGraph is Used

MPSGraph is used as a supplementary code path when:

  1. Custom ops: The IR contains ops not mappable to Core ML layers.
  2. Dynamic shapes: Ops that produce dynamically-shaped outputs (e.g., NonMaxSuppression).
  3. Custom shaders: The user provides custom Metal compute shaders.
  4. MPS-accelerated ops: Ops where MPS provides a faster implementation than Core ML (e.g., batched matrix multiplication on GPU).

MPSGraph Integration Architecture

┌───────────────────────────────────────────────────────────┐
│                   Core ML Model (.mlpackage)              │
│                                                           │
│  ┌─────────────────┐     ┌────────────────────────────-─┐ │
│  │ Core ML Layers  │     │ Custom Model (MPSGraph)      │ │
│  │                 │     │                              │ │
│  │ Conv2D          │     │  ┌─────────────────────────┐ │ │
│  │ BatchNorm       │     │  │ MPSGraph executable     │ │ │
│  │ ReLU            │     │  │                         │ │ │
│  │ MaxPool         │     │  │  ├── MPSGraphMatMul     │ │ │
│  │ ...             │     │  │  ├── MPSGraphSoftmax    │ │ │
│  │                 │     │  │  ├── Custom MTLFunction │ │ │
│  │ Compute: ANE/GPU│     │  │  └── ...                │ │ │
│  │                 │     │  │                         │ │ │
│  └─────────────────┘     │  │  Compute: GPU only      │ │ │
│                          │  └─────────────────────────┘ │ │
│                          └────────────────────────────-─┘ │
└───────────────────────────────────────────────────────────┘

Generating MPSGraph Code

from edgecompiler.backends.metal.mpsgraph_builder import MPSGraphBuilder

builder = MPSGraphBuilder()

# Map IR ops to MPSGraph operations
for op in ir_graph.ops:
    if op.op_type == "MatMul":
        builder.add_matmul(
            name=op.name,
            input_a=op.inputs[0],
            input_b=op.inputs[1],
            output=op.outputs[0],
        )
    elif op.op_type == "Softmax":
        builder.add_softmax(
            name=op.name,
            input=op.inputs[0],
            output=op.outputs[0],
            axis=op.attributes.get("axis", -1),
        )
    elif op.op_type == "Custom":
        # Load user-provided Metal shader
        builder.add_custom_shader(
            name=op.name,
            source_path=op.attributes["shader_path"],
            inputs=op.inputs,
            outputs=op.outputs,
            threadgroup_size=op.attributes.get("threadgroup_size", [32, 1, 1]),
        )

# Generate the MPSGraph executable
graph_executable = builder.compile()

Custom Metal Shaders

Users can provide custom Metal compute shaders for specialised operations:

// custom_attention.metal

#include <MetalPerformanceShadersGraph/MetalPerformanceShadersGraph.h>

kernel void scaled_dot_product_attention(
    device const float* queries [[buffer(0)]],
    device const float* keys    [[buffer(1)]],
    device const float* values  [[buffer(2)]],
    device float* output        [[buffer(3)]],
    constant int& seq_len       [[buffer(4)]],
    constant int& head_dim      [[buffer(5)]],
    uint2 gid                   [[thread_position_in_grid]]
) {
    int q = gid.x;
    int d = gid.y;
    if (q >= seq_len || d >= head_dim) return;

    float scale = 1.0 / sqrt(float(head_dim));
    float sum = 0.0;

    for (int k = 0; k < seq_len; k++) {
        float dot = 0.0;
        for (int i = 0; i < head_dim; i++) {
            dot += queries[q * head_dim + i] * keys[k * head_dim + i];
        }
        float weight = exp(dot * scale);
        sum += weight;
        output[q * head_dim + d] += weight * values[k * head_dim + d];
    }

    output[q * head_dim + d] /= sum;
}

Apple Silicon Optimisations

edgecompiler applies several optimisations specific to Apple Silicon hardware.

Memory Layout Optimisations

Standard NHWC (TFLite)          →    Optimised for Apple Silicon

┌───────────────────────────┐       ┌──────────────────────────--─┐
│ Tensor: [1, 224, 224, 3]  │       │ Tensor: [3, 224, 224, 1]    │
│ Layout: NHWC              │       │ Layout: NCHW (channel-first)│
│                           │       │                             │
│ ANE prefers NCHW for      │       │ Cache-friendly access       │
│ convolution operations    │       │ patterns on NE engine       │
└───────────────────────────┘       └───────────────────────────--┘

The Metal backend automatically selects the optimal memory layout for each operation based on the target compute unit:

Compute Unit Preferred Layout Notes
Neural Engine (ANE) NCHW Hardware-optimised for channel-first
GPU (Metal) NCHW Better texture cache utilisation
CPU NHWC Standard for TFLite compatibility

Layer Fusion

The backend fuses common layer patterns for reduced memory bandwidth:

Before fusion:                          After fusion:
┌──────────┐  ┌──────────┐  ┌─────┐    ┌──────────────────────┐
│ Conv2D   │→ │ BatchNorm│→ │ ReLU│    │ FusedConvBNReLU      │
│          │  │          │  │     │    │  (single ANE op)     │
└──────────┘  └──────────┘  └─────┘    └──────────────────────┘

Before fusion:                          After fusion:
┌──────────┐  ┌──────────┐             ┌──────────────────────┐
│ Conv2D   │→ │ BiasAdd  │             │ Conv2D with bias     │
│          │  │          │             │  (bias folded into W)│
└──────────┘  └──────────┘             └──────────────────────┘

Before fusion:                          After fusion:
┌──────────┐  ┌──────────┐  ┌─────-┐    ┌──────────────────────┐
│ Conv2D   │→ │ BatchNorm│→ │ ReLU6│    │ FusedConvBNReLU6     │
│          │  │          │  │ clip │    │  (clamp output [0,6])│
└──────────┘  └──────────┘  └─────-┘    └──────────────────────┘

Weight Compression

Apple Silicon supports several weight compression strategies:

Strategy Compression Accuracy Impact Speed
FP32 → FP16 Negligible Same or faster
FP16 → INT8 (palette) 2–4× Small (< 1% typical) Same on ANE
FP16 → 4-bit palette Moderate (1–3%) Same on ANE
FP16 → 2-bit palette Significant (> 5%) Same on ANE
Structured sparsity Small Potentially faster
# Apply weight compression
from edgecompiler import compile

result = compile(
    "model.pt",
    target="metal",
    quantize="ptq",
    weight_compression="int8",    # "fp16", "int8", "4bit", "2bit"
    output="model.mlpackage",
)

Buffer Pooling

For inference workloads, edgecompiler's runtime implements buffer pooling to minimise memory allocation overhead:

┌────────────────────────────────────────────────-─┐
│              Buffer Pool                         │
│                                                  │
│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐             │
│  │ 4 KB │ │ 16KB │ │ 64KB │ │256KB │  ...        │
│  └──────┘ └──────┘ └──────┘ └──────┘             │
│                                                  │
│  Buffers are reused across inference calls.      │
│  Allocations are rounded to nearest power of 2.  │
│  Pool grows dynamically up to a configurable max.│
└─────────────────────────────────────────────────-┘

Neural Engine vs GPU vs CPU Selection

Apple Silicon has three compute units for ML inference. edgecompiler configures Core ML's compute unit selection to maximise performance.

Compute Unit Characteristics

Property Neural Engine (ANE) GPU (Metal) CPU
Latency Very low (~0.1 ms/op) Low (~0.3 ms/op) High (~1 ms/op)
Throughput Very high High Moderate
Power Very low Moderate High
Precision FP16, INT8 FP32, FP16, INT8 FP32, FP16, INT8
Dynamic shapes ❌ Not supported ✅ Supported ✅ Supported
Custom ops ❌ Not supported ✅ Via MPSGraph ✅ Any
Concurrent ❌ Single model ✅ Multiple models ✅ Multiple

Selection Strategy

                    ┌──────────────────┐
                    │   Operation?     │
                    └────────┬─────────┘
                    ┌────────▼─────────┐
                    │ Dynamic shapes?  │──────── Yes ────▶ CPU
                    └────────┬─────────┘
                             │ No
                    ┌────────▼─────────┐
                    │ Custom op?       │──────── Yes ────▶ GPU (MPSGraph)
                    └────────┬─────────┘
                             │ No
                    ┌────────▼─────────┐
                    │ In ANE support   │
                    │ list?            │──────── Yes ────▶ ANE
                    └────────┬─────────┘
                             │ No
                    ┌────────▼─────────┐
                    │ GPU-compatible?  │──────── Yes ────▶ GPU
                    └────────┬─────────┘
                             │ No
                           CPU

ANE-Supported Operations

The Apple Neural Engine efficiently accelerates these operations:

  • Conv2D (standard, depthwise, dilated)
  • DepthwiseConv2D
  • FullyConnected (InnerProduct)
  • MaxPool2D, AveragePool2D
  • ReLU, ReLU6, LeakyReLU
  • Sigmoid, Tanh, HardSigmoid
  • Softmax
  • Add, Sub, Mul, Div (element-wise)
  • Concatenation
  • ResizeBilinear, ResizeNearestNeighbor
  • Pad (constant, reflect, replicate)
  • BatchNorm (fused into Conv2D)
  • ReduceMean, ReduceMax, ReduceMin, ReduceSum

GPU-Preferred Operations

These operations run on the GPU when the ANE is not suitable:

  • MatMul (large matrices)
  • Attention / ScaledDotProductAttention
  • LayerNorm
  • LSTM, GRU (recurrent layers)
  • Einsum
  • ScatterND, GatherND
  • Custom Metal shaders
  • Operations with dynamic shapes

Configuration Options

from edgecompiler import compile

# Let Core ML auto-select (recommended)
result = compile("model.pt", target="metal",
                 compute_unit="auto")  # Default

# Force all ops to ANE (fails if unsupported ops exist)
result = compile("model.pt", target="metal",
                 compute_unit="ane_only")

# Force all ops to GPU
result = compile("model.pt", target="metal",
                 compute_unit="gpu_only")

# Use ANE + GPU, fall back to CPU if needed
result = compile("model.pt", target="metal",
                 compute_unit="cpu_and_ne")

# Use all available units
result = compile("model.pt", target="metal",
                 compute_unit="all")

Compute Unit Assignment Example

For a MobileNetV2 model:

┌───────────────────────────────────────────────────────────┐
│                  MobileNetV2 Compute Plan                 │
│                                                           │
│  Op 0:  Conv2D (stem)           → ANE  (0.05 ms)          │
│  Op 1:  BatchNorm               → FUSED into Op 0         │
│  Op 2:  ReLU6                   → FUSED into Op 0         │
│  Op 3:  DepthwiseConv2D         → ANE  (0.08 ms)          │
│  Op 4:  BatchNorm + ReLU6       → FUSED into Op 3         │
│  Op 5:  Conv2D (expand)         → ANE  (0.06 ms)          │
│  Op 6:  BatchNorm + ReLU        → FUSED into Op 5         │
│  ...                                                      │
│  Op 140: Conv2D (final)         → ANE  (0.04 ms)          │
│  Op 141: GlobalAvgPool          → ANE  (0.01 ms)          │
│  Op 142: FullyConnected         → ANE  (0.03 ms)          │
│  Op 143: Softmax                → ANE  (0.01 ms)          │
│                                                           │
│  Total estimated latency: ~1.2 ms                         │
│  Compute unit breakdown: ANE 100%, GPU 0%, CPU 0%         │
└───────────────────────────────────────────────────────────┘

For a Transformer model:

┌───────────────────────────────────────────────────────────┐
│                  Transformer Compute Plan                 │
│                                                           │
│  Op 0:  MatMul (Q projection)  → GPU  (0.12 ms)           │
│  Op 1:  MatMul (K projection)  → GPU  (0.12 ms)           │
│  Op 2:  MatMul (V projection)  → GPU  (0.12 ms)           │
│  Op 3:  ScaledDotProductAtt    → GPU  (0.15 ms)           │
│  Op 4:  MatMul (output proj)   → GPU  (0.12 ms)           │
│  Op 5:  LayerNorm              → GPU  (0.05 ms)           │
│  Op 6:  Add (residual)         → ANE  (0.01 ms)           │
│  Op 7:  FullyConnected (FFN)   → GPU  (0.18 ms)           │
│  Op 8:  FullyConnected (FFN)   → GPU  (0.18 ms)           │
│  Op 9:  LayerNorm              → GPU  (0.05 ms)           │
│  Op 10: Add (residual)         → ANE  (0.01 ms)           │
│  ...                                                      │
│                                                           │
│  Total estimated latency: ~4.8 ms                         │
│  Compute unit breakdown: ANE 15%, GPU 80%, CPU 5%         │
└───────────────────────────────────────────────────────────┘

Known Limitations

Core ML / Neural Engine Limits

Limitation Detail Workaround
ANE dynamic shapes The Neural Engine does not support dynamically-shaped tensors Pre-define all shapes at compile time
ANE INT8 support ANE supports INT8 for some ops but not all Use FP16 for unsupported ops
Custom ops on ANE Custom ops cannot run on the Neural Engine Route custom ops to GPU/CPU
Model size Very large models (> 1 GB) may exceed ANE on-chip memory Split into multiple models
Concurrent ANE Only one model can use the ANE at a time Queue models sequentially
LayerNorm on ANE LayerNorm runs on GPU, not ANE Use BatchNorm (ANE-accelerated) when possible

MPSGraph Limits

Limitation Detail Workaround
macOS version MPSGraph requires macOS 12.3+ Use Core ML on older systems
No ANE MPSGraph runs on GPU only, not Neural Engine Use Core ML for ANE-eligible ops
Debugging Limited debugging tools for MPSGraph Use Xcode GPU profiler
Memory MPSGraph shares GPU memory with display Leave headroom for UI

macOS-Specific Limits

Limitation Detail
Thermal throttling Sustained GPU inference on M1 Pro may throttle after 30–60 seconds
Unified memory GPU and CPU share memory; large models may compete with system memory
External GPU eGPU support for Core ML is limited; internal GPU is preferred
iOS deployment Models compiled on macOS can be deployed to iOS, but ANE behaviour may differ

Debugging Tips

  1. Check compute unit assignment: Use --verbose to see which compute unit each op is assigned to.
  2. Profile with Instruments: Use Xcode Instruments → Metal System Trace to profile GPU utilisation.
  3. Force compute unit: If the ANE is underutilised, try --compute-unit cpu_and_ne to prevent GPU fallback.
  4. Validate accuracy: Compare Core ML output against reference implementation:
from edgecompiler.runtime import InferenceSession, compare_outputs

session = InferenceSession("model.mlpackage", target="metal")
result = session.run({"input": test_input})
reference = run_reference_model(test_input)

mse, max_diff = compare_outputs(result, reference)
print(f"MSE: {mse:.6f}, Max diff: {max_diff:.6f}")
  1. Reduce model size: If the model is too large for the ANE, try weight compression (--weight-compression int8) or model splitting.