Skip to content

Coral Backend (Edge TPU)

This document describes how the Coral backend compiles models for Google's Edge TPU USB Accelerator on macOS ARM64.


Table of Contents

  1. Edge TPU Binary Format
  2. TFLite FlatBuffer Construction
  3. Custom Op Embedding
  4. Parameter Data Serialisation
  5. Operation Partitioning Strategy
  6. Fallback to edgetpu_compiler
  7. Known Limitations

Edge TPU Binary Format

The Edge TPU does not execute standard TFLite models. Instead, it requires a specially crafted TFLite FlatBuffer that embeds compiled binary segments for the Edge TPU coprocessor alongside a standard TFLite graph for CPU fallback.

Binary Layout

┌──────────────────────────────────────────────────────────┐
│                    TFLite FlatBuffer                     │
│                                                          │
│  ┌────────────────────────────────────────────────────┐  │
│  │  Model Header                                      │  │
│  │  ├─ version: 3                                     │  │
│  │  ├─ description: "Edge TPU compiled model"         │  │
│  │  └─ metadata_buffer[]                              │  │
│  └────────────────────────────────────────────────────┘  │
│                                                          │
│  ┌────────────────────────────────────────────────────┐  │
│  │  Subgraph 0 (Primary)                              │  │
│  │  ├─ tensors[]                                      │  │
│  │  │   ├─ input_0 (UINT8, [1, 224, 224, 3])          │  │
│  │  │   ├─ Conv2D_weight (INT8, [32, 3, 3, 1])        │  │
│  │  │   ├─ Conv2D_bias (INT32, [32])                  │  │
│  │  │   ├─ ... (intermediate tensors)                 │  │
│  │  │   └─ output_0 (UINT8, [1, 1001])                │  │
│  │  ├─ operators[]                                    │  │
│  │  │   ├─ Op 0: edgetpu-custom-op (Conv2D)           │  │
│  │  │   │   └─ builtin_code: CUSTOM                   │  │
│  │  │   │   └─ custom_code: "edgetpu-custom-op"       │  │
│  │  │   ├─ Op 1: edgetpu-custom-op (DepthwiseConv2D)  │  │
│  │  │   ├─ ...                                        │  │
│  │  │   └─ Op N: SOFTMAX (CPU fallback)               │  │
│  │  └─ inputs[], outputs[]                            │  │
│  └────────────────────────────────────────────────────┘  │
│                                                          │
│  ┌────────────────────────────────────────────────────┐  │
│  │  Custom Op Metadata                                │  │
│  │  ┌──────────────────────────────────────────────┐  │  │
│  │  │  Edge TPU Binary Segments                    │  │  │
│  │  │  ├─ Segment 0: [offset, size]                │  │  │
│  │  │  │   └─ Compiled instructions for ops 0..k   │  │  │
│  │  │  ├─ Segment 1: [offset, size]                │  │  │
│  │  │  │   └─ Compiled instructions for ops k+1..m │  │  │
│  │  │  └─ ...                                      │  │  │
│  │  └──────────────────────────────────────────────┘  │  │
│  │  ┌──────────────────────────────────────────────┐  │  │
│  │  │  Parameter Data                              │  │  │
│  │  │  ├─ Quantised weights (aligned)              │  │  │
│  │  │  ├─ Bias values (INT32, aligned)             │  │  │
│  │  │  └─ Op configuration records                 │  │  │
│  │  └──────────────────────────────────────────────┘  │  │
│  └────────────────────────────────────────────────────┘  │
│                                                          │
│  ┌────────────────────────────────────────────────────┐  │
│  │  Signature Defs (optional)                         │  │
│  └────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────┘

Key Concepts

  1. Custom ops: All Edge TPU-compiled operations are represented as CUSTOM ops with the custom code "edgetpu-custom-op". The actual operation type (Conv2D, DepthwiseConv2D, etc.) is encoded in the custom op's binary metadata.

  2. Binary segments: The Edge TPU coprocessor executes pre-compiled binary instructions. These segments contain the microcode that tells the TPU how to execute each operation, including data routing, on-chip memory allocation, and computation scheduling.

  3. Parameter data: Weights and biases are stored in a separate data section within the FlatBuffer, aligned to the Edge TPU's DMA requirements (64-byte alignment for weights, 4-byte alignment for biases).

  4. Metadata buffer: The FlatBuffer's metadata_buffer field contains a serialised edgetpu_cache_entry that maps custom ops to their corresponding binary segments.


TFLite FlatBuffer Construction

edgecompiler constructs the TFLite FlatBuffer from scratch using the flatbuffers Python library. This replaces the edgetpu_compiler binary entirely.

Construction Pipeline

  Quantised IRGraph
┌───────────────────┐
│ 1. Legalise ops   │─── Map IR ops to TFLite builtin op codes
│                   │─── Replace unsupported ops with alternatives
└────────┬──────────┘
┌───────────────────-┐
│ 2. Allocate tensors│─── Assign tensor indices
│                    │─── Compute buffer sizes
│                    │─── Determine buffer types (UINT8, INT32, etc.)
└────────┬─────────-─┘
┌──────────────────-─┐
│ 3. Build operators │─── Create TFLite Operator table for each IR op
│                    │─── Set opcode_index, inputs, outputs
│                    │─── Encode op-specific options (Conv2DOptions, etc.)
└────────┬──────────-┘
┌──────────────────-─┐
│ 4. Embed Edge TPU  │─── Replace supported ops with custom ops
│    custom ops      │─── Generate binary segments for each op cluster
│                    │─── Build edgetpu_cache_entry metadata
└────────┬─────────-─┘
┌──────────────────-─┐
│ 5. Serialise       │─── Write FlatBuffer to bytes
│    FlatBuffer      │─── Verify with TFLite schema
│                    │─── Write .tflite file
└────────┬─────────-─┘
  model_coral.tflite

FlatBuffer Schema Usage

edgecompiler uses the official TFLite FlatBuffer schema (schema.fbs) to construct valid models. The schema is bundled with the package.

from edgecompiler.backends.coral.flatbuffer_builder import TFLiteBuilder

builder = TFLiteBuilder()

# Add opcode entries
builder.add_opcode("CONV_2D", builtin=True)
builder.add_opcode("DEPTHWISE_CONV_2D", builtin=True)
builder.add_opcode("edgetpu-custom-op", builtin=False)

# Add tensors
builder.add_tensor("input_0", shape=[1, 224, 224, 3], dtype="UINT8")
builder.add_tensor("Conv2D_weight", shape=[32, 3, 3, 1], dtype="INT8",
                   data=quantised_weights)

# Add operators
builder.add_operator(opcode_idx=2,  # edgetpu-custom-op
                     inputs=[0, 1, 2],
                     outputs=[3],
                     options=Conv2DOptions(pad="SAME", stride_h=1, stride_w=1))

# Build the model
flatbuffer_bytes = builder.build()

Validation

After construction, the FlatBuffer is validated against several criteria:

  1. Schema validity: The FlatBuffer is parseable by the TFLite schema.
  2. Tensor consistency: All operator input/output indices reference valid tensors.
  3. Quantisation consistency: Quantisation parameters are present on all INT8/UINT8 tensors used by custom ops.
  4. Edge TPU compatibility: All custom ops have corresponding binary segments.

Custom Op Embedding

How Custom Ops Work

The Edge TPU runtime identifies compiled operations through the "edgetpu-custom-op" custom op code. When the TFLite interpreter encounters this op, it delegates execution to the Edge TPU via libedgetpu.

┌────────────────────────────────────────────────-──┐
│              TFLite Interpreter                   │
│                                                   │
│  Op 0: edgetpu-custom-op  ──┐                     │
│  Op 1: edgetpu-custom-op  ──┤                     │
│  Op 2: edgetpu-custom-op  ──┼──▶ Edge TPU Delegate│
│  Op 3: SOFTMAX            ──┼──▶ CPU execution    │
│  Op 4: edgetpu-custom-op  ──┘                     │
│                                                   │
└─────────────────────────────────────────────────--┘

Custom Op Binary Format

Each custom op's binary data is structured as:

┌─────────────────────────────────────┐
│      Custom Op Binary Record        │
│                                     │
│  ┌───────────────────────────────┐  │
│  │  Op Header                    │  │
│  │  ├─ op_type: uint32           │  │
│  │  │   (0x01=Conv2D,            │  │
│  │  │    0x02=DepthwiseConv2D,   │  │
│  │  │    0x03=FullyConnected,    │  │
│  │  │    0x04=MaxPool2D,         │  │
│  │  │    0x05=AveragePool2D,     │  │
│  │  │    0x06=Add,               │  │
│  │  │    ...)                    │  │
│  │  ├─ input_count: uint32       │  │
│  │  ├─ output_count: uint32      │  │
│  │  ├─ flags: uint32             │  │
│  │  └─ reserved: uint32[4]       │  │
│  └───────────────────────────────┘  │
│                                     │
│  ┌───────────────────────────────┐  │
│  │  Input Descriptors            │  │
│  │  ├─ tensor_index: uint32      │  │
│  │  ├─ on_chip_address: uint32   │  │
│  │  ├─ size: uint32              │  │
│  │  └─ quant: (scale, zp)        │  │
│  └───────────────────────────────┘  │
│                                     │
│  ┌───────────────────────────────┐  │
│  │  Output Descriptors           │  │
│  │  ├─ tensor_index: uint32      │  │
│  │  ├─ on_chip_address: uint32   │  │
│  │  ├─ size: uint32              │  │
│  │  └─ quant: (scale, zp)        │  │
│  └───────────────────────────────┘  │
│                                     │
│  ┌───────────────────────────────┐  │
│  │  Op-Specific Parameters       │  │
│  │  (Conv2D: kernel_h, kernel_w, │  │
│  │   stride_h, stride_w, pad_h,  │  │
│  │   pad_w, dilation_h, dilation_w│ │
│  │   depth_multiplier, ...)      │ │
│  └───────────────────────────────┘  │
└─────────────────────────────────────┘

Building Custom Op Records

edgecompiler builds these binary records from the IR operation attributes:

from edgecompiler.backends.coral.custom_ops import CustomOpBuilder

builder = CustomOpBuilder()

# Map IR operation to custom op binary
record = builder.build_conv2d(
    input_tensor=graph.get_tensor("input_0"),
    weight_tensor=graph.get_tensor("Conv2D_weight"),
    bias_tensor=graph.get_tensor("Conv2D_bias"),
    output_tensor=graph.get_tensor("Conv2D_output"),
    stride_h=op.attributes["stride_h"],
    stride_w=op.attributes["stride_w"],
    padding=op.attributes["padding"],       # "SAME" or "VALID"
    dilation_h=op.attributes.get("dilation_h", 1),
    dilation_w=op.attributes.get("dilation_w", 1),
)

# Record contains:
# record.op_type = 0x01
# record.binary_data = bytes(...)  # Serialised record
# record.parameter_references = [...]  # Weight/bias locations

Parameter Data Serialisation

Alignment Requirements

The Edge TPU requires parameter data to be aligned to specific boundaries for DMA transfers:

Data Type Alignment Notes
Weights (INT8) 64 bytes Must be contiguous per layer
Biases (INT32) 4 bytes Accumulator values
Per-channel scales 4 bytes Float32, per output channel
Per-channel zero points 4 bytes Int32, per output channel

Serialisation Process

  IR Tensors (Quantised)
┌───────────────────-┐
│ 1. Extract weights │─── Get constant tensor data from IR
│    and biases      │─── Verify INT8 / INT32 types
└────────┬──────────-┘
┌──────────────────-─┐
│ 2. Align data      │─── Pad to 64-byte boundary (weights)
│                    │─── Pad to 4-byte boundary (biases)
│                    │─── Record offsets for reference
└────────┬──────────-┘
┌──────────────────-─┐
│ 3. Pack into       │─── Concatenate all parameter data
│    data section    │─── Weights first, then biases, then scales
│                    │─── Build offset table for custom op records
└────────┬─────────-─┘
┌──────────────────-─┐
│ 4. Embed in        │─── Write parameter data to FlatBuffer buffer
│    FlatBuffer      │─── Set buffer offsets in tensor definitions
│                    │─── Validate all references are correct
└────────┬────────-──┘
  Complete FlatBuffer with embedded parameters

On-Chip Memory Management

The Edge TPU has limited on-chip SRAM (~2 MB on Coral USB). edgecompiler simulates on-chip memory allocation during compilation to ensure the binary segments are feasible:

from edgecompiler.backends.coral.memory_planner import OnChipMemoryPlanner

planner = OnChipMemoryPlanner(total_sram=2 * 1024 * 1024)  # 2 MB

for op in graph.ops:
    # Allocate on-chip buffers for activations
    planner.allocate(op.output_tensor, lifetime=(op.index, op.last_use_index))

# Verify feasibility
if not planner.is_feasible():
    # Some ops must fall back to off-chip (slower) or CPU
    planner.spill_to_offchip(planner.largest_activation)

Operation Partitioning Strategy

Not all operations can run on the Edge TPU. The partitioner decides which ops execute on the TPU and which fall back to CPU.

Partitioning Algorithm

  IRGraph (all ops)
┌─────────────────────--─┐
│ 1. Classify each op    │─── TPU_SUPPORTED, CPU_ONLY, or CONDITIONAL
│    by Edge TPU support │─── CONDITIONAL ops depend on tensor shapes/sizes
└────────┬──────────────-┘
┌──────────────────────-─┐
│ 2. Build op clusters   │─── Group consecutive TPU ops into clusters
│    (TPU runs)          │─── A cluster = one binary segment
│                        │─── Minimum cluster size: 1 op
│                        │─── Maximum cluster size: limited by SRAM
└────────┬──────────────-┘
┌──────────────────────-─┐
│ 3. Insert data transfer│─── At cluster boundaries: insert
│    ops                 │──   Dequantize → (CPU ops) → Quantize
│                        │──   if going TPU → CPU → TPU
└────────┬─────────────-─┘
┌──────────────────────-─┐
│ 4. Optimise partitions │─── Merge small clusters when possible
│                        │──   Move isolated CPU ops to TPU if
│                        │    a compatible variant exists
│                        │──   Minimise TPU↔CPU transitions
└────────┬──────────────-┘
  Partitioned Graph
  ┌───────────────────────────────────────┐
  │ TPU Cluster 0: Conv2D → BN → ReLU     │
  │   ↓ (Dequantize)                      │
  │ CPU: Reshape (dynamic shape)          │
  │   ↓ (Quantize)                        │
  │ TPU Cluster 1: Conv2D → ReLU → Pool   │
  │   ↓                                   │
  │ CPU: Softmax                          │
  └───────────────────────────────────────┘

Partitioning Rules

Rule Description
Minimise transitions Each TPU↔CPU boundary adds ~0.1 ms overhead. Prefer keeping ops on one side.
Fusion priority Fused ops (Conv+BN+ReLU) are always kept together in one cluster.
SRAM budget A cluster's total activation memory must fit in ~2 MB. If not, split the cluster.
Dynamic shapes Ops with dynamic output shapes (e.g., NonMaxSuppression) always fall back to CPU.
Softmax placement Softmax is supported on Edge TPU but may be more accurate on CPU for large class counts (>1000).
Concat boundaries Concat ops can merge multiple TPU clusters if all inputs are TPU-produced.

Example Partitioning

For MobileNetV2 with 152 operations:

Typical partitioning result:
  TPU ops:  140 (92.1%)
  CPU ops:   12 (7.9%)
  TPU clusters: 3
  Transitions: 4

CPU-fallback ops typically include:
  - Reshape with dynamic target shape
  - Gather on non-0 axis
  - StridedSlice with complex masks
  - Custom post-processing ops

Fallback to edgetpu_compiler

When edgecompiler's native compilation encounters issues, it can fall back to the official edgetpu_compiler binary if available.

Fallback Conditions

The compiler falls back to edgetpu_compiler when:

  1. Complex custom ops: The model contains ops that edgecompiler cannot yet compile natively for the Edge TPU.
  2. Verification failure: The natively compiled model produces incorrect results compared to reference inference.
  3. User override: The user explicitly requests --fallback-edgetpu-compiler.

Fallback Mechanism

# Automatic fallback logic
from edgecompiler.backends.coral.fallback import EdgetpuCompilerFallback

fallback = EdgetpuCompilerFallback()

if fallback.is_available():
    # edgetpu_compiler is installed on the system
    result = fallback.compile(
        input_tflite="quantized_model.tflite",
        output_path="model_coral.tflite",
    )
else:
    # No edgetpu_compiler available; edgecompiler compiles natively
    result = native_compile(ir_graph, config)

Detecting edgetpu_compiler

# edgecompiler checks these locations for edgetpu_compiler:
# 1. PATH environment variable
# 2. /usr/local/bin/edgetpu_compiler
# 3. ~/.edgecompiler/edgetpu_compiler
# 4. The edgetpu_compiler Docker image (if Docker is available)

edgecompile model.tflite --target coral --verbose
# Output:
# [INFO] edgetpu_compiler not found in PATH
# [INFO] Using native edgecompiler Coral backend
# [INFO] If you encounter issues, install edgetpu_compiler for fallback support

Why Native Compilation is Preferred

Aspect edgecompiler (native) edgetpu_compiler (fallback)
macOS ARM64 ✅ Native support ❌ Requires x86 emulation or Docker
Installation pip install edgecompiler Download binary + Docker
Speed Fast (in-process) Slow (subprocess + Docker overhead)
Customisation Full control over partitioning Black-box compilation
Debugging IR dumps, verbose logging Limited output
Extensibility Add custom passes No extension API

Known Limitations

Edge TPU Hardware Limits

Limitation Detail
SRAM size ~2 MB on-chip memory for activations. Large models may require off-chip spills.
Tensor rank Maximum 4 dimensions for most ops. 5D tensors are not supported.
Batch size Only batch size 1 is supported for most operations.
Weight quantisation Weights must be INT8 (per-channel) or UINT8 (per-tensor). FP16 weights are not supported.
Bias precision Biases are INT32 (accumulated). No FP16 bias support.
Maximum model size Practical limit of ~8 MB for parameter data on Coral USB.

edgecompiler Software Limits

Limitation Detail Workaround
LSTM/GRU on TPU Recurrent layers are not supported on Edge TPU Use CNN or Transformer alternatives
Dynamic shapes Ops with dynamic output shapes fall back to CPU Pre-compute shapes at compile time
Einsum Not supported on Edge TPU Decompose into supported ops
ScatterND Not supported on Edge TPU Use Gather + alternative indexing
Multi-device Only one Coral USB device is currently supported Use multiple USB devices with separate processes
Training No on-device training support Train off-device, compile for inference
FP16 models Edge TPU requires INT8 quantisation Quantise to INT8 before compilation

macOS-Specific Limits

Limitation Detail
libedgetpu ARM64 No official ARM64 macOS build. Use our install_coral_runtime.sh script for a community-built binary.
USB driver Some USB-C hubs may cause intermittent disconnections. Use a direct USB-A port with a USB-C adapter.
Thermal throttling Extended inference on both Metal and Coral simultaneously may trigger thermal throttling on M1 Pro.

Debugging Tips

When compilation fails or produces incorrect results:

  1. Check op support: Run with --verbose to see which ops are mapped to TPU vs CPU.
  2. Verify quantisation: Use --dump-ir to inspect the quantised IR before backend code generation.
  3. Compare outputs: Use the runtime to compare model outputs before and after compilation:
from edgecompiler.runtime import InferenceSession, compare_outputs

session = InferenceSession("model_coral.tflite", target="coral")
result = session.run({"input": test_input})
reference = run_reference_model(test_input)

mse, max_diff = compare_outputs(result, reference)
print(f"MSE: {mse:.6f}, Max diff: {max_diff:.6f}")
  1. Reduce model complexity: If partitioning fails, try reducing the model size or splitting it into sub-models.
  2. Use edgetpu_compiler fallback: Install edgetpu_compiler and re-run with --fallback-edgetpu-compiler to compare results.