Coral Backend (Edge TPU)¶

This document describes how the Coral backend compiles models for Google's Edge TPU USB Accelerator on macOS ARM64.

Table of Contents¶

Edge TPU Binary Format
TFLite FlatBuffer Construction
Custom Op Embedding
Parameter Data Serialisation
Operation Partitioning Strategy
Fallback to edgetpu_compiler
Known Limitations

Edge TPU Binary Format¶

The Edge TPU does not execute standard TFLite models. Instead, it requires a specially crafted TFLite FlatBuffer that embeds compiled binary segments for the Edge TPU coprocessor alongside a standard TFLite graph for CPU fallback.

Binary Layout¶

┌──────────────────────────────────────────────────────────┐
│                    TFLite FlatBuffer                     │
│                                                          │
│  ┌────────────────────────────────────────────────────┐  │
│  │  Model Header                                      │  │
│  │  ├─ version: 3                                     │  │
│  │  ├─ description: "Edge TPU compiled model"         │  │
│  │  └─ metadata_buffer[]                              │  │
│  └────────────────────────────────────────────────────┘  │
│                                                          │
│  ┌────────────────────────────────────────────────────┐  │
│  │  Subgraph 0 (Primary)                              │  │
│  │  ├─ tensors[]                                      │  │
│  │  │   ├─ input_0 (UINT8, [1, 224, 224, 3])          │  │
│  │  │   ├─ Conv2D_weight (INT8, [32, 3, 3, 1])        │  │
│  │  │   ├─ Conv2D_bias (INT32, [32])                  │  │
│  │  │   ├─ ... (intermediate tensors)                 │  │
│  │  │   └─ output_0 (UINT8, [1, 1001])                │  │
│  │  ├─ operators[]                                    │  │
│  │  │   ├─ Op 0: edgetpu-custom-op (Conv2D)           │  │
│  │  │   │   └─ builtin_code: CUSTOM                   │  │
│  │  │   │   └─ custom_code: "edgetpu-custom-op"       │  │
│  │  │   ├─ Op 1: edgetpu-custom-op (DepthwiseConv2D)  │  │
│  │  │   ├─ ...                                        │  │
│  │  │   └─ Op N: SOFTMAX (CPU fallback)               │  │
│  │  └─ inputs[], outputs[]                            │  │
│  └────────────────────────────────────────────────────┘  │
│                                                          │
│  ┌────────────────────────────────────────────────────┐  │
│  │  Custom Op Metadata                                │  │
│  │  ┌──────────────────────────────────────────────┐  │  │
│  │  │  Edge TPU Binary Segments                    │  │  │
│  │  │  ├─ Segment 0: [offset, size]                │  │  │
│  │  │  │   └─ Compiled instructions for ops 0..k   │  │  │
│  │  │  ├─ Segment 1: [offset, size]                │  │  │
│  │  │  │   └─ Compiled instructions for ops k+1..m │  │  │
│  │  │  └─ ...                                      │  │  │
│  │  └──────────────────────────────────────────────┘  │  │
│  │  ┌──────────────────────────────────────────────┐  │  │
│  │  │  Parameter Data                              │  │  │
│  │  │  ├─ Quantised weights (aligned)              │  │  │
│  │  │  ├─ Bias values (INT32, aligned)             │  │  │
│  │  │  └─ Op configuration records                 │  │  │
│  │  └──────────────────────────────────────────────┘  │  │
│  └────────────────────────────────────────────────────┘  │
│                                                          │
│  ┌────────────────────────────────────────────────────┐  │
│  │  Signature Defs (optional)                         │  │
│  └────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────┘

Key Concepts¶

Custom ops: All Edge TPU-compiled operations are represented as CUSTOM ops with the custom code "edgetpu-custom-op". The actual operation type (Conv2D, DepthwiseConv2D, etc.) is encoded in the custom op's binary metadata.
Binary segments: The Edge TPU coprocessor executes pre-compiled binary instructions. These segments contain the microcode that tells the TPU how to execute each operation, including data routing, on-chip memory allocation, and computation scheduling.
Parameter data: Weights and biases are stored in a separate data section within the FlatBuffer, aligned to the Edge TPU's DMA requirements (64-byte alignment for weights, 4-byte alignment for biases).
Metadata buffer: The FlatBuffer's metadata_buffer field contains a serialised edgetpu_cache_entry that maps custom ops to their corresponding binary segments.

TFLite FlatBuffer Construction¶

edgecompiler constructs the TFLite FlatBuffer from scratch using the flatbuffers Python library. This replaces the edgetpu_compiler binary entirely.

Construction Pipeline¶

  Quantised IRGraph
        │
        ▼
┌───────────────────┐
│ 1. Legalise ops   │─── Map IR ops to TFLite builtin op codes
│                   │─── Replace unsupported ops with alternatives
└────────┬──────────┘
         │
         ▼
┌───────────────────-┐
│ 2. Allocate tensors│─── Assign tensor indices
│                    │─── Compute buffer sizes
│                    │─── Determine buffer types (UINT8, INT32, etc.)
└────────┬─────────-─┘
         │
         ▼
┌──────────────────-─┐
│ 3. Build operators │─── Create TFLite Operator table for each IR op
│                    │─── Set opcode_index, inputs, outputs
│                    │─── Encode op-specific options (Conv2DOptions, etc.)
└────────┬──────────-┘
         │
         ▼
┌──────────────────-─┐
│ 4. Embed Edge TPU  │─── Replace supported ops with custom ops
│    custom ops      │─── Generate binary segments for each op cluster
│                    │─── Build edgetpu_cache_entry metadata
└────────┬─────────-─┘
         │
         ▼
┌──────────────────-─┐
│ 5. Serialise       │─── Write FlatBuffer to bytes
│    FlatBuffer      │─── Verify with TFLite schema
│                    │─── Write .tflite file
└────────┬─────────-─┘
         │
         ▼
  model_coral.tflite

FlatBuffer Schema Usage¶

edgecompiler uses the official TFLite FlatBuffer schema (schema.fbs) to construct valid models. The schema is bundled with the package.

from edgecompiler.backends.coral.flatbuffer_builder import TFLiteBuilder

builder = TFLiteBuilder()

# Add opcode entries
builder.add_opcode("CONV_2D", builtin=True)
builder.add_opcode("DEPTHWISE_CONV_2D", builtin=True)
builder.add_opcode("edgetpu-custom-op", builtin=False)

# Add tensors
builder.add_tensor("input_0", shape=[1, 224, 224, 3], dtype="UINT8")
builder.add_tensor("Conv2D_weight", shape=[32, 3, 3, 1], dtype="INT8",
                   data=quantised_weights)

# Add operators
builder.add_operator(opcode_idx=2,  # edgetpu-custom-op
                     inputs=[0, 1, 2],
                     outputs=[3],
                     options=Conv2DOptions(pad="SAME", stride_h=1, stride_w=1))

# Build the model
flatbuffer_bytes = builder.build()

Validation¶

After construction, the FlatBuffer is validated against several criteria:

Schema validity: The FlatBuffer is parseable by the TFLite schema.
Tensor consistency: All operator input/output indices reference valid tensors.
Quantisation consistency: Quantisation parameters are present on all INT8/UINT8 tensors used by custom ops.
Edge TPU compatibility: All custom ops have corresponding binary segments.

Custom Op Embedding¶

How Custom Ops Work¶

The Edge TPU runtime identifies compiled operations through the "edgetpu-custom-op" custom op code. When the TFLite interpreter encounters this op, it delegates execution to the Edge TPU via libedgetpu.

┌────────────────────────────────────────────────-──┐
│              TFLite Interpreter                   │
│                                                   │
│  Op 0: edgetpu-custom-op  ──┐                     │
│  Op 1: edgetpu-custom-op  ──┤                     │
│  Op 2: edgetpu-custom-op  ──┼──▶ Edge TPU Delegate│
│  Op 3: SOFTMAX            ──┼──▶ CPU execution    │
│  Op 4: edgetpu-custom-op  ──┘                     │
│                                                   │
└─────────────────────────────────────────────────--┘

Custom Op Binary Format¶

Each custom op's binary data is structured as:

┌─────────────────────────────────────┐
│      Custom Op Binary Record        │
│                                     │
│  ┌───────────────────────────────┐  │
│  │  Op Header                    │  │
│  │  ├─ op_type: uint32           │  │
│  │  │   (0x01=Conv2D,            │  │
│  │  │    0x02=DepthwiseConv2D,   │  │
│  │  │    0x03=FullyConnected,    │  │
│  │  │    0x04=MaxPool2D,         │  │
│  │  │    0x05=AveragePool2D,     │  │
│  │  │    0x06=Add,               │  │
│  │  │    ...)                    │  │
│  │  ├─ input_count: uint32       │  │
│  │  ├─ output_count: uint32      │  │
│  │  ├─ flags: uint32             │  │
│  │  └─ reserved: uint32[4]       │  │
│  └───────────────────────────────┘  │
│                                     │
│  ┌───────────────────────────────┐  │
│  │  Input Descriptors            │  │
│  │  ├─ tensor_index: uint32      │  │
│  │  ├─ on_chip_address: uint32   │  │
│  │  ├─ size: uint32              │  │
│  │  └─ quant: (scale, zp)        │  │
│  └───────────────────────────────┘  │
│                                     │
│  ┌───────────────────────────────┐  │
│  │  Output Descriptors           │  │
│  │  ├─ tensor_index: uint32      │  │
│  │  ├─ on_chip_address: uint32   │  │
│  │  ├─ size: uint32              │  │
│  │  └─ quant: (scale, zp)        │  │
│  └───────────────────────────────┘  │
│                                     │
│  ┌───────────────────────────────┐  │
│  │  Op-Specific Parameters       │  │
│  │  (Conv2D: kernel_h, kernel_w, │  │
│  │   stride_h, stride_w, pad_h,  │  │
│  │   pad_w, dilation_h, dilation_w│ │
│  │   depth_multiplier, ...)      │ │
│  └───────────────────────────────┘  │
└─────────────────────────────────────┘

Building Custom Op Records¶

edgecompiler builds these binary records from the IR operation attributes:

from edgecompiler.backends.coral.custom_ops import CustomOpBuilder

builder = CustomOpBuilder()

# Map IR operation to custom op binary
record = builder.build_conv2d(
    input_tensor=graph.get_tensor("input_0"),
    weight_tensor=graph.get_tensor("Conv2D_weight"),
    bias_tensor=graph.get_tensor("Conv2D_bias"),
    output_tensor=graph.get_tensor("Conv2D_output"),
    stride_h=op.attributes["stride_h"],
    stride_w=op.attributes["stride_w"],
    padding=op.attributes["padding"],       # "SAME" or "VALID"
    dilation_h=op.attributes.get("dilation_h", 1),
    dilation_w=op.attributes.get("dilation_w", 1),
)

# Record contains:
# record.op_type = 0x01
# record.binary_data = bytes(...)  # Serialised record
# record.parameter_references = [...]  # Weight/bias locations

Parameter Data Serialisation¶

Alignment Requirements¶

The Edge TPU requires parameter data to be aligned to specific boundaries for DMA transfers:

Data Type	Alignment	Notes
Weights (INT8)	64 bytes	Must be contiguous per layer
Biases (INT32)	4 bytes	Accumulator values
Per-channel scales	4 bytes	Float32, per output channel
Per-channel zero points	4 bytes	Int32, per output channel

Serialisation Process¶

  IR Tensors (Quantised)
        │
        ▼
┌───────────────────-┐
│ 1. Extract weights │─── Get constant tensor data from IR
│    and biases      │─── Verify INT8 / INT32 types
└────────┬──────────-┘
         │
         ▼
┌──────────────────-─┐
│ 2. Align data      │─── Pad to 64-byte boundary (weights)
│                    │─── Pad to 4-byte boundary (biases)
│                    │─── Record offsets for reference
└────────┬──────────-┘
         │
         ▼
┌──────────────────-─┐
│ 3. Pack into       │─── Concatenate all parameter data
│    data section    │─── Weights first, then biases, then scales
│                    │─── Build offset table for custom op records
└────────┬─────────-─┘
         │
         ▼
┌──────────────────-─┐
│ 4. Embed in        │─── Write parameter data to FlatBuffer buffer
│    FlatBuffer      │─── Set buffer offsets in tensor definitions
│                    │─── Validate all references are correct
└────────┬────────-──┘
         │
         ▼
  Complete FlatBuffer with embedded parameters

On-Chip Memory Management¶

The Edge TPU has limited on-chip SRAM (~2 MB on Coral USB). edgecompiler simulates on-chip memory allocation during compilation to ensure the binary segments are feasible:

from edgecompiler.backends.coral.memory_planner import OnChipMemoryPlanner

planner = OnChipMemoryPlanner(total_sram=2 * 1024 * 1024)  # 2 MB

for op in graph.ops:
    # Allocate on-chip buffers for activations
    planner.allocate(op.output_tensor, lifetime=(op.index, op.last_use_index))

# Verify feasibility
if not planner.is_feasible():
    # Some ops must fall back to off-chip (slower) or CPU
    planner.spill_to_offchip(planner.largest_activation)

Operation Partitioning Strategy¶

Not all operations can run on the Edge TPU. The partitioner decides which ops execute on the TPU and which fall back to CPU.

Partitioning Algorithm¶

  IRGraph (all ops)
        │
        ▼
┌─────────────────────--─┐
│ 1. Classify each op    │─── TPU_SUPPORTED, CPU_ONLY, or CONDITIONAL
│    by Edge TPU support │─── CONDITIONAL ops depend on tensor shapes/sizes
└────────┬──────────────-┘
         │
         ▼
┌──────────────────────-─┐
│ 2. Build op clusters   │─── Group consecutive TPU ops into clusters
│    (TPU runs)          │─── A cluster = one binary segment
│                        │─── Minimum cluster size: 1 op
│                        │─── Maximum cluster size: limited by SRAM
└────────┬──────────────-┘
         │
         ▼
┌──────────────────────-─┐
│ 3. Insert data transfer│─── At cluster boundaries: insert
│    ops                 │──   Dequantize → (CPU ops) → Quantize
│                        │──   if going TPU → CPU → TPU
└────────┬─────────────-─┘
         │
         ▼
┌──────────────────────-─┐
│ 4. Optimise partitions │─── Merge small clusters when possible
│                        │──   Move isolated CPU ops to TPU if
│                        │    a compatible variant exists
│                        │──   Minimise TPU↔CPU transitions
└────────┬──────────────-┘
         │
         ▼
  Partitioned Graph
  ┌───────────────────────────────────────┐
  │ TPU Cluster 0: Conv2D → BN → ReLU     │
  │   ↓ (Dequantize)                      │
  │ CPU: Reshape (dynamic shape)          │
  │   ↓ (Quantize)                        │
  │ TPU Cluster 1: Conv2D → ReLU → Pool   │
  │   ↓                                   │
  │ CPU: Softmax                          │
  └───────────────────────────────────────┘

Partitioning Rules¶

Rule	Description
Minimise transitions	Each TPU↔CPU boundary adds ~0.1 ms overhead. Prefer keeping ops on one side.
Fusion priority	Fused ops (Conv+BN+ReLU) are always kept together in one cluster.
SRAM budget	A cluster's total activation memory must fit in ~2 MB. If not, split the cluster.
Dynamic shapes	Ops with dynamic output shapes (e.g., `NonMaxSuppression`) always fall back to CPU.
Softmax placement	Softmax is supported on Edge TPU but may be more accurate on CPU for large class counts (>1000).
Concat boundaries	Concat ops can merge multiple TPU clusters if all inputs are TPU-produced.

Example Partitioning¶

For MobileNetV2 with 152 operations:

Typical partitioning result:
  TPU ops:  140 (92.1%)
  CPU ops:   12 (7.9%)
  TPU clusters: 3
  Transitions: 4

CPU-fallback ops typically include:
  - Reshape with dynamic target shape
  - Gather on non-0 axis
  - StridedSlice with complex masks
  - Custom post-processing ops

Fallback to edgetpu_compiler¶

When edgecompiler's native compilation encounters issues, it can fall back to the official edgetpu_compiler binary if available.

Fallback Conditions¶

The compiler falls back to edgetpu_compiler when:

Complex custom ops: The model contains ops that edgecompiler cannot yet compile natively for the Edge TPU.
Verification failure: The natively compiled model produces incorrect results compared to reference inference.
User override: The user explicitly requests --fallback-edgetpu-compiler.

Fallback Mechanism¶

# Automatic fallback logic
from edgecompiler.backends.coral.fallback import EdgetpuCompilerFallback

fallback = EdgetpuCompilerFallback()

if fallback.is_available():
    # edgetpu_compiler is installed on the system
    result = fallback.compile(
        input_tflite="quantized_model.tflite",
        output_path="model_coral.tflite",
    )
else:
    # No edgetpu_compiler available; edgecompiler compiles natively
    result = native_compile(ir_graph, config)

Detecting edgetpu_compiler¶

# edgecompiler checks these locations for edgetpu_compiler:
# 1. PATH environment variable
# 2. /usr/local/bin/edgetpu_compiler
# 3. ~/.edgecompiler/edgetpu_compiler
# 4. The edgetpu_compiler Docker image (if Docker is available)

edgecompile model.tflite --target coral --verbose
# Output:
# [INFO] edgetpu_compiler not found in PATH
# [INFO] Using native edgecompiler Coral backend
# [INFO] If you encounter issues, install edgetpu_compiler for fallback support

Why Native Compilation is Preferred¶

Aspect	edgecompiler (native)	edgetpu_compiler (fallback)
macOS ARM64	✅ Native support	❌ Requires x86 emulation or Docker
Installation	`pip install edgecompiler`	Download binary + Docker
Speed	Fast (in-process)	Slow (subprocess + Docker overhead)
Customisation	Full control over partitioning	Black-box compilation
Debugging	IR dumps, verbose logging	Limited output
Extensibility	Add custom passes	No extension API

Known Limitations¶

Edge TPU Hardware Limits¶

Limitation	Detail
SRAM size	~2 MB on-chip memory for activations. Large models may require off-chip spills.
Tensor rank	Maximum 4 dimensions for most ops. 5D tensors are not supported.
Batch size	Only batch size 1 is supported for most operations.
Weight quantisation	Weights must be INT8 (per-channel) or UINT8 (per-tensor). FP16 weights are not supported.
Bias precision	Biases are INT32 (accumulated). No FP16 bias support.
Maximum model size	Practical limit of ~8 MB for parameter data on Coral USB.

edgecompiler Software Limits¶

Limitation	Detail	Workaround
LSTM/GRU on TPU	Recurrent layers are not supported on Edge TPU	Use CNN or Transformer alternatives
Dynamic shapes	Ops with dynamic output shapes fall back to CPU	Pre-compute shapes at compile time
Einsum	Not supported on Edge TPU	Decompose into supported ops
ScatterND	Not supported on Edge TPU	Use Gather + alternative indexing
Multi-device	Only one Coral USB device is currently supported	Use multiple USB devices with separate processes
Training	No on-device training support	Train off-device, compile for inference
FP16 models	Edge TPU requires INT8 quantisation	Quantise to INT8 before compilation

macOS-Specific Limits¶

Limitation	Detail
libedgetpu ARM64	No official ARM64 macOS build. Use our `install_coral_runtime.sh` script for a community-built binary.
USB driver	Some USB-C hubs may cause intermittent disconnections. Use a direct USB-A port with a USB-C adapter.
Thermal throttling	Extended inference on both Metal and Coral simultaneously may trigger thermal throttling on M1 Pro.

Debugging Tips¶

When compilation fails or produces incorrect results:

Check op support: Run with --verbose to see which ops are mapped to TPU vs CPU.
Verify quantisation: Use --dump-ir to inspect the quantised IR before backend code generation.
Compare outputs: Use the runtime to compare model outputs before and after compilation:

from edgecompiler.runtime import InferenceSession, compare_outputs

session = InferenceSession("model_coral.tflite", target="coral")
result = session.run({"input": test_input})
reference = run_reference_model(test_input)

mse, max_diff = compare_outputs(result, reference)
print(f"MSE: {mse:.6f}, Max diff: {max_diff:.6f}")

Reduce model complexity: If partitioning fails, try reducing the model size or splitting it into sub-models.
Use edgetpu_compiler fallback: Install edgetpu_compiler and re-run with --fallback-edgetpu-compiler to compare results.