Coral Backend (Edge TPU)¶
This document describes how the Coral backend compiles models for Google's Edge TPU USB Accelerator on macOS ARM64.
Table of Contents¶
- Edge TPU Binary Format
- TFLite FlatBuffer Construction
- Custom Op Embedding
- Parameter Data Serialisation
- Operation Partitioning Strategy
- Fallback to edgetpu_compiler
- Known Limitations
Edge TPU Binary Format¶
The Edge TPU does not execute standard TFLite models. Instead, it requires a specially crafted TFLite FlatBuffer that embeds compiled binary segments for the Edge TPU coprocessor alongside a standard TFLite graph for CPU fallback.
Binary Layout¶
┌──────────────────────────────────────────────────────────┐
│ TFLite FlatBuffer │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Model Header │ │
│ │ ├─ version: 3 │ │
│ │ ├─ description: "Edge TPU compiled model" │ │
│ │ └─ metadata_buffer[] │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Subgraph 0 (Primary) │ │
│ │ ├─ tensors[] │ │
│ │ │ ├─ input_0 (UINT8, [1, 224, 224, 3]) │ │
│ │ │ ├─ Conv2D_weight (INT8, [32, 3, 3, 1]) │ │
│ │ │ ├─ Conv2D_bias (INT32, [32]) │ │
│ │ │ ├─ ... (intermediate tensors) │ │
│ │ │ └─ output_0 (UINT8, [1, 1001]) │ │
│ │ ├─ operators[] │ │
│ │ │ ├─ Op 0: edgetpu-custom-op (Conv2D) │ │
│ │ │ │ └─ builtin_code: CUSTOM │ │
│ │ │ │ └─ custom_code: "edgetpu-custom-op" │ │
│ │ │ ├─ Op 1: edgetpu-custom-op (DepthwiseConv2D) │ │
│ │ │ ├─ ... │ │
│ │ │ └─ Op N: SOFTMAX (CPU fallback) │ │
│ │ └─ inputs[], outputs[] │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Custom Op Metadata │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ Edge TPU Binary Segments │ │ │
│ │ │ ├─ Segment 0: [offset, size] │ │ │
│ │ │ │ └─ Compiled instructions for ops 0..k │ │ │
│ │ │ ├─ Segment 1: [offset, size] │ │ │
│ │ │ │ └─ Compiled instructions for ops k+1..m │ │ │
│ │ │ └─ ... │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ Parameter Data │ │ │
│ │ │ ├─ Quantised weights (aligned) │ │ │
│ │ │ ├─ Bias values (INT32, aligned) │ │ │
│ │ │ └─ Op configuration records │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Signature Defs (optional) │ │
│ └────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
Key Concepts¶
-
Custom ops: All Edge TPU-compiled operations are represented as
CUSTOMops with the custom code"edgetpu-custom-op". The actual operation type (Conv2D, DepthwiseConv2D, etc.) is encoded in the custom op's binary metadata. -
Binary segments: The Edge TPU coprocessor executes pre-compiled binary instructions. These segments contain the microcode that tells the TPU how to execute each operation, including data routing, on-chip memory allocation, and computation scheduling.
-
Parameter data: Weights and biases are stored in a separate data section within the FlatBuffer, aligned to the Edge TPU's DMA requirements (64-byte alignment for weights, 4-byte alignment for biases).
-
Metadata buffer: The FlatBuffer's
metadata_bufferfield contains a serialisededgetpu_cache_entrythat maps custom ops to their corresponding binary segments.
TFLite FlatBuffer Construction¶
edgecompiler constructs the TFLite FlatBuffer from scratch using the flatbuffers
Python library. This replaces the edgetpu_compiler binary entirely.
Construction Pipeline¶
Quantised IRGraph
│
▼
┌───────────────────┐
│ 1. Legalise ops │─── Map IR ops to TFLite builtin op codes
│ │─── Replace unsupported ops with alternatives
└────────┬──────────┘
│
▼
┌───────────────────-┐
│ 2. Allocate tensors│─── Assign tensor indices
│ │─── Compute buffer sizes
│ │─── Determine buffer types (UINT8, INT32, etc.)
└────────┬─────────-─┘
│
▼
┌──────────────────-─┐
│ 3. Build operators │─── Create TFLite Operator table for each IR op
│ │─── Set opcode_index, inputs, outputs
│ │─── Encode op-specific options (Conv2DOptions, etc.)
└────────┬──────────-┘
│
▼
┌──────────────────-─┐
│ 4. Embed Edge TPU │─── Replace supported ops with custom ops
│ custom ops │─── Generate binary segments for each op cluster
│ │─── Build edgetpu_cache_entry metadata
└────────┬─────────-─┘
│
▼
┌──────────────────-─┐
│ 5. Serialise │─── Write FlatBuffer to bytes
│ FlatBuffer │─── Verify with TFLite schema
│ │─── Write .tflite file
└────────┬─────────-─┘
│
▼
model_coral.tflite
FlatBuffer Schema Usage¶
edgecompiler uses the official TFLite FlatBuffer schema (schema.fbs) to construct
valid models. The schema is bundled with the package.
from edgecompiler.backends.coral.flatbuffer_builder import TFLiteBuilder
builder = TFLiteBuilder()
# Add opcode entries
builder.add_opcode("CONV_2D", builtin=True)
builder.add_opcode("DEPTHWISE_CONV_2D", builtin=True)
builder.add_opcode("edgetpu-custom-op", builtin=False)
# Add tensors
builder.add_tensor("input_0", shape=[1, 224, 224, 3], dtype="UINT8")
builder.add_tensor("Conv2D_weight", shape=[32, 3, 3, 1], dtype="INT8",
data=quantised_weights)
# Add operators
builder.add_operator(opcode_idx=2, # edgetpu-custom-op
inputs=[0, 1, 2],
outputs=[3],
options=Conv2DOptions(pad="SAME", stride_h=1, stride_w=1))
# Build the model
flatbuffer_bytes = builder.build()
Validation¶
After construction, the FlatBuffer is validated against several criteria:
- Schema validity: The FlatBuffer is parseable by the TFLite schema.
- Tensor consistency: All operator input/output indices reference valid tensors.
- Quantisation consistency: Quantisation parameters are present on all INT8/UINT8 tensors used by custom ops.
- Edge TPU compatibility: All custom ops have corresponding binary segments.
Custom Op Embedding¶
How Custom Ops Work¶
The Edge TPU runtime identifies compiled operations through the "edgetpu-custom-op"
custom op code. When the TFLite interpreter encounters this op, it delegates execution
to the Edge TPU via libedgetpu.
┌────────────────────────────────────────────────-──┐
│ TFLite Interpreter │
│ │
│ Op 0: edgetpu-custom-op ──┐ │
│ Op 1: edgetpu-custom-op ──┤ │
│ Op 2: edgetpu-custom-op ──┼──▶ Edge TPU Delegate│
│ Op 3: SOFTMAX ──┼──▶ CPU execution │
│ Op 4: edgetpu-custom-op ──┘ │
│ │
└─────────────────────────────────────────────────--┘
Custom Op Binary Format¶
Each custom op's binary data is structured as:
┌─────────────────────────────────────┐
│ Custom Op Binary Record │
│ │
│ ┌───────────────────────────────┐ │
│ │ Op Header │ │
│ │ ├─ op_type: uint32 │ │
│ │ │ (0x01=Conv2D, │ │
│ │ │ 0x02=DepthwiseConv2D, │ │
│ │ │ 0x03=FullyConnected, │ │
│ │ │ 0x04=MaxPool2D, │ │
│ │ │ 0x05=AveragePool2D, │ │
│ │ │ 0x06=Add, │ │
│ │ │ ...) │ │
│ │ ├─ input_count: uint32 │ │
│ │ ├─ output_count: uint32 │ │
│ │ ├─ flags: uint32 │ │
│ │ └─ reserved: uint32[4] │ │
│ └───────────────────────────────┘ │
│ │
│ ┌───────────────────────────────┐ │
│ │ Input Descriptors │ │
│ │ ├─ tensor_index: uint32 │ │
│ │ ├─ on_chip_address: uint32 │ │
│ │ ├─ size: uint32 │ │
│ │ └─ quant: (scale, zp) │ │
│ └───────────────────────────────┘ │
│ │
│ ┌───────────────────────────────┐ │
│ │ Output Descriptors │ │
│ │ ├─ tensor_index: uint32 │ │
│ │ ├─ on_chip_address: uint32 │ │
│ │ ├─ size: uint32 │ │
│ │ └─ quant: (scale, zp) │ │
│ └───────────────────────────────┘ │
│ │
│ ┌───────────────────────────────┐ │
│ │ Op-Specific Parameters │ │
│ │ (Conv2D: kernel_h, kernel_w, │ │
│ │ stride_h, stride_w, pad_h, │ │
│ │ pad_w, dilation_h, dilation_w│ │
│ │ depth_multiplier, ...) │ │
│ └───────────────────────────────┘ │
└─────────────────────────────────────┘
Building Custom Op Records¶
edgecompiler builds these binary records from the IR operation attributes:
from edgecompiler.backends.coral.custom_ops import CustomOpBuilder
builder = CustomOpBuilder()
# Map IR operation to custom op binary
record = builder.build_conv2d(
input_tensor=graph.get_tensor("input_0"),
weight_tensor=graph.get_tensor("Conv2D_weight"),
bias_tensor=graph.get_tensor("Conv2D_bias"),
output_tensor=graph.get_tensor("Conv2D_output"),
stride_h=op.attributes["stride_h"],
stride_w=op.attributes["stride_w"],
padding=op.attributes["padding"], # "SAME" or "VALID"
dilation_h=op.attributes.get("dilation_h", 1),
dilation_w=op.attributes.get("dilation_w", 1),
)
# Record contains:
# record.op_type = 0x01
# record.binary_data = bytes(...) # Serialised record
# record.parameter_references = [...] # Weight/bias locations
Parameter Data Serialisation¶
Alignment Requirements¶
The Edge TPU requires parameter data to be aligned to specific boundaries for DMA transfers:
| Data Type | Alignment | Notes |
|---|---|---|
| Weights (INT8) | 64 bytes | Must be contiguous per layer |
| Biases (INT32) | 4 bytes | Accumulator values |
| Per-channel scales | 4 bytes | Float32, per output channel |
| Per-channel zero points | 4 bytes | Int32, per output channel |
Serialisation Process¶
IR Tensors (Quantised)
│
▼
┌───────────────────-┐
│ 1. Extract weights │─── Get constant tensor data from IR
│ and biases │─── Verify INT8 / INT32 types
└────────┬──────────-┘
│
▼
┌──────────────────-─┐
│ 2. Align data │─── Pad to 64-byte boundary (weights)
│ │─── Pad to 4-byte boundary (biases)
│ │─── Record offsets for reference
└────────┬──────────-┘
│
▼
┌──────────────────-─┐
│ 3. Pack into │─── Concatenate all parameter data
│ data section │─── Weights first, then biases, then scales
│ │─── Build offset table for custom op records
└────────┬─────────-─┘
│
▼
┌──────────────────-─┐
│ 4. Embed in │─── Write parameter data to FlatBuffer buffer
│ FlatBuffer │─── Set buffer offsets in tensor definitions
│ │─── Validate all references are correct
└────────┬────────-──┘
│
▼
Complete FlatBuffer with embedded parameters
On-Chip Memory Management¶
The Edge TPU has limited on-chip SRAM (~2 MB on Coral USB). edgecompiler simulates
on-chip memory allocation during compilation to ensure the binary segments are
feasible:
from edgecompiler.backends.coral.memory_planner import OnChipMemoryPlanner
planner = OnChipMemoryPlanner(total_sram=2 * 1024 * 1024) # 2 MB
for op in graph.ops:
# Allocate on-chip buffers for activations
planner.allocate(op.output_tensor, lifetime=(op.index, op.last_use_index))
# Verify feasibility
if not planner.is_feasible():
# Some ops must fall back to off-chip (slower) or CPU
planner.spill_to_offchip(planner.largest_activation)
Operation Partitioning Strategy¶
Not all operations can run on the Edge TPU. The partitioner decides which ops execute on the TPU and which fall back to CPU.
Partitioning Algorithm¶
IRGraph (all ops)
│
▼
┌─────────────────────--─┐
│ 1. Classify each op │─── TPU_SUPPORTED, CPU_ONLY, or CONDITIONAL
│ by Edge TPU support │─── CONDITIONAL ops depend on tensor shapes/sizes
└────────┬──────────────-┘
│
▼
┌──────────────────────-─┐
│ 2. Build op clusters │─── Group consecutive TPU ops into clusters
│ (TPU runs) │─── A cluster = one binary segment
│ │─── Minimum cluster size: 1 op
│ │─── Maximum cluster size: limited by SRAM
└────────┬──────────────-┘
│
▼
┌──────────────────────-─┐
│ 3. Insert data transfer│─── At cluster boundaries: insert
│ ops │── Dequantize → (CPU ops) → Quantize
│ │── if going TPU → CPU → TPU
└────────┬─────────────-─┘
│
▼
┌──────────────────────-─┐
│ 4. Optimise partitions │─── Merge small clusters when possible
│ │── Move isolated CPU ops to TPU if
│ │ a compatible variant exists
│ │── Minimise TPU↔CPU transitions
└────────┬──────────────-┘
│
▼
Partitioned Graph
┌───────────────────────────────────────┐
│ TPU Cluster 0: Conv2D → BN → ReLU │
│ ↓ (Dequantize) │
│ CPU: Reshape (dynamic shape) │
│ ↓ (Quantize) │
│ TPU Cluster 1: Conv2D → ReLU → Pool │
│ ↓ │
│ CPU: Softmax │
└───────────────────────────────────────┘
Partitioning Rules¶
| Rule | Description |
|---|---|
| Minimise transitions | Each TPU↔CPU boundary adds ~0.1 ms overhead. Prefer keeping ops on one side. |
| Fusion priority | Fused ops (Conv+BN+ReLU) are always kept together in one cluster. |
| SRAM budget | A cluster's total activation memory must fit in ~2 MB. If not, split the cluster. |
| Dynamic shapes | Ops with dynamic output shapes (e.g., NonMaxSuppression) always fall back to CPU. |
| Softmax placement | Softmax is supported on Edge TPU but may be more accurate on CPU for large class counts (>1000). |
| Concat boundaries | Concat ops can merge multiple TPU clusters if all inputs are TPU-produced. |
Example Partitioning¶
For MobileNetV2 with 152 operations:
Typical partitioning result:
TPU ops: 140 (92.1%)
CPU ops: 12 (7.9%)
TPU clusters: 3
Transitions: 4
CPU-fallback ops typically include:
- Reshape with dynamic target shape
- Gather on non-0 axis
- StridedSlice with complex masks
- Custom post-processing ops
Fallback to edgetpu_compiler¶
When edgecompiler's native compilation encounters issues, it can fall back to the
official edgetpu_compiler binary if available.
Fallback Conditions¶
The compiler falls back to edgetpu_compiler when:
- Complex custom ops: The model contains ops that
edgecompilercannot yet compile natively for the Edge TPU. - Verification failure: The natively compiled model produces incorrect results compared to reference inference.
- User override: The user explicitly requests
--fallback-edgetpu-compiler.
Fallback Mechanism¶
# Automatic fallback logic
from edgecompiler.backends.coral.fallback import EdgetpuCompilerFallback
fallback = EdgetpuCompilerFallback()
if fallback.is_available():
# edgetpu_compiler is installed on the system
result = fallback.compile(
input_tflite="quantized_model.tflite",
output_path="model_coral.tflite",
)
else:
# No edgetpu_compiler available; edgecompiler compiles natively
result = native_compile(ir_graph, config)
Detecting edgetpu_compiler¶
# edgecompiler checks these locations for edgetpu_compiler:
# 1. PATH environment variable
# 2. /usr/local/bin/edgetpu_compiler
# 3. ~/.edgecompiler/edgetpu_compiler
# 4. The edgetpu_compiler Docker image (if Docker is available)
edgecompile model.tflite --target coral --verbose
# Output:
# [INFO] edgetpu_compiler not found in PATH
# [INFO] Using native edgecompiler Coral backend
# [INFO] If you encounter issues, install edgetpu_compiler for fallback support
Why Native Compilation is Preferred¶
| Aspect | edgecompiler (native) | edgetpu_compiler (fallback) |
|---|---|---|
| macOS ARM64 | ✅ Native support | ❌ Requires x86 emulation or Docker |
| Installation | pip install edgecompiler |
Download binary + Docker |
| Speed | Fast (in-process) | Slow (subprocess + Docker overhead) |
| Customisation | Full control over partitioning | Black-box compilation |
| Debugging | IR dumps, verbose logging | Limited output |
| Extensibility | Add custom passes | No extension API |
Known Limitations¶
Edge TPU Hardware Limits¶
| Limitation | Detail |
|---|---|
| SRAM size | ~2 MB on-chip memory for activations. Large models may require off-chip spills. |
| Tensor rank | Maximum 4 dimensions for most ops. 5D tensors are not supported. |
| Batch size | Only batch size 1 is supported for most operations. |
| Weight quantisation | Weights must be INT8 (per-channel) or UINT8 (per-tensor). FP16 weights are not supported. |
| Bias precision | Biases are INT32 (accumulated). No FP16 bias support. |
| Maximum model size | Practical limit of ~8 MB for parameter data on Coral USB. |
edgecompiler Software Limits¶
| Limitation | Detail | Workaround |
|---|---|---|
| LSTM/GRU on TPU | Recurrent layers are not supported on Edge TPU | Use CNN or Transformer alternatives |
| Dynamic shapes | Ops with dynamic output shapes fall back to CPU | Pre-compute shapes at compile time |
| Einsum | Not supported on Edge TPU | Decompose into supported ops |
| ScatterND | Not supported on Edge TPU | Use Gather + alternative indexing |
| Multi-device | Only one Coral USB device is currently supported | Use multiple USB devices with separate processes |
| Training | No on-device training support | Train off-device, compile for inference |
| FP16 models | Edge TPU requires INT8 quantisation | Quantise to INT8 before compilation |
macOS-Specific Limits¶
| Limitation | Detail |
|---|---|
| libedgetpu ARM64 | No official ARM64 macOS build. Use our install_coral_runtime.sh script for a community-built binary. |
| USB driver | Some USB-C hubs may cause intermittent disconnections. Use a direct USB-A port with a USB-C adapter. |
| Thermal throttling | Extended inference on both Metal and Coral simultaneously may trigger thermal throttling on M1 Pro. |
Debugging Tips¶
When compilation fails or produces incorrect results:
- Check op support: Run with
--verboseto see which ops are mapped to TPU vs CPU. - Verify quantisation: Use
--dump-irto inspect the quantised IR before backend code generation. - Compare outputs: Use the runtime to compare model outputs before and after compilation:
from edgecompiler.runtime import InferenceSession, compare_outputs
session = InferenceSession("model_coral.tflite", target="coral")
result = session.run({"input": test_input})
reference = run_reference_model(test_input)
mse, max_diff = compare_outputs(result, reference)
print(f"MSE: {mse:.6f}, Max diff: {max_diff:.6f}")
- Reduce model complexity: If partitioning fails, try reducing the model size or splitting it into sub-models.
- Use edgetpu_compiler fallback: Install
edgetpu_compilerand re-run with--fallback-edgetpu-compilerto compare results.