Skip to content

Edge TPU Ecosystem Research

Comprehensive technical reference for the Google Coral Edge TPU platform, compiled as the basis for an alternative compiler project.


Table of Contents

  1. Official Coral USB / Edge TPU Software Stack
  2. Coral USB on macOS / Apple Silicon
  3. Existing Compiler Architectures
  4. Edge TPU Operation Set & Limitations
  5. Sloth Integration Implications

1. Official Coral USB / Edge TPU Software Stack

1.1 Repository Structure

The Coral software stack comprises three principal repositories:

Repository Purpose Language Current TF Version
google-coral/libedgetpu Userspace runtime driver (USB/PCIe communication) C++ 2.16.1
google-coral/libcoral High-level C++ inference/pipeline/transfer-learning API C++
google-coral/pycoral Python bindings (PyCoral) Python/C++

The legacy monorepo google-coral/edgetpu now serves primarily as an issue tracker; source has been split into the repos above. The deprecated edgetpu Python API still lives in the old repo but is superseded by pycoral.

1.2 Building libedgetpu

Three build paths are documented:

# x86-64
DOCKER_CPUS="k8" DOCKER_IMAGE="ubuntu:22.04" DOCKER_TARGETS=libedgetpu make docker-build

# ARM64 / ARMv7
DOCKER_CPUS="armv7a aarch64" DOCKER_IMAGE="debian:bookworm" DOCKER_TARGETS=libedgetpu make docker-build

All built binaries land in the out/ directory. Debian packages can be produced with debuild -us -uc -tc -b -a arm64 -d.

B. Bare Bazel

# Requires Bazel 6.5.0 for TF 2.16.1
make                  # native build
CPU=armv7a make       # cross-compile ARMv7
CPU=aarch64 make      # cross-compile AArch64

macOS caveat: Compilation fails out of the box. Two manual steps are required:

  1. Install flatbuffers via MacPorts.
  2. After the first build failure, patch the auto-generated Bazel BUILD file at /var/tmp/_bazel_xxxxx/.../external/local_config_cc/BUILD line 48:
"darwin_x86_64": ":cc-compiler-darwin",

C. Makefile (Linux-only, no Bazel)

sudo apt install libabsl-dev libflatbuffers-dev
git clone https://github.com/tensorflow/tensorflow && cd tensorflow && git checkout v2.16.1
TFROOT=<path-to-tf> make -f makefile_build/Makefile -j$(nproc) libedgetpu

This approach eliminates Bazel/Docker entirely and uses system packages for libabsl and libflatbuffers.

1.3 The *_edgetpu.tflite Binary Format

1.3.1 How the Edge TPU Custom Op Is Embedded

The Edge TPU Compiler (edgetpu_compiler) takes a standard TFLite FlatBuffer and transforms it into a compiled *_edgetpu.tflite file. The key transformation is:

  1. Subgraph partitioning: The compiler walks the TFLite subgraph from inputs to outputs. Starting from the beginning, it collects consecutive supported INT8 ops into a single contiguous subgraph destined for the Edge TPU.

  2. Custom op creation: All Edge TPU-compatible ops in that contiguous subgraph are replaced by a single custom op with the name "edgetpu-custom-op":

// From edgetpu.h
static const char kCustomOp[] = "edgetpu-custom-op";
  1. Opaque binary payload: The custom op carries an opaque binary blob in its custom_options field (the FlatBuffer Operator.custom_options byte vector). This blob contains:
  2. The inference executable — a compiled binary that the Edge TPU DSP/systolic array can execute directly. This is NOT a TFLite subgraph; it is a proprietary instruction stream for the Edge TPU hardware.
  3. Parameter data layout descriptors — metadata describing which tensors should be cached in on-chip SRAM versus streamed from off-chip memory.
  4. A caching token — a 64-bit number uniquely identifying the parameter data layout for cache management.

  5. Remaining ops stay as-is: If the compiler encounters an unsupported op, it stops the Edge TPU subgraph. All remaining ops after the unsupported op stay as standard TFLite ops and execute on the CPU. This creates a "split" model:

  6. Edge TPU custom op (runs on the TPU)
  7. Remaining TFLite ops (run on CPU)

1.3.2 The "Executable Preamble"

The Coral documentation explicitly states:

"The Edge TPU Compiler adds a small executable inside the model that writes a specific amount of the model's parameter data to the Edge TPU RAM (if available) before running an inference."

This executable preamble is embedded within the edgetpu-custom-op custom options blob and serves two purposes:

  1. Parameter data loading (cache warm-up): On first invocation, the preamble streams parameter data (INT8 weights/biases) from host memory into the Edge TPU's ~8 MB on-chip SRAM. Subsequent invocations can skip this step if the caching token matches.

  2. Inference dispatch: After parameter data is loaded, the preamble transitions to the actual inference engine that drives the Edge TPU's systolic array.

1.3.3 Parameter-Data Caching Protocol

The Edge TPU has approximately 8 MB of SRAM for parameter caching. The protocol works as follows:

┌──────────────────────────────────────────────────┐
│             Edge TPU SRAM (~8 MB)                │
│  ┌────────────────────────────────────────────┐  │
│  │  Inference Executable Space (reserved)      │  │
│  ├────────────────────────────────────────────┤  │
│  │  Parameter Data Cache (remaining space)     │  │
│  │  - Weights, biases for cached ops           │  │
│  └────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────┘

Key details:

  • Compiler-allocated scratchpad (not traditional cache): The compiler knows the exact SRAM size and the model's requirements, so it assigns fixed cache space at compile time.

  • Caching token: A 64-bit number assigned at compile time. The runtime compares the token of the incoming model against the token of the currently-cached data:

  • Match: Use cached data (fast path).
  • Mismatch: Wipe cache, write new data, then execute (slow first inference).

  • Co-compilation: Multiple models can be compiled together (edgetpu_compiler model_A.tflite model_B.tflite) to share the same caching token, eliminating cache thrashing when switching between models.

  • Compiler output example:

On-chip memory available for caching model parameters: 6.91 MiB
On-chip memory used for caching model parameters: 4.21 MiB
Off-chip memory used for streaming uncached model parameters: 0.00 B
  • Off-chip fallback: If parameter data exceeds available SRAM, the excess is streamed from host memory at inference time, degrading performance.

1.4 The libedgetpu API (edgetpu.h)

The public API is defined in libedgetpu/edgetpu.h. Key components:

1.4.1 EdgeTpuManager (Singleton)

class EDGETPU_EXPORT EdgeTpuManager {
 public:
  // Singleton accessor
  static EdgeTpuManager* GetSingleton();

  // Shared-ownership device opening (preferred API)
  virtual std::shared_ptr<EdgeTpuContext> OpenDevice() = 0;
  virtual std::shared_ptr<EdgeTpuContext> OpenDevice(DeviceType device_type) = 0;
  virtual std::shared_ptr<EdgeTpuContext> OpenDevice(
      DeviceType device_type, const std::string& device_path) = 0;
  virtual std::shared_ptr<EdgeTpuContext> OpenDevice(
      DeviceType device_type, const std::string& device_path,
      const DeviceOptions& options) = 0;

  // Deprecated: exclusive-ownership API
  virtual std::unique_ptr<EdgeTpuContext> NewEdgeTpuContext(...) = 0;

  // Device enumeration
  virtual std::vector<DeviceEnumerationRecord> EnumerateEdgeTpu() const = 0;

  // Currently opened shared devices
  virtual std::vector<std::shared_ptr<EdgeTpuContext>> GetOpenedDevices() const = 0;

  // Runtime version
  virtual std::string Version() const = 0;
  virtual TfLiteStatus SetVerbosity(int verbosity) = 0;
};

Device options (passed as std::unordered_map<std::string, std::string>):

  • "Performance": "Low" | "Medium" | "High" | "Max" (default: "Max")
  • "Usb.AlwaysDfu": "True" | "False" (default: "False")
  • "Usb.MaxBulkInQueueLength": "0".."255" (default: "32")

Device types:

enum class DeviceType {
  kApexPci = 0,
  kApexUsb = 1,
};

1.4.2 EdgeTpuContext

class EdgeTpuContext : public TfLiteExternalContext {
 public:
  virtual ~EdgeTpuContext() = 0;
  virtual const EdgeTpuManager::DeviceEnumerationRecord& GetDeviceEnumRecord() const = 0;
  virtual EdgeTpuManager::DeviceOptions GetDeviceOptions() const = 0;
  virtual bool IsReady() const = 0;
};

The context is a TfLiteExternalContext, meaning it plugs into the TFLite interpreter via interpreter->SetExternalContext(kTfLiteEdgeTpuContext, tpu_context.get()).

1.4.3 TFLite Delegate Registration

The custom op is registered via the standard TFLite mechanism:

// Typical usage (Non-NNAPI path):
auto tpu_context = edgetpu::EdgeTpuManager::GetSingleton()->OpenDevice();

tflite::ops::builtin::BuiltinOpResolver resolver;
// Register the Edge TPU custom op handler
resolver.AddCustom(edgetpu::kCustomOp, edgetpu::RegisterCustomOp());

tflite::InterpreterBuilder(*model, resolver)(&interpreter);

// Bind the TPU context to the interpreter
interpreter->SetExternalContext(kTfLiteEdgeTpuContext, tpu_context.get());

interpreter->AllocateTensors();
interpreter->Invoke();

RegisterCustomOp() returns a TfLiteRegistration* that handles:

  • Init: Parses the custom_options blob, sets up the Edge TPU executable
  • Prepare: Resolves tensor shapes and memory layout
  • Invoke: Sends the inference request to the Edge TPU hardware via USB/PCIe

1.4.4 Runtime Version Negotiation

The compiled model encodes a minimum runtime version requirement. At inference time, the runtime checks compatibility:

Failed precondition: Package requires runtime version (12),
which is newer than this runtime version (10).

Version compatibility table:

Compiler Version Default Runtime Version Required
16.0 14
15.0 13
14.1 13
2.1.302470888 13
2.0.291256449 12
1.0 10

Models are forward-compatible: a model compiled for runtime v10 will work on v12+. To create backward-compatible models:

edgetpu_compiler --min_runtime_version 10 your_model.tflite

The runtime version also determines which operations are available (newer ops like LSTM, PReLU, ReduceMax, etc. require runtime ≥13 or ≥14).

1.5 Software Stack Summary

┌─────────────────────────────────────────┐
│            User Application             │
├─────────────────────────────────────────┤
│  PyCoral (Python) / libcoral (C++)      │  High-level inference API
├─────────────────────────────────────────┤
│  TFLite Interpreter + Custom Op         │  Standard TFLite + edgetpu-custom-op
├─────────────────────────────────────────┤
│  libedgetpu                             │  Runtime driver (USB/PCIe comm)
├─────────────────────────────────────────┤
│  Edge TPU Hardware (USB / M.2 / PCIe)  │  ASIC accelerator
└─────────────────────────────────────────┘

2. Coral USB on macOS / Apple Silicon

2.1 Official Support Status

Google Coral's official documentation states that the USB Accelerator works on "Linux, Mac, or Windows." However, the official pre-built libedgetpu binaries are x86_64-only for macOS. The Edge TPU Compiler is Linux x86-64 only (since compiler v2.1, ARM64 builds are no longer provided).

As of 2024–2025, Google has not released native ARM64 (Apple Silicon) binaries for libedgetpu, pycoral, or libcoral.

2.2 Community Efforts

2.2.1 cocoa-xu/libedgetpu

Repository: github.com/cocoa-xu/libedgetpu

Cocoa Xu maintains a fork of libedgetpu that provides:

  • Pre-built darwin-arm64 (Apple Silicon) binaries of libedgetpu
  • aarch64-linux-musl builds (static linking for Alpine/musl-based systems)
  • riscv64-linux-musl builds
  • Integration with the tflite_beam Erlang/Elixir bindings, which bundles the native libraries

The fork modifies the Bazel/Makefile build system to cross-compile for darwin-arm64 and other targets. The key challenge is that the upstream libedgetpu build assumes darwin_x86_64 for macOS, and the Bazel toolchain configuration does not include darwin_arm64 by default.

How the ARM64 build was achieved:

  1. Fork the google-coral/libedgetpu repo
  2. Patch the Bazel build configuration to add darwin_arm64 as a target CPU
  3. Build on an Apple Silicon Mac using native Clang (or cross-compile from x86_64 using Rosetta 2 + ARM64 target triple)
  4. The resulting libedgetpu.1.dylib is a native ARM64 shared library

2.2.2 feranick/libedgetpu

Repository: github.com/feranick/libedgetpu

Feranick maintains an actively-updated fork that tracks newer TensorFlow versions:

  • Current release: 16.0TF2.19.1-1 (compatible with TF 2.19.1)
  • Provides deb packages for amd64, arm64, armhf
  • Also maintains feranick/TFlite-builds with updated tflite_runtime Python wheels

This fork is widely used by the Frigate NVR community for running Coral TPU on ARM64 Linux systems (e.g., Raspberry Pi 5, Orange Pi).

Note: feranick's builds target Linux ARM64, not macOS ARM64. They don't produce darwin-arm64 dylibs.

2.2.3 Tim Strobel's Setup Guide (2024)

URL: tim-strobel.de/coral.html

A practical guide for running Coral USB on macOS 14.x (Sonoma) as of October 2024:

Steps:

  1. Install edgetpu runtime and PyCoral per Coral's setup guide
  2. Python version: Must use Python 3.9 (PyCoral's last supported version; 3.10+ does not work)
  3. NumPy issue: PyCoral was compiled against NumPy 1.x; NumPy 2.x causes AttributeError: _ARRAY_API not found. Fix:
pip install "numpy<2.0"
  1. Library linking error: ValueError: Failed to load delegate from libedgetpu.1.dylib. The dylib is installed at /usr/local/lib/libedgetpu.1.dylib but the system can't find it. Fix:
sudo ln -s /usr/local/lib/libedgetpu.1.dylib /usr/local/lib/libedgetpu.dylib

This guide is for Intel Macs or Apple Silicon Macs running under Rosetta 2.

2.3 Steps to Install on Apple Silicon (M1/M2/M3) Today

There is no fully native, officially-supported path for Apple Silicon as of early 2025. The workable approaches are:

Approach A: Rosetta 2 (x86_64 emulation)

  1. Install Rosetta 2: softwareupdate --install-rosetta
  2. Create an x86_64 Python environment:
conda create -n coral python=3.9
conda activate coral
# Force x86_64 arch
arch -x86_64 pip install pycoral
  1. Install x86_64 libedgetpu:
# Download from coral.ai/software (macOS x86_64 package)
sudo cp libedgetpu.1.dylib /usr/local/lib/
sudo ln -s /usr/local/lib/libedgetpu.1.dylib /usr/local/lib/libedgetpu.dylib
  1. Install pycoral and run:
pip install "numpy<2.0"
arch -x86_64 python3 classify_image.py ...

Performance cost: USB communication runs through Rosetta 2 translation, adding overhead to every inference call.

Approach B: cocoa-xu Native ARM64 Build

  1. Obtain the native ARM64 libedgetpu.1.dylib from cocoa-xu's releases or build from source:
git clone https://github.com/cocoa-xu/libedgetpu
cd libedgetpu
make  # native build on Apple Silicon
  1. Install the ARM64 dylib:
sudo cp out/darwin_arm64/libedgetpu.1.dylib /usr/local/lib/
sudo ln -s /usr/local/lib/libedgetpu.1.dylib /usr/local/lib/libedgetpu.dylib
  1. Build tflite_runtime and pycoral from source for ARM64.

2.4 What Breaks on Native ARM64

Component Issue Workaround
libedgetpu build Bazel toolchain doesn't define darwin_arm64 Patch BUILD file (cocoa-xu approach)
pycoral pip wheel Only x86_64 wheels published Build from source
tflite_runtime pip wheel Only x86_64 wheels published Build from source or use feranick/TFlite-builds
NumPy ABI mismatch Pre-compiled C extensions use NumPy 1.x ABI Pin numpy<2.0
Edge TPU Compiler Linux x86-64 only Use Google Colab or Docker on Linux
USB driver (libusb) Works but needs proper ARM64 build Install via Homebrew: brew install libusb
libedgetpu.1.dylib not found Dynamic linker can't locate it Create symlink: ln -s libedgetpu.1.dylib libedgetpu.dylib

3. Existing Compiler Architectures (IREE, TVM, ExecuTorch)

3.1 IREE (Intermediate Representation Execution Environment)

URL: iree.dev | Repo: github.com/iree-org/iree

3.1.1 Architecture Overview

IREE is an MLIR-based end-to-end compiler and runtime that lowers ML models to a unified executable format. Its architecture is a multi-dialect MLIR pipeline:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Front-end    │────▶│  IREE Core   │────▶│  IREE HAL    │────▶│  IREE VM     │
│  Importers    │     │  Dialects    │     │  Dialects    │     │  Bytecode    │
└──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘
   TF / JAX /          linalg_ext,          hal.device,          FlatBuffer
   PyTorch →           stream,              hal.executable       executable
   MLIR                util                  target configs

3.1.2 Multi-Framework Front-Ends

IREE historically supported TensorFlow/TFLite as its primary frontend, then expanded to JAX due to organizational affiliation at Google. PyTorch support is being integrated via the torch-mlir project and the SHARK-Turbine prototype:

Front-end Import Path Status
TensorFlow / TFLite iree-import-tflite Stable
JAX / JAX.export jax.export → MLIR Stable
PyTorch torch-mlir → IREE Experimental (SHARK-Turbine)
ONNX Via iree-import-onnx Community-maintained

The common entry point is MLIR's linalg dialect — all front-ends lower to linalg-on-tensors before IREE's core compilation kicks in.

3.1.3 Quantization Pipeline

IREE's quantization support is evolving (tracked in issue #12005):

  • Import-time quantization: Models quantized in the front-end (e.g., TFLite INT8, PyTorch quantized) carry quantization metadata into IREE's MLIR representation via the quant dialect.
  • Compiler-internal quantization: IREE can perform post-training quantization during compilation, but this is less mature than TFLite's approach.
  • Hardware-aware quantization: Different backends (CPU/GPU) may require different quantization schemes. IREE's data-tiling path handles sub-byte quantization (4-bit, 2-bit) for specific targets.

3.1.4 Backend Delegation & Hardware Targets

IREE uses a HAL (Hardware Abstraction Layer) dialect to describe device targets and executables:

Compiler target backends:
├── llvm-cpu        → LLVM CPU codegen (x86-64, ARM, RISC-V)
├── vulkan-spirv    → GPU via Vulkan/SPIR-V
├── rocm            → AMD GPU via ROCm
├── cuda            → NVIDIA GPU via PTX
├── metal-spirv     → Apple Metal (via SPIR-V cross-compilation)
└── vmvx            → IREE's reference software backend

Adding a custom backend requires implementing:

  1. An IREE::HAL::ExecutableTargetAttr in MLIR
  2. A lowering pass from linalg → target-specific dialect
  3. A runtime HAL driver for the target device

Feasibility for Edge TPU: IREE's MLIR-based architecture makes it possible in theory to add an Edge TPU backend. However, the proprietary instruction set and lack of public documentation for the Edge TPU's microarchitecture make this extremely challenging. The most practical path would be:

  1. Import model → linalg dialect
  2. Lower to TFLite-compatible representation
  3. Feed into the edgetpu_compiler as a post-processing step
  4. Wrap the resulting edgetpu-custom-op as an IREE executable

3.2 Apache TVM

URL: tvm.apache.org | Repo: github.com/apache/tvm

3.2.1 Architecture Overview

TVM uses a Relay IR intermediate representation and a target-based compilation pipeline:

┌──────────────-┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Front-end    │────▶│  Relay IR    │────▶│  TIR         │────▶│  Runtime     │
│  Importers    │     │  (Functional │     │  (Tensor IR) │     │  Module      │
│               │     │   Graph IR)  │     │              │     │              │
└──────────────-┘     └──────────────┘     └──────────────┘     └──────────────┘
   TF / ONNX /        High-level graph    Low-level loop      .so / .tar
   PyTorch / MXNet     with types          nests + buffer

3.2.2 Multi-Framework Front-Ends

Front-end Import Method Status
TensorFlow tvm.relay.frontend.from_tensorflow Stable
TFLite tvm.relay.frontend.from_tflite Stable
ONNX tvm.relay.frontend.from_onnx Stable
PyTorch tvm.relay.frontend.from_pytorch Stable
MXNet tvm.relay.frontend.from_mxnet Stable
PaddlePaddle tvm.relay.frontend.from_paddle Experimental

3.2.3 Quantization Pipeline

TVM provides a comprehensive quantization toolkit:

  1. Post-training quantization (PTQ):
with tvm.transform.PassContext(opt_level=3):
    mod = tvm.relay.quantize.quantize(mod, params)
  1. Quantization-aware training (QAT) import: Models pre-quantized in TensorFlow (via tf.quantization) or PyTorch (via torch.quantization) can be imported with their quantization metadata preserved.

  2. Calibration-based quantization: TVM can calibrate a float model using a representative dataset to determine optimal quantization parameters:

with tvm.target.Target("llvm"):
    qmod = tvm.relay.quantize.quantize(mod, params, dataset=calibration_data)
  1. Custom quantization schemes: TVM supports configurable quantization (per-channel, per-tensor, symmetric, asymmetric) through the tvm.relay.quantize.qconfig API.

3.2.4 Backend Delegation & External Codegen

TVM supports external codegen for proprietary hardware through two mechanisms:

  1. BYOC (Bring Your Own Codegen): Relay graph partitioning delegates subgraphs to external codegen tools:
# Register an external codegen for a target
@tvm.register_func("relay.ext.my_accel.codegen")
def my_accel_codegen(ref_call):
    # Generate code for the accelerator
    return compiled_module
  1. Target hooks: Custom targets can register compilation passes that intercept specific op patterns.

Feasibility for Edge TPU: TVM's BYOC framework could theoretically wrap the edgetpu_compiler as an external codegen target. The workflow would be:

  1. Import model → Relay IR
  2. Partition graph: identify Edge TPU-compatible subgraphs
  3. Export subgraphs to TFLite format
  4. Invoke edgetpu_compiler on each subgraph
  5. Link compiled Edge TPU blobs back into the TVM runtime as external modules

This is similar to how TVM integrates with the VTA (Versatile Tensor Accelerator) and how the MATCH framework uses TVM for MCU deployment.

3.3 ExecuTorch (Meta / PyTorch)

URL: executorch.ai | Repo: github.com/pytorch/executorch

3.3.1 Architecture Overview

ExecuTorch is PyTorch's on-device inference framework, built on top of PyTorch 2's torch.export:

┌──────────────-┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  torch.export │────▶│  Edge IR     │────▶│  Partitioner │────▶│  ExecuTorch  │
│  (PyTorch 2)  │     │  (Core ATen  │     │  (Backend    │     │  Program     │
│               │     │   dialect)   │     │   Delegate)  │     │  (.pte)      │
└──────────────-┘     └──────────────┘     └──────────────┘     └──────────────┘
   AOT graph           Standardized         Subgraph split       FlatBuffer
   export               operator set         per backend          executable

3.3.2 Front-End & IR

  • Single front-end: PyTorch models via torch.export (AOT compilation)
  • IR: "Core ATen Operator Set" — a standardized subset of ATen ops
  • Export flow: torch.export.export(model, example_inputs) → Edge dialect

3.3.3 Quantization Pipeline

ExecuTorch supports two quantization pathways:

  1. Post-training quantization (PTQ):
  2. XNNPACK quantizer: INT8 per-channel weight quantization
  3. Quantization flow: quantize(model, qconfig) → quantized Edge IR

  4. Backend-aware quantization:

  5. Each backend delegate can define its own quantizer
  6. Example: The OpenVINO backend provides OpenVINOQuantizer with backend-aware compression pathways
  7. Example: The CoreML backend supports INT8 and FP16 quantization with hardware-specific optimizations
from executorch.exir import EdgeCompileConfig, to_edge
from torch.ao.quantization.quantize_pt2e import convert_pt2e, prepare_pt2e

# Quantize
model = prepare_pt2e(model, qconfig)
# ... calibrate ...
model = convert_pt2e(model)

# Export to Edge
edge_program = to_edge(model, compile_config=EdgeCompileConfig())

3.3.4 Backend Delegation

ExecuTorch uses a partitioner + delegate model:

  1. Partitioner: Splits the Edge IR graph into subgraphs suitable for each backend. A partitioner identifies which ops a backend can handle and groups them.

  2. Delegate: Each backend implements a DelegateHandler that compiles its assigned subgraph into a binary blob. At runtime, the ExecuTorch runtime calls the delegate to execute the subgraph.

Current backends (12+):

Backend Hardware Status
XNNPACK CPU (ARM, x86) Stable
CoreML Apple Neural Engine / GPU Stable
Qualcomm QNN Hexagon DSP / AI Engine Stable
Vulkan Mobile GPU Stable
MediaTek APU New
MPS Apple Metal Experimental
NNAPI Android NNAPI Community
OpenVINO Intel CPU/GPU/VPU Stable
Vulkan GPU via SPIR-V Stable

Feasibility for Edge TPU: ExecuTorch's delegate model is well-suited for Edge TPU integration. The approach would be:

  1. Implement a partitioner that identifies Edge TPU-compatible subgraphs (INT8 ops from the supported list)
  2. Implement a delegate that converts the subgraph to TFLite format and invokes edgetpu_compiler
  3. Package the compiled edgetpu-custom-op blob into the ExecuTorch program

The main challenge is the PyTorch → TFLite conversion step, which currently has limited support (ONNX as an intermediate format is one option).

3.4 Comparison Table

Feature IREE TVM ExecuTorch
IR MLIR (linalg, stream, hal) Relay IR + TIR Core ATen (Edge dialect)
Front-ends TF, JAX, PyTorch (exp.) TF, ONNX, PyTorch, TFLite, etc. PyTorch only
Quantization Import-time + compiler-internal PTQ + QAT + calibration PTQ + backend-aware
Backend model HAL device targets BYOC external codegen Partitioner + delegate
Custom HW path Implement HAL target + runtime driver BYOC: register codegen func Implement partitioner + delegate
Edge TPU feasibility Hard (proprietary ISA) Medium (BYOC + edgetpu_compiler) Medium (delegate + TFLite conversion)
MLIR as IR? Yes (native) No (Relay is custom) No (Core ATen dialect)
Runtime IREE VM (custom bytecode) TVM runtime (graph executor) ExecuTorch runtime (.pte)

3.5 Feasibility of MLIR (IREE) or Relay (TVM) as Intermediate Representation

MLIR (via IREE)

Pros:

  • Native multi-dialect system allows clean separation of concerns
  • linalg dialect is a natural meeting point for different front-ends
  • Extensible: new hardware targets are "just" new dialects and lowering passes
  • Growing ecosystem (XLA, IREE, torch-mlir, stablehlo all use MLIR)
  • Potential for stablehlolinalg → Edge TPU lowering pipeline

Cons:

  • IREE's hal.executable abstraction assumes you can lower to some form of compute kernel (SPIR-V, PTX, LLVM IR). The Edge TPU's proprietary ISA doesn't fit this model.
  • The edgetpu_compiler is a black box; you can't decompose its output back into MLIR.
  • Significant engineering effort to implement a working pipeline

Recommended approach: Use MLIR as a front-end aggregation layer (TF/PyTorch/ONNX → stablehlo → linalg) and then lower to TFLite for the edgetpu_compiler step. Don't attempt to bypass edgetpu_compiler.

Relay (via TVM)

Pros:

  • Mature BYOC framework designed for proprietary hardware
  • Good TFLite front-end import — can read and write TFLite models
  • Quantization toolkit is well-tested for INT8 workflows
  • TVM has been used with real hardware accelerators (VTA, MATCH)

Cons:

  • Relay IR is not MLIR; doesn't benefit from the MLIR ecosystem
  • TVM's own development velocity has slowed relative to IREE/ExecuTorch
  • The BYOC → edgetpu_compiler path requires an awkward Relay → TFLite → edgetpu_compiler → custom op relay roundtrip

Recommended approach: If starting from TVM, use BYOC with a TFLite export/reimport strategy. This is the path of least resistance for integrating the edgetpu_compiler.


4. Edge TPU Operation Set & Limitations

4.1 Model Requirements

For a TensorFlow model to compile for the Edge TPU, it must meet all of these:

  1. Tensor parameters are quantized to 8-bit fixed-point (int8 or uint8)
  2. Tensor sizes are constant at compile-time (no dynamic shapes)
  3. Model parameters (weights, biases) are constant at compile-time
  4. Tensors are 1-, 2-, or 3-dimensional. If a tensor has >3 dimensions, only the 3 innermost dimensions may have size >1
  5. Only supported operations are used (see table below)

Float inputs are OK — the compiler will leave a QUANTIZE op at the graph entry point that runs on the CPU, converting float → INT8 before the Edge TPU custom op.

4.2 Complete Supported Operations Table

Source: coral.ai/docs/edgetpu/models-intro#supported-operations

Operation Min Runtime Known Limitations
Add All
AveragePool2d All No fused activation function
Concatenation All No fused activation function. If any input is a compile-time constant tensor, there must be only 2 inputs, and this constant must be all zeros (zero-padding op)
Conv2d All Must use same dilation in x and y dimensions
DepthwiseConv2d ≤12 Dilated conv kernels not supported
≥13 Must use same dilation in x and y dimensions
ExpandDims ≥13
FullyConnected All Only default weight format. Output tensor is 1-dimensional
L2Normalization All
Logistic (Sigmoid) All
LSTM ≥14 Unidirectional LSTM only
Maximum All
MaxPool2d All No fused activation function
Mean ≤12 No batch-dim reduction. Supports reduction along x- and/or y-dimensions only
≥13 No batch-dim reduction. If z-reduction, z-dim must be multiple of 4
Minimum All
Mul All
Pack ≥13 No packing in batch dimension
Pad ≤12 No batch-dim padding. Supports padding along x- and/or y-dimensions only
≥13 No batch-dim padding
PReLU ≥13 Alpha must be 1-dimensional. If using Keras PReLU with 4D input (batch, height, width, channels), shared_axes must be [1,2]
Quantize ≥13
ReduceMax ≥14 Cannot operate on batch dimension
ReduceMin ≥14 Cannot operate on batch dimension
ReLU All
ReLU6 All
ReLUN1To1 All
Reshape All Certain reshapes might not be mapped for large tensor sizes
ResizeBilinear All Input/output is 3-dimensional. Might not be mapped to avoid precision loss depending on size
ResizeNearestNeighbor All Input/output is 3-dimensional. Might not be mapped depending on size
Rsqrt ≥14
Slice All
Softmax All Supports only 1-D input with max 16,000 elements
SpaceToDepth All
Split All No splitting in batch dimension
Squeeze ≤12 Only when input has leading 1s (no relayout needed), e.g. [1,1,10] or [1,5,10] is OK; [5,1,10] is not
≥13 None
StridedSlice All All strides must equal 1 (effectively a Slice op), ellipsis-axis-mask == 0, new-axis-mask == 0
Sub All
Sum ≥13 Cannot operate on batch dimension
SquaredDifference ≥14
Tanh All
Transpose ≥14
TransposeConv ≥13

Total: 38 operations (with varying runtime version requirements)

4.3 Key Constraints

4.3.1 INT8 Quantization Requirement

All tensor parameters (weights, biases) must be quantized to INT8/UINT8. The Edge TPU hardware has no floating-point execution units. Activation tensors must also be quantized; the TFLite converter handles this via "full integer quantization" with calibration data.

4.3.2 Tensor Dimension Constraints

  • Tensors must be 1D, 2D, or 3D
  • If >3D, only the 3 innermost dimensions may have size >1
  • This effectively means the Edge TPU processes tensors in NHWC format where N (batch) must be 1 for most operations, and dimensions beyond C must be 1

4.3.3 On-Chip Memory

  • ~8 MB SRAM total for executable + parameter cache
  • Typically ~6.91 MiB available for parameter data (after executable space)
  • If model parameters exceed available SRAM, excess is streamed from host memory (off-chip), which significantly degrades performance

4.3.4 Max Parameter Data Size

There is no single documented hard limit on model size, but practical limits arise from the 8 MB SRAM:

  • Models with <7 MB of parameter data can be fully cached on-chip
  • Larger models must stream parameters, causing 2-10x slower inference

4.3.5 Compilation Behavior

The edgetpu_compiler operates in a greedy prefix manner:

  1. Starts from the graph input
  2. Collects consecutive supported ops
  3. Stops at the first unsupported op
  4. Everything before the stop → edgetpu-custom-op (runs on TPU)
  5. Everything after → standard TFLite ops (runs on CPU)

This means the position of unsupported ops matters — an unsupported op early in the graph will prevent most of the model from running on the TPU, even if later ops are individually supported.

The -d / --search_delegate flag enables the compiler to retry compilation from an earlier point in the graph when it encounters an unsupported op, potentially allowing a partial TPU mapping even with mid-graph failures.

4.3.6 Subgraph Limitations

Many operations show "More than one subgraph is not supported" in compiler output. This occurs when:

  • The model has control flow (if/while) creating multiple subgraphs
  • Certain ops create implicit subgraphs
  • The Edge TPU currently supports only a single subgraph for hardware execution

4.3.7 No 3D Convolutions

The Edge TPU supports only Conv2D and DepthwiseConv2D. There is no 3D convolution support. Models requiring 3D convs (video processing, 3D medical imaging) cannot be fully accelerated.

4.3.8 Softmax Element Limit

Softmax supports only 1-D input with a maximum of 16,000 elements. This limits the number of classes in classification models running entirely on the Edge TPU.


5. Sloth Integration Implications

The repository now carries an integrated sloth_integration package that targets text-classification and embedding deployment to Coral USB.

5.1 Why This Matters for Edge TPU Research

  1. It validates an end-to-end SLM deployment path (adapter -> converter -> quantizer -> runtime).
  2. It stress-tests the runtime API compatibility layer against Coral USB delegates.
  3. It provides reproducible benchmark scenarios that compare CPU baseline, direct hardware, and hardware through sloth runtime abstractions.

5.2 Practical Constraints Observed

  • Synthetic micro-models are useful for pipeline validation but should not be used as production latency proxies.
  • Delegate preflight behavior can vary by model artifact; robust fallback paths are required in runtime wrappers.
  • Keeping package metadata centralized in root build config avoids drift between runtime code and published dependencies.

5.3 Current Internal References

  • Integration package: sloth-integration/src/sloth_integration
  • Integration tests: sloth-integration/tests
  • Benchmark report: sloth-integration/docs/benchmarks_sloth.md

Appendix A: Key URLs

Resource URL
Coral Documentation https://www.coral.ai/docs
Edge TPU Models Intro https://www.coral.ai/docs/edgetpu/models-intro
Edge TPU Compiler https://www.coral.ai/docs/edgetpu/compiler
libedgetpu (official) https://github.com/google-coral/libedgetpu
edgetpu issue tracker https://github.com/google-coral/edgetpu
pycoral https://github.com/google-coral/pycoral
libcoral https://github.com/google-coral/libcoral
cocoa-xu/libedgetpu https://github.com/cocoa-xu/libedgetpu
feranick/libedgetpu https://github.com/feranick/libedgetpu
feranick/TFlite-builds https://github.com/feranick/TFlite-builds
Tim Strobel's Mac Guide https://tim-strobel.de/coral.html
IREE https://iree.dev
Apache TVM https://tvm.apache.org
ExecuTorch https://executorch.ai
Edge TPU Compiler (Colab) https://colab.research.google.com

Appendix B: Edge TPU Compiler CLI Reference

edgetpu_compiler [options] model [...]

Options:
  -h, --help                     Print help
  -i, --intermediate_tensors     Output tensors from Edge TPU custom op
  -m, --min_runtime_version      Min runtime version (default: compiler-specific)
  -n, --num_segments             Compile into N segments for pipelining
  -o, --out_dir                  Output directory (default: cwd)
  -s, --show_operations          Print op mapping to console
  -d, --search_delegate          Retry compilation on failure (since v16)
  -t, --timeout_sec              Compiler timeout in seconds (default: 180)
  -v, --version                  Print compiler version

Appendix C: edgetpu.h Complete Public API

namespace edgetpu {

// Custom op identifier
static const char kCustomOp[] = "edgetpu-custom-op";

enum class DeviceType { kApexPci = 0, kApexUsb = 1 };

class EdgeTpuManager {
 public:
  static EdgeTpuManager* GetSingleton();

  // Shared ownership (preferred)
  virtual std::shared_ptr<EdgeTpuContext> OpenDevice() = 0;
  virtual std::shared_ptr<EdgeTpuContext> OpenDevice(DeviceType) = 0;
  virtual std::shared_ptr<EdgeTpuContext> OpenDevice(
      DeviceType, const std::string& path) = 0;
  virtual std::shared_ptr<EdgeTpuContext> OpenDevice(
      DeviceType, const std::string& path, const DeviceOptions&) = 0;

  // Exclusive ownership (deprecated)
  virtual std::unique_ptr<EdgeTpuContext> NewEdgeTpuContext(...) = 0;

  virtual std::vector<DeviceEnumerationRecord> EnumerateEdgeTpu() const = 0;
  virtual std::vector<std::shared_ptr<EdgeTpuContext>> GetOpenedDevices() const = 0;
  virtual TfLiteStatus SetVerbosity(int verbosity) = 0;
  virtual std::string Version() const = 0;
};

class EdgeTpuContext : public TfLiteExternalContext {
 public:
  virtual const DeviceEnumerationRecord& GetDeviceEnumRecord() const = 0;
  virtual DeviceOptions GetDeviceOptions() const = 0;
  virtual bool IsReady() const = 0;
};

// Register the custom op with TFLite
EDGETPU_EXPORT TfLiteRegistration* RegisterCustomOp();

}  // namespace edgetpu

Appendix D: Compiler-Runtime Version Compatibility

Compiler Default Runtime Notable New Ops
1.0 10 Base set (Add, Conv2D, etc.)
2.0 12 DepthwiseConv2d dilation
2.1 13 ExpandDims, Pack, PReLU, Quantize, Sum, TransposeConv, improved Squeeze/Mean/Pad
14.1–15.0 13
16.0 14 LSTM, ReduceMax, ReduceMin, Rsqrt, SquaredDifference, Transpose

Report compiled from official documentation, GitHub repositories, community guides, and web research. Last updated: 2025-03-05.