Edge TPU Ecosystem Research¶

Comprehensive technical reference for the Google Coral Edge TPU platform, compiled as the basis for an alternative compiler project.

Table of Contents¶

Official Coral USB / Edge TPU Software Stack
Coral USB on macOS / Apple Silicon
Existing Compiler Architectures
Edge TPU Operation Set & Limitations
Sloth Integration Implications

1. Official Coral USB / Edge TPU Software Stack¶

1.1 Repository Structure¶

The Coral software stack comprises three principal repositories:

Repository	Purpose	Language	Current TF Version
google-coral/libedgetpu	Userspace runtime driver (USB/PCIe communication)	C++	2.16.1
google-coral/libcoral	High-level C++ inference/pipeline/transfer-learning API	C++	—
google-coral/pycoral	Python bindings (PyCoral)	Python/C++	—

The legacy monorepo google-coral/edgetpu now serves primarily as an issue tracker; source has been split into the repos above. The deprecated edgetpu Python API still lives in the old repo but is superseded by pycoral.

1.2 Building libedgetpu¶

Three build paths are documented:

A. Docker + Bazel (Recommended)¶

# x86-64
DOCKER_CPUS="k8" DOCKER_IMAGE="ubuntu:22.04" DOCKER_TARGETS=libedgetpu make docker-build

# ARM64 / ARMv7
DOCKER_CPUS="armv7a aarch64" DOCKER_IMAGE="debian:bookworm" DOCKER_TARGETS=libedgetpu make docker-build

All built binaries land in the out/ directory. Debian packages can be produced with debuild -us -uc -tc -b -a arm64 -d.

B. Bare Bazel¶

# Requires Bazel 6.5.0 for TF 2.16.1
make                  # native build
CPU=armv7a make       # cross-compile ARMv7
CPU=aarch64 make      # cross-compile AArch64

macOS caveat: Compilation fails out of the box. Two manual steps are required:

Install flatbuffers via MacPorts.
After the first build failure, patch the auto-generated Bazel BUILD file at /var/tmp/_bazel_xxxxx/.../external/local_config_cc/BUILD line 48:

"darwin_x86_64": ":cc-compiler-darwin",

C. Makefile (Linux-only, no Bazel)¶

sudo apt install libabsl-dev libflatbuffers-dev
git clone https://github.com/tensorflow/tensorflow && cd tensorflow && git checkout v2.16.1
TFROOT=<path-to-tf> make -f makefile_build/Makefile -j$(nproc) libedgetpu

This approach eliminates Bazel/Docker entirely and uses system packages for libabsl and libflatbuffers.

1.3 The `*_edgetpu.tflite` Binary Format¶

1.3.1 How the Edge TPU Custom Op Is Embedded¶

The Edge TPU Compiler (edgetpu_compiler) takes a standard TFLite FlatBuffer and transforms it into a compiled *_edgetpu.tflite file. The key transformation is:

Subgraph partitioning: The compiler walks the TFLite subgraph from inputs to outputs. Starting from the beginning, it collects consecutive supported INT8 ops into a single contiguous subgraph destined for the Edge TPU.
Custom op creation: All Edge TPU-compatible ops in that contiguous subgraph are replaced by a single custom op with the name "edgetpu-custom-op":

// From edgetpu.h
static const char kCustomOp[] = "edgetpu-custom-op";

Opaque binary payload: The custom op carries an opaque binary blob in its custom_options field (the FlatBuffer Operator.custom_options byte vector). This blob contains:
The inference executable — a compiled binary that the Edge TPU DSP/systolic array can execute directly. This is NOT a TFLite subgraph; it is a proprietary instruction stream for the Edge TPU hardware.
Parameter data layout descriptors — metadata describing which tensors should be cached in on-chip SRAM versus streamed from off-chip memory.
A caching token — a 64-bit number uniquely identifying the parameter data layout for cache management.
Remaining ops stay as-is: If the compiler encounters an unsupported op, it stops the Edge TPU subgraph. All remaining ops after the unsupported op stay as standard TFLite ops and execute on the CPU. This creates a "split" model:
Edge TPU custom op (runs on the TPU)
Remaining TFLite ops (run on CPU)

1.3.2 The "Executable Preamble"¶

The Coral documentation explicitly states:

"The Edge TPU Compiler adds a small executable inside the model that writes a specific amount of the model's parameter data to the Edge TPU RAM (if available) before running an inference."

This executable preamble is embedded within the edgetpu-custom-op custom options blob and serves two purposes:

Parameter data loading (cache warm-up): On first invocation, the preamble streams parameter data (INT8 weights/biases) from host memory into the Edge TPU's ~8 MB on-chip SRAM. Subsequent invocations can skip this step if the caching token matches.
Inference dispatch: After parameter data is loaded, the preamble transitions to the actual inference engine that drives the Edge TPU's systolic array.

1.3.3 Parameter-Data Caching Protocol¶

The Edge TPU has approximately 8 MB of SRAM for parameter caching. The protocol works as follows:

┌──────────────────────────────────────────────────┐
│             Edge TPU SRAM (~8 MB)                │
│  ┌────────────────────────────────────────────┐  │
│  │  Inference Executable Space (reserved)      │  │
│  ├────────────────────────────────────────────┤  │
│  │  Parameter Data Cache (remaining space)     │  │
│  │  - Weights, biases for cached ops           │  │
│  └────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────┘

Key details:

Compiler-allocated scratchpad (not traditional cache): The compiler knows the exact SRAM size and the model's requirements, so it assigns fixed cache space at compile time.
Caching token: A 64-bit number assigned at compile time. The runtime compares the token of the incoming model against the token of the currently-cached data:
Match: Use cached data (fast path).
Mismatch: Wipe cache, write new data, then execute (slow first inference).
Co-compilation: Multiple models can be compiled together (edgetpu_compiler model_A.tflite model_B.tflite) to share the same caching token, eliminating cache thrashing when switching between models.
Compiler output example:

On-chip memory available for caching model parameters: 6.91 MiB
On-chip memory used for caching model parameters: 4.21 MiB
Off-chip memory used for streaming uncached model parameters: 0.00 B

Off-chip fallback: If parameter data exceeds available SRAM, the excess is streamed from host memory at inference time, degrading performance.

1.4 The libedgetpu API (edgetpu.h)¶

The public API is defined in libedgetpu/edgetpu.h. Key components:

1.4.1 EdgeTpuManager (Singleton)¶

class EDGETPU_EXPORT EdgeTpuManager {
 public:
  // Singleton accessor
  static EdgeTpuManager* GetSingleton();

  // Shared-ownership device opening (preferred API)
  virtual std::shared_ptr<EdgeTpuContext> OpenDevice() = 0;
  virtual std::shared_ptr<EdgeTpuContext> OpenDevice(DeviceType device_type) = 0;
  virtual std::shared_ptr<EdgeTpuContext> OpenDevice(
      DeviceType device_type, const std::string& device_path) = 0;
  virtual std::shared_ptr<EdgeTpuContext> OpenDevice(
      DeviceType device_type, const std::string& device_path,
      const DeviceOptions& options) = 0;

  // Deprecated: exclusive-ownership API
  virtual std::unique_ptr<EdgeTpuContext> NewEdgeTpuContext(...) = 0;

  // Device enumeration
  virtual std::vector<DeviceEnumerationRecord> EnumerateEdgeTpu() const = 0;

  // Currently opened shared devices
  virtual std::vector<std::shared_ptr<EdgeTpuContext>> GetOpenedDevices() const = 0;

  // Runtime version
  virtual std::string Version() const = 0;
  virtual TfLiteStatus SetVerbosity(int verbosity) = 0;
};

Device options (passed as std::unordered_map<std::string, std::string>):

"Performance": "Low" | "Medium" | "High" | "Max" (default: "Max")
"Usb.AlwaysDfu": "True" | "False" (default: "False")
"Usb.MaxBulkInQueueLength": "0".."255" (default: "32")

Device types:

enum class DeviceType {
  kApexPci = 0,
  kApexUsb = 1,
};

1.4.2 EdgeTpuContext¶

class EdgeTpuContext : public TfLiteExternalContext {
 public:
  virtual ~EdgeTpuContext() = 0;
  virtual const EdgeTpuManager::DeviceEnumerationRecord& GetDeviceEnumRecord() const = 0;
  virtual EdgeTpuManager::DeviceOptions GetDeviceOptions() const = 0;
  virtual bool IsReady() const = 0;
};

The context is a TfLiteExternalContext, meaning it plugs into the TFLite interpreter via interpreter->SetExternalContext(kTfLiteEdgeTpuContext, tpu_context.get()).

1.4.3 TFLite Delegate Registration¶

The custom op is registered via the standard TFLite mechanism:

// Typical usage (Non-NNAPI path):
auto tpu_context = edgetpu::EdgeTpuManager::GetSingleton()->OpenDevice();

tflite::ops::builtin::BuiltinOpResolver resolver;
// Register the Edge TPU custom op handler
resolver.AddCustom(edgetpu::kCustomOp, edgetpu::RegisterCustomOp());

tflite::InterpreterBuilder(*model, resolver)(&interpreter);

// Bind the TPU context to the interpreter
interpreter->SetExternalContext(kTfLiteEdgeTpuContext, tpu_context.get());

interpreter->AllocateTensors();
interpreter->Invoke();

RegisterCustomOp() returns a TfLiteRegistration* that handles:

Init: Parses the custom_options blob, sets up the Edge TPU executable
Prepare: Resolves tensor shapes and memory layout
Invoke: Sends the inference request to the Edge TPU hardware via USB/PCIe

1.4.4 Runtime Version Negotiation¶

The compiled model encodes a minimum runtime version requirement. At inference time, the runtime checks compatibility:

Failed precondition: Package requires runtime version (12),
which is newer than this runtime version (10).

Version compatibility table:

Compiler Version	Default Runtime Version Required
16.0	14
15.0	13
14.1	13
2.1.302470888	13
2.0.291256449	12
1.0	10

Models are forward-compatible: a model compiled for runtime v10 will work on v12+. To create backward-compatible models:

edgetpu_compiler --min_runtime_version 10 your_model.tflite

The runtime version also determines which operations are available (newer ops like LSTM, PReLU, ReduceMax, etc. require runtime ≥13 or ≥14).

1.5 Software Stack Summary¶

┌─────────────────────────────────────────┐
│            User Application             │
├─────────────────────────────────────────┤
│  PyCoral (Python) / libcoral (C++)      │  High-level inference API
├─────────────────────────────────────────┤
│  TFLite Interpreter + Custom Op         │  Standard TFLite + edgetpu-custom-op
├─────────────────────────────────────────┤
│  libedgetpu                             │  Runtime driver (USB/PCIe comm)
├─────────────────────────────────────────┤
│  Edge TPU Hardware (USB / M.2 / PCIe)  │  ASIC accelerator
└─────────────────────────────────────────┘

2. Coral USB on macOS / Apple Silicon¶

2.1 Official Support Status¶

Google Coral's official documentation states that the USB Accelerator works on "Linux, Mac, or Windows." However, the official pre-built libedgetpu binaries are x86_64-only for macOS. The Edge TPU Compiler is Linux x86-64 only (since compiler v2.1, ARM64 builds are no longer provided).

As of 2024–2025, Google has not released native ARM64 (Apple Silicon) binaries for libedgetpu, pycoral, or libcoral.

2.2 Community Efforts¶

2.2.1 cocoa-xu/libedgetpu¶

Repository: github.com/cocoa-xu/libedgetpu

Cocoa Xu maintains a fork of libedgetpu that provides:

Pre-built darwin-arm64 (Apple Silicon) binaries of libedgetpu
aarch64-linux-musl builds (static linking for Alpine/musl-based systems)
riscv64-linux-musl builds
Integration with the tflite_beam Erlang/Elixir bindings, which bundles the native libraries

The fork modifies the Bazel/Makefile build system to cross-compile for darwin-arm64 and other targets. The key challenge is that the upstream libedgetpu build assumes darwin_x86_64 for macOS, and the Bazel toolchain configuration does not include darwin_arm64 by default.

How the ARM64 build was achieved:

Fork the google-coral/libedgetpu repo
Patch the Bazel build configuration to add darwin_arm64 as a target CPU
Build on an Apple Silicon Mac using native Clang (or cross-compile from x86_64 using Rosetta 2 + ARM64 target triple)
The resulting libedgetpu.1.dylib is a native ARM64 shared library

2.2.2 feranick/libedgetpu¶

Repository: github.com/feranick/libedgetpu

Feranick maintains an actively-updated fork that tracks newer TensorFlow versions:

Current release: 16.0TF2.19.1-1 (compatible with TF 2.19.1)
Provides deb packages for amd64, arm64, armhf
Also maintains feranick/TFlite-builds with updated tflite_runtime Python wheels

This fork is widely used by the Frigate NVR community for running Coral TPU on ARM64 Linux systems (e.g., Raspberry Pi 5, Orange Pi).

Note: feranick's builds target Linux ARM64, not macOS ARM64. They don't produce darwin-arm64 dylibs.

2.2.3 Tim Strobel's Setup Guide (2024)¶

URL: tim-strobel.de/coral.html

A practical guide for running Coral USB on macOS 14.x (Sonoma) as of October 2024:

Steps:

Install edgetpu runtime and PyCoral per Coral's setup guide
Python version: Must use Python 3.9 (PyCoral's last supported version; 3.10+ does not work)
NumPy issue: PyCoral was compiled against NumPy 1.x; NumPy 2.x causes AttributeError: _ARRAY_API not found. Fix:

pip install "numpy<2.0"

Library linking error: ValueError: Failed to load delegate from libedgetpu.1.dylib. The dylib is installed at /usr/local/lib/libedgetpu.1.dylib but the system can't find it. Fix:

sudo ln -s /usr/local/lib/libedgetpu.1.dylib /usr/local/lib/libedgetpu.dylib

This guide is for Intel Macs or Apple Silicon Macs running under Rosetta 2.

2.3 Steps to Install on Apple Silicon (M1/M2/M3) Today¶

There is no fully native, officially-supported path for Apple Silicon as of early 2025. The workable approaches are:

Approach A: Rosetta 2 (x86_64 emulation)¶

Install Rosetta 2: softwareupdate --install-rosetta
Create an x86_64 Python environment:

conda create -n coral python=3.9
conda activate coral
# Force x86_64 arch
arch -x86_64 pip install pycoral

Install x86_64 libedgetpu:

# Download from coral.ai/software (macOS x86_64 package)
sudo cp libedgetpu.1.dylib /usr/local/lib/
sudo ln -s /usr/local/lib/libedgetpu.1.dylib /usr/local/lib/libedgetpu.dylib

Install pycoral and run:

pip install "numpy<2.0"
arch -x86_64 python3 classify_image.py ...

Performance cost: USB communication runs through Rosetta 2 translation, adding overhead to every inference call.

Approach B: cocoa-xu Native ARM64 Build¶

Obtain the native ARM64 libedgetpu.1.dylib from cocoa-xu's releases or build from source:

git clone https://github.com/cocoa-xu/libedgetpu
cd libedgetpu
make  # native build on Apple Silicon

Install the ARM64 dylib:

sudo cp out/darwin_arm64/libedgetpu.1.dylib /usr/local/lib/
sudo ln -s /usr/local/lib/libedgetpu.1.dylib /usr/local/lib/libedgetpu.dylib

Build tflite_runtime and pycoral from source for ARM64.

2.4 What Breaks on Native ARM64¶

Component	Issue	Workaround
`libedgetpu` build	Bazel toolchain doesn't define `darwin_arm64`	Patch BUILD file (cocoa-xu approach)
`pycoral` pip wheel	Only x86_64 wheels published	Build from source
`tflite_runtime` pip wheel	Only x86_64 wheels published	Build from source or use feranick/TFlite-builds
NumPy ABI mismatch	Pre-compiled C extensions use NumPy 1.x ABI	Pin `numpy<2.0`
Edge TPU Compiler	Linux x86-64 only	Use Google Colab or Docker on Linux
USB driver (libusb)	Works but needs proper ARM64 build	Install via Homebrew: `brew install libusb`
`libedgetpu.1.dylib` not found	Dynamic linker can't locate it	Create symlink: `ln -s libedgetpu.1.dylib libedgetpu.dylib`

3. Existing Compiler Architectures (IREE, TVM, ExecuTorch)¶

3.1 IREE (Intermediate Representation Execution Environment)¶

URL: iree.dev | Repo: github.com/iree-org/iree

3.1.1 Architecture Overview¶

IREE is an MLIR-based end-to-end compiler and runtime that lowers ML models to a unified executable format. Its architecture is a multi-dialect MLIR pipeline:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Front-end    │────▶│  IREE Core   │────▶│  IREE HAL    │────▶│  IREE VM     │
│  Importers    │     │  Dialects    │     │  Dialects    │     │  Bytecode    │
└──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘
   TF / JAX /          linalg_ext,          hal.device,          FlatBuffer
   PyTorch →           stream,              hal.executable       executable
   MLIR                util                  target configs

3.1.2 Multi-Framework Front-Ends¶

IREE historically supported TensorFlow/TFLite as its primary frontend, then expanded to JAX due to organizational affiliation at Google. PyTorch support is being integrated via the torch-mlir project and the SHARK-Turbine prototype:

Front-end	Import Path	Status
TensorFlow / TFLite	`iree-import-tflite`	Stable
JAX / JAX.export	`jax.export` → MLIR	Stable
PyTorch	`torch-mlir` → IREE	Experimental (SHARK-Turbine)
ONNX	Via `iree-import-onnx`	Community-maintained

The common entry point is MLIR's linalg dialect — all front-ends lower to linalg-on-tensors before IREE's core compilation kicks in.

3.1.3 Quantization Pipeline¶

IREE's quantization support is evolving (tracked in issue #12005):

Import-time quantization: Models quantized in the front-end (e.g., TFLite INT8, PyTorch quantized) carry quantization metadata into IREE's MLIR representation via the quant dialect.
Compiler-internal quantization: IREE can perform post-training quantization during compilation, but this is less mature than TFLite's approach.
Hardware-aware quantization: Different backends (CPU/GPU) may require different quantization schemes. IREE's data-tiling path handles sub-byte quantization (4-bit, 2-bit) for specific targets.

3.1.4 Backend Delegation & Hardware Targets¶

IREE uses a HAL (Hardware Abstraction Layer) dialect to describe device targets and executables:

Compiler target backends:
├── llvm-cpu        → LLVM CPU codegen (x86-64, ARM, RISC-V)
├── vulkan-spirv    → GPU via Vulkan/SPIR-V
├── rocm            → AMD GPU via ROCm
├── cuda            → NVIDIA GPU via PTX
├── metal-spirv     → Apple Metal (via SPIR-V cross-compilation)
└── vmvx            → IREE's reference software backend

Adding a custom backend requires implementing:

An IREE::HAL::ExecutableTargetAttr in MLIR
A lowering pass from linalg → target-specific dialect
A runtime HAL driver for the target device

Feasibility for Edge TPU: IREE's MLIR-based architecture makes it possible in theory to add an Edge TPU backend. However, the proprietary instruction set and lack of public documentation for the Edge TPU's microarchitecture make this extremely challenging. The most practical path would be:

Import model → linalg dialect
Lower to TFLite-compatible representation
Feed into the edgetpu_compiler as a post-processing step
Wrap the resulting edgetpu-custom-op as an IREE executable

3.2 Apache TVM¶

URL: tvm.apache.org | Repo: github.com/apache/tvm

3.2.1 Architecture Overview¶

TVM uses a Relay IR intermediate representation and a target-based compilation pipeline:

┌──────────────-┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Front-end    │────▶│  Relay IR    │────▶│  TIR         │────▶│  Runtime     │
│  Importers    │     │  (Functional │     │  (Tensor IR) │     │  Module      │
│               │     │   Graph IR)  │     │              │     │              │
└──────────────-┘     └──────────────┘     └──────────────┘     └──────────────┘
   TF / ONNX /        High-level graph    Low-level loop      .so / .tar
   PyTorch / MXNet     with types          nests + buffer

3.2.2 Multi-Framework Front-Ends¶

Front-end	Import Method	Status
TensorFlow	`tvm.relay.frontend.from_tensorflow`	Stable
TFLite	`tvm.relay.frontend.from_tflite`	Stable
ONNX	`tvm.relay.frontend.from_onnx`	Stable
PyTorch	`tvm.relay.frontend.from_pytorch`	Stable
MXNet	`tvm.relay.frontend.from_mxnet`	Stable
PaddlePaddle	`tvm.relay.frontend.from_paddle`	Experimental

3.2.3 Quantization Pipeline¶

TVM provides a comprehensive quantization toolkit:

Post-training quantization (PTQ):

with tvm.transform.PassContext(opt_level=3):
    mod = tvm.relay.quantize.quantize(mod, params)

Quantization-aware training (QAT) import: Models pre-quantized in TensorFlow (via tf.quantization) or PyTorch (via torch.quantization) can be imported with their quantization metadata preserved.
Calibration-based quantization: TVM can calibrate a float model using a representative dataset to determine optimal quantization parameters:

with tvm.target.Target("llvm"):
    qmod = tvm.relay.quantize.quantize(mod, params, dataset=calibration_data)

Custom quantization schemes: TVM supports configurable quantization (per-channel, per-tensor, symmetric, asymmetric) through the tvm.relay.quantize.qconfig API.

3.2.4 Backend Delegation & External Codegen¶

TVM supports external codegen for proprietary hardware through two mechanisms:

BYOC (Bring Your Own Codegen): Relay graph partitioning delegates subgraphs to external codegen tools:

# Register an external codegen for a target
@tvm.register_func("relay.ext.my_accel.codegen")
def my_accel_codegen(ref_call):
    # Generate code for the accelerator
    return compiled_module

Target hooks: Custom targets can register compilation passes that intercept specific op patterns.

Feasibility for Edge TPU: TVM's BYOC framework could theoretically wrap the edgetpu_compiler as an external codegen target. The workflow would be:

Import model → Relay IR
Partition graph: identify Edge TPU-compatible subgraphs
Export subgraphs to TFLite format
Invoke edgetpu_compiler on each subgraph
Link compiled Edge TPU blobs back into the TVM runtime as external modules

This is similar to how TVM integrates with the VTA (Versatile Tensor Accelerator) and how the MATCH framework uses TVM for MCU deployment.

3.3 ExecuTorch (Meta / PyTorch)¶

URL: executorch.ai | Repo: github.com/pytorch/executorch

3.3.1 Architecture Overview¶

ExecuTorch is PyTorch's on-device inference framework, built on top of PyTorch 2's torch.export:

┌──────────────-┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  torch.export │────▶│  Edge IR     │────▶│  Partitioner │────▶│  ExecuTorch  │
│  (PyTorch 2)  │     │  (Core ATen  │     │  (Backend    │     │  Program     │
│               │     │   dialect)   │     │   Delegate)  │     │  (.pte)      │
└──────────────-┘     └──────────────┘     └──────────────┘     └──────────────┘
   AOT graph           Standardized         Subgraph split       FlatBuffer
   export               operator set         per backend          executable

3.3.2 Front-End & IR¶

Single front-end: PyTorch models via torch.export (AOT compilation)
IR: "Core ATen Operator Set" — a standardized subset of ATen ops
Export flow: torch.export.export(model, example_inputs) → Edge dialect

3.3.3 Quantization Pipeline¶

ExecuTorch supports two quantization pathways:

Post-training quantization (PTQ):
XNNPACK quantizer: INT8 per-channel weight quantization
Quantization flow: quantize(model, qconfig) → quantized Edge IR
Backend-aware quantization:
Each backend delegate can define its own quantizer
Example: The OpenVINO backend provides OpenVINOQuantizer with backend-aware compression pathways
Example: The CoreML backend supports INT8 and FP16 quantization with hardware-specific optimizations

from executorch.exir import EdgeCompileConfig, to_edge
from torch.ao.quantization.quantize_pt2e import convert_pt2e, prepare_pt2e

# Quantize
model = prepare_pt2e(model, qconfig)
# ... calibrate ...
model = convert_pt2e(model)

# Export to Edge
edge_program = to_edge(model, compile_config=EdgeCompileConfig())

3.3.4 Backend Delegation¶

ExecuTorch uses a partitioner + delegate model:

Partitioner: Splits the Edge IR graph into subgraphs suitable for each backend. A partitioner identifies which ops a backend can handle and groups them.
Delegate: Each backend implements a DelegateHandler that compiles its assigned subgraph into a binary blob. At runtime, the ExecuTorch runtime calls the delegate to execute the subgraph.

Current backends (12+):

Backend	Hardware	Status
XNNPACK	CPU (ARM, x86)	Stable
CoreML	Apple Neural Engine / GPU	Stable
Qualcomm QNN	Hexagon DSP / AI Engine	Stable
Vulkan	Mobile GPU	Stable
MediaTek	APU	New
MPS	Apple Metal	Experimental
NNAPI	Android NNAPI	Community
OpenVINO	Intel CPU/GPU/VPU	Stable
Vulkan	GPU via SPIR-V	Stable

Feasibility for Edge TPU: ExecuTorch's delegate model is well-suited for Edge TPU integration. The approach would be:

Implement a partitioner that identifies Edge TPU-compatible subgraphs (INT8 ops from the supported list)
Implement a delegate that converts the subgraph to TFLite format and invokes edgetpu_compiler
Package the compiled edgetpu-custom-op blob into the ExecuTorch program

The main challenge is the PyTorch → TFLite conversion step, which currently has limited support (ONNX as an intermediate format is one option).

3.4 Comparison Table¶

Feature	IREE	TVM	ExecuTorch
IR	MLIR (linalg, stream, hal)	Relay IR + TIR	Core ATen (Edge dialect)
Front-ends	TF, JAX, PyTorch (exp.)	TF, ONNX, PyTorch, TFLite, etc.	PyTorch only
Quantization	Import-time + compiler-internal	PTQ + QAT + calibration	PTQ + backend-aware
Backend model	HAL device targets	BYOC external codegen	Partitioner + delegate
Custom HW path	Implement HAL target + runtime driver	BYOC: register codegen func	Implement partitioner + delegate
Edge TPU feasibility	Hard (proprietary ISA)	Medium (BYOC + edgetpu_compiler)	Medium (delegate + TFLite conversion)
MLIR as IR?	Yes (native)	No (Relay is custom)	No (Core ATen dialect)
Runtime	IREE VM (custom bytecode)	TVM runtime (graph executor)	ExecuTorch runtime (.pte)

3.5 Feasibility of MLIR (IREE) or Relay (TVM) as Intermediate Representation¶

MLIR (via IREE)¶

Pros:

Native multi-dialect system allows clean separation of concerns
linalg dialect is a natural meeting point for different front-ends
Extensible: new hardware targets are "just" new dialects and lowering passes
Growing ecosystem (XLA, IREE, torch-mlir, stablehlo all use MLIR)
Potential for stablehlo → linalg → Edge TPU lowering pipeline

Cons:

IREE's hal.executable abstraction assumes you can lower to some form of compute kernel (SPIR-V, PTX, LLVM IR). The Edge TPU's proprietary ISA doesn't fit this model.
The edgetpu_compiler is a black box; you can't decompose its output back into MLIR.
Significant engineering effort to implement a working pipeline

Recommended approach: Use MLIR as a front-end aggregation layer (TF/PyTorch/ONNX → stablehlo → linalg) and then lower to TFLite for the edgetpu_compiler step. Don't attempt to bypass edgetpu_compiler.

Relay (via TVM)¶

Pros:

Mature BYOC framework designed for proprietary hardware
Good TFLite front-end import — can read and write TFLite models
Quantization toolkit is well-tested for INT8 workflows
TVM has been used with real hardware accelerators (VTA, MATCH)

Cons:

Relay IR is not MLIR; doesn't benefit from the MLIR ecosystem
TVM's own development velocity has slowed relative to IREE/ExecuTorch
The BYOC → edgetpu_compiler path requires an awkward Relay → TFLite → edgetpu_compiler → custom op relay roundtrip

Recommended approach: If starting from TVM, use BYOC with a TFLite export/reimport strategy. This is the path of least resistance for integrating the edgetpu_compiler.

4. Edge TPU Operation Set & Limitations¶

4.1 Model Requirements¶

For a TensorFlow model to compile for the Edge TPU, it must meet all of these:

Tensor parameters are quantized to 8-bit fixed-point (int8 or uint8)
Tensor sizes are constant at compile-time (no dynamic shapes)
Model parameters (weights, biases) are constant at compile-time
Tensors are 1-, 2-, or 3-dimensional. If a tensor has >3 dimensions, only the 3 innermost dimensions may have size >1
Only supported operations are used (see table below)

Float inputs are OK — the compiler will leave a QUANTIZE op at the graph entry point that runs on the CPU, converting float → INT8 before the Edge TPU custom op.

4.2 Complete Supported Operations Table¶

Source: coral.ai/docs/edgetpu/models-intro#supported-operations

Operation	Min Runtime	Known Limitations
Add	All	—
AveragePool2d	All	No fused activation function
Concatenation	All	No fused activation function. If any input is a compile-time constant tensor, there must be only 2 inputs, and this constant must be all zeros (zero-padding op)
Conv2d	All	Must use same dilation in x and y dimensions
DepthwiseConv2d	≤12	Dilated conv kernels not supported
	≥13	Must use same dilation in x and y dimensions
ExpandDims	≥13	—
FullyConnected	All	Only default weight format. Output tensor is 1-dimensional
L2Normalization	All	—
Logistic (Sigmoid)	All	—
LSTM	≥14	Unidirectional LSTM only
Maximum	All	—
MaxPool2d	All	No fused activation function
Mean	≤12	No batch-dim reduction. Supports reduction along x- and/or y-dimensions only
	≥13	No batch-dim reduction. If z-reduction, z-dim must be multiple of 4
Minimum	All	—
Mul	All	—
Pack	≥13	No packing in batch dimension
Pad	≤12	No batch-dim padding. Supports padding along x- and/or y-dimensions only
	≥13	No batch-dim padding
PReLU	≥13	Alpha must be 1-dimensional. If using Keras PReLU with 4D input (batch, height, width, channels), `shared_axes` must be [1,2]
Quantize	≥13	—
ReduceMax	≥14	Cannot operate on batch dimension
ReduceMin	≥14	Cannot operate on batch dimension
ReLU	All	—
ReLU6	All	—
ReLUN1To1	All	—
Reshape	All	Certain reshapes might not be mapped for large tensor sizes
ResizeBilinear	All	Input/output is 3-dimensional. Might not be mapped to avoid precision loss depending on size
ResizeNearestNeighbor	All	Input/output is 3-dimensional. Might not be mapped depending on size
Rsqrt	≥14	—
Slice	All	—
Softmax	All	Supports only 1-D input with max 16,000 elements
SpaceToDepth	All	—
Split	All	No splitting in batch dimension
Squeeze	≤12	Only when input has leading 1s (no relayout needed), e.g. [1,1,10] or [1,5,10] is OK; [5,1,10] is not
	≥13	None
StridedSlice	All	All strides must equal 1 (effectively a Slice op), ellipsis-axis-mask == 0, new-axis-mask == 0
Sub	All	—
Sum	≥13	Cannot operate on batch dimension
SquaredDifference	≥14	—
Tanh	All	—
Transpose	≥14	—
TransposeConv	≥13	—

Total: 38 operations (with varying runtime version requirements)

4.3 Key Constraints¶

4.3.1 INT8 Quantization Requirement¶

All tensor parameters (weights, biases) must be quantized to INT8/UINT8. The Edge TPU hardware has no floating-point execution units. Activation tensors must also be quantized; the TFLite converter handles this via "full integer quantization" with calibration data.

4.3.2 Tensor Dimension Constraints¶

Tensors must be 1D, 2D, or 3D
If >3D, only the 3 innermost dimensions may have size >1
This effectively means the Edge TPU processes tensors in NHWC format where N (batch) must be 1 for most operations, and dimensions beyond C must be 1

4.3.3 On-Chip Memory¶

~8 MB SRAM total for executable + parameter cache
Typically ~6.91 MiB available for parameter data (after executable space)
If model parameters exceed available SRAM, excess is streamed from host memory (off-chip), which significantly degrades performance

4.3.4 Max Parameter Data Size¶

There is no single documented hard limit on model size, but practical limits arise from the 8 MB SRAM:

Models with <7 MB of parameter data can be fully cached on-chip
Larger models must stream parameters, causing 2-10x slower inference

4.3.5 Compilation Behavior¶

The edgetpu_compiler operates in a greedy prefix manner:

Starts from the graph input
Collects consecutive supported ops
Stops at the first unsupported op
Everything before the stop → edgetpu-custom-op (runs on TPU)
Everything after → standard TFLite ops (runs on CPU)

This means the position of unsupported ops matters — an unsupported op early in the graph will prevent most of the model from running on the TPU, even if later ops are individually supported.

The -d / --search_delegate flag enables the compiler to retry compilation from an earlier point in the graph when it encounters an unsupported op, potentially allowing a partial TPU mapping even with mid-graph failures.

4.3.6 Subgraph Limitations¶

Many operations show "More than one subgraph is not supported" in compiler output. This occurs when:

The model has control flow (if/while) creating multiple subgraphs
Certain ops create implicit subgraphs
The Edge TPU currently supports only a single subgraph for hardware execution

4.3.7 No 3D Convolutions¶

The Edge TPU supports only Conv2D and DepthwiseConv2D. There is no 3D convolution support. Models requiring 3D convs (video processing, 3D medical imaging) cannot be fully accelerated.

4.3.8 Softmax Element Limit¶

Softmax supports only 1-D input with a maximum of 16,000 elements. This limits the number of classes in classification models running entirely on the Edge TPU.

5. Sloth Integration Implications¶

The repository now carries an integrated sloth_integration package that targets text-classification and embedding deployment to Coral USB.

5.1 Why This Matters for Edge TPU Research¶

It validates an end-to-end SLM deployment path (adapter -> converter -> quantizer -> runtime).
It stress-tests the runtime API compatibility layer against Coral USB delegates.
It provides reproducible benchmark scenarios that compare CPU baseline, direct hardware, and hardware through sloth runtime abstractions.

5.2 Practical Constraints Observed¶

Synthetic micro-models are useful for pipeline validation but should not be used as production latency proxies.
Delegate preflight behavior can vary by model artifact; robust fallback paths are required in runtime wrappers.
Keeping package metadata centralized in root build config avoids drift between runtime code and published dependencies.

5.3 Current Internal References¶

Integration package: sloth-integration/src/sloth_integration
Integration tests: sloth-integration/tests
Benchmark report: sloth-integration/docs/benchmarks_sloth.md

Appendix A: Key URLs¶

Resource	URL
Coral Documentation	https://www.coral.ai/docs
Edge TPU Models Intro	https://www.coral.ai/docs/edgetpu/models-intro
Edge TPU Compiler	https://www.coral.ai/docs/edgetpu/compiler
libedgetpu (official)	https://github.com/google-coral/libedgetpu
edgetpu issue tracker	https://github.com/google-coral/edgetpu
pycoral	https://github.com/google-coral/pycoral
libcoral	https://github.com/google-coral/libcoral
cocoa-xu/libedgetpu	https://github.com/cocoa-xu/libedgetpu
feranick/libedgetpu	https://github.com/feranick/libedgetpu
feranick/TFlite-builds	https://github.com/feranick/TFlite-builds
Tim Strobel's Mac Guide	https://tim-strobel.de/coral.html
IREE	https://iree.dev
Apache TVM	https://tvm.apache.org
ExecuTorch	https://executorch.ai
Edge TPU Compiler (Colab)	https://colab.research.google.com

Appendix B: Edge TPU Compiler CLI Reference¶

edgetpu_compiler [options] model [...]

Options:
  -h, --help                     Print help
  -i, --intermediate_tensors     Output tensors from Edge TPU custom op
  -m, --min_runtime_version      Min runtime version (default: compiler-specific)
  -n, --num_segments             Compile into N segments for pipelining
  -o, --out_dir                  Output directory (default: cwd)
  -s, --show_operations          Print op mapping to console
  -d, --search_delegate          Retry compilation on failure (since v16)
  -t, --timeout_sec              Compiler timeout in seconds (default: 180)
  -v, --version                  Print compiler version

Appendix C: edgetpu.h Complete Public API¶

namespace edgetpu {

// Custom op identifier
static const char kCustomOp[] = "edgetpu-custom-op";

enum class DeviceType { kApexPci = 0, kApexUsb = 1 };

class EdgeTpuManager {
 public:
  static EdgeTpuManager* GetSingleton();

  // Shared ownership (preferred)
  virtual std::shared_ptr<EdgeTpuContext> OpenDevice() = 0;
  virtual std::shared_ptr<EdgeTpuContext> OpenDevice(DeviceType) = 0;
  virtual std::shared_ptr<EdgeTpuContext> OpenDevice(
      DeviceType, const std::string& path) = 0;
  virtual std::shared_ptr<EdgeTpuContext> OpenDevice(
      DeviceType, const std::string& path, const DeviceOptions&) = 0;

  // Exclusive ownership (deprecated)
  virtual std::unique_ptr<EdgeTpuContext> NewEdgeTpuContext(...) = 0;

  virtual std::vector<DeviceEnumerationRecord> EnumerateEdgeTpu() const = 0;
  virtual std::vector<std::shared_ptr<EdgeTpuContext>> GetOpenedDevices() const = 0;
  virtual TfLiteStatus SetVerbosity(int verbosity) = 0;
  virtual std::string Version() const = 0;
};

class EdgeTpuContext : public TfLiteExternalContext {
 public:
  virtual const DeviceEnumerationRecord& GetDeviceEnumRecord() const = 0;
  virtual DeviceOptions GetDeviceOptions() const = 0;
  virtual bool IsReady() const = 0;
};

// Register the custom op with TFLite
EDGETPU_EXPORT TfLiteRegistration* RegisterCustomOp();

}  // namespace edgetpu

Appendix D: Compiler-Runtime Version Compatibility¶

Compiler	Default Runtime	Notable New Ops
1.0	10	Base set (Add, Conv2D, etc.)
2.0	12	DepthwiseConv2d dilation
2.1	13	ExpandDims, Pack, PReLU, Quantize, Sum, TransposeConv, improved Squeeze/Mean/Pad
14.1–15.0	13	—
16.0	14	LSTM, ReduceMax, ReduceMin, Rsqrt, SquaredDifference, Transpose

Report compiled from official documentation, GitHub repositories, community guides, and web research. Last updated: 2025-03-05.