Edge TPU Ecosystem Research¶
Comprehensive technical reference for the Google Coral Edge TPU platform, compiled as the basis for an alternative compiler project.
Table of Contents¶
- Official Coral USB / Edge TPU Software Stack
- Coral USB on macOS / Apple Silicon
- Existing Compiler Architectures
- Edge TPU Operation Set & Limitations
- Sloth Integration Implications
1. Official Coral USB / Edge TPU Software Stack¶
1.1 Repository Structure¶
The Coral software stack comprises three principal repositories:
| Repository | Purpose | Language | Current TF Version |
|---|---|---|---|
| google-coral/libedgetpu | Userspace runtime driver (USB/PCIe communication) | C++ | 2.16.1 |
| google-coral/libcoral | High-level C++ inference/pipeline/transfer-learning API | C++ | — |
| google-coral/pycoral | Python bindings (PyCoral) | Python/C++ | — |
The legacy monorepo google-coral/edgetpu now serves
primarily as an issue tracker; source has been split into the repos above. The deprecated
edgetpu Python API still lives in the old repo but is superseded by pycoral.
1.2 Building libedgetpu¶
Three build paths are documented:
A. Docker + Bazel (Recommended)¶
# x86-64
DOCKER_CPUS="k8" DOCKER_IMAGE="ubuntu:22.04" DOCKER_TARGETS=libedgetpu make docker-build
# ARM64 / ARMv7
DOCKER_CPUS="armv7a aarch64" DOCKER_IMAGE="debian:bookworm" DOCKER_TARGETS=libedgetpu make docker-build
All built binaries land in the out/ directory. Debian packages can be produced with
debuild -us -uc -tc -b -a arm64 -d.
B. Bare Bazel¶
# Requires Bazel 6.5.0 for TF 2.16.1
make # native build
CPU=armv7a make # cross-compile ARMv7
CPU=aarch64 make # cross-compile AArch64
macOS caveat: Compilation fails out of the box. Two manual steps are required:
- Install
flatbuffersvia MacPorts. - After the first build failure, patch the auto-generated Bazel
BUILDfile at/var/tmp/_bazel_xxxxx/.../external/local_config_cc/BUILDline 48:
C. Makefile (Linux-only, no Bazel)¶
sudo apt install libabsl-dev libflatbuffers-dev
git clone https://github.com/tensorflow/tensorflow && cd tensorflow && git checkout v2.16.1
TFROOT=<path-to-tf> make -f makefile_build/Makefile -j$(nproc) libedgetpu
This approach eliminates Bazel/Docker entirely and uses system packages for
libabsl and libflatbuffers.
1.3 The *_edgetpu.tflite Binary Format¶
1.3.1 How the Edge TPU Custom Op Is Embedded¶
The Edge TPU Compiler (edgetpu_compiler) takes a standard TFLite FlatBuffer and
transforms it into a compiled *_edgetpu.tflite file. The key transformation is:
-
Subgraph partitioning: The compiler walks the TFLite subgraph from inputs to outputs. Starting from the beginning, it collects consecutive supported INT8 ops into a single contiguous subgraph destined for the Edge TPU.
-
Custom op creation: All Edge TPU-compatible ops in that contiguous subgraph are replaced by a single custom op with the name
"edgetpu-custom-op":
- Opaque binary payload: The custom op carries an opaque binary blob in its
custom_optionsfield (the FlatBufferOperator.custom_optionsbyte vector). This blob contains: - The inference executable — a compiled binary that the Edge TPU DSP/systolic array can execute directly. This is NOT a TFLite subgraph; it is a proprietary instruction stream for the Edge TPU hardware.
- Parameter data layout descriptors — metadata describing which tensors should be cached in on-chip SRAM versus streamed from off-chip memory.
-
A caching token — a 64-bit number uniquely identifying the parameter data layout for cache management.
-
Remaining ops stay as-is: If the compiler encounters an unsupported op, it stops the Edge TPU subgraph. All remaining ops after the unsupported op stay as standard TFLite ops and execute on the CPU. This creates a "split" model:
- Edge TPU custom op (runs on the TPU)
- Remaining TFLite ops (run on CPU)
1.3.2 The "Executable Preamble"¶
The Coral documentation explicitly states:
"The Edge TPU Compiler adds a small executable inside the model that writes a specific amount of the model's parameter data to the Edge TPU RAM (if available) before running an inference."
This executable preamble is embedded within the edgetpu-custom-op custom options blob
and serves two purposes:
-
Parameter data loading (cache warm-up): On first invocation, the preamble streams parameter data (INT8 weights/biases) from host memory into the Edge TPU's ~8 MB on-chip SRAM. Subsequent invocations can skip this step if the caching token matches.
-
Inference dispatch: After parameter data is loaded, the preamble transitions to the actual inference engine that drives the Edge TPU's systolic array.
1.3.3 Parameter-Data Caching Protocol¶
The Edge TPU has approximately 8 MB of SRAM for parameter caching. The protocol works as follows:
┌──────────────────────────────────────────────────┐
│ Edge TPU SRAM (~8 MB) │
│ ┌────────────────────────────────────────────┐ │
│ │ Inference Executable Space (reserved) │ │
│ ├────────────────────────────────────────────┤ │
│ │ Parameter Data Cache (remaining space) │ │
│ │ - Weights, biases for cached ops │ │
│ └────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
Key details:
-
Compiler-allocated scratchpad (not traditional cache): The compiler knows the exact SRAM size and the model's requirements, so it assigns fixed cache space at compile time.
-
Caching token: A 64-bit number assigned at compile time. The runtime compares the token of the incoming model against the token of the currently-cached data:
- Match: Use cached data (fast path).
-
Mismatch: Wipe cache, write new data, then execute (slow first inference).
-
Co-compilation: Multiple models can be compiled together (
edgetpu_compiler model_A.tflite model_B.tflite) to share the same caching token, eliminating cache thrashing when switching between models. -
Compiler output example:
On-chip memory available for caching model parameters: 6.91 MiB
On-chip memory used for caching model parameters: 4.21 MiB
Off-chip memory used for streaming uncached model parameters: 0.00 B
- Off-chip fallback: If parameter data exceeds available SRAM, the excess is streamed from host memory at inference time, degrading performance.
1.4 The libedgetpu API (edgetpu.h)¶
The public API is defined in libedgetpu/edgetpu.h. Key components:
1.4.1 EdgeTpuManager (Singleton)¶
class EDGETPU_EXPORT EdgeTpuManager {
public:
// Singleton accessor
static EdgeTpuManager* GetSingleton();
// Shared-ownership device opening (preferred API)
virtual std::shared_ptr<EdgeTpuContext> OpenDevice() = 0;
virtual std::shared_ptr<EdgeTpuContext> OpenDevice(DeviceType device_type) = 0;
virtual std::shared_ptr<EdgeTpuContext> OpenDevice(
DeviceType device_type, const std::string& device_path) = 0;
virtual std::shared_ptr<EdgeTpuContext> OpenDevice(
DeviceType device_type, const std::string& device_path,
const DeviceOptions& options) = 0;
// Deprecated: exclusive-ownership API
virtual std::unique_ptr<EdgeTpuContext> NewEdgeTpuContext(...) = 0;
// Device enumeration
virtual std::vector<DeviceEnumerationRecord> EnumerateEdgeTpu() const = 0;
// Currently opened shared devices
virtual std::vector<std::shared_ptr<EdgeTpuContext>> GetOpenedDevices() const = 0;
// Runtime version
virtual std::string Version() const = 0;
virtual TfLiteStatus SetVerbosity(int verbosity) = 0;
};
Device options (passed as std::unordered_map<std::string, std::string>):
"Performance":"Low"|"Medium"|"High"|"Max"(default:"Max")"Usb.AlwaysDfu":"True"|"False"(default:"False")"Usb.MaxBulkInQueueLength":"0".."255"(default:"32")
Device types:
1.4.2 EdgeTpuContext¶
class EdgeTpuContext : public TfLiteExternalContext {
public:
virtual ~EdgeTpuContext() = 0;
virtual const EdgeTpuManager::DeviceEnumerationRecord& GetDeviceEnumRecord() const = 0;
virtual EdgeTpuManager::DeviceOptions GetDeviceOptions() const = 0;
virtual bool IsReady() const = 0;
};
The context is a TfLiteExternalContext, meaning it plugs into the TFLite interpreter
via interpreter->SetExternalContext(kTfLiteEdgeTpuContext, tpu_context.get()).
1.4.3 TFLite Delegate Registration¶
The custom op is registered via the standard TFLite mechanism:
// Typical usage (Non-NNAPI path):
auto tpu_context = edgetpu::EdgeTpuManager::GetSingleton()->OpenDevice();
tflite::ops::builtin::BuiltinOpResolver resolver;
// Register the Edge TPU custom op handler
resolver.AddCustom(edgetpu::kCustomOp, edgetpu::RegisterCustomOp());
tflite::InterpreterBuilder(*model, resolver)(&interpreter);
// Bind the TPU context to the interpreter
interpreter->SetExternalContext(kTfLiteEdgeTpuContext, tpu_context.get());
interpreter->AllocateTensors();
interpreter->Invoke();
RegisterCustomOp() returns a TfLiteRegistration* that handles:
Init: Parses the custom_options blob, sets up the Edge TPU executablePrepare: Resolves tensor shapes and memory layoutInvoke: Sends the inference request to the Edge TPU hardware via USB/PCIe
1.4.4 Runtime Version Negotiation¶
The compiled model encodes a minimum runtime version requirement. At inference time, the runtime checks compatibility:
Failed precondition: Package requires runtime version (12),
which is newer than this runtime version (10).
Version compatibility table:
| Compiler Version | Default Runtime Version Required |
|---|---|
| 16.0 | 14 |
| 15.0 | 13 |
| 14.1 | 13 |
| 2.1.302470888 | 13 |
| 2.0.291256449 | 12 |
| 1.0 | 10 |
Models are forward-compatible: a model compiled for runtime v10 will work on v12+. To create backward-compatible models:
The runtime version also determines which operations are available (newer ops like LSTM, PReLU, ReduceMax, etc. require runtime ≥13 or ≥14).
1.5 Software Stack Summary¶
┌─────────────────────────────────────────┐
│ User Application │
├─────────────────────────────────────────┤
│ PyCoral (Python) / libcoral (C++) │ High-level inference API
├─────────────────────────────────────────┤
│ TFLite Interpreter + Custom Op │ Standard TFLite + edgetpu-custom-op
├─────────────────────────────────────────┤
│ libedgetpu │ Runtime driver (USB/PCIe comm)
├─────────────────────────────────────────┤
│ Edge TPU Hardware (USB / M.2 / PCIe) │ ASIC accelerator
└─────────────────────────────────────────┘
2. Coral USB on macOS / Apple Silicon¶
2.1 Official Support Status¶
Google Coral's official documentation states that the USB Accelerator works on
"Linux, Mac, or Windows." However, the official pre-built libedgetpu binaries
are x86_64-only for macOS. The Edge TPU Compiler is Linux x86-64 only
(since compiler v2.1, ARM64 builds are no longer provided).
As of 2024–2025, Google has not released native ARM64 (Apple Silicon) binaries
for libedgetpu, pycoral, or libcoral.
2.2 Community Efforts¶
2.2.1 cocoa-xu/libedgetpu¶
Repository: github.com/cocoa-xu/libedgetpu
Cocoa Xu maintains a fork of libedgetpu that provides:
- Pre-built
darwin-arm64(Apple Silicon) binaries oflibedgetpu - aarch64-linux-musl builds (static linking for Alpine/musl-based systems)
- riscv64-linux-musl builds
- Integration with the tflite_beam Erlang/Elixir bindings, which bundles the native libraries
The fork modifies the Bazel/Makefile build system to cross-compile for
darwin-arm64 and other targets. The key challenge is that the upstream
libedgetpu build assumes darwin_x86_64 for macOS, and the Bazel
toolchain configuration does not include darwin_arm64 by default.
How the ARM64 build was achieved:
- Fork the
google-coral/libedgetpurepo - Patch the Bazel build configuration to add
darwin_arm64as a target CPU - Build on an Apple Silicon Mac using native Clang (or cross-compile from x86_64 using Rosetta 2 + ARM64 target triple)
- The resulting
libedgetpu.1.dylibis a native ARM64 shared library
2.2.2 feranick/libedgetpu¶
Repository: github.com/feranick/libedgetpu
Feranick maintains an actively-updated fork that tracks newer TensorFlow versions:
- Current release:
16.0TF2.19.1-1(compatible with TF 2.19.1) - Provides deb packages for
amd64,arm64,armhf - Also maintains feranick/TFlite-builds
with updated
tflite_runtimePython wheels
This fork is widely used by the Frigate NVR community for running Coral TPU on ARM64 Linux systems (e.g., Raspberry Pi 5, Orange Pi).
Note: feranick's builds target Linux ARM64, not macOS ARM64. They don't
produce darwin-arm64 dylibs.
2.2.3 Tim Strobel's Setup Guide (2024)¶
URL: tim-strobel.de/coral.html
A practical guide for running Coral USB on macOS 14.x (Sonoma) as of October 2024:
Steps:
- Install
edgetpuruntime andPyCoralper Coral's setup guide - Python version: Must use Python 3.9 (PyCoral's last supported version; 3.10+ does not work)
- NumPy issue: PyCoral was compiled against NumPy 1.x; NumPy 2.x causes
AttributeError: _ARRAY_API not found. Fix:
- Library linking error:
ValueError: Failed to load delegate from libedgetpu.1.dylib. The dylib is installed at/usr/local/lib/libedgetpu.1.dylibbut the system can't find it. Fix:
This guide is for Intel Macs or Apple Silicon Macs running under Rosetta 2.
2.3 Steps to Install on Apple Silicon (M1/M2/M3) Today¶
There is no fully native, officially-supported path for Apple Silicon as of early 2025. The workable approaches are:
Approach A: Rosetta 2 (x86_64 emulation)¶
- Install Rosetta 2:
softwareupdate --install-rosetta - Create an x86_64 Python environment:
conda create -n coral python=3.9
conda activate coral
# Force x86_64 arch
arch -x86_64 pip install pycoral
- Install x86_64
libedgetpu:
# Download from coral.ai/software (macOS x86_64 package)
sudo cp libedgetpu.1.dylib /usr/local/lib/
sudo ln -s /usr/local/lib/libedgetpu.1.dylib /usr/local/lib/libedgetpu.dylib
- Install
pycoraland run:
Performance cost: USB communication runs through Rosetta 2 translation, adding overhead to every inference call.
Approach B: cocoa-xu Native ARM64 Build¶
- Obtain the native ARM64
libedgetpu.1.dylibfrom cocoa-xu's releases or build from source:
- Install the ARM64 dylib:
sudo cp out/darwin_arm64/libedgetpu.1.dylib /usr/local/lib/
sudo ln -s /usr/local/lib/libedgetpu.1.dylib /usr/local/lib/libedgetpu.dylib
- Build
tflite_runtimeandpycoralfrom source for ARM64.
2.4 What Breaks on Native ARM64¶
| Component | Issue | Workaround |
|---|---|---|
libedgetpu build |
Bazel toolchain doesn't define darwin_arm64 |
Patch BUILD file (cocoa-xu approach) |
pycoral pip wheel |
Only x86_64 wheels published | Build from source |
tflite_runtime pip wheel |
Only x86_64 wheels published | Build from source or use feranick/TFlite-builds |
| NumPy ABI mismatch | Pre-compiled C extensions use NumPy 1.x ABI | Pin numpy<2.0 |
| Edge TPU Compiler | Linux x86-64 only | Use Google Colab or Docker on Linux |
| USB driver (libusb) | Works but needs proper ARM64 build | Install via Homebrew: brew install libusb |
libedgetpu.1.dylib not found |
Dynamic linker can't locate it | Create symlink: ln -s libedgetpu.1.dylib libedgetpu.dylib |
3. Existing Compiler Architectures (IREE, TVM, ExecuTorch)¶
3.1 IREE (Intermediate Representation Execution Environment)¶
URL: iree.dev | Repo: github.com/iree-org/iree
3.1.1 Architecture Overview¶
IREE is an MLIR-based end-to-end compiler and runtime that lowers ML models to a unified executable format. Its architecture is a multi-dialect MLIR pipeline:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Front-end │────▶│ IREE Core │────▶│ IREE HAL │────▶│ IREE VM │
│ Importers │ │ Dialects │ │ Dialects │ │ Bytecode │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
TF / JAX / linalg_ext, hal.device, FlatBuffer
PyTorch → stream, hal.executable executable
MLIR util target configs
3.1.2 Multi-Framework Front-Ends¶
IREE historically supported TensorFlow/TFLite as its primary frontend, then
expanded to JAX due to organizational affiliation at Google. PyTorch support
is being integrated via the torch-mlir project and the SHARK-Turbine prototype:
| Front-end | Import Path | Status |
|---|---|---|
| TensorFlow / TFLite | iree-import-tflite |
Stable |
| JAX / JAX.export | jax.export → MLIR |
Stable |
| PyTorch | torch-mlir → IREE |
Experimental (SHARK-Turbine) |
| ONNX | Via iree-import-onnx |
Community-maintained |
The common entry point is MLIR's linalg dialect — all front-ends lower to
linalg-on-tensors before IREE's core compilation kicks in.
3.1.3 Quantization Pipeline¶
IREE's quantization support is evolving (tracked in issue #12005):
- Import-time quantization: Models quantized in the front-end (e.g., TFLite
INT8, PyTorch quantized) carry quantization metadata into IREE's MLIR
representation via the
quantdialect. - Compiler-internal quantization: IREE can perform post-training quantization during compilation, but this is less mature than TFLite's approach.
- Hardware-aware quantization: Different backends (CPU/GPU) may require different quantization schemes. IREE's data-tiling path handles sub-byte quantization (4-bit, 2-bit) for specific targets.
3.1.4 Backend Delegation & Hardware Targets¶
IREE uses a HAL (Hardware Abstraction Layer) dialect to describe device targets and executables:
Compiler target backends:
├── llvm-cpu → LLVM CPU codegen (x86-64, ARM, RISC-V)
├── vulkan-spirv → GPU via Vulkan/SPIR-V
├── rocm → AMD GPU via ROCm
├── cuda → NVIDIA GPU via PTX
├── metal-spirv → Apple Metal (via SPIR-V cross-compilation)
└── vmvx → IREE's reference software backend
Adding a custom backend requires implementing:
- An
IREE::HAL::ExecutableTargetAttr in MLIR - A lowering pass from
linalg→ target-specific dialect - A runtime HAL driver for the target device
Feasibility for Edge TPU: IREE's MLIR-based architecture makes it possible in theory to add an Edge TPU backend. However, the proprietary instruction set and lack of public documentation for the Edge TPU's microarchitecture make this extremely challenging. The most practical path would be:
- Import model →
linalgdialect - Lower to TFLite-compatible representation
- Feed into the
edgetpu_compileras a post-processing step - Wrap the resulting
edgetpu-custom-opas an IREE executable
3.2 Apache TVM¶
URL: tvm.apache.org | Repo: github.com/apache/tvm
3.2.1 Architecture Overview¶
TVM uses a Relay IR intermediate representation and a target-based compilation pipeline:
┌──────────────-┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Front-end │────▶│ Relay IR │────▶│ TIR │────▶│ Runtime │
│ Importers │ │ (Functional │ │ (Tensor IR) │ │ Module │
│ │ │ Graph IR) │ │ │ │ │
└──────────────-┘ └──────────────┘ └──────────────┘ └──────────────┘
TF / ONNX / High-level graph Low-level loop .so / .tar
PyTorch / MXNet with types nests + buffer
3.2.2 Multi-Framework Front-Ends¶
| Front-end | Import Method | Status |
|---|---|---|
| TensorFlow | tvm.relay.frontend.from_tensorflow |
Stable |
| TFLite | tvm.relay.frontend.from_tflite |
Stable |
| ONNX | tvm.relay.frontend.from_onnx |
Stable |
| PyTorch | tvm.relay.frontend.from_pytorch |
Stable |
| MXNet | tvm.relay.frontend.from_mxnet |
Stable |
| PaddlePaddle | tvm.relay.frontend.from_paddle |
Experimental |
3.2.3 Quantization Pipeline¶
TVM provides a comprehensive quantization toolkit:
- Post-training quantization (PTQ):
-
Quantization-aware training (QAT) import: Models pre-quantized in TensorFlow (via
tf.quantization) or PyTorch (viatorch.quantization) can be imported with their quantization metadata preserved. -
Calibration-based quantization: TVM can calibrate a float model using a representative dataset to determine optimal quantization parameters:
with tvm.target.Target("llvm"):
qmod = tvm.relay.quantize.quantize(mod, params, dataset=calibration_data)
- Custom quantization schemes: TVM supports configurable quantization
(per-channel, per-tensor, symmetric, asymmetric) through the
tvm.relay.quantize.qconfigAPI.
3.2.4 Backend Delegation & External Codegen¶
TVM supports external codegen for proprietary hardware through two mechanisms:
- BYOC (Bring Your Own Codegen): Relay graph partitioning delegates subgraphs to external codegen tools:
# Register an external codegen for a target
@tvm.register_func("relay.ext.my_accel.codegen")
def my_accel_codegen(ref_call):
# Generate code for the accelerator
return compiled_module
- Target hooks: Custom targets can register compilation passes that intercept specific op patterns.
Feasibility for Edge TPU: TVM's BYOC framework could theoretically
wrap the edgetpu_compiler as an external codegen target. The workflow
would be:
- Import model → Relay IR
- Partition graph: identify Edge TPU-compatible subgraphs
- Export subgraphs to TFLite format
- Invoke
edgetpu_compileron each subgraph - Link compiled Edge TPU blobs back into the TVM runtime as external modules
This is similar to how TVM integrates with the VTA (Versatile Tensor Accelerator) and how the MATCH framework uses TVM for MCU deployment.
3.3 ExecuTorch (Meta / PyTorch)¶
URL: executorch.ai | Repo: github.com/pytorch/executorch
3.3.1 Architecture Overview¶
ExecuTorch is PyTorch's on-device inference framework, built on top of
PyTorch 2's torch.export:
┌──────────────-┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ torch.export │────▶│ Edge IR │────▶│ Partitioner │────▶│ ExecuTorch │
│ (PyTorch 2) │ │ (Core ATen │ │ (Backend │ │ Program │
│ │ │ dialect) │ │ Delegate) │ │ (.pte) │
└──────────────-┘ └──────────────┘ └──────────────┘ └──────────────┘
AOT graph Standardized Subgraph split FlatBuffer
export operator set per backend executable
3.3.2 Front-End & IR¶
- Single front-end: PyTorch models via
torch.export(AOT compilation) - IR: "Core ATen Operator Set" — a standardized subset of ATen ops
- Export flow:
torch.export.export(model, example_inputs)→ Edge dialect
3.3.3 Quantization Pipeline¶
ExecuTorch supports two quantization pathways:
- Post-training quantization (PTQ):
- XNNPACK quantizer: INT8 per-channel weight quantization
-
Quantization flow:
quantize(model, qconfig)→ quantized Edge IR -
Backend-aware quantization:
- Each backend delegate can define its own quantizer
- Example: The OpenVINO backend provides
OpenVINOQuantizerwith backend-aware compression pathways - Example: The CoreML backend supports INT8 and FP16 quantization with hardware-specific optimizations
from executorch.exir import EdgeCompileConfig, to_edge
from torch.ao.quantization.quantize_pt2e import convert_pt2e, prepare_pt2e
# Quantize
model = prepare_pt2e(model, qconfig)
# ... calibrate ...
model = convert_pt2e(model)
# Export to Edge
edge_program = to_edge(model, compile_config=EdgeCompileConfig())
3.3.4 Backend Delegation¶
ExecuTorch uses a partitioner + delegate model:
-
Partitioner: Splits the Edge IR graph into subgraphs suitable for each backend. A partitioner identifies which ops a backend can handle and groups them.
-
Delegate: Each backend implements a
DelegateHandlerthat compiles its assigned subgraph into a binary blob. At runtime, the ExecuTorch runtime calls the delegate to execute the subgraph.
Current backends (12+):
| Backend | Hardware | Status |
|---|---|---|
| XNNPACK | CPU (ARM, x86) | Stable |
| CoreML | Apple Neural Engine / GPU | Stable |
| Qualcomm QNN | Hexagon DSP / AI Engine | Stable |
| Vulkan | Mobile GPU | Stable |
| MediaTek | APU | New |
| MPS | Apple Metal | Experimental |
| NNAPI | Android NNAPI | Community |
| OpenVINO | Intel CPU/GPU/VPU | Stable |
| Vulkan | GPU via SPIR-V | Stable |
Feasibility for Edge TPU: ExecuTorch's delegate model is well-suited for Edge TPU integration. The approach would be:
- Implement a partitioner that identifies Edge TPU-compatible subgraphs (INT8 ops from the supported list)
- Implement a delegate that converts the subgraph to TFLite format and
invokes
edgetpu_compiler - Package the compiled
edgetpu-custom-opblob into the ExecuTorch program
The main challenge is the PyTorch → TFLite conversion step, which currently has limited support (ONNX as an intermediate format is one option).
3.4 Comparison Table¶
| Feature | IREE | TVM | ExecuTorch |
|---|---|---|---|
| IR | MLIR (linalg, stream, hal) | Relay IR + TIR | Core ATen (Edge dialect) |
| Front-ends | TF, JAX, PyTorch (exp.) | TF, ONNX, PyTorch, TFLite, etc. | PyTorch only |
| Quantization | Import-time + compiler-internal | PTQ + QAT + calibration | PTQ + backend-aware |
| Backend model | HAL device targets | BYOC external codegen | Partitioner + delegate |
| Custom HW path | Implement HAL target + runtime driver | BYOC: register codegen func | Implement partitioner + delegate |
| Edge TPU feasibility | Hard (proprietary ISA) | Medium (BYOC + edgetpu_compiler) | Medium (delegate + TFLite conversion) |
| MLIR as IR? | Yes (native) | No (Relay is custom) | No (Core ATen dialect) |
| Runtime | IREE VM (custom bytecode) | TVM runtime (graph executor) | ExecuTorch runtime (.pte) |
3.5 Feasibility of MLIR (IREE) or Relay (TVM) as Intermediate Representation¶
MLIR (via IREE)¶
Pros:
- Native multi-dialect system allows clean separation of concerns
linalgdialect is a natural meeting point for different front-ends- Extensible: new hardware targets are "just" new dialects and lowering passes
- Growing ecosystem (XLA, IREE, torch-mlir, stablehlo all use MLIR)
- Potential for
stablehlo→linalg→ Edge TPU lowering pipeline
Cons:
- IREE's
hal.executableabstraction assumes you can lower to some form of compute kernel (SPIR-V, PTX, LLVM IR). The Edge TPU's proprietary ISA doesn't fit this model. - The
edgetpu_compileris a black box; you can't decompose its output back into MLIR. - Significant engineering effort to implement a working pipeline
Recommended approach: Use MLIR as a front-end aggregation layer
(TF/PyTorch/ONNX → stablehlo → linalg) and then lower to TFLite for
the edgetpu_compiler step. Don't attempt to bypass edgetpu_compiler.
Relay (via TVM)¶
Pros:
- Mature BYOC framework designed for proprietary hardware
- Good TFLite front-end import — can read and write TFLite models
- Quantization toolkit is well-tested for INT8 workflows
- TVM has been used with real hardware accelerators (VTA, MATCH)
Cons:
- Relay IR is not MLIR; doesn't benefit from the MLIR ecosystem
- TVM's own development velocity has slowed relative to IREE/ExecuTorch
- The BYOC → edgetpu_compiler path requires an awkward Relay → TFLite → edgetpu_compiler → custom op relay roundtrip
Recommended approach: If starting from TVM, use BYOC with a TFLite
export/reimport strategy. This is the path of least resistance for
integrating the edgetpu_compiler.
4. Edge TPU Operation Set & Limitations¶
4.1 Model Requirements¶
For a TensorFlow model to compile for the Edge TPU, it must meet all of these:
- Tensor parameters are quantized to 8-bit fixed-point (int8 or uint8)
- Tensor sizes are constant at compile-time (no dynamic shapes)
- Model parameters (weights, biases) are constant at compile-time
- Tensors are 1-, 2-, or 3-dimensional. If a tensor has >3 dimensions, only the 3 innermost dimensions may have size >1
- Only supported operations are used (see table below)
Float inputs are OK — the compiler will leave a QUANTIZE op at the graph
entry point that runs on the CPU, converting float → INT8 before the Edge TPU
custom op.
4.2 Complete Supported Operations Table¶
Source: coral.ai/docs/edgetpu/models-intro#supported-operations
| Operation | Min Runtime | Known Limitations |
|---|---|---|
| Add | All | — |
| AveragePool2d | All | No fused activation function |
| Concatenation | All | No fused activation function. If any input is a compile-time constant tensor, there must be only 2 inputs, and this constant must be all zeros (zero-padding op) |
| Conv2d | All | Must use same dilation in x and y dimensions |
| DepthwiseConv2d | ≤12 | Dilated conv kernels not supported |
| ≥13 | Must use same dilation in x and y dimensions | |
| ExpandDims | ≥13 | — |
| FullyConnected | All | Only default weight format. Output tensor is 1-dimensional |
| L2Normalization | All | — |
| Logistic (Sigmoid) | All | — |
| LSTM | ≥14 | Unidirectional LSTM only |
| Maximum | All | — |
| MaxPool2d | All | No fused activation function |
| Mean | ≤12 | No batch-dim reduction. Supports reduction along x- and/or y-dimensions only |
| ≥13 | No batch-dim reduction. If z-reduction, z-dim must be multiple of 4 | |
| Minimum | All | — |
| Mul | All | — |
| Pack | ≥13 | No packing in batch dimension |
| Pad | ≤12 | No batch-dim padding. Supports padding along x- and/or y-dimensions only |
| ≥13 | No batch-dim padding | |
| PReLU | ≥13 | Alpha must be 1-dimensional. If using Keras PReLU with 4D input (batch, height, width, channels), shared_axes must be [1,2] |
| Quantize | ≥13 | — |
| ReduceMax | ≥14 | Cannot operate on batch dimension |
| ReduceMin | ≥14 | Cannot operate on batch dimension |
| ReLU | All | — |
| ReLU6 | All | — |
| ReLUN1To1 | All | — |
| Reshape | All | Certain reshapes might not be mapped for large tensor sizes |
| ResizeBilinear | All | Input/output is 3-dimensional. Might not be mapped to avoid precision loss depending on size |
| ResizeNearestNeighbor | All | Input/output is 3-dimensional. Might not be mapped depending on size |
| Rsqrt | ≥14 | — |
| Slice | All | — |
| Softmax | All | Supports only 1-D input with max 16,000 elements |
| SpaceToDepth | All | — |
| Split | All | No splitting in batch dimension |
| Squeeze | ≤12 | Only when input has leading 1s (no relayout needed), e.g. [1,1,10] or [1,5,10] is OK; [5,1,10] is not |
| ≥13 | None | |
| StridedSlice | All | All strides must equal 1 (effectively a Slice op), ellipsis-axis-mask == 0, new-axis-mask == 0 |
| Sub | All | — |
| Sum | ≥13 | Cannot operate on batch dimension |
| SquaredDifference | ≥14 | — |
| Tanh | All | — |
| Transpose | ≥14 | — |
| TransposeConv | ≥13 | — |
Total: 38 operations (with varying runtime version requirements)
4.3 Key Constraints¶
4.3.1 INT8 Quantization Requirement¶
All tensor parameters (weights, biases) must be quantized to INT8/UINT8. The Edge TPU hardware has no floating-point execution units. Activation tensors must also be quantized; the TFLite converter handles this via "full integer quantization" with calibration data.
4.3.2 Tensor Dimension Constraints¶
- Tensors must be 1D, 2D, or 3D
- If >3D, only the 3 innermost dimensions may have size >1
- This effectively means the Edge TPU processes tensors in NHWC format where N (batch) must be 1 for most operations, and dimensions beyond C must be 1
4.3.3 On-Chip Memory¶
- ~8 MB SRAM total for executable + parameter cache
- Typically ~6.91 MiB available for parameter data (after executable space)
- If model parameters exceed available SRAM, excess is streamed from host memory (off-chip), which significantly degrades performance
4.3.4 Max Parameter Data Size¶
There is no single documented hard limit on model size, but practical limits arise from the 8 MB SRAM:
- Models with <7 MB of parameter data can be fully cached on-chip
- Larger models must stream parameters, causing 2-10x slower inference
4.3.5 Compilation Behavior¶
The edgetpu_compiler operates in a greedy prefix manner:
- Starts from the graph input
- Collects consecutive supported ops
- Stops at the first unsupported op
- Everything before the stop →
edgetpu-custom-op(runs on TPU) - Everything after → standard TFLite ops (runs on CPU)
This means the position of unsupported ops matters — an unsupported op early in the graph will prevent most of the model from running on the TPU, even if later ops are individually supported.
The -d / --search_delegate flag enables the compiler to retry compilation
from an earlier point in the graph when it encounters an unsupported op,
potentially allowing a partial TPU mapping even with mid-graph failures.
4.3.6 Subgraph Limitations¶
Many operations show "More than one subgraph is not supported" in compiler output. This occurs when:
- The model has control flow (if/while) creating multiple subgraphs
- Certain ops create implicit subgraphs
- The Edge TPU currently supports only a single subgraph for hardware execution
4.3.7 No 3D Convolutions¶
The Edge TPU supports only Conv2D and DepthwiseConv2D. There is no 3D convolution support. Models requiring 3D convs (video processing, 3D medical imaging) cannot be fully accelerated.
4.3.8 Softmax Element Limit¶
Softmax supports only 1-D input with a maximum of 16,000 elements. This limits the number of classes in classification models running entirely on the Edge TPU.
5. Sloth Integration Implications¶
The repository now carries an integrated sloth_integration package that targets
text-classification and embedding deployment to Coral USB.
5.1 Why This Matters for Edge TPU Research¶
- It validates an end-to-end SLM deployment path (adapter -> converter -> quantizer -> runtime).
- It stress-tests the runtime API compatibility layer against Coral USB delegates.
- It provides reproducible benchmark scenarios that compare CPU baseline, direct hardware, and hardware through sloth runtime abstractions.
5.2 Practical Constraints Observed¶
- Synthetic micro-models are useful for pipeline validation but should not be used as production latency proxies.
- Delegate preflight behavior can vary by model artifact; robust fallback paths are required in runtime wrappers.
- Keeping package metadata centralized in root build config avoids drift between runtime code and published dependencies.
5.3 Current Internal References¶
- Integration package:
sloth-integration/src/sloth_integration - Integration tests:
sloth-integration/tests - Benchmark report:
sloth-integration/docs/benchmarks_sloth.md
Appendix A: Key URLs¶
| Resource | URL |
|---|---|
| Coral Documentation | https://www.coral.ai/docs |
| Edge TPU Models Intro | https://www.coral.ai/docs/edgetpu/models-intro |
| Edge TPU Compiler | https://www.coral.ai/docs/edgetpu/compiler |
| libedgetpu (official) | https://github.com/google-coral/libedgetpu |
| edgetpu issue tracker | https://github.com/google-coral/edgetpu |
| pycoral | https://github.com/google-coral/pycoral |
| libcoral | https://github.com/google-coral/libcoral |
| cocoa-xu/libedgetpu | https://github.com/cocoa-xu/libedgetpu |
| feranick/libedgetpu | https://github.com/feranick/libedgetpu |
| feranick/TFlite-builds | https://github.com/feranick/TFlite-builds |
| Tim Strobel's Mac Guide | https://tim-strobel.de/coral.html |
| IREE | https://iree.dev |
| Apache TVM | https://tvm.apache.org |
| ExecuTorch | https://executorch.ai |
| Edge TPU Compiler (Colab) | https://colab.research.google.com |
Appendix B: Edge TPU Compiler CLI Reference¶
edgetpu_compiler [options] model [...]
Options:
-h, --help Print help
-i, --intermediate_tensors Output tensors from Edge TPU custom op
-m, --min_runtime_version Min runtime version (default: compiler-specific)
-n, --num_segments Compile into N segments for pipelining
-o, --out_dir Output directory (default: cwd)
-s, --show_operations Print op mapping to console
-d, --search_delegate Retry compilation on failure (since v16)
-t, --timeout_sec Compiler timeout in seconds (default: 180)
-v, --version Print compiler version
Appendix C: edgetpu.h Complete Public API¶
namespace edgetpu {
// Custom op identifier
static const char kCustomOp[] = "edgetpu-custom-op";
enum class DeviceType { kApexPci = 0, kApexUsb = 1 };
class EdgeTpuManager {
public:
static EdgeTpuManager* GetSingleton();
// Shared ownership (preferred)
virtual std::shared_ptr<EdgeTpuContext> OpenDevice() = 0;
virtual std::shared_ptr<EdgeTpuContext> OpenDevice(DeviceType) = 0;
virtual std::shared_ptr<EdgeTpuContext> OpenDevice(
DeviceType, const std::string& path) = 0;
virtual std::shared_ptr<EdgeTpuContext> OpenDevice(
DeviceType, const std::string& path, const DeviceOptions&) = 0;
// Exclusive ownership (deprecated)
virtual std::unique_ptr<EdgeTpuContext> NewEdgeTpuContext(...) = 0;
virtual std::vector<DeviceEnumerationRecord> EnumerateEdgeTpu() const = 0;
virtual std::vector<std::shared_ptr<EdgeTpuContext>> GetOpenedDevices() const = 0;
virtual TfLiteStatus SetVerbosity(int verbosity) = 0;
virtual std::string Version() const = 0;
};
class EdgeTpuContext : public TfLiteExternalContext {
public:
virtual const DeviceEnumerationRecord& GetDeviceEnumRecord() const = 0;
virtual DeviceOptions GetDeviceOptions() const = 0;
virtual bool IsReady() const = 0;
};
// Register the custom op with TFLite
EDGETPU_EXPORT TfLiteRegistration* RegisterCustomOp();
} // namespace edgetpu
Appendix D: Compiler-Runtime Version Compatibility¶
| Compiler | Default Runtime | Notable New Ops |
|---|---|---|
| 1.0 | 10 | Base set (Add, Conv2D, etc.) |
| 2.0 | 12 | DepthwiseConv2d dilation |
| 2.1 | 13 | ExpandDims, Pack, PReLU, Quantize, Sum, TransposeConv, improved Squeeze/Mean/Pad |
| 14.1–15.0 | 13 | — |
| 16.0 | 14 | LSTM, ReduceMax, ReduceMin, Rsqrt, SquaredDifference, Transpose |
Report compiled from official documentation, GitHub repositories, community guides, and web research. Last updated: 2025-03-05.