Skip to content

sloth-integration: Comprehensive Usage Guide

This document explains how to use sloth-integration with the existing edgecompiler repository and Google Coral USB Accelerator to deploy fine-tuned Small Language Models on edge hardware. It covers the complete workflow from fine-tuning with unsloth to running INT8 inference on a Coral Edge TPU, with specific instructions for MacBook M1 Pro.


Table of Contents

  1. Overview
  2. Prerequisites
  3. Installation
  4. Quick Start
  5. Step-by-Step Workflow
  6. Using the CLI
  7. Understanding Edge TPU Constraints
  8. Model Selection Guide
  9. Advanced Usage
  10. Integration Architecture
  11. Troubleshooting
  12. API Reference

1. Overview

sloth-integration bridges two projects:

  • unsloth provides FastLanguageModel, a wrapper around Hugging Face Transformers that makes LoRA/QLoRA fine-tuning 2x faster. It handles 4-bit quantisation, gradient checkpointing, and optimised training loops.

  • edgecompiler provides a compiler toolchain for deploying ML models on edge hardware. Its compile() function converts models (ONNX, TFLite, PyTorch) to an intermediate representation, quantises them to INT8, and compiles for Google Coral Edge TPU or Apple Silicon Metal backends.

sloth-integration connects these tools with purpose-built components:

Component Purpose
SlothAdapter Wraps unsloth FastLanguageModel for fine-tuning and model management
SlothConverter Exports fine-tuned models (classification head, embedding layer) to ONNX/TFLite
SlothQuantizer Prepares calibration data and configures INT8 quantisation for Edge TPU
SlothCoralRuntime Runs text classification and embedding inference on Coral USB Accelerator
ModelDistiller Distils large SLMs into smaller models suitable for edge deployment

The typical workflow is:

Fine-tune with unsloth -> Export sub-model -> Compile with edgecompiler -> Run on Coral

2. Prerequisites

Hardware

Item Requirement Notes
MacBook M1 Pro or later M1 base model works but has fewer GPU cores
RAM 16 GB minimum 32 GB recommended for models larger than 1B parameters
Storage 20 GB free For model weights, exports, and compiled artefacts
Coral USB Accelerator Any revision USB 3.0 model recommended for best throughput
USB connection USB 3.0 (5 Gbps) Use a direct port; avoid unpowered USB-C hubs

Software

Package Version Purpose
Python 3.10+ Runtime
unsloth Latest Fast fine-tuning with LoRA
edgecompiler 0.1.0+ Model compilation for Coral
PyTorch 2.1+ Backend for unsloth and model export
transformers 4.36+ Model loading and tokenisation
onnx 1.15+ ONNX export
onnxruntime 1.16+ ONNX model validation
tflite-runtime 2.14+ TFLite interpreter for Coral
numpy 1.24+ Array operations

Coral Edge TPU Runtime

You must install libedgetpu for your platform. On macOS Apple Silicon, the official Google build is x86_64 only. Use the community ARM64 build from feranick/libedgetpu:

# See Section 3 for detailed installation steps

3. Installation

Step 1: Install edgecompiler

Clone and install the edgecompiler repository:

cd ~/projects
git clone https://github.com/rotsl/edgecompiler.git
cd edgecompiler
pip install -e ".[dev]"

Verify the installation:

python -c "import edgecompiler; print(edgecompiler.__version__)"
# Expected output: 0.4.x

Step 2: Enable sloth integration extras (optional)

pip install -e ".[sloth]"

For ONNX and unsloth workflows:

pip install -e ".[sloth,sloth-onnx]"
pip install -e ".[sloth,sloth-unsloth]"

sloth_integration is packaged from sloth-integration/src/sloth_integration by the root pyproject.toml.

Step 3: Install Coral Edge TPU runtime (libedgetpu for ARM64 macOS)

The Google Coral Edge TPU runtime (libedgetpu) is required to offload inference to the Coral USB Accelerator. On macOS Apple Silicon, you need the ARM64 build.

bash scripts/setup_coral_runtime.sh

This script:

  • Detects your CPU architecture (arm64 or x86_64)
  • Downloads the appropriate libedgetpu build
  • Installs the dylib to /usr/local/lib/
  • Sets up DYLD_LIBRARY_PATH in your shell configuration
  • Verifies the installation

Option B: Manual installation

# Install build dependencies
brew install libusb cmake wget

# Clone and build for ARM64
git clone https://github.com/feranick/libedgetpu.git
cd libedgetpu
CPU=darwin_arm64 make

# Install the built library
sudo cp out/darwin_arm64/libedgetpu.1.dylib /usr/local/lib/
sudo ln -sf /usr/local/lib/libedgetpu.1.dylib /usr/local/lib/libedgetpu.dylib
sudo update_dyld_shared_cache 2>/dev/null || true

# Set library path
echo 'export DYLD_LIBRARY_PATH="/usr/local/lib:${DYLD_LIBRARY_PATH:-}"' >> ~/.zshrc
source ~/.zshrc

Install tflite-runtime

pip install tflite-runtime

If tflite-runtime is not available for your Python version on ARM64 macOS, install a compatible wheel from feranick/TFlite-builds.

Step 4: Verify the complete setup

Run the verification script:

python -c "
import edgecompiler
print(f'edgecompiler: {edgecompiler.__version__}')

from edgecompiler.runtime.coral_usb import CoralUSBRuntime, find_libedgetpu
lib_path = find_libedgetpu()
print(f'libedgetpu: {lib_path}')

runtime = CoralUSBRuntime()
devices = runtime.detect_devices()
print(f'Coral devices: {devices}')

from sloth_integration import SlothAdapter, SlothConverter, SlothCoralRuntime
print('sloth-integration: all imports OK')
"

Expected output:

edgecompiler: 0.1.0
libedgetpu: /usr/local/lib/libedgetpu.1.dylib
Coral devices: [CoralDevice(path=':0', type='USB', name='Coral USB Accelerator')]
sloth-integration: all imports OK

If Coral devices is empty, check that the USB Accelerator is plugged in and the LED is lit. See the Troubleshooting section for help.


4. Quick Start

This minimal example fine-tunes a sentiment classifier on the IMDB dataset, exports the classification head, compiles it for Coral, and runs inference.

"""Quick start: fine-tune to Coral inference in under 20 lines."""

from sloth_integration import SlothAdapter, SlothConverter, SlothCoralRuntime
import edgecompiler

# 1. Fine-tune a 1B model on IMDB sentiment
adapter = SlothAdapter("unsloth/Llama-3.2-1B-Instruct", max_seq_length=2048)
adapter.finetune_classification(
    dataset_name="imdb",
    text_field="text",
    label_field="label",
    num_labels=2,
    num_epochs=1,
)
adapter.save("sloth_output/imdb_sentiment")

# 2. Export the classification head to ONNX
converter = SlothConverter("sloth_output/imdb_sentiment")
onnx_path = converter.export_head(
    output_path="imdb_sentiment_head.onnx",
    format="onnx",
)

# 3. Compile for Coral Edge TPU
compiled_path = edgecompiler.compile(
    onnx_path,
    target="coral",
    quantize=True,
)
print(f"Compiled model: {compiled_path}")

# 4. Run inference on Coral USB
runtime = SlothCoralRuntime(compiled_path)
result = runtime.classify_text("This movie was absolutely fantastic!")
print(f"Label: {result['label']}, Confidence: {result['confidence']:.4f}")
print(f"Latency: {result['latency_ms']:.1f} ms on Edge TPU")

5. Step-by-Step Workflow

Step 1: Fine-tune an SLM with unsloth

The SlothAdapter wraps unsloth's FastLanguageModel and provides methods for LoRA fine-tuning on classification and embedding tasks.

from sloth_integration import SlothAdapter

# Load a pre-trained model with 4-bit quantisation
adapter = SlothAdapter(
    model_name="unsloth/Llama-3.2-1B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
    dtype="auto",
)

# Apply LoRA adapters
adapter.apply_lora(
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
)

# Fine-tune for text classification
adapter.finetune_classification(
    dataset_name="imdb",
    text_field="text",
    label_field="label",
    num_labels=2,
    num_epochs=1,
    batch_size=2,
    learning_rate=2e-4,
    warmup_steps=10,
    output_dir="sloth_output/imdb_sentiment",
)

# Save the fine-tuned checkpoint
adapter.save("sloth_output/imdb_sentiment")

You can also fine-tune on a custom dataset:

from datasets import load_dataset

# Load a custom CSV/JSON dataset
dataset = load_dataset("csv", data_files="my_data.csv")

adapter.finetune_classification(
    dataset=dataset,
    text_field="review_text",
    label_field="sentiment",
    num_labels=3,  # negative, neutral, positive
    num_epochs=3,
    output_dir="sloth_output/custom_classifier",
)

Step 2: Export the fine-tuned model for Coral

Edge TPU models must be small (under ~8 MB) and consist of operations that the Edge TPU supports. Full transformer models are far too large. The recommended approach is to export only the classification head -- a small feed-forward network that takes pre-computed embeddings as input.

from sloth_integration import SlothConverter

converter = SlothConverter(
    checkpoint_path="sloth_output/imdb_sentiment",
    tokenizer_path="sloth_output/imdb_sentiment",
)

# Export the classification head (recommended for SLMs)
onnx_path = converter.export_head(
    output_path="imdb_sentiment_head.onnx",
    format="onnx",
    opset=14,
    optimize=True,
    # The head expects embeddings of shape [batch_size, hidden_size]
    # hidden_size is auto-detected from the model config
)
print(f"Exported ONNX model: {onnx_path}")

# Alternative: export the embedding model (for hybrid inference)
embedding_onnx_path = converter.export_embeddings(
    output_path="imdb_embedding_model.onnx",
    format="onnx",
)

For very small models (under 8 MB when quantised), you can attempt to export the full model:

full_onnx_path = converter.export_full(
    output_path="full_model.onnx",
    format="onnx",
    max_seq_length=128,  # Truncate to reduce model size
)

Note: Full model export is only practical for models with fewer than ~30M parameters and short sequence lengths. Most SLMs (even 1B parameter models) are too large. Use the classification head or embedding extraction approach instead.

Step 3: Compile with edgecompiler for Edge TPU

Pass the exported ONNX model to edgecompiler.compile() with target="coral":

import edgecompiler

# One-call compilation with automatic INT8 quantisation
compiled_path = edgecompiler.compile(
    "imdb_sentiment_head.onnx",
    target="coral",
    quantize=True,
    # Optional: provide calibration data for better quantisation accuracy
    # calibration_data="calibration_samples.npz",
)
print(f"Compiled Edge TPU model: {compiled_path}")
# Output: imdb_sentiment_head_edgetpu.tflite

The compilation pipeline performs these steps internally:

  1. Convert the ONNX model to edgecompiler's IR (Intermediate Representation)
  2. Quantise the IR graph to INT8 using post-training quantisation (PTQ)
  3. Compile the quantised graph for the Coral Edge TPU backend
  4. Write the compiled *_edgetpu.tflite file

For better quantisation accuracy, provide calibration data:

import numpy as np

# Generate or collect calibration samples (embeddings from the SLM)
# These should be representative of the data the model will see at inference time
cal_samples = np.random.randn(100, 2048).astype(np.float32)  # 100 samples, hidden_size=2048
np.savez("calibration_samples.npz", embeddings=cal_samples)

compiled_path = edgecompiler.compile(
    "imdb_sentiment_head.onnx",
    target="coral",
    quantize=True,
    calibration_data="calibration_samples.npz",
    per_channel=True,
    symmetric=True,
)

Step 4: Run inference on Coral USB

Use SlothCoralRuntime for high-level text inference:

from sloth_integration import SlothCoralRuntime

# Initialize with the compiled Edge TPU model
runtime = SlothCoralRuntime(
    model_path="imdb_sentiment_head_edgetpu.tflite",
    labels=["negative", "positive"],
)

# Classify a single text
result = runtime.classify_text("This movie was absolutely fantastic!")
print(result)
# {'label': 'positive', 'confidence': 0.94, 'scores': [0.06, 0.94], 'latency_ms': 3.2}

# Classify multiple texts
results = runtime.classify_batch([
    "Terrible waste of time.",
    "One of the best films I have ever seen.",
    "It was okay, nothing special.",
])
for text, result in zip(texts, results):
    print(f"  {text!r} -> {result['label']} ({result['confidence']:.2f})")

For embedding inference:

runtime = SlothCoralRuntime("embedding_model_edgetpu.tflite")
embedding = runtime.embed_text("Hello, world!")
print(f"Embedding shape: {embedding['vector'].shape}")
print(f"Latency: {embedding['latency_ms']:.1f} ms")

6. Using the CLI

sloth-integration provides a sloth CLI tool with the following subcommands:

sloth finetune

Fine-tune a model on a classification dataset.

sloth finetune \
    --model unsloth/Llama-3.2-1B-Instruct \
    --dataset imdb \
    --text-field text \
    --label-field label \
    --num-labels 2 \
    --epochs 1 \
    --output-dir sloth_output/imdb

Options:

Option Default Description
--model Required Hugging Face model name or path
--dataset Required Dataset name (Hugging Face) or path to CSV/JSON
--text-field text Name of the text column
--label-field label Name of the label column
--num-labels 2 Number of classification classes
--epochs 1 Number of training epochs
--batch-size 2 Per-device batch size
--learning-rate 2e-4 Learning rate
--max-seq-length 2048 Maximum sequence length
--lora-r 16 LoRA rank
--output-dir sloth_output Output directory

sloth export

Export a fine-tuned model checkpoint to ONNX or TFLite.

# Export classification head (recommended)
sloth export \
    --checkpoint sloth_output/imdb \
    --component head \
    --format onnx \
    --output sentiment_head.onnx

# Export embedding model
sloth export \
    --checkpoint sloth_output/imdb \
    --component embeddings \
    --format onnx \
    --output embedding_model.onnx

Options:

Option Default Description
--checkpoint Required Path to the fine-tuned checkpoint
--component head Which part to export: head, embeddings, or full
--format onnx Export format: onnx or tflite
--opset 14 ONNX opset version
--output Auto Output file path
--optimize True Apply ONNX optimisations

sloth compile

Compile an exported model for Coral Edge TPU.

sloth compile \
    --model sentiment_head.onnx \
    --target coral \
    --quantize \
    --calibration-data calib.npz \
    --output sentiment_head_edgetpu.tflite

Options:

Option Default Description
--model Required Path to the ONNX/TFLite model
--target coral Target backend: coral or metal
--quantize True Apply INT8 quantisation
--calibration-data None Path to calibration data (.npz)
--per-channel True Use per-channel quantisation
--symmetric True Use symmetric quantisation
--output Auto Output path for compiled model

sloth infer

Run inference on a compiled model.

# Text classification
sloth infer \
    --model sentiment_head_edgetpu.tflite \
    --text "This product is amazing!" \
    --labels negative,positive

# With input from file
sloth infer \
    --model sentiment_head_edgetpu.tflite \
    --input-file reviews.txt \
    --labels negative,positive \
    --batch

Options:

Option Default Description
--model Required Path to the compiled Edge TPU model
--text None Single text to classify
--input-file None File with one text per line
--labels Auto Comma-separated label names
--batch False Process input file in batches
--top-k 5 Number of top predictions to return

sloth benchmark

Benchmark a compiled model on Coral Edge TPU.

sloth benchmark \
    --model sentiment_head_edgetpu.tflite \
    --iterations 100 \
    --warmup 5

sloth info

Display information about a model and its Edge TPU compatibility.

sloth info --model sentiment_head.onnx

7. Understanding Edge TPU Constraints

The Google Coral Edge TPU is a specialised INT8 inference accelerator with strict constraints. Understanding these constraints is essential for successfully deploying SLM components on the device.

8 MB On-Chip Cache

The Edge TPU has 8 MB of on-chip SRAM that stores model parameters during inference. If a model exceeds this cache, parameters must be streamed from host DRAM, which significantly increases latency.

Implications for sloth-integration:

  • A classification head with hidden_size=2048 and 2 classes requires:
  • Weights: 2048 x 2 x 1 byte (INT8) = ~4 KB
  • Bias: 2 x 4 bytes (INT32) = 8 bytes
  • Total: ~4 KB -- easily fits in cache

  • An embedding model for a 1B parameter SLM with hidden_size=2048:

  • Embedding table: vocab_size x 2048 x 1 byte = ~2 GB for 128K vocabulary
  • This far exceeds the 8 MB cache; embedding lookups must run on the host

INT8 Only

The Edge TPU only executes INT8 operations. All float32 tensors must be quantised before compilation. edgecompiler handles this automatically when quantize=True, but you should be aware of the accuracy implications.

Mitigation strategies:

  • Use calibration data that is representative of your inference workload
  • Enable per-channel quantisation (per_channel=True) for weight tensors
  • Use symmetric quantisation (symmetric=True) for Edge TPU compatibility
  • Evaluate quantised model accuracy before deployment

Supported Operations

The Edge TPU supports a specific set of TFLite operations. Operations not supported on the Edge TPU fall back to CPU execution, which increases latency.

Common Edge TPU supported operations:

  • Conv2D, DepthwiseConv2D
  • FullyConnected (dense layers)
  • Add, Subtract, Multiply, Maximum, Minimum
  • ReLU, ReLU6, Tanh, Sigmoid
  • Softmax
  • Reshape, Squeeze, ExpandDims, Transpose
  • MaxPool2D, AveragePool2D
  • Concatenation
  • Quantize, Dequantize

Operations NOT supported on Edge TPU (fall back to CPU):

  • LSTM, GRU, RNN
  • Attention mechanisms (self-attention, cross-attention)
  • Gather, EmbeddingLookup (large tables)
  • String operations
  • Dynamic shapes

This is why full transformer models cannot run entirely on the Edge TPU -- the attention and embedding layers are unsupported. The classification head approach works because it uses only FullyConnected and activation operations.


8. Model Selection Guide

Choosing the right SLM for Coral deployment depends on your task and accuracy requirements.

Model Parameters Hidden Size Recommended For
Llama 3.2 1B Instruct 1.2B 2048 Text classification, sentiment analysis
Qwen3-4B 4B 2560 More complex classification, topic categorisation
Phi-4-mini 3.8B 3072 Reasoning tasks, multi-class classification
Gemma-2-2B 2B 2048 General-purpose classification

Deployment Strategies

This is the recommended approach for deploying SLMs on Coral. The full SLM runs on the host (CPU/GPU) to compute embeddings, and only the lightweight classification head runs on the Coral Edge TPU.

Workflow:

  1. Fine-tune the full SLM with LoRA on your classification dataset
  2. Extract the classification head (typically 2-3 linear layers)
  3. Export the head to ONNX
  4. Compile for Coral with INT8 quantisation
  5. At inference: compute embeddings on host, classify on Coral

Size estimate:

Hidden Size Classes INT8 Size (approx.) Fits in 8 MB?
2048 2 4 KB Yes
2048 10 20 KB Yes
2560 100 256 KB Yes
3072 1000 3 MB Yes

Embedding Model Extraction

Extract the embedding layer (without the full transformer stack) for tasks like semantic similarity or retrieval. Note that the embedding table for most SLMs is too large for the Edge TPU cache, so embedding lookups run on CPU with post-embedding projection on the Edge TPU.

Size estimate:

Model Vocab Size Hidden Size Embedding Table (INT8) Fits?
Llama 3.2 1B 128K 2048 ~256 MB No
Qwen3-4B 152K 2560 ~390 MB No

For embedding models, use the hybrid inference pattern (Section 9.4).

Only practical for models under ~8 MB when quantised. This limits you to very small models with short sequence lengths. Not recommended for SLMs.


9. Advanced Usage

9.1 Knowledge Distillation for Edge Deployment

Use ModelDistiller to create a smaller student model that mimics the behaviour of the fine-tuned SLM (teacher). The student model can then be deployed on the Coral Edge TPU.

from sloth_integration import ModelDistiller, SlothAdapter, SlothConverter
import edgecompiler

# Load the fine-tuned teacher model
teacher = SlothAdapter("sloth_output/imdb_sentiment")

# Create a distiller
distiller = ModelDistiller(
    teacher_model=teacher,
    student_hidden_sizes=[512, 256, 128],  # Progressively smaller layers
    num_labels=2,
    temperature=4.0,       # Softmax temperature for soft labels
    alpha=0.7,             # Weight for distillation loss vs hard label loss
)

# Train the student model
distiller.train(
    dataset_name="imdb",
    text_field="text",
    label_field="label",
    num_epochs=5,
    batch_size=16,
    learning_rate=1e-3,
)

# Export and compile the student model
student_path = distiller.save_student("sloth_output/distilled_student")
converter = SlothConverter(student_path)
onnx_path = converter.export_head(format="onnx")
compiled_path = edgecompiler.compile(onnx_path, target="coral", quantize=True)

9.2 Pruning Models to Fit Edge TPU

If a model is slightly too large for the Edge TPU cache, you can prune less important weights before quantisation:

from sloth_integration import SlothConverter

converter = SlothConverter("sloth_output/imdb_sentiment")

# Export with pruning
onnx_path = converter.export_head(
    format="onnx",
    prune_ratio=0.3,  # Remove 30% of least important weights
    prune_method="magnitude",  # Prune by weight magnitude
)

9.3 Calibration Data Preparation for PTQ

Accurate calibration data is critical for INT8 quantisation quality. The calibration samples should represent the distribution of embeddings that the classification head will receive at inference time.

from sloth_integration import SlothQuantizer

quantizer = SlothQuantizer(
    checkpoint_path="sloth_output/imdb_sentiment",
)

# Generate calibration samples from your training data
cal_path = quantizer.prepare_calibration_data(
    dataset_name="imdb",
    text_field="text",
    num_samples=100,
    output_path="calibration_samples.npz",
)

# Use the calibration data during compilation
compiled_path = edgecompiler.compile(
    "sentiment_head.onnx",
    target="coral",
    quantize=True,
    calibration_data=cal_path,
)

9.4 Hybrid Inference (Embeddings on Host, Classifier on Coral)

This is the recommended pattern for SLM deployment. The full model runs on the host (MacBook M1 Pro) to generate embeddings, and the lightweight classification head runs on the Coral Edge TPU for fast, low-power inference.

from sloth_integration import SlothAdapter, SlothCoralRuntime
import numpy as np

# Load the full model on host for embedding computation
adapter = SlothAdapter("sloth_output/imdb_sentiment")

# Load the compiled classification head on Coral
coral = SlothCoralRuntime(
    model_path="sentiment_head_edgetpu.tflite",
    labels=["negative", "positive"],
)

# Hybrid inference pipeline
def classify_text_hybrid(text: str) -> dict:
    # Step 1: Compute embedding on host (using MPS/GPU on M1 Pro)
    embedding = adapter.get_embedding(text)  # np.ndarray of shape [1, hidden_size]

    # Step 2: Run classification on Coral Edge TPU
    result = coral.classify_embedding(embedding)

    return result

# Use the pipeline
result = classify_text_hybrid("This movie was absolutely fantastic!")
print(f"Label: {result['label']}, Confidence: {result['confidence']:.4f}")
print(f"Total latency: {result['latency_ms']:.1f} ms")

The hybrid approach combines:

  • Host GPU (MPS): Best for transformer attention, embedding lookups, dynamic shapes, and variable-length sequences
  • Coral Edge TPU: Best for dense INT8 matrix multiplication (the classification head), with sub-millisecond latency for small layers

10. Integration Architecture

Import Relationship Diagram

sloth_integration
    |
    +-- SlothAdapter
    |       |-- uses: unsloth.FastLanguageModel
    |       |-- uses: transformers.AutoModelForSequenceClassification
    |       |-- uses: peft.LoraConfig, get_peft_model
    |
    +-- SlothConverter
    |       |-- uses: torch.onnx.export
    |       |-- uses: onnx (optimization, validation)
    |       |-- uses: SlothAdapter (to load model weights)
    |
    +-- SlothQuantizer
    |       |-- uses: SlothAdapter (to generate calibration data)
    |       |-- uses: numpy (to save .npz files)
    |
    +-- SlothCoralRuntime
    |       |-- uses: edgecompiler.runtime.CoralUSBRuntime
    |       |-- uses: edgecompiler.runtime.CoralInferenceSession
    |       |-- uses: transformers.AutoTokenizer
    |
    +-- ModelDistiller
    |       |-- uses: SlothAdapter (teacher model)
    |       |-- uses: torch.nn.Module (student model)
    |
    +-- cli
            |-- uses: SlothAdapter, SlothConverter, SlothQuantizer
            |-- uses: edgecompiler.compile()
            |-- uses: SlothCoralRuntime

Data Flow

1. Fine-tuning:
   HF Dataset --> DataLoader --> FastLanguageModel (LoRA) --> Checkpoint

2. Export:
   Checkpoint --> SlothConverter --> ONNX file (classification head)

3. Compilation:
   ONNX file --> edgecompiler.convert_to_ir() --> IR Graph
              --> edgecompiler.quantize_ptq() --> Quantised IR Graph
              --> edgecompiler.compile_for_coral() --> *_edgetpu.tflite

4. Hybrid Inference:
   Text --> Tokenizer --> FastLanguageModel (host GPU) --> Embedding vector
         --> SlothCoralRuntime.classify_embedding() --> Coral Edge TPU --> Label

5. Standalone Coral Inference:
   Embedding vector --> CoralUSBRuntime.infer() --> InferenceResult --> Label

11. Troubleshooting

"unsloth not found"

Symptoms: ModuleNotFoundError: No module named 'unsloth'

Solution: Install unsloth:

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

"edgecompiler not found"

Symptoms: ModuleNotFoundError: No module named 'edgecompiler'

Solution: Install edgecompiler from the repository:

cd ~/projects/edgecompiler
pip install -e .

"libedgetpu not found"

Symptoms: OSError: dlopen(libedgetpu.1.dylib, ...): Library not loaded

Solutions:

  1. Verify the dylib exists:
ls -la /usr/local/lib/libedgetpu*.dylib
  1. Set DYLD_LIBRARY_PATH:
export DYLD_LIBRARY_PATH="/usr/local/lib:${DYLD_LIBRARY_PATH:-}"
  1. Run the setup script:
bash scripts/setup_coral_runtime.sh --force

"Architecture mismatch (x86_64 dylib on ARM64)"

Symptoms: OSError: mach-o, but not compatible architecture

Solution: Build and install the ARM64 version:

file /usr/local/lib/libedgetpu.1.dylib
# If it says x86_64, rebuild:
git clone https://github.com/feranick/libedgetpu.git
cd libedgetpu && CPU=darwin_arm64 make
sudo cp out/darwin_arm64/libedgetpu.1.dylib /usr/local/lib/

"No Coral USB device detected"

Solutions:

  1. Check the physical connection (LED should be lit)
  2. Verify with system_profiler SPUSBDataType | grep -i coral
  3. Avoid USB-C hubs; use a direct USB-A port with adapter
  4. Re-plug the device after installing libedgetpu

"Model exceeds Edge TPU cache"

Symptoms: Model runs but with high latency (30+ ms for small models)

Diagnosis: The model is too large for the 8 MB on-chip cache, forcing parameter streaming from DRAM.

Solutions:

  1. Use classification head extraction instead of the full model
  2. Reduce hidden_size via distillation
  3. Apply pruning with SlothConverter.export_head(prune_ratio=0.3)
  4. Use the hybrid inference pattern (embeddings on host, classifier on Coral)

"Quantisation accuracy is poor"

Symptoms: The compiled model produces incorrect classifications

Solutions:

  1. Provide calibration data:
cal_path = quantizer.prepare_calibration_data(num_samples=200)
edgecompiler.compile(model, target="coral", calibration_data=cal_path)
  1. Increase the number of calibration samples (100-500 recommended)
  2. Enable per-channel quantisation: per_channel=True
  3. Try symmetric quantisation: symmetric=True

"Operations falling back to CPU"

Symptoms: Warning during compilation, higher than expected latency

Diagnosis: Some operations in the model are not supported by the Edge TPU and must run on the host CPU.

Solutions:

  1. Check which ops are falling back:
edgecompiler inspect model_edgetpu.tflite
  1. Restructure the model to avoid unsupported operations
  2. Use classification head extraction (avoids attention ops)
  3. Use the --strict flag to fail on unsupported ops rather than falling back

12. API Reference

SlothAdapter

class SlothAdapter:
    """Wraps unsloth FastLanguageModel for fine-tuning and inference."""

    def __init__(
        self,
        model_name: str,
        max_seq_length: int = 2048,
        load_in_4bit: bool = True,
        dtype: str = "auto",
    ): ...

    def apply_lora(
        self,
        r: int = 16,
        lora_alpha: int = 16,
        lora_dropout: float = 0,
        target_modules: list[str] | None = None,
    ): ...

    def finetune_classification(
        self,
        dataset_name: str | None = None,
        dataset: Dataset | None = None,
        text_field: str = "text",
        label_field: str = "label",
        num_labels: int = 2,
        num_epochs: int = 1,
        batch_size: int = 2,
        learning_rate: float = 2e-4,
        warmup_steps: int = 10,
        output_dir: str = "sloth_output",
    ): ...

    def get_embedding(self, text: str) -> np.ndarray: ...

    def save(self, path: str) -> None: ...

    def estimate_model_size_mb(self) -> float: ...

SlothConverter

class SlothConverter:
    """Exports fine-tuned models to ONNX/TFLite for edge deployment."""

    def __init__(
        self,
        checkpoint_path: str,
        tokenizer_path: str | None = None,
    ): ...

    def export_head(
        self,
        output_path: str | None = None,
        format: str = "onnx",
        opset: int = 14,
        optimize: bool = True,
        prune_ratio: float = 0.0,
        prune_method: str = "magnitude",
    ) -> str: ...

    def export_embeddings(
        self,
        output_path: str | None = None,
        format: str = "onnx",
    ) -> str: ...

    def export_full(
        self,
        output_path: str | None = None,
        format: str = "onnx",
        max_seq_length: int = 128,
    ) -> str: ...

    def validate_onnx(self, onnx_path: str) -> bool: ...

SlothQuantizer

class SlothQuantizer:
    """Prepares calibration data and configures INT8 quantisation."""

    def __init__(
        self,
        checkpoint_path: str,
    ): ...

    def prepare_calibration_data(
        self,
        dataset_name: str | None = None,
        dataset: Dataset | None = None,
        text_field: str = "text",
        num_samples: int = 100,
        output_path: str = "calibration_samples.npz",
    ) -> str: ...

    def get_quantization_config(
        self,
        per_channel: bool = True,
        symmetric: bool = True,
    ) -> dict: ...

SlothCoralRuntime

class SlothCoralRuntime:
    """Runs text classification and embedding inference on Coral Edge TPU."""

    def __init__(
        self,
        model_path: str,
        labels: list[str] | None = None,
        libedgetpu_path: str | None = None,
    ): ...

    def classify_text(self, text: str, top_k: int = 5) -> dict: ...

    def classify_embedding(self, embedding: np.ndarray, top_k: int = 5) -> dict: ...

    def embed_text(self, text: str) -> dict: ...

    def classify_batch(self, texts: list[str]) -> list[dict]: ...

    def benchmark(self, num_runs: int = 100, warmup: int = 5) -> dict: ...

ModelDistiller

class ModelDistiller:
    """Distils a fine-tuned SLM into a smaller student model."""

    def __init__(
        self,
        teacher_model: SlothAdapter,
        student_hidden_sizes: list[int] = [512, 256],
        num_labels: int = 2,
        temperature: float = 4.0,
        alpha: float = 0.7,
    ): ...

    def train(
        self,
        dataset_name: str | None = None,
        dataset: Dataset | None = None,
        text_field: str = "text",
        label_field: str = "label",
        num_epochs: int = 5,
        batch_size: int = 16,
        learning_rate: float = 1e-3,
    ): ...

    def save_student(self, path: str) -> str: ...

    def estimate_student_size_mb(self) -> float: ...