sloth-integration: Comprehensive Usage Guide¶
This document explains how to use sloth-integration with the existing edgecompiler repository and Google Coral USB Accelerator to deploy fine-tuned Small Language Models on edge hardware. It covers the complete workflow from fine-tuning with unsloth to running INT8 inference on a Coral Edge TPU, with specific instructions for MacBook M1 Pro.
Table of Contents¶
- Overview
- Prerequisites
- Installation
- Quick Start
- Step-by-Step Workflow
- Using the CLI
- Understanding Edge TPU Constraints
- Model Selection Guide
- Advanced Usage
- Integration Architecture
- Troubleshooting
- API Reference
1. Overview¶
sloth-integration bridges two projects:
-
unsloth provides
FastLanguageModel, a wrapper around Hugging Face Transformers that makes LoRA/QLoRA fine-tuning 2x faster. It handles 4-bit quantisation, gradient checkpointing, and optimised training loops. -
edgecompiler provides a compiler toolchain for deploying ML models on edge hardware. Its
compile()function converts models (ONNX, TFLite, PyTorch) to an intermediate representation, quantises them to INT8, and compiles for Google Coral Edge TPU or Apple Silicon Metal backends.
sloth-integration connects these tools with purpose-built components:
| Component | Purpose |
|---|---|
SlothAdapter |
Wraps unsloth FastLanguageModel for fine-tuning and model management |
SlothConverter |
Exports fine-tuned models (classification head, embedding layer) to ONNX/TFLite |
SlothQuantizer |
Prepares calibration data and configures INT8 quantisation for Edge TPU |
SlothCoralRuntime |
Runs text classification and embedding inference on Coral USB Accelerator |
ModelDistiller |
Distils large SLMs into smaller models suitable for edge deployment |
The typical workflow is:
2. Prerequisites¶
Hardware¶
| Item | Requirement | Notes |
|---|---|---|
| MacBook | M1 Pro or later | M1 base model works but has fewer GPU cores |
| RAM | 16 GB minimum | 32 GB recommended for models larger than 1B parameters |
| Storage | 20 GB free | For model weights, exports, and compiled artefacts |
| Coral USB Accelerator | Any revision | USB 3.0 model recommended for best throughput |
| USB connection | USB 3.0 (5 Gbps) | Use a direct port; avoid unpowered USB-C hubs |
Software¶
| Package | Version | Purpose |
|---|---|---|
| Python | 3.10+ | Runtime |
| unsloth | Latest | Fast fine-tuning with LoRA |
| edgecompiler | 0.1.0+ | Model compilation for Coral |
| PyTorch | 2.1+ | Backend for unsloth and model export |
| transformers | 4.36+ | Model loading and tokenisation |
| onnx | 1.15+ | ONNX export |
| onnxruntime | 1.16+ | ONNX model validation |
| tflite-runtime | 2.14+ | TFLite interpreter for Coral |
| numpy | 1.24+ | Array operations |
Coral Edge TPU Runtime¶
You must install libedgetpu for your platform. On macOS Apple Silicon, the
official Google build is x86_64 only. Use the community ARM64 build from
feranick/libedgetpu:
3. Installation¶
Step 1: Install edgecompiler¶
Clone and install the edgecompiler repository:
cd ~/projects
git clone https://github.com/rotsl/edgecompiler.git
cd edgecompiler
pip install -e ".[dev]"
Verify the installation:
Step 2: Enable sloth integration extras (optional)¶
For ONNX and unsloth workflows:
sloth_integration is packaged from sloth-integration/src/sloth_integration
by the root pyproject.toml.
Step 3: Install Coral Edge TPU runtime (libedgetpu for ARM64 macOS)¶
The Google Coral Edge TPU runtime (libedgetpu) is required to offload
inference to the Coral USB Accelerator. On macOS Apple Silicon, you need
the ARM64 build.
Option A: Use the setup script (recommended)¶
This script:
- Detects your CPU architecture (arm64 or x86_64)
- Downloads the appropriate
libedgetpubuild - Installs the dylib to
/usr/local/lib/ - Sets up
DYLD_LIBRARY_PATHin your shell configuration - Verifies the installation
Option B: Manual installation¶
# Install build dependencies
brew install libusb cmake wget
# Clone and build for ARM64
git clone https://github.com/feranick/libedgetpu.git
cd libedgetpu
CPU=darwin_arm64 make
# Install the built library
sudo cp out/darwin_arm64/libedgetpu.1.dylib /usr/local/lib/
sudo ln -sf /usr/local/lib/libedgetpu.1.dylib /usr/local/lib/libedgetpu.dylib
sudo update_dyld_shared_cache 2>/dev/null || true
# Set library path
echo 'export DYLD_LIBRARY_PATH="/usr/local/lib:${DYLD_LIBRARY_PATH:-}"' >> ~/.zshrc
source ~/.zshrc
Install tflite-runtime¶
If tflite-runtime is not available for your Python version on ARM64 macOS,
install a compatible wheel from feranick/TFlite-builds.
Step 4: Verify the complete setup¶
Run the verification script:
python -c "
import edgecompiler
print(f'edgecompiler: {edgecompiler.__version__}')
from edgecompiler.runtime.coral_usb import CoralUSBRuntime, find_libedgetpu
lib_path = find_libedgetpu()
print(f'libedgetpu: {lib_path}')
runtime = CoralUSBRuntime()
devices = runtime.detect_devices()
print(f'Coral devices: {devices}')
from sloth_integration import SlothAdapter, SlothConverter, SlothCoralRuntime
print('sloth-integration: all imports OK')
"
Expected output:
edgecompiler: 0.1.0
libedgetpu: /usr/local/lib/libedgetpu.1.dylib
Coral devices: [CoralDevice(path=':0', type='USB', name='Coral USB Accelerator')]
sloth-integration: all imports OK
If Coral devices is empty, check that the USB Accelerator is plugged in and
the LED is lit. See the Troubleshooting section for help.
4. Quick Start¶
This minimal example fine-tunes a sentiment classifier on the IMDB dataset, exports the classification head, compiles it for Coral, and runs inference.
"""Quick start: fine-tune to Coral inference in under 20 lines."""
from sloth_integration import SlothAdapter, SlothConverter, SlothCoralRuntime
import edgecompiler
# 1. Fine-tune a 1B model on IMDB sentiment
adapter = SlothAdapter("unsloth/Llama-3.2-1B-Instruct", max_seq_length=2048)
adapter.finetune_classification(
dataset_name="imdb",
text_field="text",
label_field="label",
num_labels=2,
num_epochs=1,
)
adapter.save("sloth_output/imdb_sentiment")
# 2. Export the classification head to ONNX
converter = SlothConverter("sloth_output/imdb_sentiment")
onnx_path = converter.export_head(
output_path="imdb_sentiment_head.onnx",
format="onnx",
)
# 3. Compile for Coral Edge TPU
compiled_path = edgecompiler.compile(
onnx_path,
target="coral",
quantize=True,
)
print(f"Compiled model: {compiled_path}")
# 4. Run inference on Coral USB
runtime = SlothCoralRuntime(compiled_path)
result = runtime.classify_text("This movie was absolutely fantastic!")
print(f"Label: {result['label']}, Confidence: {result['confidence']:.4f}")
print(f"Latency: {result['latency_ms']:.1f} ms on Edge TPU")
5. Step-by-Step Workflow¶
Step 1: Fine-tune an SLM with unsloth¶
The SlothAdapter wraps unsloth's FastLanguageModel and provides methods
for LoRA fine-tuning on classification and embedding tasks.
from sloth_integration import SlothAdapter
# Load a pre-trained model with 4-bit quantisation
adapter = SlothAdapter(
model_name="unsloth/Llama-3.2-1B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
dtype="auto",
)
# Apply LoRA adapters
adapter.apply_lora(
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
)
# Fine-tune for text classification
adapter.finetune_classification(
dataset_name="imdb",
text_field="text",
label_field="label",
num_labels=2,
num_epochs=1,
batch_size=2,
learning_rate=2e-4,
warmup_steps=10,
output_dir="sloth_output/imdb_sentiment",
)
# Save the fine-tuned checkpoint
adapter.save("sloth_output/imdb_sentiment")
You can also fine-tune on a custom dataset:
from datasets import load_dataset
# Load a custom CSV/JSON dataset
dataset = load_dataset("csv", data_files="my_data.csv")
adapter.finetune_classification(
dataset=dataset,
text_field="review_text",
label_field="sentiment",
num_labels=3, # negative, neutral, positive
num_epochs=3,
output_dir="sloth_output/custom_classifier",
)
Step 2: Export the fine-tuned model for Coral¶
Edge TPU models must be small (under ~8 MB) and consist of operations that the Edge TPU supports. Full transformer models are far too large. The recommended approach is to export only the classification head -- a small feed-forward network that takes pre-computed embeddings as input.
from sloth_integration import SlothConverter
converter = SlothConverter(
checkpoint_path="sloth_output/imdb_sentiment",
tokenizer_path="sloth_output/imdb_sentiment",
)
# Export the classification head (recommended for SLMs)
onnx_path = converter.export_head(
output_path="imdb_sentiment_head.onnx",
format="onnx",
opset=14,
optimize=True,
# The head expects embeddings of shape [batch_size, hidden_size]
# hidden_size is auto-detected from the model config
)
print(f"Exported ONNX model: {onnx_path}")
# Alternative: export the embedding model (for hybrid inference)
embedding_onnx_path = converter.export_embeddings(
output_path="imdb_embedding_model.onnx",
format="onnx",
)
For very small models (under 8 MB when quantised), you can attempt to export the full model:
full_onnx_path = converter.export_full(
output_path="full_model.onnx",
format="onnx",
max_seq_length=128, # Truncate to reduce model size
)
Note: Full model export is only practical for models with fewer than ~30M parameters and short sequence lengths. Most SLMs (even 1B parameter models) are too large. Use the classification head or embedding extraction approach instead.
Step 3: Compile with edgecompiler for Edge TPU¶
Pass the exported ONNX model to edgecompiler.compile() with target="coral":
import edgecompiler
# One-call compilation with automatic INT8 quantisation
compiled_path = edgecompiler.compile(
"imdb_sentiment_head.onnx",
target="coral",
quantize=True,
# Optional: provide calibration data for better quantisation accuracy
# calibration_data="calibration_samples.npz",
)
print(f"Compiled Edge TPU model: {compiled_path}")
# Output: imdb_sentiment_head_edgetpu.tflite
The compilation pipeline performs these steps internally:
- Convert the ONNX model to edgecompiler's IR (Intermediate Representation)
- Quantise the IR graph to INT8 using post-training quantisation (PTQ)
- Compile the quantised graph for the Coral Edge TPU backend
- Write the compiled
*_edgetpu.tflitefile
For better quantisation accuracy, provide calibration data:
import numpy as np
# Generate or collect calibration samples (embeddings from the SLM)
# These should be representative of the data the model will see at inference time
cal_samples = np.random.randn(100, 2048).astype(np.float32) # 100 samples, hidden_size=2048
np.savez("calibration_samples.npz", embeddings=cal_samples)
compiled_path = edgecompiler.compile(
"imdb_sentiment_head.onnx",
target="coral",
quantize=True,
calibration_data="calibration_samples.npz",
per_channel=True,
symmetric=True,
)
Step 4: Run inference on Coral USB¶
Use SlothCoralRuntime for high-level text inference:
from sloth_integration import SlothCoralRuntime
# Initialize with the compiled Edge TPU model
runtime = SlothCoralRuntime(
model_path="imdb_sentiment_head_edgetpu.tflite",
labels=["negative", "positive"],
)
# Classify a single text
result = runtime.classify_text("This movie was absolutely fantastic!")
print(result)
# {'label': 'positive', 'confidence': 0.94, 'scores': [0.06, 0.94], 'latency_ms': 3.2}
# Classify multiple texts
results = runtime.classify_batch([
"Terrible waste of time.",
"One of the best films I have ever seen.",
"It was okay, nothing special.",
])
for text, result in zip(texts, results):
print(f" {text!r} -> {result['label']} ({result['confidence']:.2f})")
For embedding inference:
runtime = SlothCoralRuntime("embedding_model_edgetpu.tflite")
embedding = runtime.embed_text("Hello, world!")
print(f"Embedding shape: {embedding['vector'].shape}")
print(f"Latency: {embedding['latency_ms']:.1f} ms")
6. Using the CLI¶
sloth-integration provides a sloth CLI tool with the following subcommands:
sloth finetune¶
Fine-tune a model on a classification dataset.
sloth finetune \
--model unsloth/Llama-3.2-1B-Instruct \
--dataset imdb \
--text-field text \
--label-field label \
--num-labels 2 \
--epochs 1 \
--output-dir sloth_output/imdb
Options:
| Option | Default | Description |
|---|---|---|
--model |
Required | Hugging Face model name or path |
--dataset |
Required | Dataset name (Hugging Face) or path to CSV/JSON |
--text-field |
text |
Name of the text column |
--label-field |
label |
Name of the label column |
--num-labels |
2 |
Number of classification classes |
--epochs |
1 |
Number of training epochs |
--batch-size |
2 |
Per-device batch size |
--learning-rate |
2e-4 |
Learning rate |
--max-seq-length |
2048 |
Maximum sequence length |
--lora-r |
16 |
LoRA rank |
--output-dir |
sloth_output |
Output directory |
sloth export¶
Export a fine-tuned model checkpoint to ONNX or TFLite.
# Export classification head (recommended)
sloth export \
--checkpoint sloth_output/imdb \
--component head \
--format onnx \
--output sentiment_head.onnx
# Export embedding model
sloth export \
--checkpoint sloth_output/imdb \
--component embeddings \
--format onnx \
--output embedding_model.onnx
Options:
| Option | Default | Description |
|---|---|---|
--checkpoint |
Required | Path to the fine-tuned checkpoint |
--component |
head |
Which part to export: head, embeddings, or full |
--format |
onnx |
Export format: onnx or tflite |
--opset |
14 |
ONNX opset version |
--output |
Auto | Output file path |
--optimize |
True |
Apply ONNX optimisations |
sloth compile¶
Compile an exported model for Coral Edge TPU.
sloth compile \
--model sentiment_head.onnx \
--target coral \
--quantize \
--calibration-data calib.npz \
--output sentiment_head_edgetpu.tflite
Options:
| Option | Default | Description |
|---|---|---|
--model |
Required | Path to the ONNX/TFLite model |
--target |
coral |
Target backend: coral or metal |
--quantize |
True |
Apply INT8 quantisation |
--calibration-data |
None | Path to calibration data (.npz) |
--per-channel |
True |
Use per-channel quantisation |
--symmetric |
True |
Use symmetric quantisation |
--output |
Auto | Output path for compiled model |
sloth infer¶
Run inference on a compiled model.
# Text classification
sloth infer \
--model sentiment_head_edgetpu.tflite \
--text "This product is amazing!" \
--labels negative,positive
# With input from file
sloth infer \
--model sentiment_head_edgetpu.tflite \
--input-file reviews.txt \
--labels negative,positive \
--batch
Options:
| Option | Default | Description |
|---|---|---|
--model |
Required | Path to the compiled Edge TPU model |
--text |
None | Single text to classify |
--input-file |
None | File with one text per line |
--labels |
Auto | Comma-separated label names |
--batch |
False |
Process input file in batches |
--top-k |
5 |
Number of top predictions to return |
sloth benchmark¶
Benchmark a compiled model on Coral Edge TPU.
sloth info¶
Display information about a model and its Edge TPU compatibility.
7. Understanding Edge TPU Constraints¶
The Google Coral Edge TPU is a specialised INT8 inference accelerator with strict constraints. Understanding these constraints is essential for successfully deploying SLM components on the device.
8 MB On-Chip Cache¶
The Edge TPU has 8 MB of on-chip SRAM that stores model parameters during inference. If a model exceeds this cache, parameters must be streamed from host DRAM, which significantly increases latency.
Implications for sloth-integration:
- A classification head with hidden_size=2048 and 2 classes requires:
- Weights: 2048 x 2 x 1 byte (INT8) = ~4 KB
- Bias: 2 x 4 bytes (INT32) = 8 bytes
-
Total: ~4 KB -- easily fits in cache
-
An embedding model for a 1B parameter SLM with hidden_size=2048:
- Embedding table: vocab_size x 2048 x 1 byte = ~2 GB for 128K vocabulary
- This far exceeds the 8 MB cache; embedding lookups must run on the host
INT8 Only¶
The Edge TPU only executes INT8 operations. All float32 tensors must be
quantised before compilation. edgecompiler handles this automatically when
quantize=True, but you should be aware of the accuracy implications.
Mitigation strategies:
- Use calibration data that is representative of your inference workload
- Enable per-channel quantisation (
per_channel=True) for weight tensors - Use symmetric quantisation (
symmetric=True) for Edge TPU compatibility - Evaluate quantised model accuracy before deployment
Supported Operations¶
The Edge TPU supports a specific set of TFLite operations. Operations not supported on the Edge TPU fall back to CPU execution, which increases latency.
Common Edge TPU supported operations:
- Conv2D, DepthwiseConv2D
- FullyConnected (dense layers)
- Add, Subtract, Multiply, Maximum, Minimum
- ReLU, ReLU6, Tanh, Sigmoid
- Softmax
- Reshape, Squeeze, ExpandDims, Transpose
- MaxPool2D, AveragePool2D
- Concatenation
- Quantize, Dequantize
Operations NOT supported on Edge TPU (fall back to CPU):
- LSTM, GRU, RNN
- Attention mechanisms (self-attention, cross-attention)
- Gather, EmbeddingLookup (large tables)
- String operations
- Dynamic shapes
This is why full transformer models cannot run entirely on the Edge TPU -- the attention and embedding layers are unsupported. The classification head approach works because it uses only FullyConnected and activation operations.
8. Model Selection Guide¶
Choosing the right SLM for Coral deployment depends on your task and accuracy requirements.
Recommended SLMs¶
| Model | Parameters | Hidden Size | Recommended For |
|---|---|---|---|
| Llama 3.2 1B Instruct | 1.2B | 2048 | Text classification, sentiment analysis |
| Qwen3-4B | 4B | 2560 | More complex classification, topic categorisation |
| Phi-4-mini | 3.8B | 3072 | Reasoning tasks, multi-class classification |
| Gemma-2-2B | 2B | 2048 | General-purpose classification |
Deployment Strategies¶
Text Classification Head Extraction (Recommended)¶
This is the recommended approach for deploying SLMs on Coral. The full SLM runs on the host (CPU/GPU) to compute embeddings, and only the lightweight classification head runs on the Coral Edge TPU.
Workflow:
- Fine-tune the full SLM with LoRA on your classification dataset
- Extract the classification head (typically 2-3 linear layers)
- Export the head to ONNX
- Compile for Coral with INT8 quantisation
- At inference: compute embeddings on host, classify on Coral
Size estimate:
| Hidden Size | Classes | INT8 Size (approx.) | Fits in 8 MB? |
|---|---|---|---|
| 2048 | 2 | 4 KB | Yes |
| 2048 | 10 | 20 KB | Yes |
| 2560 | 100 | 256 KB | Yes |
| 3072 | 1000 | 3 MB | Yes |
Embedding Model Extraction¶
Extract the embedding layer (without the full transformer stack) for tasks like semantic similarity or retrieval. Note that the embedding table for most SLMs is too large for the Edge TPU cache, so embedding lookups run on CPU with post-embedding projection on the Edge TPU.
Size estimate:
| Model | Vocab Size | Hidden Size | Embedding Table (INT8) | Fits? |
|---|---|---|---|---|
| Llama 3.2 1B | 128K | 2048 | ~256 MB | No |
| Qwen3-4B | 152K | 2560 | ~390 MB | No |
For embedding models, use the hybrid inference pattern (Section 9.4).
Full Model Deployment (Not Recommended)¶
Only practical for models under ~8 MB when quantised. This limits you to very small models with short sequence lengths. Not recommended for SLMs.
9. Advanced Usage¶
9.1 Knowledge Distillation for Edge Deployment¶
Use ModelDistiller to create a smaller student model that mimics the
behaviour of the fine-tuned SLM (teacher). The student model can then be
deployed on the Coral Edge TPU.
from sloth_integration import ModelDistiller, SlothAdapter, SlothConverter
import edgecompiler
# Load the fine-tuned teacher model
teacher = SlothAdapter("sloth_output/imdb_sentiment")
# Create a distiller
distiller = ModelDistiller(
teacher_model=teacher,
student_hidden_sizes=[512, 256, 128], # Progressively smaller layers
num_labels=2,
temperature=4.0, # Softmax temperature for soft labels
alpha=0.7, # Weight for distillation loss vs hard label loss
)
# Train the student model
distiller.train(
dataset_name="imdb",
text_field="text",
label_field="label",
num_epochs=5,
batch_size=16,
learning_rate=1e-3,
)
# Export and compile the student model
student_path = distiller.save_student("sloth_output/distilled_student")
converter = SlothConverter(student_path)
onnx_path = converter.export_head(format="onnx")
compiled_path = edgecompiler.compile(onnx_path, target="coral", quantize=True)
9.2 Pruning Models to Fit Edge TPU¶
If a model is slightly too large for the Edge TPU cache, you can prune less important weights before quantisation:
from sloth_integration import SlothConverter
converter = SlothConverter("sloth_output/imdb_sentiment")
# Export with pruning
onnx_path = converter.export_head(
format="onnx",
prune_ratio=0.3, # Remove 30% of least important weights
prune_method="magnitude", # Prune by weight magnitude
)
9.3 Calibration Data Preparation for PTQ¶
Accurate calibration data is critical for INT8 quantisation quality. The calibration samples should represent the distribution of embeddings that the classification head will receive at inference time.
from sloth_integration import SlothQuantizer
quantizer = SlothQuantizer(
checkpoint_path="sloth_output/imdb_sentiment",
)
# Generate calibration samples from your training data
cal_path = quantizer.prepare_calibration_data(
dataset_name="imdb",
text_field="text",
num_samples=100,
output_path="calibration_samples.npz",
)
# Use the calibration data during compilation
compiled_path = edgecompiler.compile(
"sentiment_head.onnx",
target="coral",
quantize=True,
calibration_data=cal_path,
)
9.4 Hybrid Inference (Embeddings on Host, Classifier on Coral)¶
This is the recommended pattern for SLM deployment. The full model runs on the host (MacBook M1 Pro) to generate embeddings, and the lightweight classification head runs on the Coral Edge TPU for fast, low-power inference.
from sloth_integration import SlothAdapter, SlothCoralRuntime
import numpy as np
# Load the full model on host for embedding computation
adapter = SlothAdapter("sloth_output/imdb_sentiment")
# Load the compiled classification head on Coral
coral = SlothCoralRuntime(
model_path="sentiment_head_edgetpu.tflite",
labels=["negative", "positive"],
)
# Hybrid inference pipeline
def classify_text_hybrid(text: str) -> dict:
# Step 1: Compute embedding on host (using MPS/GPU on M1 Pro)
embedding = adapter.get_embedding(text) # np.ndarray of shape [1, hidden_size]
# Step 2: Run classification on Coral Edge TPU
result = coral.classify_embedding(embedding)
return result
# Use the pipeline
result = classify_text_hybrid("This movie was absolutely fantastic!")
print(f"Label: {result['label']}, Confidence: {result['confidence']:.4f}")
print(f"Total latency: {result['latency_ms']:.1f} ms")
The hybrid approach combines:
- Host GPU (MPS): Best for transformer attention, embedding lookups, dynamic shapes, and variable-length sequences
- Coral Edge TPU: Best for dense INT8 matrix multiplication (the classification head), with sub-millisecond latency for small layers
10. Integration Architecture¶
Import Relationship Diagram¶
sloth_integration
|
+-- SlothAdapter
| |-- uses: unsloth.FastLanguageModel
| |-- uses: transformers.AutoModelForSequenceClassification
| |-- uses: peft.LoraConfig, get_peft_model
|
+-- SlothConverter
| |-- uses: torch.onnx.export
| |-- uses: onnx (optimization, validation)
| |-- uses: SlothAdapter (to load model weights)
|
+-- SlothQuantizer
| |-- uses: SlothAdapter (to generate calibration data)
| |-- uses: numpy (to save .npz files)
|
+-- SlothCoralRuntime
| |-- uses: edgecompiler.runtime.CoralUSBRuntime
| |-- uses: edgecompiler.runtime.CoralInferenceSession
| |-- uses: transformers.AutoTokenizer
|
+-- ModelDistiller
| |-- uses: SlothAdapter (teacher model)
| |-- uses: torch.nn.Module (student model)
|
+-- cli
|-- uses: SlothAdapter, SlothConverter, SlothQuantizer
|-- uses: edgecompiler.compile()
|-- uses: SlothCoralRuntime
Data Flow¶
1. Fine-tuning:
HF Dataset --> DataLoader --> FastLanguageModel (LoRA) --> Checkpoint
2. Export:
Checkpoint --> SlothConverter --> ONNX file (classification head)
3. Compilation:
ONNX file --> edgecompiler.convert_to_ir() --> IR Graph
--> edgecompiler.quantize_ptq() --> Quantised IR Graph
--> edgecompiler.compile_for_coral() --> *_edgetpu.tflite
4. Hybrid Inference:
Text --> Tokenizer --> FastLanguageModel (host GPU) --> Embedding vector
--> SlothCoralRuntime.classify_embedding() --> Coral Edge TPU --> Label
5. Standalone Coral Inference:
Embedding vector --> CoralUSBRuntime.infer() --> InferenceResult --> Label
11. Troubleshooting¶
"unsloth not found"¶
Symptoms: ModuleNotFoundError: No module named 'unsloth'
Solution: Install unsloth:
"edgecompiler not found"¶
Symptoms: ModuleNotFoundError: No module named 'edgecompiler'
Solution: Install edgecompiler from the repository:
"libedgetpu not found"¶
Symptoms: OSError: dlopen(libedgetpu.1.dylib, ...): Library not loaded
Solutions:
- Verify the dylib exists:
- Set
DYLD_LIBRARY_PATH:
- Run the setup script:
"Architecture mismatch (x86_64 dylib on ARM64)"¶
Symptoms: OSError: mach-o, but not compatible architecture
Solution: Build and install the ARM64 version:
file /usr/local/lib/libedgetpu.1.dylib
# If it says x86_64, rebuild:
git clone https://github.com/feranick/libedgetpu.git
cd libedgetpu && CPU=darwin_arm64 make
sudo cp out/darwin_arm64/libedgetpu.1.dylib /usr/local/lib/
"No Coral USB device detected"¶
Solutions:
- Check the physical connection (LED should be lit)
- Verify with
system_profiler SPUSBDataType | grep -i coral - Avoid USB-C hubs; use a direct USB-A port with adapter
- Re-plug the device after installing libedgetpu
"Model exceeds Edge TPU cache"¶
Symptoms: Model runs but with high latency (30+ ms for small models)
Diagnosis: The model is too large for the 8 MB on-chip cache, forcing parameter streaming from DRAM.
Solutions:
- Use classification head extraction instead of the full model
- Reduce
hidden_sizevia distillation - Apply pruning with
SlothConverter.export_head(prune_ratio=0.3) - Use the hybrid inference pattern (embeddings on host, classifier on Coral)
"Quantisation accuracy is poor"¶
Symptoms: The compiled model produces incorrect classifications
Solutions:
- Provide calibration data:
cal_path = quantizer.prepare_calibration_data(num_samples=200)
edgecompiler.compile(model, target="coral", calibration_data=cal_path)
- Increase the number of calibration samples (100-500 recommended)
- Enable per-channel quantisation:
per_channel=True - Try symmetric quantisation:
symmetric=True
"Operations falling back to CPU"¶
Symptoms: Warning during compilation, higher than expected latency
Diagnosis: Some operations in the model are not supported by the Edge TPU and must run on the host CPU.
Solutions:
- Check which ops are falling back:
- Restructure the model to avoid unsupported operations
- Use classification head extraction (avoids attention ops)
- Use the
--strictflag to fail on unsupported ops rather than falling back
12. API Reference¶
SlothAdapter¶
class SlothAdapter:
"""Wraps unsloth FastLanguageModel for fine-tuning and inference."""
def __init__(
self,
model_name: str,
max_seq_length: int = 2048,
load_in_4bit: bool = True,
dtype: str = "auto",
): ...
def apply_lora(
self,
r: int = 16,
lora_alpha: int = 16,
lora_dropout: float = 0,
target_modules: list[str] | None = None,
): ...
def finetune_classification(
self,
dataset_name: str | None = None,
dataset: Dataset | None = None,
text_field: str = "text",
label_field: str = "label",
num_labels: int = 2,
num_epochs: int = 1,
batch_size: int = 2,
learning_rate: float = 2e-4,
warmup_steps: int = 10,
output_dir: str = "sloth_output",
): ...
def get_embedding(self, text: str) -> np.ndarray: ...
def save(self, path: str) -> None: ...
def estimate_model_size_mb(self) -> float: ...
SlothConverter¶
class SlothConverter:
"""Exports fine-tuned models to ONNX/TFLite for edge deployment."""
def __init__(
self,
checkpoint_path: str,
tokenizer_path: str | None = None,
): ...
def export_head(
self,
output_path: str | None = None,
format: str = "onnx",
opset: int = 14,
optimize: bool = True,
prune_ratio: float = 0.0,
prune_method: str = "magnitude",
) -> str: ...
def export_embeddings(
self,
output_path: str | None = None,
format: str = "onnx",
) -> str: ...
def export_full(
self,
output_path: str | None = None,
format: str = "onnx",
max_seq_length: int = 128,
) -> str: ...
def validate_onnx(self, onnx_path: str) -> bool: ...
SlothQuantizer¶
class SlothQuantizer:
"""Prepares calibration data and configures INT8 quantisation."""
def __init__(
self,
checkpoint_path: str,
): ...
def prepare_calibration_data(
self,
dataset_name: str | None = None,
dataset: Dataset | None = None,
text_field: str = "text",
num_samples: int = 100,
output_path: str = "calibration_samples.npz",
) -> str: ...
def get_quantization_config(
self,
per_channel: bool = True,
symmetric: bool = True,
) -> dict: ...
SlothCoralRuntime¶
class SlothCoralRuntime:
"""Runs text classification and embedding inference on Coral Edge TPU."""
def __init__(
self,
model_path: str,
labels: list[str] | None = None,
libedgetpu_path: str | None = None,
): ...
def classify_text(self, text: str, top_k: int = 5) -> dict: ...
def classify_embedding(self, embedding: np.ndarray, top_k: int = 5) -> dict: ...
def embed_text(self, text: str) -> dict: ...
def classify_batch(self, texts: list[str]) -> list[dict]: ...
def benchmark(self, num_runs: int = 100, warmup: int = 5) -> dict: ...
ModelDistiller¶
class ModelDistiller:
"""Distils a fine-tuned SLM into a smaller student model."""
def __init__(
self,
teacher_model: SlothAdapter,
student_hidden_sizes: list[int] = [512, 256],
num_labels: int = 2,
temperature: float = 4.0,
alpha: float = 0.7,
): ...
def train(
self,
dataset_name: str | None = None,
dataset: Dataset | None = None,
text_field: str = "text",
label_field: str = "label",
num_epochs: int = 5,
batch_size: int = 16,
learning_rate: float = 1e-3,
): ...
def save_student(self, path: str) -> str: ...
def estimate_student_size_mb(self) -> float: ...