sloth-integration: Comprehensive Usage Guide¶

This document explains how to use sloth-integration with the existing edgecompiler repository and Google Coral USB Accelerator to deploy fine-tuned Small Language Models on edge hardware. It covers the complete workflow from fine-tuning with unsloth to running INT8 inference on a Coral Edge TPU, with specific instructions for MacBook M1 Pro.

1. Overview¶

sloth-integration bridges two projects:

unsloth provides FastLanguageModel, a wrapper around Hugging Face Transformers that makes LoRA/QLoRA fine-tuning 2x faster. It handles 4-bit quantisation, gradient checkpointing, and optimised training loops.
edgecompiler provides a compiler toolchain for deploying ML models on edge hardware. Its compile() function converts models (ONNX, TFLite, PyTorch) to an intermediate representation, quantises them to INT8, and compiles for Google Coral Edge TPU or Apple Silicon Metal backends.

sloth-integration connects these tools with purpose-built components:

Component	Purpose
`SlothAdapter`	Wraps unsloth `FastLanguageModel` for fine-tuning and model management
`SlothConverter`	Exports fine-tuned models (classification head, embedding layer) to ONNX/TFLite
`SlothQuantizer`	Prepares calibration data and configures INT8 quantisation for Edge TPU
`SlothCoralRuntime`	Runs text classification and embedding inference on Coral USB Accelerator
`ModelDistiller`	Distils large SLMs into smaller models suitable for edge deployment

The typical workflow is:

Fine-tune with unsloth -> Export sub-model -> Compile with edgecompiler -> Run on Coral

2. Prerequisites¶

Hardware¶

Item	Requirement	Notes
MacBook	M1 Pro or later	M1 base model works but has fewer GPU cores
RAM	16 GB minimum	32 GB recommended for models larger than 1B parameters
Storage	20 GB free	For model weights, exports, and compiled artefacts
Coral USB Accelerator	Any revision	USB 3.0 model recommended for best throughput
USB connection	USB 3.0 (5 Gbps)	Use a direct port; avoid unpowered USB-C hubs

Software¶

Package	Version	Purpose
Python	3.10+	Runtime
unsloth	Latest	Fast fine-tuning with LoRA
edgecompiler	0.1.0+	Model compilation for Coral
PyTorch	2.1+	Backend for unsloth and model export
transformers	4.36+	Model loading and tokenisation
onnx	1.15+	ONNX export
onnxruntime	1.16+	ONNX model validation
tflite-runtime	2.14+	TFLite interpreter for Coral
numpy	1.24+	Array operations

Coral Edge TPU Runtime¶

You must install libedgetpu for your platform. On macOS Apple Silicon, the official Google build is x86_64 only. Use the community ARM64 build from feranick/libedgetpu:

# See Section 3 for detailed installation steps

3. Installation¶

Step 1: Install edgecompiler¶

Clone and install the edgecompiler repository:

cd ~/projects
git clone https://github.com/rotsl/edgecompiler.git
cd edgecompiler
pip install -e ".[dev]"

Verify the installation:

python -c "import edgecompiler; print(edgecompiler.__version__)"
# Expected output: 0.4.x

Step 2: Enable sloth integration extras (optional)¶

pip install -e ".[sloth]"

For ONNX and unsloth workflows:

pip install -e ".[sloth,sloth-onnx]"
pip install -e ".[sloth,sloth-unsloth]"

sloth_integration is packaged from sloth-integration/src/sloth_integration by the root pyproject.toml.

Step 3: Install Coral Edge TPU runtime (libedgetpu for ARM64 macOS)¶

The Google Coral Edge TPU runtime (libedgetpu) is required to offload inference to the Coral USB Accelerator. On macOS Apple Silicon, you need the ARM64 build.

Option A: Use the setup script (recommended)¶

bash scripts/setup_coral_runtime.sh

This script:

Detects your CPU architecture (arm64 or x86_64)
Downloads the appropriate libedgetpu build
Installs the dylib to /usr/local/lib/
Sets up DYLD_LIBRARY_PATH in your shell configuration
Verifies the installation

Option B: Manual installation¶

# Install build dependencies
brew install libusb cmake wget

# Clone and build for ARM64
git clone https://github.com/feranick/libedgetpu.git
cd libedgetpu
CPU=darwin_arm64 make

# Install the built library
sudo cp out/darwin_arm64/libedgetpu.1.dylib /usr/local/lib/
sudo ln -sf /usr/local/lib/libedgetpu.1.dylib /usr/local/lib/libedgetpu.dylib
sudo update_dyld_shared_cache 2>/dev/null || true

# Set library path
echo 'export DYLD_LIBRARY_PATH="/usr/local/lib:${DYLD_LIBRARY_PATH:-}"' >> ~/.zshrc
source ~/.zshrc

Install tflite-runtime¶

pip install tflite-runtime

If tflite-runtime is not available for your Python version on ARM64 macOS, install a compatible wheel from feranick/TFlite-builds.

Step 4: Verify the complete setup¶

Run the verification script:

python -c "
import edgecompiler
print(f'edgecompiler: {edgecompiler.__version__}')

from edgecompiler.runtime.coral_usb import CoralUSBRuntime, find_libedgetpu
lib_path = find_libedgetpu()
print(f'libedgetpu: {lib_path}')

runtime = CoralUSBRuntime()
devices = runtime.detect_devices()
print(f'Coral devices: {devices}')

from sloth_integration import SlothAdapter, SlothConverter, SlothCoralRuntime
print('sloth-integration: all imports OK')
"

Expected output:

edgecompiler: 0.1.0
libedgetpu: /usr/local/lib/libedgetpu.1.dylib
Coral devices: [CoralDevice(path=':0', type='USB', name='Coral USB Accelerator')]
sloth-integration: all imports OK

If Coral devices is empty, check that the USB Accelerator is plugged in and the LED is lit. See the Troubleshooting section for help.

4. Quick Start¶

This minimal example fine-tunes a sentiment classifier on the IMDB dataset, exports the classification head, compiles it for Coral, and runs inference.

"""Quick start: fine-tune to Coral inference in under 20 lines."""

from sloth_integration import SlothAdapter, SlothConverter, SlothCoralRuntime
import edgecompiler

# 1. Fine-tune a 1B model on IMDB sentiment
adapter = SlothAdapter("unsloth/Llama-3.2-1B-Instruct", max_seq_length=2048)
adapter.finetune_classification(
    dataset_name="imdb",
    text_field="text",
    label_field="label",
    num_labels=2,
    num_epochs=1,
)
adapter.save("sloth_output/imdb_sentiment")

# 2. Export the classification head to ONNX
converter = SlothConverter("sloth_output/imdb_sentiment")
onnx_path = converter.export_head(
    output_path="imdb_sentiment_head.onnx",
    format="onnx",
)

# 3. Compile for Coral Edge TPU
compiled_path = edgecompiler.compile(
    onnx_path,
    target="coral",
    quantize=True,
)
print(f"Compiled model: {compiled_path}")

# 4. Run inference on Coral USB
runtime = SlothCoralRuntime(compiled_path)
result = runtime.classify_text("This movie was absolutely fantastic!")
print(f"Label: {result['label']}, Confidence: {result['confidence']:.4f}")
print(f"Latency: {result['latency_ms']:.1f} ms on Edge TPU")

5. Step-by-Step Workflow¶

Step 1: Fine-tune an SLM with unsloth¶

The SlothAdapter wraps unsloth's FastLanguageModel and provides methods for LoRA fine-tuning on classification and embedding tasks.

from sloth_integration import SlothAdapter

# Load a pre-trained model with 4-bit quantisation
adapter = SlothAdapter(
    model_name="unsloth/Llama-3.2-1B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
    dtype="auto",
)

# Apply LoRA adapters
adapter.apply_lora(
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
)

# Fine-tune for text classification
adapter.finetune_classification(
    dataset_name="imdb",
    text_field="text",
    label_field="label",
    num_labels=2,
    num_epochs=1,
    batch_size=2,
    learning_rate=2e-4,
    warmup_steps=10,
    output_dir="sloth_output/imdb_sentiment",
)

# Save the fine-tuned checkpoint
adapter.save("sloth_output/imdb_sentiment")

You can also fine-tune on a custom dataset:

from datasets import load_dataset

# Load a custom CSV/JSON dataset
dataset = load_dataset("csv", data_files="my_data.csv")

adapter.finetune_classification(
    dataset=dataset,
    text_field="review_text",
    label_field="sentiment",
    num_labels=3,  # negative, neutral, positive
    num_epochs=3,
    output_dir="sloth_output/custom_classifier",
)

Step 2: Export the fine-tuned model for Coral¶

Edge TPU models must be small (under ~8 MB) and consist of operations that the Edge TPU supports. Full transformer models are far too large. The recommended approach is to export only the classification head -- a small feed-forward network that takes pre-computed embeddings as input.

from sloth_integration import SlothConverter

converter = SlothConverter(
    checkpoint_path="sloth_output/imdb_sentiment",
    tokenizer_path="sloth_output/imdb_sentiment",
)

# Export the classification head (recommended for SLMs)
onnx_path = converter.export_head(
    output_path="imdb_sentiment_head.onnx",
    format="onnx",
    opset=14,
    optimize=True,
    # The head expects embeddings of shape [batch_size, hidden_size]
    # hidden_size is auto-detected from the model config
)
print(f"Exported ONNX model: {onnx_path}")

# Alternative: export the embedding model (for hybrid inference)
embedding_onnx_path = converter.export_embeddings(
    output_path="imdb_embedding_model.onnx",
    format="onnx",
)

For very small models (under 8 MB when quantised), you can attempt to export the full model:

full_onnx_path = converter.export_full(
    output_path="full_model.onnx",
    format="onnx",
    max_seq_length=128,  # Truncate to reduce model size
)

Note: Full model export is only practical for models with fewer than ~30M parameters and short sequence lengths. Most SLMs (even 1B parameter models) are too large. Use the classification head or embedding extraction approach instead.

Step 3: Compile with edgecompiler for Edge TPU¶

Pass the exported ONNX model to edgecompiler.compile() with target="coral":

import edgecompiler

# One-call compilation with automatic INT8 quantisation
compiled_path = edgecompiler.compile(
    "imdb_sentiment_head.onnx",
    target="coral",
    quantize=True,
    # Optional: provide calibration data for better quantisation accuracy
    # calibration_data="calibration_samples.npz",
)
print(f"Compiled Edge TPU model: {compiled_path}")
# Output: imdb_sentiment_head_edgetpu.tflite

The compilation pipeline performs these steps internally:

Convert the ONNX model to edgecompiler's IR (Intermediate Representation)
Quantise the IR graph to INT8 using post-training quantisation (PTQ)
Compile the quantised graph for the Coral Edge TPU backend
Write the compiled *_edgetpu.tflite file

For better quantisation accuracy, provide calibration data:

import numpy as np

# Generate or collect calibration samples (embeddings from the SLM)
# These should be representative of the data the model will see at inference time
cal_samples = np.random.randn(100, 2048).astype(np.float32)  # 100 samples, hidden_size=2048
np.savez("calibration_samples.npz", embeddings=cal_samples)

compiled_path = edgecompiler.compile(
    "imdb_sentiment_head.onnx",
    target="coral",
    quantize=True,
    calibration_data="calibration_samples.npz",
    per_channel=True,
    symmetric=True,
)

Step 4: Run inference on Coral USB¶

Use SlothCoralRuntime for high-level text inference:

from sloth_integration import SlothCoralRuntime

# Initialize with the compiled Edge TPU model
runtime = SlothCoralRuntime(
    model_path="imdb_sentiment_head_edgetpu.tflite",
    labels=["negative", "positive"],
)

# Classify a single text
result = runtime.classify_text("This movie was absolutely fantastic!")
print(result)
# {'label': 'positive', 'confidence': 0.94, 'scores': [0.06, 0.94], 'latency_ms': 3.2}

# Classify multiple texts
results = runtime.classify_batch([
    "Terrible waste of time.",
    "One of the best films I have ever seen.",
    "It was okay, nothing special.",
])
for text, result in zip(texts, results):
    print(f"  {text!r} -> {result['label']} ({result['confidence']:.2f})")

For embedding inference:

runtime = SlothCoralRuntime("embedding_model_edgetpu.tflite")
embedding = runtime.embed_text("Hello, world!")
print(f"Embedding shape: {embedding['vector'].shape}")
print(f"Latency: {embedding['latency_ms']:.1f} ms")

6. Using the CLI¶

sloth-integration provides a sloth CLI tool with the following subcommands:

`sloth finetune`¶

Fine-tune a model on a classification dataset.

sloth finetune \
    --model unsloth/Llama-3.2-1B-Instruct \
    --dataset imdb \
    --text-field text \
    --label-field label \
    --num-labels 2 \
    --epochs 1 \
    --output-dir sloth_output/imdb

Options:

Option	Default	Description
`--model`	Required	Hugging Face model name or path
`--dataset`	Required	Dataset name (Hugging Face) or path to CSV/JSON
`--text-field`	`text`	Name of the text column
`--label-field`	`label`	Name of the label column
`--num-labels`	`2`	Number of classification classes
`--epochs`	`1`	Number of training epochs
`--batch-size`	`2`	Per-device batch size
`--learning-rate`	`2e-4`	Learning rate
`--max-seq-length`	`2048`	Maximum sequence length
`--lora-r`	`16`	LoRA rank
`--output-dir`	`sloth_output`	Output directory

`sloth export`¶

Export a fine-tuned model checkpoint to ONNX or TFLite.

# Export classification head (recommended)
sloth export \
    --checkpoint sloth_output/imdb \
    --component head \
    --format onnx \
    --output sentiment_head.onnx

# Export embedding model
sloth export \
    --checkpoint sloth_output/imdb \
    --component embeddings \
    --format onnx \
    --output embedding_model.onnx

Options:

Option	Default	Description
`--checkpoint`	Required	Path to the fine-tuned checkpoint
`--component`	`head`	Which part to export: `head`, `embeddings`, or `full`
`--format`	`onnx`	Export format: `onnx` or `tflite`
`--opset`	`14`	ONNX opset version
`--output`	Auto	Output file path
`--optimize`	`True`	Apply ONNX optimisations

`sloth compile`¶

Compile an exported model for Coral Edge TPU.

sloth compile \
    --model sentiment_head.onnx \
    --target coral \
    --quantize \
    --calibration-data calib.npz \
    --output sentiment_head_edgetpu.tflite

Options:

Option	Default	Description
`--model`	Required	Path to the ONNX/TFLite model
`--target`	`coral`	Target backend: `coral` or `metal`
`--quantize`	`True`	Apply INT8 quantisation
`--calibration-data`	None	Path to calibration data (.npz)
`--per-channel`	`True`	Use per-channel quantisation
`--symmetric`	`True`	Use symmetric quantisation
`--output`	Auto	Output path for compiled model

`sloth infer`¶

Run inference on a compiled model.

# Text classification
sloth infer \
    --model sentiment_head_edgetpu.tflite \
    --text "This product is amazing!" \
    --labels negative,positive

# With input from file
sloth infer \
    --model sentiment_head_edgetpu.tflite \
    --input-file reviews.txt \
    --labels negative,positive \
    --batch

Options:

Option	Default	Description
`--model`	Required	Path to the compiled Edge TPU model
`--text`	None	Single text to classify
`--input-file`	None	File with one text per line
`--labels`	Auto	Comma-separated label names
`--batch`	`False`	Process input file in batches
`--top-k`	`5`	Number of top predictions to return

`sloth benchmark`¶

Benchmark a compiled model on Coral Edge TPU.

sloth benchmark \
    --model sentiment_head_edgetpu.tflite \
    --iterations 100 \
    --warmup 5

`sloth info`¶

Display information about a model and its Edge TPU compatibility.

sloth info --model sentiment_head.onnx

7. Understanding Edge TPU Constraints¶

The Google Coral Edge TPU is a specialised INT8 inference accelerator with strict constraints. Understanding these constraints is essential for successfully deploying SLM components on the device.

8 MB On-Chip Cache¶

The Edge TPU has 8 MB of on-chip SRAM that stores model parameters during inference. If a model exceeds this cache, parameters must be streamed from host DRAM, which significantly increases latency.

Implications for sloth-integration:

A classification head with hidden_size=2048 and 2 classes requires:
Weights: 2048 x 2 x 1 byte (INT8) = ~4 KB
Bias: 2 x 4 bytes (INT32) = 8 bytes
Total: ~4 KB -- easily fits in cache
An embedding model for a 1B parameter SLM with hidden_size=2048:
Embedding table: vocab_size x 2048 x 1 byte = ~2 GB for 128K vocabulary
This far exceeds the 8 MB cache; embedding lookups must run on the host

INT8 Only¶

The Edge TPU only executes INT8 operations. All float32 tensors must be quantised before compilation. edgecompiler handles this automatically when quantize=True, but you should be aware of the accuracy implications.

Mitigation strategies:

Use calibration data that is representative of your inference workload
Enable per-channel quantisation (per_channel=True) for weight tensors
Use symmetric quantisation (symmetric=True) for Edge TPU compatibility
Evaluate quantised model accuracy before deployment

Supported Operations¶

The Edge TPU supports a specific set of TFLite operations. Operations not supported on the Edge TPU fall back to CPU execution, which increases latency.

Common Edge TPU supported operations:

Conv2D, DepthwiseConv2D
FullyConnected (dense layers)
Add, Subtract, Multiply, Maximum, Minimum
ReLU, ReLU6, Tanh, Sigmoid
Softmax
Reshape, Squeeze, ExpandDims, Transpose
MaxPool2D, AveragePool2D
Concatenation
Quantize, Dequantize

Operations NOT supported on Edge TPU (fall back to CPU):

LSTM, GRU, RNN
Attention mechanisms (self-attention, cross-attention)
Gather, EmbeddingLookup (large tables)
String operations
Dynamic shapes

This is why full transformer models cannot run entirely on the Edge TPU -- the attention and embedding layers are unsupported. The classification head approach works because it uses only FullyConnected and activation operations.

8. Model Selection Guide¶

Choosing the right SLM for Coral deployment depends on your task and accuracy requirements.

Recommended SLMs¶

Model	Parameters	Hidden Size	Recommended For
Llama 3.2 1B Instruct	1.2B	2048	Text classification, sentiment analysis
Qwen3-4B	4B	2560	More complex classification, topic categorisation
Phi-4-mini	3.8B	3072	Reasoning tasks, multi-class classification
Gemma-2-2B	2B	2048	General-purpose classification

Deployment Strategies¶

Text Classification Head Extraction (Recommended)¶

This is the recommended approach for deploying SLMs on Coral. The full SLM runs on the host (CPU/GPU) to compute embeddings, and only the lightweight classification head runs on the Coral Edge TPU.

Workflow:

Fine-tune the full SLM with LoRA on your classification dataset
Extract the classification head (typically 2-3 linear layers)
Export the head to ONNX
Compile for Coral with INT8 quantisation
At inference: compute embeddings on host, classify on Coral

Size estimate:

Hidden Size	Classes	INT8 Size (approx.)	Fits in 8 MB?
2048	2	4 KB	Yes
2048	10	20 KB	Yes
2560	100	256 KB	Yes
3072	1000	3 MB	Yes

Embedding Model Extraction¶

Extract the embedding layer (without the full transformer stack) for tasks like semantic similarity or retrieval. Note that the embedding table for most SLMs is too large for the Edge TPU cache, so embedding lookups run on CPU with post-embedding projection on the Edge TPU.

Size estimate:

Model	Vocab Size	Hidden Size	Embedding Table (INT8)	Fits?
Llama 3.2 1B	128K	2048	~256 MB	No
Qwen3-4B	152K	2560	~390 MB	No

For embedding models, use the hybrid inference pattern (Section 9.4).

Full Model Deployment (Not Recommended)¶

Only practical for models under ~8 MB when quantised. This limits you to very small models with short sequence lengths. Not recommended for SLMs.

9. Advanced Usage¶

9.1 Knowledge Distillation for Edge Deployment¶

Use ModelDistiller to create a smaller student model that mimics the behaviour of the fine-tuned SLM (teacher). The student model can then be deployed on the Coral Edge TPU.

from sloth_integration import ModelDistiller, SlothAdapter, SlothConverter
import edgecompiler

# Load the fine-tuned teacher model
teacher = SlothAdapter("sloth_output/imdb_sentiment")

# Create a distiller
distiller = ModelDistiller(
    teacher_model=teacher,
    student_hidden_sizes=[512, 256, 128],  # Progressively smaller layers
    num_labels=2,
    temperature=4.0,       # Softmax temperature for soft labels
    alpha=0.7,             # Weight for distillation loss vs hard label loss
)

# Train the student model
distiller.train(
    dataset_name="imdb",
    text_field="text",
    label_field="label",
    num_epochs=5,
    batch_size=16,
    learning_rate=1e-3,
)

# Export and compile the student model
student_path = distiller.save_student("sloth_output/distilled_student")
converter = SlothConverter(student_path)
onnx_path = converter.export_head(format="onnx")
compiled_path = edgecompiler.compile(onnx_path, target="coral", quantize=True)

9.2 Pruning Models to Fit Edge TPU¶

If a model is slightly too large for the Edge TPU cache, you can prune less important weights before quantisation:

from sloth_integration import SlothConverter

converter = SlothConverter("sloth_output/imdb_sentiment")

# Export with pruning
onnx_path = converter.export_head(
    format="onnx",
    prune_ratio=0.3,  # Remove 30% of least important weights
    prune_method="magnitude",  # Prune by weight magnitude
)

9.3 Calibration Data Preparation for PTQ¶

Accurate calibration data is critical for INT8 quantisation quality. The calibration samples should represent the distribution of embeddings that the classification head will receive at inference time.

from sloth_integration import SlothQuantizer

quantizer = SlothQuantizer(
    checkpoint_path="sloth_output/imdb_sentiment",
)

# Generate calibration samples from your training data
cal_path = quantizer.prepare_calibration_data(
    dataset_name="imdb",
    text_field="text",
    num_samples=100,
    output_path="calibration_samples.npz",
)

# Use the calibration data during compilation
compiled_path = edgecompiler.compile(
    "sentiment_head.onnx",
    target="coral",
    quantize=True,
    calibration_data=cal_path,
)

9.4 Hybrid Inference (Embeddings on Host, Classifier on Coral)¶

This is the recommended pattern for SLM deployment. The full model runs on the host (MacBook M1 Pro) to generate embeddings, and the lightweight classification head runs on the Coral Edge TPU for fast, low-power inference.

from sloth_integration import SlothAdapter, SlothCoralRuntime
import numpy as np

# Load the full model on host for embedding computation
adapter = SlothAdapter("sloth_output/imdb_sentiment")

# Load the compiled classification head on Coral
coral = SlothCoralRuntime(
    model_path="sentiment_head_edgetpu.tflite",
    labels=["negative", "positive"],
)

# Hybrid inference pipeline
def classify_text_hybrid(text: str) -> dict:
    # Step 1: Compute embedding on host (using MPS/GPU on M1 Pro)
    embedding = adapter.get_embedding(text)  # np.ndarray of shape [1, hidden_size]

    # Step 2: Run classification on Coral Edge TPU
    result = coral.classify_embedding(embedding)

    return result

# Use the pipeline
result = classify_text_hybrid("This movie was absolutely fantastic!")
print(f"Label: {result['label']}, Confidence: {result['confidence']:.4f}")
print(f"Total latency: {result['latency_ms']:.1f} ms")

The hybrid approach combines:

Host GPU (MPS): Best for transformer attention, embedding lookups, dynamic shapes, and variable-length sequences
Coral Edge TPU: Best for dense INT8 matrix multiplication (the classification head), with sub-millisecond latency for small layers

10. Integration Architecture¶

Import Relationship Diagram¶

sloth_integration
    |
    +-- SlothAdapter
    |       |-- uses: unsloth.FastLanguageModel
    |       |-- uses: transformers.AutoModelForSequenceClassification
    |       |-- uses: peft.LoraConfig, get_peft_model
    |
    +-- SlothConverter
    |       |-- uses: torch.onnx.export
    |       |-- uses: onnx (optimization, validation)
    |       |-- uses: SlothAdapter (to load model weights)
    |
    +-- SlothQuantizer
    |       |-- uses: SlothAdapter (to generate calibration data)
    |       |-- uses: numpy (to save .npz files)
    |
    +-- SlothCoralRuntime
    |       |-- uses: edgecompiler.runtime.CoralUSBRuntime
    |       |-- uses: edgecompiler.runtime.CoralInferenceSession
    |       |-- uses: transformers.AutoTokenizer
    |
    +-- ModelDistiller
    |       |-- uses: SlothAdapter (teacher model)
    |       |-- uses: torch.nn.Module (student model)
    |
    +-- cli
            |-- uses: SlothAdapter, SlothConverter, SlothQuantizer
            |-- uses: edgecompiler.compile()
            |-- uses: SlothCoralRuntime

Data Flow¶

1. Fine-tuning:
   HF Dataset --> DataLoader --> FastLanguageModel (LoRA) --> Checkpoint

2. Export:
   Checkpoint --> SlothConverter --> ONNX file (classification head)

3. Compilation:
   ONNX file --> edgecompiler.convert_to_ir() --> IR Graph
              --> edgecompiler.quantize_ptq() --> Quantised IR Graph
              --> edgecompiler.compile_for_coral() --> *_edgetpu.tflite

4. Hybrid Inference:
   Text --> Tokenizer --> FastLanguageModel (host GPU) --> Embedding vector
         --> SlothCoralRuntime.classify_embedding() --> Coral Edge TPU --> Label

5. Standalone Coral Inference:
   Embedding vector --> CoralUSBRuntime.infer() --> InferenceResult --> Label

11. Troubleshooting¶

"unsloth not found"¶

Symptoms: ModuleNotFoundError: No module named 'unsloth'

Solution: Install unsloth:

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

"edgecompiler not found"¶

Symptoms: ModuleNotFoundError: No module named 'edgecompiler'

Solution: Install edgecompiler from the repository:

cd ~/projects/edgecompiler
pip install -e .

"libedgetpu not found"¶

Symptoms: OSError: dlopen(libedgetpu.1.dylib, ...): Library not loaded

Solutions:

Verify the dylib exists:

ls -la /usr/local/lib/libedgetpu*.dylib

Set DYLD_LIBRARY_PATH:

export DYLD_LIBRARY_PATH="/usr/local/lib:${DYLD_LIBRARY_PATH:-}"

Run the setup script:

bash scripts/setup_coral_runtime.sh --force

"Architecture mismatch (x86_64 dylib on ARM64)"¶

Symptoms: OSError: mach-o, but not compatible architecture

Solution: Build and install the ARM64 version:

file /usr/local/lib/libedgetpu.1.dylib
# If it says x86_64, rebuild:
git clone https://github.com/feranick/libedgetpu.git
cd libedgetpu && CPU=darwin_arm64 make
sudo cp out/darwin_arm64/libedgetpu.1.dylib /usr/local/lib/

"No Coral USB device detected"¶

Solutions:

Check the physical connection (LED should be lit)
Verify with system_profiler SPUSBDataType | grep -i coral
Avoid USB-C hubs; use a direct USB-A port with adapter
Re-plug the device after installing libedgetpu

"Model exceeds Edge TPU cache"¶

Symptoms: Model runs but with high latency (30+ ms for small models)

Diagnosis: The model is too large for the 8 MB on-chip cache, forcing parameter streaming from DRAM.

Solutions:

Use classification head extraction instead of the full model
Reduce hidden_size via distillation
Apply pruning with SlothConverter.export_head(prune_ratio=0.3)
Use the hybrid inference pattern (embeddings on host, classifier on Coral)

"Quantisation accuracy is poor"¶

Symptoms: The compiled model produces incorrect classifications

Solutions:

Provide calibration data:

cal_path = quantizer.prepare_calibration_data(num_samples=200)
edgecompiler.compile(model, target="coral", calibration_data=cal_path)

Increase the number of calibration samples (100-500 recommended)
Enable per-channel quantisation: per_channel=True
Try symmetric quantisation: symmetric=True

"Operations falling back to CPU"¶

Symptoms: Warning during compilation, higher than expected latency

Diagnosis: Some operations in the model are not supported by the Edge TPU and must run on the host CPU.

Solutions:

Check which ops are falling back:

edgecompiler inspect model_edgetpu.tflite

Restructure the model to avoid unsupported operations
Use classification head extraction (avoids attention ops)
Use the --strict flag to fail on unsupported ops rather than falling back

12. API Reference¶

SlothAdapter¶

class SlothAdapter:
    """Wraps unsloth FastLanguageModel for fine-tuning and inference."""

    def __init__(
        self,
        model_name: str,
        max_seq_length: int = 2048,
        load_in_4bit: bool = True,
        dtype: str = "auto",
    ): ...

    def apply_lora(
        self,
        r: int = 16,
        lora_alpha: int = 16,
        lora_dropout: float = 0,
        target_modules: list[str] | None = None,
    ): ...

    def finetune_classification(
        self,
        dataset_name: str | None = None,
        dataset: Dataset | None = None,
        text_field: str = "text",
        label_field: str = "label",
        num_labels: int = 2,
        num_epochs: int = 1,
        batch_size: int = 2,
        learning_rate: float = 2e-4,
        warmup_steps: int = 10,
        output_dir: str = "sloth_output",
    ): ...

    def get_embedding(self, text: str) -> np.ndarray: ...

    def save(self, path: str) -> None: ...

    def estimate_model_size_mb(self) -> float: ...

SlothConverter¶

class SlothConverter:
    """Exports fine-tuned models to ONNX/TFLite for edge deployment."""

    def __init__(
        self,
        checkpoint_path: str,
        tokenizer_path: str | None = None,
    ): ...

    def export_head(
        self,
        output_path: str | None = None,
        format: str = "onnx",
        opset: int = 14,
        optimize: bool = True,
        prune_ratio: float = 0.0,
        prune_method: str = "magnitude",
    ) -> str: ...

    def export_embeddings(
        self,
        output_path: str | None = None,
        format: str = "onnx",
    ) -> str: ...

    def export_full(
        self,
        output_path: str | None = None,
        format: str = "onnx",
        max_seq_length: int = 128,
    ) -> str: ...

    def validate_onnx(self, onnx_path: str) -> bool: ...

SlothQuantizer¶

class SlothQuantizer:
    """Prepares calibration data and configures INT8 quantisation."""

    def __init__(
        self,
        checkpoint_path: str,
    ): ...

    def prepare_calibration_data(
        self,
        dataset_name: str | None = None,
        dataset: Dataset | None = None,
        text_field: str = "text",
        num_samples: int = 100,
        output_path: str = "calibration_samples.npz",
    ) -> str: ...

    def get_quantization_config(
        self,
        per_channel: bool = True,
        symmetric: bool = True,
    ) -> dict: ...

SlothCoralRuntime¶

class SlothCoralRuntime:
    """Runs text classification and embedding inference on Coral Edge TPU."""

    def __init__(
        self,
        model_path: str,
        labels: list[str] | None = None,
        libedgetpu_path: str | None = None,
    ): ...

    def classify_text(self, text: str, top_k: int = 5) -> dict: ...

    def classify_embedding(self, embedding: np.ndarray, top_k: int = 5) -> dict: ...

    def embed_text(self, text: str) -> dict: ...

    def classify_batch(self, texts: list[str]) -> list[dict]: ...

    def benchmark(self, num_runs: int = 100, warmup: int = 5) -> dict: ...

ModelDistiller¶

class ModelDistiller:
    """Distils a fine-tuned SLM into a smaller student model."""

    def __init__(
        self,
        teacher_model: SlothAdapter,
        student_hidden_sizes: list[int] = [512, 256],
        num_labels: int = 2,
        temperature: float = 4.0,
        alpha: float = 0.7,
    ): ...

    def train(
        self,
        dataset_name: str | None = None,
        dataset: Dataset | None = None,
        text_field: str = "text",
        label_field: str = "label",
        num_epochs: int = 5,
        batch_size: int = 16,
        learning_rate: float = 1e-3,
    ): ...

    def save_student(self, path: str) -> str: ...

    def estimate_student_size_mb(self) -> float: ...

sloth-integration: Comprehensive Usage Guide¶

Table of Contents¶

1. Overview¶

2. Prerequisites¶

Hardware¶

Software¶

Coral Edge TPU Runtime¶

3. Installation¶

Step 1: Install edgecompiler¶

Step 2: Enable sloth integration extras (optional)¶

Step 3: Install Coral Edge TPU runtime (libedgetpu for ARM64 macOS)¶

Option A: Use the setup script (recommended)¶

Option B: Manual installation¶

Install tflite-runtime¶

Step 4: Verify the complete setup¶

4. Quick Start¶

5. Step-by-Step Workflow¶

Step 1: Fine-tune an SLM with unsloth¶

Step 2: Export the fine-tuned model for Coral¶

Step 3: Compile with edgecompiler for Edge TPU¶

Step 4: Run inference on Coral USB¶

6. Using the CLI¶

sloth finetune¶

sloth export¶

sloth compile¶

sloth infer¶

sloth benchmark¶

sloth info¶

7. Understanding Edge TPU Constraints¶

8 MB On-Chip Cache¶

INT8 Only¶

Supported Operations¶

8. Model Selection Guide¶

Recommended SLMs¶

Deployment Strategies¶

Text Classification Head Extraction (Recommended)¶

Embedding Model Extraction¶

Full Model Deployment (Not Recommended)¶

9. Advanced Usage¶

9.1 Knowledge Distillation for Edge Deployment¶

9.2 Pruning Models to Fit Edge TPU¶

9.3 Calibration Data Preparation for PTQ¶

9.4 Hybrid Inference (Embeddings on Host, Classifier on Coral)¶

10. Integration Architecture¶

Import Relationship Diagram¶

Data Flow¶

11. Troubleshooting¶

"unsloth not found"¶

"edgecompiler not found"¶

"libedgetpu not found"¶

"Architecture mismatch (x86_64 dylib on ARM64)"¶

"No Coral USB device detected"¶

"Model exceeds Edge TPU cache"¶

"Quantisation accuracy is poor"¶

"Operations falling back to CPU"¶

12. API Reference¶

SlothAdapter¶

SlothConverter¶

SlothQuantizer¶

SlothCoralRuntime¶

ModelDistiller¶

`sloth finetune`¶

`sloth export`¶

`sloth compile`¶

`sloth infer`¶

`sloth benchmark`¶

`sloth info`¶