HFL

Run any HuggingFace model locally. 500K+ models at your fingertips.

v0.1.0 Python ≥3.10 1900 tests 90%+ coverage OpenAI & Ollama compatible

GitHub Repository Full Architecture Docs

What is HFL?

HFL (HuggingFace Local) is a CLI + API server that lets you run HuggingFace models locally. While Ollama offers ~500 curated models, HFL gives you access to the 500,000+ models on the HuggingFace Hub.

    Key insight: HFL is to HuggingFace Hub what Ollama is to its own model library — but with 1000x more models available.
  

Quick Start

# Install
pip install hfl

# Pull a model
hfl pull microsoft/Phi-3-mini-4k-instruct-gguf

# Chat interactively
hfl run microsoft/Phi-3-mini-4k-instruct-gguf

# Start API server (OpenAI + Ollama compatible)
hfl serve --model microsoft/Phi-3-mini-4k-instruct-gguf

HFL vs Ollama

Feature	HFL	Ollama
Available models	500,000+	~500
Source	HuggingFace Hub	Ollama Library
OpenAI API compatible	✓	✓
Ollama API compatible	✓	✓
TTS support	✓	✗
Multiple backends	llama.cpp, transformers, vLLM	llama.cpp only
License verification	Automatic (5 levels)	✗
EU AI Act compliance	Built-in	✗
GGUF auto-conversion	✓	✗
i18n (EN/ES)	✓	✗

API Compatibility

HFL exposes both OpenAI and Ollama compatible APIs, so it works as a drop-in replacement with existing tools:

OpenAI Endpoints

POST /v1/chat/completions

POST /v1/completions

GET /v1/models

POST /v1/audio/speech

Ollama Endpoints

POST /api/generate

POST /api/chat

GET /api/tags

POST /api/tts

Works with: Open WebUI, Chatbox, Continue.dev, and any OpenAI/Ollama-compatible client.

12 CLI Commands

Command	Description
`hfl pull`	Download model from HuggingFace Hub
`hfl run`	Interactive chat with a model
`hfl serve`	Start API server
`hfl list`	List local models
`hfl search`	Search HuggingFace Hub
`hfl inspect`	Show model details
`hfl rm`	Remove a model
`hfl alias`	Create model aliases
`hfl login / logout`	Manage HF authentication
`hfl version`	Show version info
`hfl compliance-report`	Legal compliance report

Architecture Highlights

3 Inference Backends

llama.cpp for GGUF (CPU/GPU), transformers for safetensors, vLLM for production GPU with real async streaming.

FailoverEngine

Multi-backend with sticky routing. Automatically retries with the next engine if one fails.

Model Pool

LRU eviction with real-time RAM/GPU memory tracking. Non-recursive concurrent loading.

Legal Compliance

5-level license classification, EU AI Act notices, provenance logging, AI disclaimers.

Production Ready

Rate limiting, API key auth, health probes, Prometheus metrics, SLO monitoring, structured logging.

TTS Support

Text-to-speech via Bark and Coqui XTTS-v2 engines with OpenAI-compatible endpoints.

Tech Stack

Core

Python 3.10+ • FastAPI • Typer • Rich • Pydantic

ML

llama-cpp-python • transformers • vLLM • Bark • Coqui TTS

Quality

1900 tests • 90%+ coverage • mypy • ruff • CI/CD