HFL

Run any HuggingFace model locally. 500K+ models at your fingertips.

v0.1.0 Python ≥3.10 1900 tests 90%+ coverage OpenAI & Ollama compatible

What is HFL?

HFL (HuggingFace Local) is a CLI + API server that lets you run HuggingFace models locally. While Ollama offers ~500 curated models, HFL gives you access to the 500,000+ models on the HuggingFace Hub.

Key insight: HFL is to HuggingFace Hub what Ollama is to its own model library — but with 1000x more models available.

Quick Start

# Install
pip install hfl

# Pull a model
hfl pull microsoft/Phi-3-mini-4k-instruct-gguf

# Chat interactively
hfl run microsoft/Phi-3-mini-4k-instruct-gguf

# Start API server (OpenAI + Ollama compatible)
hfl serve --model microsoft/Phi-3-mini-4k-instruct-gguf

HFL vs Ollama

FeatureHFLOllama
Available models500,000+~500
SourceHuggingFace HubOllama Library
OpenAI API compatible
Ollama API compatible
TTS support
Multiple backendsllama.cpp, transformers, vLLMllama.cpp only
License verificationAutomatic (5 levels)
EU AI Act complianceBuilt-in
GGUF auto-conversion
i18n (EN/ES)

API Compatibility

HFL exposes both OpenAI and Ollama compatible APIs, so it works as a drop-in replacement with existing tools:

OpenAI Endpoints

POST /v1/chat/completions

POST /v1/completions

GET /v1/models

POST /v1/audio/speech

Ollama Endpoints

POST /api/generate

POST /api/chat

GET /api/tags

POST /api/tts

Works with: Open WebUI, Chatbox, Continue.dev, and any OpenAI/Ollama-compatible client.

12 CLI Commands

CommandDescription
hfl pullDownload model from HuggingFace Hub
hfl runInteractive chat with a model
hfl serveStart API server
hfl listList local models
hfl searchSearch HuggingFace Hub
hfl inspectShow model details
hfl rmRemove a model
hfl aliasCreate model aliases
hfl login / logoutManage HF authentication
hfl versionShow version info
hfl compliance-reportLegal compliance report

Architecture Highlights

3 Inference Backends

llama.cpp for GGUF (CPU/GPU), transformers for safetensors, vLLM for production GPU with real async streaming.

FailoverEngine

Multi-backend with sticky routing. Automatically retries with the next engine if one fails.

Model Pool

LRU eviction with real-time RAM/GPU memory tracking. Non-recursive concurrent loading.

Legal Compliance

5-level license classification, EU AI Act notices, provenance logging, AI disclaimers.

Production Ready

Rate limiting, API key auth, health probes, Prometheus metrics, SLO monitoring, structured logging.

TTS Support

Text-to-speech via Bark and Coqui XTTS-v2 engines with OpenAI-compatible endpoints.

Tech Stack

Core

Python 3.10+ • FastAPI • Typer • Rich • Pydantic

ML

llama-cpp-python • transformers • vLLM • Bark • Coqui TTS

Quality

1900 tests • 90%+ coverage • mypy • ruff • CI/CD