HuggingFace OpenAI API Server¶

A simple OpenAI API-compatible inference server for testing HuggingFace models.

The server and client should both work with other tools which support the OpenAI protocol.

For Maintainers: See ARCHITECTURE.md for detailed technical documentation on the codebase structure, design patterns, and implementation details.

Installation¶

The base Forgather install should be sufficient.

The scripts can be executed as stand-alone executables or started from the Forgather CLI, for convenience.

Authentication¶

The server requires a bearer token by default. Multi-user hosts share the same loopback addresses, so without auth any local user could connect to a running inference port and use a private/finetuned model.

For the full picture — how this token fits in alongside the forgather server's own bearer token, the per-job trainer-control token, and the TensorBoard / MkDocs spawn defaults — see the forgather server threat model and authentication overview.

Default behaviour — if you don't pass any auth flag, the server generates a random 64-hex-char token at startup and prints it on stderr:

inference_server auth token: 8f5b...
clients must send 'Authorization: Bearer <token>'
curl -H "Authorization: Bearer 8f5b..." http://127.0.0.1:8137/v1/models
shared token file: /home/<you>/.config/forgather/inference/8137.token

The auto-generated token is also written to a per-port file under ~/.config/forgather/inference/<port>.token (mode 0600 in a 0700 directory) and removed when the server exits. The bundled CLI client picks it up automatically when its --url resolves to a loopback host (127.0.0.1, ::1, or localhost), so forgather inf server paired with forgather inf client works with no auth flags on either side. The lookup is keyed by port — a server bound to 0.0.0.0:8137 and a client connecting to http://127.0.0.1:8137/v1 share the same file.

For non-bundled clients (curl, OpenAI SDK calling directly, scripts), copy the token from stderr or read it from the file. Pass it in the Authorization: Bearer <token> header. The OpenAI Python SDK sends api_key as the bearer token.

Supplying a known token — --auth-token TOKEN or --auth-token-file PATH (mode 0600). The file form is preferred for orchestrators since --auth-token is visible to other local users via ps. When you supply a token explicitly the server does not publish it to the shared cache file — operator-managed tokens stay where the operator put them, and the client must be told about the token the same way.

Disabling auth — --no-auth turns the gate off entirely. The startup banner warns prominently when this is set. Only use it on hosts where you're the only user (or you don't care who uses the model). No cache file is written.

Examples:

# Auto-generated token + auto-discovery (zero-config local use):
forgather inf server -m /path/to/model      # in one terminal
forgather inf client --message "hi"         # in another

# Auto-generated token consumed by curl / a custom client:
TOKEN=$(cat ~/.config/forgather/inference/8137.token)
curl -H "Authorization: Bearer $TOKEN" http://127.0.0.1:8137/v1/models

# Known token from a file (orchestrators, multi-host, scripts):
echo -n "$(openssl rand -hex 32)" > ~/.inf-token && chmod 600 ~/.inf-token
forgather inf server -m /path/to/model --auth-token-file ~/.inf-token
forgather inf client --auth-token-file ~/.inf-token --message "hi"

# Opt out:
forgather inf server -m /path/to/model --no-auth

The /health endpoint is intentionally left open (no auth required) so the same-origin proxy in the forgather-server can probe upstream health before the model finishes loading. All other routes — /v1/models, /v1/chat/completions, /v1/completions, /tokenize, /v1/tokenize — require the bearer.

When the server is spawned by the forgather-server scheduler it auto-generates a per-job token under ~/.config/forgather/server/inference/<queue_id>.token (mode 0600) and the same- origin proxy adds the bearer for browser-initiated requests transparently.

Quick Start¶

Start server

# Load model in directory using AutoModelForCausalLM.from_pretrained()
# This defaults to bfloat16 on cuda:0
forgather inf server -m /path/to/model

# Load model from latest Forgather checkpoint
forgather inf server -c -m /path/to/model

Start client

# Start in interactive (chat) mode
forgather inf client

# Perform text completion on prompt
forgather inf client --completion "Once upon a time"

From Configuration File¶

Create server configuration:

# server_config.yaml
model: "/path/to/your/model"
device: "cuda:0"
port: 8007
stop-sequences: ["The end", "</s>"]

Start server:
```
python server.py server_config.yaml
```

Create client configuration:

# client_config.yaml  
url: "http://localhost:8007/v1"
max-tokens: 100
show-usage: true
repetition-penalty: 1.2

Test with client:

# Chat mode
python client.py client_config.yaml --message "Tell me a story"

# Completion mode with echo (default)
python client.py client_config.yaml --completion "Once upon a time"

# Pipeline mode (reads from stdin)
echo "In a magical forest" | python client.py client_config.yaml

Usage¶

Start the server¶

Using Command Line Arguments¶

forgather inf server --model microsoft/DialoGPT-medium

Using YAML Configuration File¶

forgather inf server server_config.yaml

Using YAML Configuration with CLI Overrides¶

forgather inf server server_config.yaml --port 8001 --log-level DEBUG

Options: - config: YAML configuration file (optional positional argument) - --model: HuggingFace model path or name (required, can be in config file) - --host: Host to bind to (default: 127.0.0.1) - --port: Port to bind to (default: 8137) - --device: Device to use - cuda, cpu, or auto (default: cuda:0) - --chat-template: Path to custom Jinja2 chat template file (optional) - --dtype: Model data type (optional, see Data Types section) - --stop-sequences: Custom stop sequences to halt generation (optional) - --ignore-eos: Ignore EOS tokens during generation (optional, default: False) - --log-level: Logging level - DEBUG, INFO, WARNING, ERROR (default: INFO)

Note: CLI arguments always override values from the configuration file.

YAML Configuration Files¶

The server supports YAML configuration files for convenient parameter management and reusable configurations.

Server Configuration Format¶

Create a server_config.yaml file:

# Model configuration
model: "/path/to/your/model"
device: "cuda:0"
dtype: "bfloat16"

# Server configuration  
host: "127.0.0.1"
port: 8007
log-level: "INFO"

# Chat template (optional)
chat-template: "/path/to/custom/template.jinja"

# Stop sequences (optional)
stop-sequences:
  - "The end"
  - "The End"
  - "</s>"

Usage Patterns¶

# Use config file only
forgather inf server my_config.yaml

# Override specific values
forgather inf server my_config.yaml --port 8001

# Multiple overrides
forgather inf server my_config.yaml --device cpu --log-level DEBUG

Chat Template Support¶

The server supports three chat template sources, in order of priority:

Custom template file (via --chat-template argument)
Tokenizer's built-in template (if available)
Default fallback template (simple format)

Using ChatML template¶

forgather inf server --model /path/to/model --chat-template /path/to/chat_templates/chatml.jinja

Narrative Template for Story-Focused Models¶

For models trained on story/narrative data (like tiny language models), a narrative-style template often works better:

Create narrative_chat.jinja:

{%- for message in messages %}
    {%- if message['role'] == 'system' -%}
        Once upon a time, {{ message['content'] }}

    {%- elif message['role'] == 'user' -%}
        {%- if loop.first -%}
            There was a story about {{ message['content'] }}
        {%- else -%}
            Then someone said: "{{ message['content'] }}" 
        {%- endif -%}
    {%- elif message['role'] == 'assistant' -%}
        And the story continued: {{ message['content'] }}

    {%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
    The story goes on: 
{%- endif -%}

This template frames conversations as storytelling, which can improve output quality for story-focused models.

Standard Assistant Template¶

Create a traditional assistant template:

{%- for message in messages %}
    {%- if message['role'] == 'system' -%}
        System: {{ message['content'] }}\n\n
    {%- elif message['role'] == 'user' -%}
        User: {{ message['content'] }}\n\n
    {%- elif message['role'] == 'assistant' -%}
        Assistant: {{ message['content'] }}\n\n
    {%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
    Assistant: 
{%- endif -%}

Data Types¶

The server supports flexible data type configuration with intelligent defaults:

Default Behavior: - GPU with bfloat16 support: bfloat16 (recommended for modern GPUs) - GPU without bfloat16 support: float16 - CPU: float32

Supported Data Types: - float32, fp32 - 32-bit floating point - float16, fp16, half - 16-bit floating point - bfloat16, bf16 - 16-bit brain floating point (recommended) - float64, fp64, double - 64-bit floating point

Examples:

# Use default (bfloat16 if supported)
forgather inf server --model /path/to/model

# Explicit bfloat16
forgather inf server --model /path/to/model --dtype bfloat16

# Use float16 for older GPUs
forgather inf server --model /path/to/model --dtype float16

# High precision for research
forgather inf server --model /path/to/model --dtype float32

# With custom stop sequences
forgather inf server --model /path/to/model --stop-sequences "<|im_end|>" "</s>"

Loading Quantized Models¶

The server transparently loads torchao-quantized artifacts produced by forgather finalize --quantize. No extra flag — the native checkpoint loader detects quantization (from config.json's quantization_config block, or by scanning the saved state_dict) and installs the matching quantized linear modules before weights load. See QAT Training § Evaluating Quantized Models for the underlying mechanism.

# Quantize a finalized model (one-time, after training)
forgather finalize --quantize int8-dynamic-act-int4-weight \
    output_models/my_run /serve/my_run_int4

# Serve it — same invocation shape as for bf16 models
forgather inf server -m /serve/my_run_int4 --from-checkpoint

Throughput. Quantized serving is currently slower than bf16 on small/medium models at batch size 1. Measured on the 4.43M Tiny Llama, single RTX 3090, greedy 64-token completion:

Variant	tok/s
bf16 baseline	379.9
`int8-dynamic-act-int4-weight`	61.9

The slowdown is the per-matmul dequant overhead being a large fraction of the work in a small model. Quantization wins (memory footprint, longer context, larger batch) appear at scale; benchmark your own setup before deploying.

Dtype interaction. --dtype controls the unquantized layers (norms, embeddings, residuals) and the dequant target. The recipe controls activation/weight precision for the quantized linears. The default bfloat16 is the right choice unless you have a specific reason to override.

Device placement. Quantized linears are CUDA-bound for the v1 recipes. CPU serving of quantized models is not currently supported.

Stop Sequences¶

The server supports flexible stop sequence configuration to control when generation should halt:

Default Behavior: - EOS token is always included as a stop criterion - Models typically learn when to stop during training

Examples:

# ChatML format stopping
forgather inf server --model /path/to/model --stop-sequences "<|im_end|>"

# Multiple stop sequences
forgather inf server --model /path/to/model --stop-sequences "</s>" "<|endoftext|>" "Human:"

# Instruct format stopping
forgather inf server --model /path/to/model --stop-sequences "### Human:" "### Assistant:"

EOS Token Control¶

By default, generation stops when the model produces an EOS (End-Of-Sequence) token. You can override this behavior to force the model to generate exactly max_tokens tokens regardless of EOS.

Server-level default:

# All requests will ignore EOS by default
forgather inf server --model /path/to/model --ignore-eos

Request-level override (using client):

# Ignore EOS for this specific request
forgather inf client --completion "Once upon a time" --max-tokens 512 --ignore-eos

# Or with Python OpenAI client
python -c "
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8137/v1', api_key='<server-printed token>')
response = client.completions.create(
    model='test',
    prompt='Once upon a time',
    max_tokens=512,
    extra_body={'ignore_eos': True}
)
print(response.choices[0].text)
"

Use cases: - Benchmarking: Generate fixed-length outputs for fair comparison - Data generation: Produce exactly N tokens for datasets - Testing: Verify model behavior at maximum context length

Note: Other stop conditions (max_tokens, stop_sequences) still apply when ignore_eos is enabled.

Logging¶

The server uses structured logging with detailed request/response information:

Log Levels¶

DEBUG: Detailed debugging information
INFO: Request details, token IDs, and decoded text with special tokens (default)
WARNING: Important warnings (e.g., dtype fallbacks)
ERROR: Error conditions

What's Logged at INFO Level¶

For each request, the server logs: - Request ID and parameters (model, max_tokens, temperature, etc.) - Individual chat messages with content - Formatted prompt text - Input token IDs and decoded text with special tokens (BOS, EOS, etc.) - Generated token IDs and decoded text with special tokens - Final response text (cleaned) - Token usage statistics

Example Log Output¶

2025-08-05 18:15:03,770 - root - INFO - [chatcmpl-c814be94] New chat completion request: model=test-model, max_tokens=30, temperature=0.7, top_p=1.0, messages_count=2
2025-08-05 18:15:03,771 - root - INFO - [chatcmpl-c814be94] Message 0: role=system, content='You are a helpful assistant.'
2025-08-05 18:15:03,771 - root - INFO - [chatcmpl-c814be94] Message 1: role=user, content='What is 2+2?'
2025-08-05 18:15:03,771 - root - INFO - [chatcmpl-c814be94] Input token IDs: [1, 4444, 29901, 887, 526, 263, 8444, 20255, 29889, 13, 13, 2659, 29901, 1724, 338, 29871, 29906, 29974, 29906, 29973, 13, 13, 7900, 22137, 29901, 29871]
2025-08-05 18:15:03,772 - root - INFO - [chatcmpl-c814be94] Generated tokens with special tokens: 'The answer is 4! I'm here to provide accurate information and support in a range of topics, from math to emotional well-be</s>'
2025-08-05 18:15:03,772 - root - INFO - [chatcmpl-c814be94] Response text (clean): 'The answer is 4! I'm here to provide accurate information and support in a range of topics, from math to emotional well-be'

CLI Client Tool¶

A dedicated CLI client (client.py) is provided for easy interaction with the server using the official OpenAI Python client, ensuring full compatibility.

Usage¶

YAML Configuration for Client¶

The client supports YAML configuration files for convenient parameter management:

Create client_config.yaml:

# Connection settings
url: "http://localhost:8007/v1"
model: "my-model"

# Generation parameters
max-tokens: 60
temperature: 0.8
top-p: 0.9
show-usage: true

# HuggingFace generation parameters
repetition-penalty: 1.2
top-k: 40
no-repeat-ngram-size: 2

Usage with configuration:

# Use config file
forgather inf client client_config.yaml --message "Tell me a story"

# Override config values
forgather inf client client_config.yaml --message "Hello" --max-tokens 30 --temperature 0.5

# Completion mode with config
forgather inf client client_config.yaml --completion "Once upon a time"

Single Message¶

# Basic usage
forgather inf client --message "Hello, how are you?"

# With custom server URL
forgather inf client --message "What's 2+2?" --url http://localhost:8001/v1

# With system prompt
forgather inf client --message "Tell me a joke" --system "You are a funny comedian"

# Show token usage
forgather inf client --message "Explain AI" --show-usage --max-tokens 200

Text Completion (Completions API)¶

The server also supports the older /v1/completions endpoint for raw text completion without chat formatting. By default, the prompt is included in the output (echo enabled) for better pipeline compatibility.

# Basic text completion (includes prompt in output by default)
forgather inf client --completion "Once upon a time"

# With custom parameters
forgather inf client --completion "The weather today is" --max-tokens 50 --temperature 0.8

# Disable echo (only show generated text)
forgather inf client --completion "Python is" --no-echo --max-tokens 30

# With stop sequences
forgather inf client --completion "Q: What is AI? A:" --stop "Q:" --max-tokens 100

# Show detailed usage information
forgather inf client --completion "Hello world" --show-usage

# Advanced generation parameters for better quality
forgather inf client --completion "Once upon a time" --repetition-penalty 1.2 --top-k 50 --max-tokens 100

# Multiple parameters for fine control
forgather inf client --completion "The story begins" --repetition-penalty 1.1 --no-repeat-ngram-size 3 --top-k 40 --num-beams 2

Pipeline Support (Stdin Input)¶

The client supports reading prompts from stdin, making it perfect for Unix pipelines:

# Pipe from file
cat prompt.txt | python client.py --max-tokens 50

# Pipe from echo
echo "The brave knight ventured forth" | python client.py --max-tokens 40

# Pipe with config file
cat story_prompt.txt | python client.py client_config.yaml

# Pipe with parameters and output redirection
echo "In a distant galaxy" | python client.py --max-tokens 60 --temperature 0.8 > story_output.txt

# Chain operations (completion only, no prompt)
echo "Once upon a time" | python client.py --no-echo --max-tokens 100 | grep -v "Usage:"

Pipeline Benefits: - Echo by default: Includes original prompt for context - Stream-friendly: Outputs to stdout for redirection - Config compatible: Works with YAML configurations - Scriptable: Perfect for automation and batch processing

Interactive Chat Mode¶

# Start interactive session
forgather inf client

# With system prompt
forgather inf client --system "You are a helpful coding assistant"

# Custom generation parameters
forgather inf client --temperature 0.9 --max-tokens 1000

Utility Commands¶

# Check server health
forgather inf client --health

# List available models
forgather inf client --list-models

# Connect to different server
forgather inf client --health --url http://localhost:8001/v1

Interactive Commands¶

When in interactive mode, use these commands: - /clear - Clear conversation history - /system <message> - Set or change system prompt - /help - Show available commands - quit, exit, or q - Exit interactive mode

Client Options¶

Configuration: - config - YAML configuration file (optional positional argument)

Basic Options: - --url - Server base URL (default: http://localhost:8137/v1) - --model - Model name (default: inference-server) - --max-tokens - Maximum tokens to generate (default: 512) - --temperature - Sampling temperature (default: 0.7) - --top-p - Top-p sampling (default: 1.0) - --system - System prompt (chat mode only) - --show-usage - Display token usage information

Mode Options: - --message - Single message to send (chat mode) - --completion - Generate text completion for the given prompt - --interactive - Run in interactive chat mode - --stop - Stop sequences for completion mode (can specify multiple) - --echo - Echo the prompt in the completion response (default for completion mode) - --no-echo - Don't echo the prompt in the completion response

Input Methods: - Command line argument: --completion "prompt text" - Standard input (stdin): echo "prompt text" | python client.py - File input: cat prompt.txt | python client.py

Advanced Generation Parameters: - --repetition-penalty - Repetition penalty (e.g., 1.2 to reduce repetition) - --no-repeat-ngram-size - Size of n-grams to avoid repeating - --top-k - Top-k sampling parameter - --num-beams - Number of beams for beam search - --min-length - Minimum length of generated sequence - --seed - Random seed for reproducible generation - --ignore-eos - Ignore EOS tokens (continue past EOS until max_tokens or stop_sequence)

Test with curl¶

Chat Completions¶

curl -X POST http://localhost:8137/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "max_tokens": 100
  }'

Text Completions¶

curl -X POST http://localhost:8137/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "test-model",
    "prompt": "Once upon a time",
    "max_tokens": 50,
    "temperature": 0.8
  }'

With HuggingFace Parameters¶

curl -X POST http://localhost:8137/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "test-model",
    "prompt": "The quick brown fox",
    "max_tokens": 50,
    "temperature": 0.7,
    "repetition_penalty": 1.2,
    "top_k": 40,
    "no_repeat_ngram_size": 3
  }'

Test with OpenAI Python client¶

Chat Completions¶

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8137/v1",
    api_key="<token printed on server stderr>",  # used as Bearer token
)

response = client.chat.completions.create(
    model="test-model",
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ],
    max_tokens=100
)

print(response.choices[0].message.content)

Text Completions¶

response = client.completions.create(
    model="test-model",
    prompt="Once upon a time",
    max_tokens=50,
    temperature=0.8
)

print(response.choices[0].text)

Using HuggingFace Parameters with extra_body¶

For advanced generation parameters, use the extra_body parameter to pass HuggingFace-specific options:

response = client.completions.create(
    model="test-model",
    prompt="The story begins with",
    max_tokens=100,
    temperature=0.7,
    extra_body={
        "repetition_penalty": 1.2,
        "top_k": 40,
        "no_repeat_ngram_size": 3,
        "num_beams": 2
    }
)

print(response.choices[0].text)

This mechanism allows you to use any HuggingFace generation parameter while maintaining compatibility with the OpenAI client library.

API Endpoints¶

GET /v1/models - List available models
POST /v1/chat/completions - Create chat completion
POST /v1/completions - Create text completion
GET /health - Health check

Features¶

OpenAI API Compatibility: Full support for both chat completions and text completions endpoints
YAML Configuration Support: Both server and client support YAML config files with CLI override capability
HuggingFace GenerationConfig Integration: Automatically loads generation_config.json from model directories
HuggingFace Generation Parameters: Comprehensive support for all HuggingFace generation options
Flexible Chat Templates: Support for custom Jinja2 templates, including narrative templates for story models
Stop Sequence Control: Configurable stop sequences for precise generation control
Data Type Support: Intelligent dtype selection with bfloat16, float16, and float32 support
Automatic Device Selection: Smart GPU/CPU device placement
Detailed Logging: Structured request/response logging with token-level information
Token Usage Tracking: Accurate prompt, completion, and total token counts
EOS Token Control: Configurable EOS token handling - stop on EOS (default) or ignore to generate exact token counts
Extra Body Parameter Support: Client supports HuggingFace parameters via OpenAI's extra_body mechanism
Pipeline Support: Stdin input support for Unix-style command chaining and automation
Smart Echo Defaults: Completion mode includes prompt by default for better pipeline compatibility

HuggingFace Generation Parameters¶

The server supports all major HuggingFace generation parameters for fine-tuning model behavior:

Repetition Control: - repetition_penalty - Penalty for repeating tokens (default: None) - no_repeat_ngram_size - Prevent repeating n-grams (default: None) - encoder_no_repeat_ngram_size - For encoder-decoder models (default: None)

Sampling Parameters: - top_k - Top-k sampling (default: None, uses model default) - typical_p - Typical sampling parameter (default: None) - temperature - Sampling temperature (default: 0.7 for chat, 1.0 for completions) - top_p - Nucleus sampling (default: 1.0)

Beam Search: - num_beams - Number of beams for beam search (default: 1) - num_beam_groups - Diverse beam search groups (default: None) - diversity_penalty - Penalty for diverse beam search (default: None) - length_penalty - Length penalty for beam search (default: None)

Length Control: - min_length - Minimum generation length (default: None) - max_new_tokens - Maximum new tokens to generate (default: varies) - early_stopping - Stop at EOS token (default: True)

Other Options: - seed - Random seed for reproducible generation (default: None) - guidance_scale - Classifier-free guidance scale (default: None) - bad_words_ids - Token IDs to avoid (default: None)

Limitations¶

Single model per server instance
Multi-token stop sequences require post-processing (slight performance impact)