vLLM Integration Guide¶

This guide explains how to configure Forgather models for distributed inference with vLLM, including tensor parallelism and pipeline parallelism support.

Status (2026-03): vLLM integration is currently broken. Forgather has moved to Transformers v5, which vLLM does not yet support. Until vLLM adds Transformers v5 compatibility (or Forgather adds a Transformers < 5 compatibility layer), the deployment steps below will not work. The architectural information on TP/PP plans remains accurate and is preserved here for reference.

Overview¶

vLLM is a high-throughput inference engine that supports distributed inference through: - Tensor Parallelism (TP): Splits individual layers across multiple GPUs - Pipeline Parallelism (PP): Distributes sequential layers across multiple GPUs

Forgather models generated with proper vLLM configuration can be deployed with vLLM for efficient distributed inference.

Quick Start¶

1. Install vLLM¶

vLLM has it's own dependencies, which may conflict with those of Forgather. I would recommend installing it with its own Python virtual environment or even in its own container to provide full dependency isolation.

The official instructions for installation can be found here.

These instructions have been tested by directly cloning the vLLM git repo and installing from source.

1. Generate Model with vLLM Support¶

Most Forgather transformer models (Llama, Qwen3, etc.) now include vLLM support by default. Simply train and export your model as usual:

# We will use the tiny-models project to demonstrate
cd examples/tiny_experiments/tiny_models

# Clear old model definitions, to be certain that the models have the latest updates
rm -rf output_models

# Train model
forgather -t tiny_fg_qwen3.yaml train

The will produce a model definition and checkpoints. vLLM will expect the checkpoints to be in the same directory as the model's source code. You can create appropriate symbolic links to the latest checkpoint like this:

# Create symlinks to latest checkpoint in model directory
forgather -t tiny_fg_qwen3.yaml checkpoint link

Just to be sure that you have a working model, test it with Forgathers inference server first. While not as fast as vLLM, it's much easier to use for a quick model test.

# Start server
forgather inf server -m output_models/tiny_fg_qwen3

# In a separate terminal, test the server via the OpenAI "completion" API
forgather inf client --completion "Once upon a time"
...
Once upon a time, there was a little girl named Lily. She loved to play with her toys and her favorite thing to do was to eat...

2. Deploy with vLLM¶

# From your vLLM Python environment (or container)
cd examples/tiny_experiments/tiny_models/

# Start vLLM server on single GPU
# Notes:
# - Leave off the training directory slash. This avoids spurious warnings about module names
# - Models with custom code require the --trust-remote-code flag, even though the code is not "remote"
# - Make sure you don't have another inference server running
vllm serve --trust-remote-code output_models/tiny_fg_qwen3

3. Test the Model¶

Test using Forgathers OpenAI API client

# From Forgather's Python environment...
# Check what name vLLM is using for the model
forgather inf client --list-models
Available models:
  - output_models/tiny_fg_qwen3

# Test the model
forgather inf client --model output_models/tiny_fg_qwen3 --completion "Once upon a time"
Once upon a time, there was a little girl named Lily. She loved to play with her toys all day long...

Test the model using the vLLM completion client

vllm complete --max-tokens 512 -q "Once upon a time"
...
there was a furry little bunny. The bunny had a big sneaker and

Test directly with curl

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "output_models/tiny_fg_qwen3",
        "prompt": "Once upon a time",
        "max_tokens": 512,
        "temperature": 0.7
    }'

Fine Tuning and Testing Workflow¶

For this example, we will demonstrate using the small-ish Llama-3.2-1B model. Note that downloading this model requires authorization from Meta, which can be obtained from the model's HF site, linked above.

# Download a small-ish HF model for testing
hf download --exclude "original*" --local-dir Llama-3.2-1B-Instruct meta-llama/Llama-3.2-1B-Instruct

# Convert the model to Forgather format
forgather convert Llama-3.2-1B-Instruct/ fg_Llama-3.2-1B-Instruct/

Test Converted Model¶

First, let's make sure the converted model runs. Setting --enforce-eager run the model in 'eager' mode. This is a little slower, but can be helpful for diagnosing model issues, if something goes wrong. For faster performance, omit --enforce-eager.

# Start server (in vLLM's Python environment)
vllm serve --trust-remote-code --enforce-eager --model fg_Llama-3.2-1B-Instruct

# Test the model in 'chat' mode (from another terminal in Forgathers environment)
forgather inf client --model fg_Llama-3.2-1B-Instruct --message "Hello. What is your name?"
...
Hello! I'm an artificial intelligence model, and I don't have a personal name...

We can verify that the converted model's Tensor Parallel and Pipeline Parallel plans work like this:

# Test with pipeline parallel (requires 2 GPUs)
vllm serve --trust-remote-code --pipeline-parallel-size 2 --model fg_Llama-3.2-1B-Instruct

# Test with tensor parallel (requires 2 GPUs)
vllm serve --trust-remote-code --tensor-parallel-size 2 --model fg_Llama-3.2-1B-Instruct

# Test with both tensor and pipeline parallel (requires 4 GPUs)
vllm serve --trust-remote-code --pipeline-parallel-size 2 --tensor-parallel-size 2 --model fg_Llama-3.2-1B-Instruct

If you wish to try interactive chat with the model...

# Start Forgather chat client (enter quit to exit)
forgather inf client --model fg_Llama-3.2-1B-Instruct

# vLLM's chat client (^C to exit)
vllm chat

Train Converted Model¶

We will train the model to become Samantha, as this makes for a fairly quick demonstration (~5 min on 2x RTX 4090)

cd examples/finetune/samantha

# Train on 2 GPUs using Torch Pipeline Parallel
# Remember to shutdown vLLM first, or you will probably hit an OOM error.
forgather -t llama3_1b/2gpu_pp_1f1b.yaml train -M ~/ai_assets/models/fg_Llama-3.2-1B-Instruct

# If you only have a single GPU, try:
forgather -t llama3_1b/1gpu_packed.yaml train -M ~/ai_assets/models/fg_Llama-3.2-1B-Instruct

Test the Trained Model¶

The models checkpoints will be saved in the "checkpoints" sub-directory, which vLLM does not know about. To create symbolic links to the newest checkpoint, and clobber the original weights, run this command:

# WARNING: This command will overwrite the original model weights with symlinks. Make sure that you have another copy of these!
forgather checkpoint link -f --output-path ~/ai_assets/models/fg_Llama-3.2-1B-Instruct

As this should be a freshly converted model, you can start over again by reconverting the model. Just be careful not to use this approach for any models when the weights in the model directory are your only copy. Make a backup first!

As before, start the vLLM server and test it with an OpenAI compatible client. e.g.

# Start server (in vLLM's Python environment)
vllm serve --trust-remote-code --model fg_Llama-3.2-1B-Instruct

# Start Forgather chat client (enter quit to exit)
# And run from Forgather's Python environment!
forgather inf client --model fg_Llama-3.2-1B-Instruct
...
> Hello. What is your name?
Hello! My name is Samantha. It's a pleasure to meet you. I find our interactions to be both engaging and enlightening, and I look forward to our future conversations.

Understanding vLLM Plans¶

vLLM uses the HF interface for specifying Tensor Parallel (TP) and Pipeline Parallel (PP) plans. The plan definitions can be found via the _tp_plan and _pp_plan attributes, attached to a PreTrainedModel, but the full specification can't be set directly with they attributes. They are constructed by the model's post_init() method, which starts with the base_model_tp_plan and base_model_pp_plan attributes of the model's configuration class, then all modules are searched for _tp_plan and _pp_plan attributes, which are merged with the base definitions to produce the final plans.

Tensor Parallel Plan¶

The tensor parallel plan tells vLLM how to split weight matrices across GPUs. It's a dictionary mapping layer name patterns to split styles:

tp_plan = {
    # Column-wise split: Independent outputs (queries, keys, values)
    "causal_lm.layer_stack.layers.*.attention.query_linear": "colwise",
    "causal_lm.layer_stack.layers.*.attention.key_linear": "colwise",
    "causal_lm.layer_stack.layers.*.attention.value_linear": "colwise",

    # Row-wise split: Combined inputs (output projections)
   "causal_lm.layer_stack.layers.*.attention.output_linear": "rowwise",

    # Feedforward layers
    "causal_lm.layer_stack.layers.*.feedforward.gate_proj": "colwise",
    "causal_lm.layer_stack.layers.*.feedforward.up_proj": "colwise",
    "causal_lm.layer_stack.layers.*.feedforward.down_proj": "rowwise",
}

Column-wise (colwise): Splits the output dimension. Each GPU computes a subset of output features independently. - Use for: Query/Key/Value projections, Gate/Up projections - Communication: AllReduce after computation

Row-wise (rowwise): Splits the input dimension. Each GPU processes a subset of input features. - Use for: Output projections, Down projections (combining parallel streams) - Communication: AllGather before computation

See: - HF Distributed inference - PyTorch TP Tutorial - vLLM Custom Models

Pipeline Parallel Plan¶

The pipeline parallel plan defines how modules are distributed across pipeline stages and their I/O interfaces:

_pp_plan = {
    # Stage boundaries defined by major model components
    "causal_lm.input_encoder": (
        ["input_ids"],              # Inputs
        ["hidden_states"]           # Outputs
    ),
    "causal_lm.layer_stack.layers": (
        ["hidden_states", "attention_mask"],
        ["hidden_states"]
    ),
    "causal_lm.layer_stack.layer_norm": (
        ["hidden_states"],
        ["logits"]
    ),
}

vLLM distributes these modules across pipeline stages automatically based on model size and available GPUs.

vLLM expects that exactly one of these named modules is an instance of nn.ModuleList, where it is assumed that the layers actually reside. After constructing the model, the unused modules are replaced with instances of PPMissingLayer, which is a derivative of nn.Identity, with logic for only returning the first element from returned tuples or dictionaries.

Technically, vLLM does not support specifying these via Fully Qualified Names (FQNs) and assumes that they are attributes of the outer-most module.

As to work around this limitation, Forgathers "hf_causal.py" implementation implements __getattr__ and __setattr__, allowing it to perform full FQN name lookups.

Forgather does not use nn.ModuleList, but uses nn.ModuleDict, which is required to support the Pytorch approach to Pipeline Parallelism. As to address this, we a proxy nn.ModuleList derivative is used, which forwards modifications to the nn.ModuleDict.

No-Split Modules (`_no_split_modules`)¶

Specifies module types that should never be split with pipeline parallelism:

_no_split_modules = ["PreLNLayer"]  # For Llama models
# or
_no_split_modules = ["PostLNLayer"]  # For vanilla transformers
# or
_no_split_modules = ["DeepnetLayer"]  # For DeepNet models

This ensures transformer blocks remain intact on single devices, which is critical for correctness; the skip-layers within these modules would otherwise be a problem.

Layer Naming Convention¶

Forgather uses semantic layer naming that differs from standard HuggingFace naming:

Component	Forgather Naming	HuggingFace Equivalent
Query projection	`attention.query_linear`	`self_attn.q_proj`
Key projection	`attention.key_linear`	`self_attn.k_proj`
Value projection	`attention.value_linear`	`self_attn.v_proj`
Attention output	`attention.output_linear`	`self_attn.o_proj`
FFN gate	`feedforward.gate_proj`	`mlp.gate_proj`
FFN up	`feedforward.up_proj`	`mlp.up_proj`
FFN down	`feedforward.down_proj`	`mlp.down_proj`