Causal LM¶

A vanilla transformer

This model is a decoder-only transformer model, roughly based on "Attention is All You Need."

decoder-only (causal)
post-layer-norm
Absolute Sinusoidal positional embeddings
Multi-head attention
MPL Feedforward
ReLU activations
Layer Norm
Embeddings init is scaled by 1/sqrt(d_model) and input embeddings are scaled by sqrt(d_model)

It supports the HF Attention interface and is compatible with vLLM