Gemma-3 (text)¶
Forgather model project for Google Gemma-3 text models, supporting
HuggingFace ↔ Forgather round-trip conversion via forgather convert.
Architecture notes¶
Gemma-3 is mostly Llama-shaped but has several quirks that motivate the extra modelsrc components in this project:
- Decoupled head_dim:
head_dim * num_heads != hidden_size(e.g. for the 270M model,head_dim=256, num_heads=4, hidden_size=640). Q/K/V/O projections usenum_heads * head_dim, andCausalMultiheadAttnaccepts an optional explicithead_dimfor this. - GemmaRMSNorm: RMSNorm applied as
(1 + weight) * normalize(x), with the weight initialized to zero. Computed in float32 for stability. Lives inmodelsrc/transformer/gemma_rms_norm.py. - QK-Norm: Query and key are normalized per-head (over
head_dim) before RoPE. ReusesCausalMultiheadAttn.qk_norm_factory, pointed atGemmaRMSNorm. - Four layernorms per decoder layer:
input_layernorm,post_attention_layernorm,pre_feedforward_layernorm,post_feedforward_layernorm.GemmaDecoderLayerinmodelsrc/transformer/gemma_layer.pywraps both the attention and feedforward blocks with pre- and post-norms. - Dual RoPE: full-attention layers use a large
rope_theta(e.g. 1e6) and sliding-attention layers use a small one (e.g. 1e4).GemmaDualRotaryEmbeddingwraps twoRotaryEmbeddinginstances and returns a dict keyed by layer type;GemmaDecoderLayerselects the appropriate entry per layer. - Per-layer attention type:
config.layer_typesassigns each layer to eitherfull_attentionorsliding_attention.GemmaDecoderLayerconstructs its attention module with the per-layersliding_window(fromconfig.sliding_window) for sliding layers andNonefor full layers. Agemma_mask_fnhelper builds both a full causal mask and a sliding window mask at forward time and passes them through as a dict that the decoder layer unpacks. - GeGLU with
gelu_pytorch_tanh: the MLP activation isnn.GELU(approximate='tanh').use_tritonis set toFalsebecause the existing Triton-fused GLU kernel implements exact (erf-based) GELU. - Embedding scaling by
sqrt(hidden_size): enabled viaInputEncoder.scale_sqrt_d_model. - Tied embeddings:
tie_word_embeddings=Trueby default, wired via the standard model code generatortied_weights_keysmechanism.
Usage¶
Convert a HuggingFace Gemma-3 model to Forgather format:
forgather convert ~/ai_assets/models/google_gemma-3-270m \
~/ai_assets/models/fg_gemma-3-270m \
--model-type gemma3 --dtype bfloat16
Serve it with the Forgather inference server:
forgather inf server --model ~/ai_assets/models/fg_gemma-3-270m
forgather inf client --completion "Once upon a time"
Reverse conversion is symmetric (the HF model type is stored in the Forgather config during conversion and auto-detected on the way back):