ollama

mirror of https://github.com/ollama/ollama.git synced 2026-01-29 07:12:03 +03:00

Files

Jeffrey Morgan 1044b0419a model: add MLA absorption for glm4moelite (#13810 )

* model: add MLA absorption for glm4moelite

Split the combined KV_B tensor into separate K_B and V_B tensors
during conversion, enabling MLA (Multi-head Latent Attention)
absorption which compresses the KV cache for improved efficiency.

* ggml: enable MLA flash attention for GLM-4.7-flash

Add support for gqa_ratio 4 in MLA flash attention kernels. GLM-4.7-flash
uses head size 576 with gqa_ratio 4, which was previously only supported
for gqa_ratio 16 (DeepSeek).

Metal changes:
- Enable head size 576 for flash attention
- Increase simdgroups to 8 for large heads (>=512)
- Add case 8 kernel dispatch for 8 simdgroups

CUDA changes:
- Add gqa_ratio 4 support for head 576/512
- Add tile configs for (576, 512, 4) and (576, 512, 8)
- Add MMA config cases for ncols 4
- Add template instances for ncols2=4

* model: add compatibility validation for glm4moelite architecture

2026-01-23 14:47:42 -08:00

imageproc

deepseekocr

2025-11-18 16:11:37 -08:00

input

batch: use tensors for outputs (#12185 )

2025-09-15 14:33:06 -07:00

models

model: add MLA absorption for glm4moelite (#13810 )

2026-01-23 14:47:42 -08:00

parsers

model: add lfm2 architecture and LFM2.5-1.2B-Thinking support (#13792 )