Files
Jeffrey Morgan a1ca428c90 glm4moelite: fix attention scale calculation (#13893)
Use the original key dimension (qkNopeHeadDim + qkRopeHeadDim = 256) for
the attention scale instead of the MLA absorbed dimension (kvLoraRank +
qkRopeHeadDim = 576).

MLA absorption is a mathematically equivalent reorganization of the
attention computation - it should not change the effective attention
scale. The scale should match training, which uses 1/sqrt(256).

This improves tool calling and model looping issues.
2026-01-24 17:48:09 -08:00
..
2025-12-16 15:44:52 -08:00
2025-12-08 14:42:22 -08:00
2025-12-08 14:42:22 -08:00
2025-12-08 14:42:22 -08:00
2025-12-08 14:42:22 -08:00
2025-12-08 14:42:22 -08:00
2025-12-08 14:42:22 -08:00
2025-12-08 14:42:22 -08:00
2025-12-08 14:42:22 -08:00
2025-12-08 14:42:22 -08:00
2025-12-08 14:42:22 -08:00
2025-12-08 14:42:22 -08:00
2025-12-15 17:30:33 -08:00