The KV cache problem
Standard transformer inference stores every key and value vector from every previous token in a KV cache. For long contexts (32K–128K tokens), this grows linearly — a 128K token context with Qwen3-72B requires ~48GB of KV cache memory alone. That's before the model weights.
This is the central bottleneck for long-context inference on consumer hardware.
The mycelium insight
Fungal mycelium networks solve a similar problem: how do you route information across a large distributed network with limited bandwidth? They do it by:
1. Selective reinforcement — paths that carry useful signals get stronger 2. Pruning — unused pathways decay over time 3. Adaptive topology — the network restructures based on what information matters
HyphalLLM applies these principles to the KV cache:
- Salience scoring — each KV pair gets a learned importance score based on attention patterns
- Selective eviction — low-salience pairs are evicted first, preserving the most relevant context
- Hierarchical storage — hot (recent + high-salience) KV pairs stay in GPU memory; cold pairs are compressed to CPU RAM
Results on Qwen3
On Qwen3-7B with a 128K token context:
At α=0.3 (30% eviction rate), we achieve a 61% reduction in KV cache memory with less than 1% perplexity degradation.
The llama.cpp fork
HyphalLLM is implemented as a fork of llama.cpp with:
- A Python reference library for experimenting with eviction strategies
- A C++ implementation integrated into the llama.cpp inference loop
- A Python architecture library for training the salience scoring model