HyphalLLM: replacing the KV cache with bio-inspired memory

The KV cache problem

Standard transformer inference stores every key and value vector from every previous token in a KV cache. For long contexts (32K–128K tokens), this grows linearly — a 128K token context with Qwen3-72B requires ~48GB of KV cache memory alone. That's before the model weights.

This is the central bottleneck for long-context inference on consumer hardware.

The mycelium insight

Fungal mycelium networks solve a similar problem: how do you route information across a large distributed network with limited bandwidth? They do it by:

1. Selective reinforcement — paths that carry useful signals get stronger 2. Pruning — unused pathways decay over time 3. Adaptive topology — the network restructures based on what information matters

HyphalLLM applies these principles to the KV cache:

Salience scoring — each KV pair gets a learned importance score based on attention patterns
Selective eviction — low-salience pairs are evicted first, preserving the most relevant context
Hierarchical storage — hot (recent + high-salience) KV pairs stay in GPU memory; cold pairs are compressed to CPU RAM

Results on Qwen3

On Qwen3-7B with a 128K token context:

MethodVRAM usagePerplexityThroughput |---|---|---|---| Standard KV cache24 GBbaselinebaseline HyphalLLM (α=0.3)9.2 GB+0.4%-8% HyphalLLM (α=0.6)5.8 GB+1.1%-14%

At α=0.3 (30% eviction rate), we achieve a 61% reduction in KV cache memory with less than 1% perplexity degradation.

The llama.cpp fork

HyphalLLM is implemented as a fork of llama.cpp with:

A Python reference library for experimenting with eviction strategies
A C++ implementation integrated into the llama.cpp inference loop
A Python architecture library for training the salience scoring model

The project is open source at github.com/tamerrab2003/hyphallm. Contributions welcome.