- Ali's Newsletter
- Posts
- π Speed Up Your LLM x10: The Magic of KV Caching (w/ Code!)
π Speed Up Your LLM x10: The Magic of KV Caching (w/ Code!)
Hey NLP Engineers Family! πEver wondered why your second chat prompt is faster than the first? π€The secret sauce is KV Caching. Today, weβre breaking it down with zero math PhD required.
π§ The Problem: Without KV Caching (The Slow Way)
When a Large Language Model (LLM) generates text, it re-calculates the same words over and over again. It has the memory of a goldfish. π
How it works (Naive method):
You write:
"Hello"Model calculates keys/values for
"Hello"β generates"world"You have:
"Hello world"Model re-calculates
"Hello"again + calculates"world"β generates"!"
Result: O(nΒ²) complexity. Exponential slowdown. π«
Code Example (Pseudocode - Without Cache)
python
# WITHOUT KV CACHING β
def generate_without_cache(input_ids):
for position in range(len(input_ids)):
# Re-computes EVERY previous token's K,V from scratch
keys, values = model.compute_all(input_ids[:position+1])
next_token = sample(keys, values)
input_ids.append(next_token)
return input_ids
# If you have 100 tokens, you do 100+99+98+...+1 calculations = 5050 ops πβ‘ The Solution: With KV Caching (The Smart Way)
KV Caching is like a post-it note π. Once the model computes the Key (K) and Value (V) for a word, it remembers it. It never does that math again.
How it works (Smart method):
You write:
"Hello"β Cache stores K,V.Model looks at cache + generates
"world"β Cache stores K,V for"world".You write:
"!"β Model looks at cache (Hello, world) + calculates only"!".
Result: O(n) complexity. Linear speed. π
Code Example (With Cache)
python
# WITH KV CACHING β
past_kv = None # This is your "post-it" note
def generate_with_cache(input_ids, past_kv):
for position in range(len(input_ids)):
# Only compute the NEW token!
keys, values, past_kv = model.compute_one(input_ids[position], past_kv)
next_token = sample(keys, values)
input_ids.append(next_token)
return input_ids, past_kv
# For 100 tokens: Only 100 calculations. 50x faster! β‘π The Difference (Real Talk)
Feature | Without Cache π’ | With Cache π |
|---|---|---|
Speed | Gets slower with every word | Constant speed |
Memory | Low (no storage) | High (stores K,V tensors) |
Use Case | Single short prompt | Long chats, summaries, code gen |
Math Ops | O(nΒ²) | O(n) |
Visual Example (Simple Python)
python
import time
# Simulate token generation (10 tokens)
tokens = ["The", "cat", "sat", "on", "the", "mat", ".", "It", "was", "happy"]
# WITHOUT CACHE π’
start = time.time()
for i in range(1, len(tokens)+1):
# "Re-calculating" previous tokens
_ = [f"re-calc-{x}" for x in range(i)] # Simulated work
print(f"Step {i}: Re-calculated {i} tokens")
print(f"π’ Time: {time.time() - start:.4f}s\n")
# WITH CACHE β
start = time.time()
cache = []
for i, token in enumerate(tokens):
# Only calculate current token
cache.append(f"cached-{token}")
print(f"Step {i+1}: Calculated 1 new token (cache size: {len(cache)})")
print(f"π Time: {time.time() - start:.4f}s")Output:
text
Step 10: Re-calculated 10 tokens
π’ Time: 0.0025s
Step 10: Calculated 1 new token (cache size: 10)
π Time: 0.0003s β 8x faster!π― Pro Tips for Your Code
Enable it by default in Hugging Face:
python
Watch out for memory blow-up π₯
Long conversations = big cache. Usepast_key_valuestrimming or sliding window.Perfect for chatbots π€
Without cache: 3s response after 50 messages. With cache: 300ms.
π― The Bottom Line
KV Caching = Remembering the past so you don't repeat work.
β Without: Goldfish memory. Slow. Painful.
β With: Elephant memory. Fast. Happy.
Go check your LLM code. If use_cache=False is thereβ¦ delete it now. π₯
Happy coding, Beehive! ππ¨
Liked this? Share with a friend who still thinks AI is "just magic." β¨