LPL#6 : Agentic Memory Hierarchy
Computer systems have the memory hierarchy: given that we have storage technologies that have very different properties, some have high density and high latency (e.g. HDDs), some have low density and low latency (e.g. SRAM) and some are in the middle (DRAM), we could use a mix of these technologies to achieve both high capacity and reasonably low latency. Essentially, data are moved between different layers in the hierarchy based on the data access pattern. Frequently accessed data tend to stay on the top layers of the hierarchy.
The top layers (L0, L1, L2, L3 cache) are hardware managed, because the cache policy needs to be fast enough for the layer to make sense. If it takes more time to decide whether to evict a line or not than to request the next layer, that would make the layer useless. The bottom layers (disk, network) are software managed since the time that software cache policy takes is tiny compared to actually accessing the next layer.
In LLMs, we have:
- Context window, where all the AI magic happens. Any data need to be loaded here to be considered by the model.
- Skills, somewhat similar to programmable microcode. Agents will follow the instructions by loading it into the context window.
- RAG, which is somewhat like a filesystem. If you need a document, you can find it on-demand.
- MCP/Tools call, somewhat like IO subsystem.