<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://happyandslow.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://happyandslow.github.io/" rel="alternate" type="text/html" /><updated>2026-04-25T17:01:18+00:00</updated><id>https://happyandslow.github.io/feed.xml</id><title type="html">Le Xu</title><subtitle>personal description</subtitle><author><name>Le Xu</name><email>le.xu@ed.ac.uk</email><uri>https://lexu.space</uri></author><entry><title type="html">On Incremental LLM Memory</title><link href="https://happyandslow.github.io/posts/2025/04/blog-post-1/" rel="alternate" type="text/html" title="On Incremental LLM Memory" /><published>2025-04-25T00:00:00+00:00</published><updated>2025-04-25T00:00:00+00:00</updated><id>https://happyandslow.github.io/posts/2025/04/incremental</id><content type="html" xml:base="https://happyandslow.github.io/posts/2025/04/blog-post-1/"><![CDATA[<!-- # LLM as Incremental Data Processing Operator -->
<p><strong>Disclaimer: All opinions my own (not related to the company/team I work for). I know a tiny bit about data streaming systems and I only pretend to know LLMs.</strong></p>

<p>Most data-intensive (long context or long generation) LLM tasks can be seen as a data processing operator that consumes one or two event streams that look like the following:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Continuous "query" defining the behavior of an LLM-based chatbot</span>
<span class="k">CREATE</span> <span class="n">CONTINUOUS</span> <span class="k">VIEW</span> <span class="n">response_stream</span> <span class="k">AS</span>
<span class="n">LLM_GENERATE</span><span class="p">(</span>
    <span class="n">PROMPT</span><span class="o">=</span><span class="nv">"Given conversation context and support docs, generate a helpful response."</span><span class="p">,</span>
    <span class="n">CONTEXT</span><span class="o">=</span><span class="n">STREAM</span><span class="p">(</span><span class="n">conversation_events</span><span class="p">),</span>        <span class="c1">-- user interactions, updates incrementally</span>
    <span class="n">DATA_SOURCE</span><span class="o">=</span><span class="n">STREAM</span><span class="p">(</span><span class="n">support_docs_or_events</span><span class="p">)</span>  <span class="c1">-- knowledge updates streamed incrementally</span>
<span class="p">);</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">conversation_events</code> and <code class="language-plaintext highlighter-rouge">support_docs_events</code> here can either be seen as:</p>
<ol>
  <li>Event streams (each event corresponds to a new entry that triggers incremental computation), or more broadly,</li>
  <li>Delta streams (each event corresponds to a modification—add/delete/update—that triggers incremental computation).</li>
</ol>

<p>Depending on the application, the <code class="language-plaintext highlighter-rouge">CONTEXT</code> and <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code> may be defined by different data streams:</p>

<ul>
  <li>For an interactive chatbot, <code class="language-plaintext highlighter-rouge">CONTEXT</code> would be the current user prompt, and <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code> would be the conversation history of the session.</li>
  <li>For summarization and data analysis apps, <code class="language-plaintext highlighter-rouge">CONTEXT</code> would be the summarization request (with customization), and <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code> would be the target docs/conversations/meeting notes/papers to summarize.</li>
  <li>For a personal assistant, <code class="language-plaintext highlighter-rouge">CONTEXT</code> would be the spec used to capture user profile (e.g., basic information, calendar, work arrangements) and requests to complete work items (e.g., code review, slide generation). <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code> would be the memory (e.g., uploaded docs, code, meeting recordings, group chats), possibly summarized.</li>
  <li>For a coding assistant, <code class="language-plaintext highlighter-rouge">CONTEXT</code> would be the code-generation/code-review/code-refactoring tasks, and <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code> would be the code file or project.</li>
  <li>For RAG-QA, <code class="language-plaintext highlighter-rouge">CONTEXT</code> would be the question from the user, and <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code> would be the retrieved texts.</li>
</ul>

<h2 id="updates-to-data_source">Updates to <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code></h2>
<p>Updates to <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code> should either be streamed to lower-level data storage first (e.g., knowledge base, vector DB, etc.) or pushed to LLM memory first (if users subscribe to the view).</p>

<h3 id="maintaining-incremental-states-on-vector-db">Maintaining incremental states on vector DB</h3>
<p>It seems most vector DBs do not natively support real-time view maintenance, and traditional streaming semantics are expensive to support in vector DBs—especially when graph-based indexing is used (i.e., update/delete/insert operations are expensive, and adding TTLs to each vector further increases maintenance overhead).</p>

<p>The most relevant work I’ve found discussing something similar to view maintenance in vector DBs is <a href="https://vldb.org/cidrdb/papers/2025/p23-lu.pdf">VectraFlow</a>. VectraFlow maintains views incrementally for semantic-aware filter and top-K operations. The focus of the work, I think, is on reducing the number of views to check and update as new events occur (through clustering, which may involve accuracy/performance trade-offs).</p>

<p>Each individual view is maintained as a plain list (without a graph-based index), which might result in longer search/update times if the view contains a large amount of data.
<img src="/images/blogs/vectraflow.png" alt="vectraflow" width="200" /></p>

<p>I think the authors might still be working on the full version of the work: it’d be interesting to see how this approach would affect result accuracy over time.</p>

<h3 id="maintaining-incremental-llm-memory-textual-memory">Maintaining incremental LLM memory (Textual Memory)</h3>

<p><strong>Using semantic operators:</strong> An example of imposing semantics on top of an LLM operator is <a href="https://www.arxiv.org/abs/2407.11418">LOTUS</a>. One way to reason about LLM operation is to convert <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code> into structured or unstructured data streams and convert <code class="language-plaintext highlighter-rouge">CONTEXT</code> into an operator with semantics.</p>

<p>This could convert a long-context QA example into the following: using Lotus as an example, the query 1. retrieves top papers most relevant to my research area, 2. generates insight for each paper, and 3. creates a digest summarizing the research insights.
<img src="/images/blogs/lotus1.png" alt="lotus1" width="800" /></p>

<p>The semantic operators proposed are mostly similar to relational operators. Therefore, the idea of incremental view maintenance should transfer straightforwardly to this framework.</p>

<p>Say I make the following modification to my <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code> (a collection of papers):</p>
<ol>
  <li><ins>Adding a new paper</ins>: If the paper is close to the target topic in vector space (via <code class="language-plaintext highlighter-rouge">sem_index</code> and <code class="language-plaintext highlighter-rouge">sem_join</code>), then the insight generated (via <code class="language-plaintext highlighter-rouge">sem_map</code>) would be pushed to <code class="language-plaintext highlighter-rouge">sem_agg</code> for aggregation. The paper mentions one technique for <code class="language-plaintext highlighter-rouge">sem_agg</code> is incrementally folding new input, which naturally supports incremental computation. <em>However, not all aggregation operations support incremental computation naturally, e.g., global ranking.</em></li>
  <li><ins>Removing a paper</ins> (or its expiration via TTL): If the removed paper wasn’t selected in the digest, this has no effect. Otherwise, we must:
    <ul>
      <li>Maintain <code class="language-plaintext highlighter-rouge">sem_index</code> by removing the entry—this overlaps with <a href="#Maintaining-incremental-states-on-vector-DB">Maintaining incremental states on vector DB</a>.</li>
      <li>Address the challenge of removing its contribution from <code class="language-plaintext highlighter-rouge">sem_agg</code>, which is hard if the aggregation is not invertible (e.g., “Writing a digest summarizing research”). One optimization is to maintain partial computational results in memory for possible re-use, e.g., tree-based aggregation steps 1-2, 3-4, 1-2-3-4 takes three steps and modifying 4 to 4’ requires re-computation of 3-4’ and 1-2-3-4’, and we are able to re-use 1-2.</li>
    </ul>
  </li>
</ol>

<p>The figure above shows semantic operators in Lotus. Many of these are derived from relational operators, so modifications to <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code> should map to existing literature.</p>

<p>Some operations like <code class="language-plaintext highlighter-rouge">sem_filter</code> and <code class="language-plaintext highlighter-rouge">sem_map</code> are easier to support incrementally—requiring one LLM inference each. Others like <code class="language-plaintext highlighter-rouge">sem_join</code>, <code class="language-plaintext highlighter-rouge">sem_topk</code>, and <code class="language-plaintext highlighter-rouge">sem_agg</code> require maintaining historical state or performing multiple inference requests. Whether <code class="language-plaintext highlighter-rouge">sem_agg</code> supports deletions depends on the language expression (e.g., natural language predicate) used by the user.</p>

<p>Additionally, the cost of using semantic operators could be high. Operators use LLMs to evaluate boolean predicates on input pairs (e.g., <code class="language-plaintext highlighter-rouge">sem_join</code> takes MxN LLM calls, <code class="language-plaintext highlighter-rouge">sem_topk</code> takes $O(NlogN)$). This is cheap and SIMD-friendly in databases but expensive in LLMs. This opens optimization avenues:</p>
<ul>
  <li>Fine-tune small LLMs/adapters per operator</li>
  <li>Operator-specific quantization</li>
  <li>Leverage attention sparsity
    <ul>
      <li>For instance, a query like <code class="language-plaintext highlighter-rouge">papers_df.sem_topk("the {abstract} makes the most outrageous claim", K=10)</code> likely attends to names, numbers, or trigger phrases. Since each entry is reused frequently in <code class="language-plaintext highlighter-rouge">sem_topk</code>, exploiting sparsity can yield efficiency without harming accuracy.</li>
    </ul>
  </li>
</ul>

<p><strong>Using UDF Operators</strong>: Semantic-aware LLM-based operators model <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code> as bags of data tuples (like relational operators). But not all continuous LLM tasks could be expressed as semantic operators, especially when the operator is executed on unstructured data source (e.g., like a long document) or the task requires extracting complex reasoning steps that cannot be expressed as semantic operators. One common approach is first extract the structure and relationships of the data explicitly by LLM first, and then performs inference on extracted structured data one request is received (e.g., <a href="https://arxiv.org/pdf/2404.16130">GraphRAG</a>, <a href="https://arxiv.org/pdf/2502.14802">HippoRAG</a>).</p>

<p><img src="/images/blogs/HippoRAG.png" alt="hipporag" width="800" /></p>

<p><a href="https://arxiv.org/pdf/2502.14802">HippoRAG</a> constructs a knowledge graph during an offline indexing phase to support reasoning in RAG QA. It claims to support incremental updates to the knowledge graph.</p>

<p>Compared to black-box UDF LLMs, these approaches explicitly convert <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code> into structured graphs, simplifying the memory maintenance problem to a streaming graph update problem. Queries involve retrieving subgraphs and reasoning over them.</p>

<p>Thus, we can selectively update outputs when their input subgraphs change. Another insight from HippoRAG is that identifying key entities/relations is critical for accuracy.</p>

<p>HippoRAG makes reasoning explicit via a knowledge graph. However, reasoning is increasingly handled implicitly by LLMs, so the retrieval (or broader “information extraction”) is embedded in the inference process. This motivates studying how textual memory relates to parameterized memory.</p>

<h3 id="maintaining-incremental-llm-memory-parameterized-memory">Maintaining incremental LLM memory (Parameterized Memory)</h3>
<p>Assuming the LLM can retrieve/reason from both context and its trained memory, can it “update” results when <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code> changes?</p>

<p><a href="https://arxiv.org/pdf/2011.01060v2">2WikiMultiHopQA</a> provides reasoning examples with annotated key entities and relations—almost like implicit graphs. The question: when the LLM reasons internally, can it incrementally maintain this reasoning if the dependency graph is implicit?</p>

<p><img src="/images/blogs/wiki2-anotated.png" alt="wiki2" width="800" /></p>

<p>In general, we want LLMs to:</p>
<ol>
  <li>Identify data source (parameterized memory, user-provided <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code>, or generated context).</li>
  <li>Link output tokens to data sources.</li>
  <li>Track dependency structures over time.</li>
  <li>Use dependency structures to selectively update outputs when <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code> changes—via:
    <ul>
      <li>Add (extend dependencies, regenerate as needed)</li>
      <li>Remove (invalidate/remove related outputs)</li>
      <li>Modify (track and recompute dependencies)</li>
    </ul>
  </li>
</ol>

<p>The goal is to keep the generated result up-to-date at low cost—especially when full recomputation is expensive or <code class="language-plaintext highlighter-rouge">CONTEXT</code> is frequently reused (e.g., summaries, notifications, function completion).</p>

<p>Three strategies to approach this:</p>
<ol>
  <li><strong>Append delta as conversation</strong> (expensive)</li>
  <li><strong>Replace keywords</strong> (cheapest)</li>
  <li><strong>Recompute sub-paragraphs</strong> (middle ground)</li>
</ol>

<p><ins>Append delta as part of conversation</ins>: For a model $LLM$, say we want to maintain a previously generated result $LLM(C)$ with initial context $C$. When the context is updated with $\Delta_{C}$, the most straightforward way to get an updated result is to call the model again with the new context and previously generated output as history:<br />
$LLM(C + LLM(C) + \Delta_{C})$</p>

<p>As updates stream in, this forms a chronological log of deltas, naturally supporting versioning and time-travel. Some considerations:</p>

<ul>
  <li><em>Scenarios where this approach applies</em>: This is generalizable to most use cases discussed in <a href="#llm-as-incremental-data-processing-operator">earlier sections</a>.</li>
  <li><em>Complexity of this approach</em>: This resembles multi-turn dialogue. While later updates may be short, accumulating full history increases context length, which impacts decoding time. Although KV sharing during prefill can reduce computation, its benefit diminishes as context grows.<br />
A potential optimization is <strong>edit history compaction</strong>—merge old edits into the original context and prune them from the conversation. To retain KV reuse after compaction, KV entries may need to be regenerated asynchronously (i.e., off the critical path).</li>
</ul>

<p><ins>Replace keywords if identifiable</ins>: <a href="https://arxiv.org/pdf/2502.12067">Recent research</a> shows that Chains-of-Thought can be compressed into key text tokens with little impact on accuracy. This suggests:</p>

<p>If we can identify the key information in context that is likely to change, and correlate it with the generated output, we can perform efficient updates.</p>

<p>The examples below illustrate cases where key information in the input directly influences output, either as a fact or a reasoning bridge (examples from <a href="https://arxiv.org/pdf/2011.01060v2">2WikiMultiHopQA</a> dataset):</p>

<ul>
  <li>
    <p>[<em>Question</em>]: What is the <mark>cause of death</mark> of the founder of Versus (Versace)?<br />
[<em>Source</em>]: “…Versus (Versace)… a gift by the founder Gianni Versace… Versace was <mark>shot</mark> and killed…”<br />
[<em>Assistant</em>]: “… <mark>shot</mark>…”</p>
  </li>
  <li>
    <p>[<em>Question</em>]: Who is Dambar Shah’s <mark>grandchild</mark>?<br />
[<em>Source</em>]: “Doc1:… Dambar Shah … He was the <mark>father of</mark> Krishna Shah. Doc2: Krishna Shah … He was the <mark>father of</mark> Rudra Shah.”<br />
[<em>Assistant</em>]: “… Rudra Shah …”</p>
  </li>
</ul>

<p>This correlation can help determine whether to skip recomputation or invalidate outputs.</p>

<p><ins>Find sub-paragraphs to recompute</ins>: Building on earlier reasoning examples, attention scores might help uncover dependencies between context and output. Consider this example from the <a href="https://huggingface.co/datasets/glaiveai/RAG-v1">glaiveai/RAG-v1</a> dataset:</p>

<p>The attention heatmap below shows alignment between generated sentences (Y-axis) and context sentences (X-axis). Blue/yellow boxes indicate where output closely follows input content. Columns with low attention were removed (green highlight), and the request was re-run.</p>

<p><img src="/images/blogs/glaive-3-mid-preupdate.png" alt="legal" width="800" /></p>

<p>Below, the updated result includes previously seen content (blue/yellow) and also retrieves new context (red box). Overall output remains similar.</p>

<p><img src="/images/blogs/glaive-3-mid-updated.png" alt="legal" width="800" /></p>

<p>In a more complex scenario like multi-hop QA, we can still trace how output depends on intermediate reasoning. Here’s an example from <a href="https://huggingface.co/datasets/xanhho/2WikiMultihopQA">2WikiMultiHopQA</a> using <a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B">deepseek-ai/DeepSeek-R1-Distill-Qwen-7B</a>:</p>

<p><img src="/images/blogs/film-original.png" alt="legal" width="1000" /></p>

<p>Blue/red boxes highlight direct facts. The yellow box shows the final reasoning step. The model first extracts two key facts before comparison—following the “bridge entity and comparison” pattern in <a href="https://aclanthology.org/2020.coling-main.580.pdf">2WikiMultiHopQA</a>. The blue and red boxes both show higher attention scores compared to other sentences in the prompt (except for the question).</p>

<p>We also know that sentence that contains numbers (e.g., time) typically triggers higher attention score. In this case, the green box has higher attention score than blue boxes in the same row, despite the relevance. This could impact accuracy of dependency detection if attention score is used to track correlation chains.</p>

<p><img src="/images/blogs/2wiki-bridge.png" alt="2wiki" width="600" /></p>

<p>When I modified birth dates of the directors (irrelevant to the question), attention patterns remained unchanged—as expected.</p>

<p><img src="/images/blogs/film-date.png" alt="legal" width="1000" /></p>

<p>When I changed a relevant fact—like nationality—the model’s attention shifted only in the reasoning step (yellow/pink boxes), leaving prior attention stable. This suggests fine-grained recomputation is feasible:</p>

<p><img src="/images/blogs/film-nationality.png" alt="legal" width="1000" /></p>

<h2 id="updates-to-context">Updates to <code class="language-plaintext highlighter-rouge">CONTEXT</code></h2>

<p>Let’s return to our initial continuous view definition:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Continuous "query" defining the behavior of an LLM-based chatbot</span>
<span class="k">CREATE</span> <span class="n">CONTINUOUS</span> <span class="k">VIEW</span> <span class="n">response_stream</span> <span class="k">AS</span>
<span class="n">LLM_GENERATE</span><span class="p">(</span>
    <span class="n">PROMPT</span><span class="o">=</span><span class="nv">"Given conversation context and support docs, generate a helpful response."</span><span class="p">,</span>
    <span class="n">CONTEXT</span><span class="o">=</span><span class="n">STREAM</span><span class="p">(</span><span class="n">conversation_events</span><span class="p">),</span>        <span class="c1">-- user interactions, updates incrementally</span>
    <span class="n">DATA_SOURCE</span><span class="o">=</span><span class="n">STREAM</span><span class="p">(</span><span class="n">support_docs_or_events</span><span class="p">)</span>  <span class="c1">-- knowledge updates streamed incrementally</span>
<span class="p">);</span>
</code></pre></div></div>

<p>Earlier, we discussed updates to <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code>. What happens when <code class="language-plaintext highlighter-rouge">CONTEXT</code> (e.g., user prompt) updates?</p>

<p>The goal remains: <strong>reuse past computation</strong> whenever possible.</p>

<ol>
  <li><ins>Reuse intermediate results</ins>: Many prompts share overlapping sub-tasks.<br />
Example:
    <ul>
      <li>Prompt A: “Are directors of Film A and Film B from the same country?”</li>
      <li>Prompt B: “Who is older: the director of Film A or Film B?”<br />
Both require identifying the directors first.</li>
    </ul>
  </li>
  <li><ins>Compute incrementally over prior results</ins>:<br />
Example:
    <ul>
      <li>Prompt: “Plan my TODOs for project X today.”<br />
Might depend on:
        <ul>
          <li>A summary of milestones and execution plans (which further depends on summarization of the planning/design documents).</li>
          <li>Tasks assigned during meetings (which further depends on summarization of the conversation in the meeting recordings).</li>
        </ul>
      </li>
    </ul>
  </li>
</ol>

<p>Coming back to the film director example earlier, the first heatmap shows attention score distribution comparing the nationalities of the two directors, as we have seen earlier. I tried changing the question from comparing nationalities of two directors to comparing the ages of the two directors and print out attention score map in the second heatmap.</p>

<p>In this simplistic example, the two attention states, have similar distribution up to the second to last row, which is the generated sentence that reaches the conclusion. Asking different questions does not seem to change the entry with the highest attention scores across different rows, and the first stage of the reasoning steps are similar. Assuming that we are able to identify shared reasoning steps triggered by different questions, the trick is to predict some of the most commonly used partial results based on set of contexts $S(C)$, and find out how these partial results could be reused by comparing the runtime context with all $C$ such that $C \in S(C)$. Possibly related: <a href="https://arxiv.org/pdf/2504.13171">Test-Time Compute</a>, <a href="https://arxiv.org/pdf/2501.15915">Parametric RAG</a>, <a href="https://arxiv.org/pdf/2412.15605v1">CAG</a>.</p>

<p><img src="/images/blogs/film-original.png" alt="legal" width="1000" />
<img src="/images/blogs/film-age.png" alt="legal" width="1000" /></p>

<p>I tried changing the question to something unrelated to film director and it is easy to see that the attention map starting from the decoding stage is completely different from the two attention map shown earlier.</p>

<p><img src="/images/blogs/film-different-context.png" alt="legal" width="1000" /></p>

<p>If we are able to make the analogy that updates to the stream of <code class="language-plaintext highlighter-rouge">CONTEXT</code> are constantly changing tasks/queries based on fixed pool of source data, and updates to the stream of <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code> are constantly updating pool of source data where a single task is based on.</p>

<p>My impression is that updates to <code class="language-plaintext highlighter-rouge">CONTEXT</code> is better explored than updates received at <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code> based on (probably only a few out of many) papers I had impression on. The obvious question to ask here is that if we treat LLM as a data processing operator (which it is), then this problem would clearly overlap with some of the traditional problems in OLAP systems due to the possibilities of semantic reasoning of new tasks received at the stream of <code class="language-plaintext highlighter-rouge">CONTEXT</code>. To make the full analogy, these problems could include the <a href="https://arxiv.org/pdf/2412.11828v1">view selection problem</a>, the <a href="https://arxiv.org/pdf/2203.16684">view maintenance problem</a>, and the <a href="https://dl.acm.org/doi/pdf/10.1145/376284.375706">query re-writing problem</a>. Specifically, if we use QA as an example, on every question we received at the stream of <code class="language-plaintext highlighter-rouge">CONTEXT</code>:</p>

<p><ins>Query Re-writing</ins> requires question to be re-written to better match views generated in the past.</p>

<p><ins>View Selection</ins>  needs a searching mechanism to identify past questions that are semantically similar to the target question (e.g. <a href="https://www.arxiv.org/pdf/2502.03771">VectorQ</a>). One question I kept wondering was that whether there is a good way to find a series of “base questions” that we predict will be useful to the questions we receive online – Like the questions like “Which director is older” or “Are these directors are of the same nationality” can be both answered by the combining the answer of “Tell me about the director of film A” and “Tell me about the director of film B”. How to decompose a question into a set of base views that we maintain, and then reason about the performance/cost of using these views are something I’d love to see.</p>

<p><ins> View Maintenance</ins> The partial results maintained by set of questions need to be constantly updated as new updates are received at the stream of <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code>, which <a href="#updates-to-data_source">has been discussed earlier</a>.</p>

<h2 id="thoughts-and-questions">Thoughts and Questions</h2>

<p><strong>Tracking intra-context dependencies?</strong></p>

<p>If we can accurately identify dependency structures, how much can we save?</p>

<p>By default, even small changes to <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code> trigger a full prefill (possibly reusable) and a full decode. Ideally, incremental processing avoids recomputation by:</p>

<ol>
  <li><strong>Saving prefill compute</strong>: Predict which KV cache entries are still valid. Many works already exploit sparsity or modular KV reuse.</li>
  <li><strong>Saving decode compute</strong>: If outputs remain similar after context changes, we can predict when to reuse vs. regenerate tokens.</li>
</ol>

<p>This opens doors to innovations like:</p>
<ul>
  <li>Reusing tokens and KVs</li>
  <li>Exploring <em>KV editing</em> and <em>non-consecutive KV reuse</em> techniques</li>
</ul>

<p><strong>Semantic-aware KV cache?</strong></p>

<p>Today, KV cache is typically used as a read-append data store and is treated as semantic-agnostic. However, with reasoning models and long-context generation, recent work has been moving toward offloading IO-heavy operations out of HBM—see:</p>

<ul>
  <li><a href="https://github.com/kvcache-ai/ktransformers">KTransformers</a></li>
  <li><a href="https://arxiv.org/pdf/2409.10516">RetrievalAttention</a></li>
  <li><a href="https://arxiv.org/pdf/2504.10326">AlayaDB</a></li>
</ul>

<p>As reasoning becomes more common, we can assume that key information used during decoding will increasingly reside in the KV cache. Exploiting <em>semantic-aware KV cache</em> might enable:</p>

<ul>
  <li>Token-level KV reuse</li>
  <li>KV editing</li>
  <li>Indexing/versioning in KV cache design</li>
  <li>Layout optimizations for efficient memory transfer</li>
</ul>

<p><strong>View Maintenance over the Entire Workflow?</strong></p>

<p>So far the discussion on maintaining incremental memory is under the assumption of LLM as a single data processing operator. In practice, the operator could easily be as single step out of a complex multi-step job like workflow/multi-agent scenarios. The multi-step view management could create a problem space for rethinking programming framework along with resource aware optimizations. I haven’t got the time to think deeper into this yet so I’m leaving it here as a space holder for now.</p>

<p><strong>Structured memory vs. implicit reasoning?</strong></p>

<p>Most reasoning models first summarize <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code>, then extract reasoning steps. If we can differentiate these steps (i.e., the model’s internal graph), we may better track dependencies between outputs and both source data and intermediate reasoning.</p>

<p>This resembles <a href="#using-udf-operator">HippoRAG</a>, where the knowledge graph is explicit. In LLMs, the structure is often implicit—but perhaps can be inferred.</p>

<p>One might ask: why not explicitly convert <code class="language-plaintext highlighter-rouge">DATA_SOURCE</code> into a structured form like a knowledge graph?</p>

<p>Pros:</p>
<ul>
  <li>Incremental maintenance via graph updates</li>
  <li>Explicit dependencies simplify recomputation</li>
</ul>

<p>Cons:</p>
<ul>
  <li>Each update may trigger an LLM call</li>
  <li>Updates may invalidate large portions of downstream output</li>
</ul>

<p>Trade-offs will vary depending on:</p>
<ul>
  <li>Length of context</li>
  <li>Number of past results being maintained</li>
  <li>Amount of computation that can be skipped</li>
</ul>

<p>This is similar to the broader debate between long-context LLMs and RAG—more on that in a future discussion.</p>

<p><strong>Emergence of Slow Compute?</strong></p>

<p>The general rationale behind managing semantic-aware memory systems for generative AI is that we should attempt to trade expensive computation for cheaper storage by preserving and reusing processed data. Ideally, LLMs should not need to “re-learn” learned information or only spend marginal cognitive effort to incorporate new knowledge. This allows models to become more capable over time as they:</p>
<ol>
  <li>Memorize more information,</li>
  <li>Learn the latest updates and store them, and</li>
  <li>Perform inference much faster and more resource-efficiently.</li>
</ol>

<p>Building incremental memory systems should be part of enabling continual learning for generative AI.</p>

<p><img src="/images/blogs/layers.png" alt="legal" width="400" /></p>

<p>Most of the problems discussed in this post focus on dynamic parameterized memory (e.g., KV cache) and textual memory, shown in the green box above, which should be shared as a service across models and applications. But what does this imply from a system-building perspective?</p>

<p>Incremental processing (if done right) should reduce computation significantly, as <a href="#maintaining-incremental-llm-memory-parameterized-memory">discussed previously</a>:</p>

<p>If we choose <em>Append delta as conversation</em> (<ins>best generalizability, least cost-effective</ins>), the updates are maintained as logs in the textual memory. The challenge then becomes <em>where</em> to maintain update history, as well as <em>how</em> and <em>when</em> to compact the update history to fit within the model’s context window.</p>

<p>If we choose to <strong>Recompute sub-paragraphs</strong> (<ins>medium generalizability, medium cost-effectiveness</ins>), we should identify reusable tokens from past results. The generation process would mix decoding and prefilling requests—similar to constraint-decoding. This could fundamentally change engine design, which traditionally assumes LLM inference is a single round of prefill + decode.<br />
Meanwhile, depending on KV cache reusability, this approach could result in frequent, token-level updates to stored KV on every update request, making the KV cache a more compute-intensive component.</p>

<p>In the scenario where <strong>Replace keywords</strong> (<ins>least generalizability, most cost-effective</ins>) is possible, the update process could happen directly where results are stored. This eliminates the need for accelerators entirely, enabling “near-storage inference.”<br />
The challenge here is that mappings between key words in source data and inference results must be pre-established offline.</p>

<p>Pre-establishing such mappings, along with other potential directions discussed earlier (e.g., <a href="#updates-to-context">tracking dependency structures, query rewriting, view maintenance and selection</a>), can all be seen as attempts to <strong>externalize LLM’s thought process from fast to slow (but larger) storage</strong>.  Many of these operations are not triggered by immediate requests and must happen off the inference critical path as part of background maintenance.</p>

<p>It is possible, I think, that the focus of building LLM serving stacks will start shifting—from purely optimizing inference tasks on <strong>fast compute devices</strong> (e.g., accelerators)—to <strong>more collaborative, full-stack solutions that leverage slow compute</strong> (i.e., near-storage) to enhance real-time inference, both in terms of performance and efficiency. This shift is likely to emphasize areas such as memory optimization techniques, efficient dependency tracking mechanisms, hybrid compute architectures that balance fast and slow compute, and innovative caching strategies to maximize reuse of intermediate results.</p>

<!-- Future directions: indexing context, caching for reuse, long vs. short generation -->]]></content><author><name>Le Xu</name><email>le.xu@ed.ac.uk</email><uri>https://lexu.space</uri></author><category term="LLM" /><category term="data streaming" /><summary type="html"><![CDATA[Disclaimer: All opinions my own (not related to the company/team I work for). I know a tiny bit about data streaming systems and I only pretend to know LLMs.]]></summary></entry></feed>