Context-Parametric Inversion

Does domain-specific fine-tuning make LLMs ignore retrieved context? A RAG faithfulness study.May 30th, 2026

aimachine learningNLPRAGLLMfine-tuningLoRA

Loading document...

An NLP Research Paper at UNSW

As part of my semester abroad at the University of New South Wales, I took a course on Natural Language Processing that turned into one of the most genuinely research-driven projects I've worked on during my studies. Together with a group of four other students, I led the investigation into a subtle but practically important failure mode of instruction fine-tuning — all within the context of building a real, functioning RAG pipeline for querying NLP research papers from the ACL Anthology.

What started as a pipeline engineering project ended up raising a concrete empirical question about model behaviour that we actually went and tested. The result is something I'm pretty happy with: a controlled study with clear findings that are directly relevant to anyone building RAG systems on top of fine-tuned LLMs.

The RAG Pipeline

The core idea behind retrieval-augmented generation (RAG) is to avoid asking a language model to answer purely from memorized knowledge. Instead, you retrieve relevant documents at query time and expect the model to ground its answer in them. For a domain like NLP research — where a model's parametric knowledge is patchy, outdated, and sometimes confidently wrong — this is a natural fit.

Our pipeline takes a raw user question and runs it through several stages: a LangChain node rewrites it into a keyword-dense retrieval query, which is embedded using SPECTER2, a scientific document embedding model trained on citation graphs. That embedding is matched against a Qdrant vector database of ACL Anthology abstracts. Retrieved chunks are then filtered by an LLM relevance grader, checked pairwise for contradictions using a DeBERTa-based NLI model, and passed to a fine-tuned LLaMA 3.1 8B model to generate the final answer.

The language model at the end of this pipeline was fine-tuned in two stages: first on TULU 3 for general instruction-following, then on QASPER — a dataset of information-seeking questions anchored in NLP research papers — using LoRA on top of the frozen base weights.

The Problem: Context-Parametric Inversion

Fine-tuning the model on QASPER immediately raised an uncomfortable question. Recent work by Goyal et al. showed that instruction fine-tuning can cause a model's faithfulness to retrieved context to first increase and then gradually decline — a phenomenon they call context-parametric inversion. The mechanism they propose is intuitive: early in training, examples where the model genuinely needs the context to answer dominate the gradient signal, pushing the model toward context reliance. Later, as the loss on those examples falls, examples where the model's parametric memory is sufficient take over and gradually shift it back toward recalling from weights rather than attending to retrieved passages.

For a RAG system, this is a real failure mode. A model that ignores retrieved context in favour of memorized knowledge makes the entire retrieval step unreliable. And in our case the concern was especially pointed: we were fine-tuning on NLP text and then retrieving from NLP papers at inference time. If inversion occurred precisely on NLP domain questions, the pipeline would be working against itself.

What made our situation distinct from Goyal et al.'s original work is that they only studied general-purpose instruction tuning datasets like TULU and Alpaca, where the fine-tuning domain and the evaluation domain are largely separate. The domain-specific case — fine-tuning on NLP data and then evaluating faithfulness on NLP questions — had not been explicitly tested. That's the gap we set out to fill.

Measuring Context Faithfulness

We tracked context sensitivity across 14 checkpoints along the QASPER fine-tuning trajectory, starting from the TULU-only baseline at step 0 through to step 130. The central metric is defined as:

\(\text{CS} = \frac{n_{\text{faithful}}}{n_{\text{faithful}} + n_{\text{parametric}}}\)(1)

where \(n_{\text{faithful}}\) is the number of examples where the model followed the retrieved context, and \(n_{\text{parametric}}\) is the number where it fell back on memorized knowledge instead. A score of CS = 1.0 means fully context-faithful; CS = 0.0 means fully parametric.

Each evaluation example presented the model with a perturbed context chunk — a passage in which one fact had been replaced with a plausible but wrong alternative — alongside a question the model was already known to answer correctly from memory. This creates a direct conflict between what the context says and what the model knows, letting us observe which one wins.

We ran two evaluation conditions:

SC1 — 180 custom NLP-domain examples, constructed specifically for this study. Each example targets a fact the model knows well (e.g. that BLEU is used for machine translation evaluation), with a context asserting a wrong alternative.
SC2 — 1,064 general knowledge examples drawn directly from Goyal et al.'s counterfactual datasets, covering biographical facts, counterfactual world events, and country capital assertions — none of which overlap with the NLP domain we're fine-tuning on.

SC2 acts as a control: if behaviour changes on SC2 across the fine-tuning trajectory, that change can't be attributed to NLP-domain adaptation specifically.

What We Found

The SC1 results were unambiguous: QASPER fine-tuning does not cause context-parametric inversion on NLP domain questions. Context sensitivity rises monotonically from CS = 0.574 at the TULU baseline all the way to CS = 0.947 by step 100, where it plateaus for the remainder of training. Every major jump is confirmed as statistically real by non-overlapping 95% confidence intervals.

The structural reason makes sense in hindsight. QASPER is a reading comprehension dataset where the answer is always in the provided passage — the model simply cannot reduce its training loss without attending to the context. The non-context-critical examples that are supposed to drive the inversion mechanism (where the model can answer from memory and gradually stops attending to the passage) are largely absent from QASPER. The gradient signal throughout training consistently reinforces context-following, which suppresses inversion entirely.

The SC2 results tell a more nuanced story. Looking at the three subtypes individually:

CF_Biographies — no inversion whatsoever. The model rises steadily and plateaus near CS = 0.91, closely mirroring SC1, despite biographical facts being completely outside the NLP fine-tuning domain. This suggests QASPER instills a broad "follow the passage" tendency that generalizes well.
CF_WorldFacts — mild but genuine inversion. The score peaks early and then partially reverts, stabilizing roughly 18 percentage points below its peak for the remainder of training.
CF_Country_Capitals — strong and sustained inversion. CS drops below 0.5 by step 100, meaning the model is by that point more likely to recall the memorized capital city than to follow a context asserting otherwise.

The gradient here is striking. Country capitals are encoded with extremely high parametric confidence — and the evaluation contexts asserting wrong capitals are short and templated, only about 34 words. We tentatively attribute the inversion to two compounding factors: deeply entrenched parametric memory, and a context format that is structurally far shorter than the long, information-dense QASPER passages the model trained on. A short templated sentence provides a weaker inference-time signal, giving the parametric prior more room to push through. The two factors point in the same direction and are difficult to disentangle with the current data, which is something we flag as a direction for future work.

Takeaway

For the RAG pipeline, the finding is straightforwardly positive: domain-specific fine-tuning on QASPER makes the deployed model substantially more faithful to retrieved NLP chunks, not less. The concern that motivated the study turned out to be unfounded in practice.

More broadly, the study suggests that context-parametric inversion is not a monolithic effect. Whether and how severely it appears depends on both the parametric strength of the knowledge being tested and the format match between fine-tuning contexts and evaluation contexts. The account by Goyal et al. — focused on the ratio of context-critical to non-context-critical training examples — may not be the full picture. What precisely drives inversion in some settings but not others remains an open question, though data curation and retrieval-side interventions both seem like worthwhile starting points for future investigation.