Skip to content
STATUS DRAFT · LCS v1.0.0 · built in the open, contributors wanted open a PR →

The attribution problem: tracing an output back to your data

Everyone wants the same thing from AI data ethics: pay people when their data is used. The quiet problem underneath is that "used" is doing enormous work in that sentence. Here is an honest look at why attribution is hard during pre-training, slightly less hard after it, and what a consent protocol should do about the parts that stay hard.

PublishedDecember 28, 2025BySubhadip MitraTagsattribution, research

If you want to compensate people when their data is used to build a model, you eventually run into a question that sounds simple and is not: which data was used, and how much did it matter? Call it the attribution problem. Most public debate skips past it, because it is easier to argue about whether scraping is fair than to admit that even if everyone agreed it should be paid for, we would not currently know how to divide the money.

This post is an attempt to lay the problem out plainly, separate the parts that are genuinely hard from the parts that just sound hard, and explain the stance LLMConsent takes as a result. It gets technical. That is on purpose. A consent protocol that hand-waves over attribution is selling something.

Two questions that get confused

The first thing to do is split “was my data used” into two very different questions, because they have very different answers.

Provenance, or membership, asks: was this specific data point part of the training set? This is a question about a fact in the past. Did the file enter the pipeline.

Influence, or attribution proper, asks: how much did this data point shape the model’s parameters, and through them, a particular output? This is a question about cause and effect inside a system with hundreds of billions of moving parts.

These get blurred together constantly, and it matters, because provenance is mostly tractable and influence mostly is not. If you can keep them separate, a lot of the confusion clears up.

Provenance is the easy half

If you control the training pipeline, you know what went in. You can hash every document at ingestion, commit those hashes to a log, and later prove that a given file was or was not part of a given training run. Cryptographically this is ordinary. Merkle trees and signed manifests have done this for decades in other contexts. There is no deep research problem in proving “this document was in corpus version 7.”

The harder version of provenance is when you do not control the pipeline and want to find out after the fact whether your data was in someone’s model. That is membership inference, and it is a statistical attack rather than a clean proof. You probe the model, look at how confidently it predicts your text, and infer membership from the gap between how it treats data it has seen and data it has not. It works better on outliers and memorized content, worse on ordinary text that looks like everything else. It gives you a probability, not a receipt.

For a consent protocol the lesson is to lean on the clean version. If consent is checked and recorded at the moment data enters a pipeline, provenance stops being a forensic question and becomes a logged fact. You do not have to reverse engineer membership from the outside if permission was verified on the way in.

Influence is the hard half

Now the difficult question. A model produced a sentence. Your blog post was somewhere in the trillions of tokens it trained on. How much credit does your post deserve for that sentence?

The honest baseline answer is that there is a precise definition of influence and it is almost never computable. The definition is counterfactual: the influence of your data point is the difference between the model you got and the model you would have gotten if that point had been left out. Train with it, train without it, compare. This is the leave-one-out ideal, and it is the thing every practical method is trying to approximate, because actually doing it means retraining a frontier model once per data point. Nobody is retraining a model a trillion times to settle a royalty.

So the field builds estimators. A few worth knowing by name:

  • Influence functions, introduced to modern machine learning by Koh and Liang in 2017, estimate the leave-one-out effect without retraining, using the model’s gradients and a term involving the inverse Hessian of the loss. The math is clean. The cost is brutal, because that Hessian is the size of the model squared, and you have to approximate it hard to run it on anything large. Anthropic’s 2023 work scaled influence functions to large language models using an approximation called EK-FAC, and one of their findings is instructive: the sequences that influence a given output are often related by theme and reasoning pattern rather than by surface wording. Influence is real, but it is diffuse and a little alien. It does not point at one source.

  • TracIn takes a different route, tracing influence by following the dot product of gradients across training checkpoints. Intuitively, a training example influenced an output if updating on it would have reduced the loss on that output. Cheaper in some setups, still approximate.

  • Datamodels, from Ilyas and collaborators in 2022, fit a surrogate that predicts a model’s behavior as a function of which training examples were included. It treats the model as something to be regressed against its own training set. Powerful, and also expensive to construct, because you train many models on many subsets to fit the surrogate.

Every one of these is an approximation of the counterfactual, each with its own failure modes, and none of them is cheap enough to run per output per data point at the scale of a real product. That is not a temporary engineering gap. The geometry of how these models store information is working against you.

Why the geometry fights back

A neural network does not file your blog post in a drawer labeled “your blog post.” Training distributes what it learns across the weights. A single concept is spread over many neurons, and a single neuron participates in many concepts. The interpretability literature calls the first property distributed representation and the second polysemanticity, and recent work on superposition shows models deliberately pack more features than they have dimensions, overlapping them because they can usually get away with it.

The practical consequence is that there is no spot in the weights you can point to and say “that came from this person.” The information is smeared. Your data nudged millions of parameters by tiny amounts, and so did everyone else’s, and the output you care about is the joint result of all those nudges passed through a deeply nonlinear function. Asking for one source’s share of one output is a bit like asking which raindrop is responsible for a particular eddy in a river.

The exception: memorization

There is one regime where attribution gets dramatically easier, and it is the one that makes headlines. Sometimes a model does not generalize a piece of data, it memorizes it, and can reproduce it close to verbatim. Work by Carlini and others has shown you can extract memorized training data from large models, that larger models memorize more, and that the single biggest driver of memorization is duplication. Data that appears many times in the corpus is far more likely to be regurgitated, which is also why deduplicating training data measurably reduces memorization.

Memorized content is attributable almost by definition. If the model emits your text verbatim, the link is not statistical, it is visible. This is the easy and legally loud case. But it is the minority case. Most of what a model does is generalization, recombination, the diffuse kind of influence that the methods above can only estimate. Building an entire compensation regime on the memorized tail would miss most of how these systems actually use data.

After pre-training, the ground shifts

Everything so far is about pre-training, the giant first pass over the open corpus. The picture changes, somewhat for the better, in the stages that come after.

Fine-tuning runs on smaller, curated datasets. The counterfactual is still the right definition and still not free, but the numbers are friendlier. With thousands or millions of examples instead of trillions, influence estimates are more stable and the set of candidate sources is bounded. Attribution here is hard but no longer hopeless.

Reinforcement learning from human feedback is harder again in a different way. The training signal is preference data, humans choosing between outputs, and that signal shapes behavior globally rather than injecting retrievable facts. Tracing a specific model behavior back to a specific preference label is murky, and the labelers themselves are a data source whose contribution almost no one accounts for.

Retrieval augmented generation is the pleasant surprise. When a model answers by pulling documents into its context at run time, the documents it used are not a mystery. They are right there in the request. Attribution becomes logging. This is why citations in retrieval systems are tractable while citations for pre-training knowledge are not. If you want clean, per-use attribution today, retrieval is where it already exists.

In-context use is the trivial case. Whatever you put in the prompt, the model saw, and you know exactly what you put there. That is not really training, but it is increasingly how data reaches models in practice, and it points at where this is all heading.

Notice the pattern. The closer data sits to the moment of use, the more attributable it becomes. The deeper it is baked into the weights, the less. A consent system should take that seriously instead of pretending one mechanism fits all of it.

What LLMConsent does about it

Given all of the above, designing the protocol around perfect per-output attribution would be designing around something that does not exist. So we do not.

We separate the two questions on purpose. Provenance is treated as a verifiable fact, established when consent is checked at ingestion, not reconstructed later by attack. LCS-001 is built around that grant and check moment, so the record of what was permitted exists before any training happens, not as an afterthought.

For influence, we bound rather than measure. A consent token carries a maximum influence ceiling, expressed in basis points, that caps how much any single source is allowed to shape a model. Capping influence is far more achievable than measuring it exactly after the fact, and it turns an impossible accounting problem into a tractable policy one. You do not need to know that a source contributed 0.0007 of an output if the terms already said it may contribute at most a set fraction.

Where attribution genuinely is available, retrieval, memorization, fine-tuning sets, the protocol expects implementations to use it, and the economics can be exact in those cases. Where it is not, the model is consent at ingestion plus bounded influence plus usage-based settlement, rather than a fantasy of tracing every token to a payee. We would rather ship a mechanism that is honest about its approximations than a precise-sounding one that quietly cannot be implemented.

There is also revocation, which drags in another genuinely open problem: machine unlearning. Removing the effect of a data point from an already-trained model is its own research frontier, with approaches that range from full retraining of data shards to approximate methods with real limits. The token in LCS-001 can signal that consent is revocable and that unlearning is requested, but we are not going to pretend the ecosystem can perfectly forget on command yet. It cannot. Naming the gap in the standard is better than hiding it.

The honest summary

Provenance is a solved problem if you record consent at the right moment. Influence during pre-training is, for now, fundamentally approximate, and the shape of these models suggests it will stay that way. Attribution improves the closer you get to the point of use, which is exactly where AI is moving with retrieval and agents. A serious consent protocol should prove what it can prove, bound what it cannot measure, and refuse to dress up estimates as receipts.

If you work on training data attribution, influence estimation, or unlearning, we would genuinely like to be wrong about the hard parts. The standards are on GitHub, and the place to argue is an issue or a proposal.

Want the next one in your inbox? Occasional updates, no spam.