Interactive explainer

Data Counterfactuals

A site built around one stubborn question: what would change if the training data were different?

The project grew out of various discussions on data valuation, algorithmic collective action, scaling, selection, poisoning, and neighboring topics. Counterfactual questions about how data might change are foundational to all these areas (and also connected to some seemingly unrelated areas), and so understanding various questions in terms of data counterfactuals can be practically and academically useful.

The smallest useful move

Consider a simple thought experiment: imagine you are going to train a machine learning model on a small dataset (or, you can imagine it's a big dataset with distinct subsets). Now imagine a massive grid where every possible training set is a row, every possible evaluation set is a column, and each cell records the performance for that train/eval pairing. In practice it helps to shrink the picture first and imagine just four data objects: A, B, C, and D. Those could be single observations in a toy example, or four large datasets we are considering mixing.

Once that grid is in view, the smallest useful counterfactual is leave-one-out: compare a row that includes one point with the nearby row in which that point is missing. By computing the difference between these two cells, we can learn how much a given data point helped or hurt our model. From there the same logic can be extended to groups of data points, fixed-size subsets, synthetic replacements, corrupted examples, withheld data, or coordinated withdrawal.

In that sense, a data counterfactual is just a concrete version of the question: what changes when the training data change?

Leave-one-out toy example

ABCD

0.92

0.88

0.85

0.82

ACD

0.78

0.55

0.82

0.80

With four toy data objects A, B, C, and D, the lower row leaves out B. The sharp drop on evaluation slice B is the kind of local contrast many attribution methods try to summarize.

A grid for seeing the space

If we could fill in the whole grid, ideas from across different subfields start to show up in one place. Various ways of defining data value can be understood as summaries over particular slices, scaling patterns emerge as we move down the rows, and selection problems become questions about which rows are worth visiting.

Real systems do not literally enumerate this whole object, but understanding this can be conceptually useful. The toy grid is there to make the comparisons visible: which training world changed, where the effect landed at evaluation time, and how large the difference was.

That same picture also helps with interventions that are not just analytical. Collective action around data can be understood as changing which rows are available or attractive to an AI operator in the first place.

Toy world with four observations

0.85

0.72

0.45

0.38

ABC

0.88

0.80

0.75

0.52

ABCD

0.92

0.88

0.85

0.82

ACD

0.78

0.55

0.82

0.80

Imagine every possible training set as a row and every evaluation slice as a column. The payoff of the metaphor is in comparing nearby cells, rows, and paths through the grid.

Open in Explorer Try the 3D view

What the idea helps connect

I think this frame is useful because it connects conversations that often happen in silos. Someone interested in differential privacy, an ML researcher running ablations, and a labor advocate thinking about data leverage are often exploring different neighborhoods of the same conceptual space.

That does not mean the projects are all formally identical. One major distinction is that some techniques only explore counterfactuals over data that already exist: subsets, reweightings, filtering, or held-out removals. Others try to change what data the world produces in the first place, which is part of why collective action matters so much. And many minor distinctions exist.

Value and attribution

Leave-one-out, influence functions, and Shapley-style methods can all be read as different ways of aggregating over slices of the giant grid showing all the choices for "stuff I can train on" and "stuff I can evaluate on".
Scaling and selection

Scaling laws, active learning, coresets, curriculum learning, and dataset distillation ask how performance changes as we move through different rows, add data, or choose which examples to keep.
Robustness, privacy, and repair

Poisoning, privacy interventions, and some fairness-by-data methods explore nearby counterfactuals in which training data are corrupted, hidden, repaired, or reweighted.
Collective action and leverage

Data strikes, contribution campaigns, and bargaining interventions try to push AI operators toward less favorable rows by changing what data the world actually produces.

How to read the site

The site is currently closer to an interactive research note than a polished landing page. If you want the shared launch-post version of the argument, start with the main memo. If you want the visual intuition, open the explorer and move around the grid. If you want the more experimental angle on how new data get produced, try the 3D view. The formalisms note is there as a more technical web companion rather than required reading for the introduction. It's WIP.

The implementation is still early, so it is best treated as a working model rather than a finished explainer. My hope is that it offers a cleaner mental model for why what if the data were different? keeps showing up across so many lines of work.

Read the launch memo
More on the motivation for this site
Open the grid explorer
Explore the data counterfactuals "grid" framing directly.
Compare formalisms
A more technical, web-only companion that lines up neighboring formalisms side by side.
Related areas and papers
A compact shelf of neighboring literatures, representative papers, and quick context.
Try the 3D view
A rougher, more game-like take that treats new data generation like extending the grid.

The smallest useful move

A grid for seeing the space

What the idea helps connect

Value and attribution

Scaling and selection

Robustness, privacy, and repair

Collective action and leverage

How to read the site