About This Project
An educational resource for understanding how changes to training data affect AI model behavior.
What is This?
Data Counterfactuals is an interactive explainer and research reference designed to help people understand the relationship between training data and model outcomes. It connects several research areas that are often studied separately:
- Data valuation — How much is each data point worth?
- Influence functions — Which training examples affect a prediction?
- Scaling laws — How does more data improve performance?
- Data poisoning — How can bad data harm a model?
- Collective action — What leverage do data creators have?
The core insight is that all of these questions are asking variations of the same thing: "What would happen if we trained on different data?"
Current Status
This project is under active development. The interactive Grid explorer works, but many features are still being refined. The paper collections are curated but not comprehensive — they represent a starting point for exploring these research areas, not a complete literature review.
Contributions, corrections, and suggestions are welcome.
How Discussions Work
The Discuss/Feedback page connects to external discussion platforms where you can:
- Ask questions about the concepts or visualizations
- Report bugs or suggest improvements
- Propose papers to add to the collections
- Share how you're using this resource
Discussions are hosted on GitHub to keep everything in one place alongside the source code. You can also open issues or pull requests directly on the repository.
Who Made This?
This project is part of a broader effort to make AI/ML concepts more accessible. It's open source and welcomes contributions.
Related Projects
- Data Licenses — Understanding data licensing for AI
- Data Napkin Math — Quick estimates for AI data questions
- Shared References — Collaborative bibliography management