For AI Systems, Provenance Is Fundamental to Building Knowledge, Trust, and Assessment
Letter from the Executive Director, June 2026
In the 1980s, when talking about arranging deals to reduce nuclear weapons stockpiles between the US and the Soviet Union, US president Ronald Reagan used a Russian proverb, “doveryay, no proveryay” (доверяй, но проверяй), to describe his position regarding any potential deal: “Trust, but verify.” As AI tools become more embedded in workflows, writing tools, and research processes, this basic principle is one that we should do well to remember. The question with generative AI systems is, how do we verify their outputs? The answer is provenance and attribution. These two concepts are also key to assessing impact and usage.
Similarly, it has long been standard practice in research communication and business processes to expect that claims, or processes upon which new claims are built, should reference prior work. Unfortunately, the first generation of AI tools was either incapable of or offered poor support for the kind of true provenance and citation linking that is fundamental to research applications. Because these were often lacking, understanding the value or impact of these tools and especially the content that they ingest has been incredibly difficult.
If we are going to trust generative AI systems or track AI usage, we need a clear understanding of what content these systems are drawing their result set from and what sources the content is drawing the resulting response from. In the same way, one might know that Einstein developed the formulae E=mc2 without knowing exactly what its implications are or the exact publication in which the proposition was first put forward. Without putting forth the reference to the source, we cannot verify the claim that it was Einstein who introduced it before anyone else. To have the reference, you need to know the source. And beyond that, without the provenance, you cannot assess the impact and usage of the materials.
Last month, Cambridge University Press, COUNTER, and NISO co-hosted a workshop on the interconnected issues of provenance, citation, and usage reporting. A couple dozen technical experts from the publishing, AI tool developer, and librarian communities met in Cambridge to scope potential community efforts on these topics. A full report describing in detail the discussions and next steps will be issued later in June, though under the Chatham House Rule without specific attribution to the participants.
In terms of practical outputs, COUNTER will continue its ongoing work to improve the tracking of content in AI systems. In April, COUNTER released a best practice to facilitate usage reporting of publisher content by AI systems. This extension updated the Code to “better reflect the distinction between human usage, malicious bots that MUST be excluded, and AI systems that SHOULD be included in COUNTER Reports” as well as “introducing a new Access_Method and several new Metric_Types” to better report AI usage. The scope of this was to focus only on AI usage within publisher systems, but not to extend outside of these systems. New work focused on those types of usage is being formulated.
NISO will be focusing our attention on the questions of provenance and attribution within AI systems. We need to move quickly, because the AI landscape is shifting so rapidly. To address this, rather than forming a dedicated working group, we will likely use a pilot exploration model before a formal recommendation is made. This approach was used for the development of Resource Access in the 21st Century Recommended Practice in its initial phase to rapidly test approaches before we moved to formalization and publication. (Note: The RA21 initiative developed into the production service SeamlessAccess in 2019.) This allowed teams to work quickly to develop prototype approaches and settle on a best-of-breed approach to addressing the problem. The same model will likely work well here, with groups ideally producing a viable strategy in months rather than years.
It is important to note here that there are several related issues and adjacent efforts underway. The IETF has been working on a preference signalling model that might replace the long-established and often ignored robots.txt approach to limiting crawling of content. The AI Preferences Working Group will standardize building blocks that allow for the expression of preferences about how content is collected and processed for AI model development, deployment, and use. Draft vocabulary for expressing AI-related preferences and a draft protocol specification to associate them have been available since last fall. They are expected to be released as final RFCs in roughly August.
Also last month, as the Cambridge meeting was ending, Creative Commons announced additional developments related to its Signals initiative. Following conversations with the community about the Signals project, CC announced that it was “Moving from Signals to Agency” in April. Its new vision is to incorporate CC preferences and develop tools to support the process of signalling and ingestion for open content, with another meeting in London in late May.
Interoperability between AI tools has also been advancing at a furious pace with the Model Context Protocol (MCP), led by Anthropic and now managed by the Linux Foundation, as well as the newer, competing Google WebMCP structure. These interoperability frameworks support agentic access and tool interactions, which can support exposing structured tools, and ensure that AI agents can perform actions across multiple platforms in a way similar to how APIs facilitate exchange between existing online systems.
Each of these projects relate to how content is gathered and trained, rather than maintaining the provenance through the workflow. Meanwhile the Coalition for Content Provenance and Authenticity (C2PA) initiative has this in mind, but more in a chain-of-custody theory of how content changes as it passes through from creator to consumer. Generative AI systems fundamentally break this chain. It seems to me and members of the Cambridge meeting participants that the generative AI process also needs to have a robust understanding of provenance baked into the tools. Obviously, for base-level models, where understanding of language and how communication functions are too complex as well as probabilistic to provide specific attribution to things or specific facts, more specific models, such as Retrieval-Augmented Generation (RAG) and In-Context Learning (ICL), have the capacity to retain source information in their processes. We’ll be focusing on these types of approaches as we work through consistent practice for provenance and attribution signalling.
In the coming weeks, we will gather interested parties to work on the piloting of these attribution and provenance models. As noted, COUNTER is now continuing its work to focus on third-party usage and assessment of agentic usage of content in AI systems. Our aims are to move with speed and agility to have something for the wider community to consider this fall. If you’re interested in participating, please reach out.
Sincerely,
Todd A. Carpenter
Executive Director