Open Data for Public Interest AI – Calls for Collaborative Action Progress Update

Author: Bolaji Ayodeji, DPG Evangelist and Technical Coordinator, DPGA Secretariat
Last year, the DPGA Secretariat launched its first-ever set of Calls for Collaborative Action following discussions with experts in the open source ecosystem, digital public infrastructure, climate action, and public interest AI. These calls are designed to galvanise support and signal to stakeholders the actions they can take to contribute to the success of digital public goods in highly impactful areas. One of those calls is the Open Data for Public Interest AI, which clamoured for DPGs that can make identifying, preparing, sharing, and using higher-quality open training data easier. The development of public interest AI depends on the opportunity to train models on both existing and new high-quality openly licensed datasets. In a time where generative AI is advancing at breakneck speed, and the term “open-source AI” is often misconstrued to describe systems that fall on varying degrees of openness, such as releasing model weights without transparency around the training data. Thus, it has become increasingly imperative to work towards a transparent and open way of building AI systems that serve the public interest.
Several challenges, including infrastructure limitations, funding constraints, and limited access to open solutions, exist that impede this at a larger scale, underscoring the need for greater resources to produce and share open data across diverse geographical contexts. In an earlier blog post by the DPGA Secretariat CEO, Liv Marte Nordhaug, she mentioned that “DPGs, as open, adaptable digital solutions, with documentation that can help facilitate reuse, can play an important role as tools for addressing common challenges to scaling public interest AI – both in the near future and longer term. In particular, DPGs can help unlock more and higher-quality open training data and data sharing.” Thus, over the past several months, we have focused on exploring how DPGs can help reduce some of the technical barriers to having more high-quality open training data, particularly for use cases like the development of language models that address language gaps in AI development, solutions for public service delivery, and research-based climate action (monitoring, mitigation, adaptation). One fundamental way we addressed this challenge, with multiple stakeholders participating on the call, was by creating an adaptable and reusable toolkit that can be recommended to countries and stakeholders to facilitate the collection, extraction, processing, validation, and preparation of data.
2025 Key Activities and Stakeholders Outputs
Our key activity during the past months was the development of this toolkit, which includes a list of existing DPGs, potential DPGs, and other open-source tools that could be relevant for advancing public interest AI. We use four core terminologies to summarise the focus areas of the solutions and their relevance to the call to action:
- Data Identification (identifying data licenses, sources, etc.).
- Data Collection (data sourcing/capturing).
- Data Validation (data cleaning and checking correctness/quality).
- Data Processing (data transforming, aggregating, feature engineering, etc.).
Our initial observation from compiling the first version of the toolkit showed that the existing DPGs included are primarily focused on data collection, with a few also involved in data validation/processing. This already indicates a gap in data identification, validation, and processing. We then aimed to coordinate stakeholder efforts that would ensure potential solutions on the list could become certified DPGs while sharing the toolkit with relevant open-source community members who are building datasets that could eventually lead to the creation of new AI DPGs on the DPG Registry.
Our coordination efforts started by gathering stakeholders (many of whom included DPGA members and DPG product owners) together to garner their initial thoughts on how this topic affects their current workstreams, including Creative Commons, Geoprism Registry, Open Knowledge Foundation, Government of the Dominican Republic, Open Future, BMZ (GIZ FAIR Forward), and Open Data Services. We also had exploratory discussions with more than ten dataset creators to learn more about the challenges they faced in collecting and developing their datasets, any unique open-source digital solutions that supported the process, and any they wished had existed at the time. All these inputs helped validate our focus on tools for this call as one means of advancing the open data problem. Below is an overview of the aspects that the consulted stakeholders work on, which are of relevance to the C4CA and the toolkit.
- Open Knowledge Foundation is working on data quality, literacy, capacity building, and discoverability work through the Open Data Editor, which, at the beginning of the Call, was still a work in progress. It has since been officially verified as a DPG. The ODE is a free, open-source tool that helps nonprofits, data journalists, activists, and public servants detect errors in their datasets. It is designed for people working with tabular data who don't know how to code or don't have the programming skills to automate the data exploration process, and therefore spend much more time than they would like checking their datasets for possible errors and correcting them. Their work on this project helped them identify technical barriers obstructing stakeholders from processing data and communicating the business value and ROI of open data. The Open Knowledge Foundation also provided virtual training for DPGA members on this project.
- Open Future is working to develop a comprehensive understanding of data governance frameworks, including open data, open licensing, and other commons-based approaches, while collaborating with open source AI developers in Europe to establish shared perspectives on training data for AI, primarily within the context of European AI policies. They are also working with stewards of collections, especially in the heritage sector, on creating a blueprint for a “books data commons” for AI training and supporting the development of the CommonsDB project, a public registry for Public Domain and openly licensed works, to bring greater legal certainty to the reuse of digital content.
- Creative Commons is working on the CC Signals project, a framework for a simple pact between those stewarding data and those reusing it for AI development. The proposed framework will help content stewards express how they want their works used in AI training—emphasising reciprocity, recognition, and sustainability in machine reuse. They aim to preserve open knowledge by encouraging responsible AI behaviour without limiting innovation, serving as a broker between content and models.
- The Government of the Dominican Republic, through the Ministry of Public Administration, is working on a national framework for interoperability and data governance in public administration, providing clear guidelines for data lifecycle management, promoting standardised metadata catalogues, open APIs, and digital government architecture, and ensuring alignment with international best practices in transparency, efficiency, and data protection. This directly supports the preparation of data for AI development by strengthening data identification, collection, validation, and processing in the public sector. They are also working on an AI & Data Sandbox that will leverage DPGs, such as CKAN, Open Data Editor, and X-Road, to enable controlled experimentation with anonymised and public datasets. This will create a safe environment where public-sector data can be prepared and shared under open licenses.
- The GIZ FAIR Forward Project, funded by the German Federal Ministry for Economic Cooperation and Development (BMZ), provided access to their network of dataset creators, who provided significant input that shaped the toolkit development. Their Lacuna Fund Learning and Evaluation Report offered insights into how the Lacuna Fund has effectively and efficiently enabled the creation, expansion, and maintenance of representative and unbiased training datasets for ML. It also examined process challenges experienced by stakeholders and provided recommendations for improvement.
- Geoprism Registry is working on developing standards like the Geo-GraphRAGs to make geospatial and terminological data interoperable and usable by AI. Such standards would enable the development of a reusable toolkit objective for data processing and making such data easier for AI models to access, integrate, and reason over—crucial for building AI systems that serve public needs such as disaster response or public health monitoring. This work focuses on creating shared data and vocabulary standards that make complex, high-value data—such as geospatial and health data—interoperable, reusable, and machine-understandable. Together, these initiatives advance the infrastructure layer for open, interoperable data ecosystems by ensuring that AI systems can use standardised, well-described data from multiple domains and jurisdictions.
We also discovered complementary toolkits developed by various organisations, including educational materials, resources, templates, and checklists, which all aid different stakeholders in the production of open datasets. Here are a few examples:
- Open Government Data Toolkit: designed to help governments, bank staff, and diverse users understand the basic precepts of open data, then get “up to speed” in planning and implementing an open government data program while avoiding common pitfalls.
- The Data Innovation Toolkit: designed to provide practical tools to facilitate and enhance the implementation of data-driven initiatives for the public good by public servants.
- AI Training Dataset Sustainability Toolkit: designed to support Lacuna Fund grantees and the broader machine learning community in publishing sustainable AI training datasets, including a step-by-step playbook, with guidance and checklists for researchers.
- Towards Best Practices for Open Datasets for LLM Training: The research outlines possible tiers of openness, normative principles, and technical best practices for sourcing, processing, governing, and releasing open datasets for LLM training, as well as opportunities for policy and technical investments to help the emerging community overcome its challenges.
- A Blueprint to Unlock New Data Commons for AI: This blueprint provides guidance and resources to support organisations (particularly those that have or steward data) seeking to create data commons for AI in the public interest.
When we spoke to the developers of impressive datasets like the Simula Datasets, Africa Biomass, Hyperlocal Mapping of Air Pollution, Inclusive Digital Monitoring System in the Eastern Himalayas, High Carbon Stock Approach for Forest Protection, Kenyan Language Corpora, Masakhane African Languages Hub, etc., we discovered several challenges that affirmed the need for more technical tools and resources. For example, the team working with fishing communities faced complications in the data collection, due to the diversity of dialects across communities, with language barriers arising between those who spoke the local languages and those responsible for gathering the data. Environmental challenges, such as sensor failures due to heat and maintenance issues, further slowed progress. Although Excel was primarily used for processing, there was a need for better open-source tools to automate sensor data and improve interoperability. Working with universities helped refine and analyse the data, but issues around data ownership, transparency in payment, release agreements, and government buy-in persisted. Others working on biomass data relied heavily on paper-based processes, later digitised, but this often led to inefficiencies—data scientists had to review information line by line, while field teams repeatedly verified corrections, creating a time-consuming back-and-forth process. For those building on existing data, broader systemic challenges included unclear or missing data licenses that caused long delays, fragmented storage without a central repository, and restricted access to locked or government-owned datasets. Those working on medical data rely on their in-house internal data processing pipeline. For data labelling, they partnered with proprietary platforms that offer a free academic license and strong compliance with GDPR and privacy regulations. While they explored open-source alternatives, most required local setups and lacked the scalability needed for multi-expert remote collaboration.
A broader challenge lies in the economics surrounding data sharing. Many see datasets as commercial assets rather than public goods, leading to the buying, selling, or locking of data, instead of open collaboration. Overall, progress is steady but slow—people are warming up to the idea that openness and privacy can coexist, and that responsible data sharing can drive scientific and AI advancement for the common good. Altogether, these experiences underscore a clear need for open, community-centred, and interoperable digital solutions that streamline data collection and validation, ensure standardisation, improve transparency, and promote collaboration between local communities, researchers, and government stakeholders—encouraging them to consider sharing national data openly to aid public interest AI innovation.
Next Steps - Priorities for 2026
Although the toolkit and the ongoing work of stakeholders mentioned earlier provide a first step in closing the identified gaps in enabling more and better open data for public interest AI, much still needs to be done to improve the availability and quality of open data resources for AI training. For this reason, the toolkit was established as a dynamic collection that allows for real-time updates, rather than a static resource. We aim to expand it through our research, sourcing and resource assessments over the next year, developing a more granular understanding of the needs of the toolkit’s target groups and adding further resources. Additionally, we plan to demonstrate the use of the products included in the toolkit through several case studies, thereby bridging the gap between theory and practice. Our work in 2026 will also focus on sourcing more AI System DPGs by engaging the open science community, exploring the creation of open datasets for SDG-relevant models, fine-tuning, and public benchmarks that are representative of the lived realities in global majority contexts. In the spirit of the Calls for Collaborative Action, we urge DPGA members, stakeholders, and partners to collectively support this effort to expand the open data for public interest AI ecosystem. Only collectively can we realise the potential of public interest AI and thereby make progress toward the SDGs.