Open Data for Public Interest AI Toolkit

C4CA2 – Comprehensive resources and tools for building responsible AI systems

Complementary Toolkits

Below are some complementary toolkits developed by various organisations, including educational materials, resources, templates, and checklists, which all aid different stakeholders in the production of open data sets.

Open Government Data Toolkit

Designed to help governments, Bank staff and users understand the basic precepts of Open Data, then get "up to speed" in planning and implementing an open government data program while avoiding common pitfalls.

The Data Innovation Toolkit

Designed to provide practical tools to facilitate and enhance the implementation of data-driven initiatives for the public good by public servants.

A Blueprint to Unlock New Data Commons for AI

This Blueprint provides guidance and resources to support organisations (particularly those that have or steward data) seeking to create data commons for AI in the public interest.

Towards Best Practices for Open Datasets for LLM Training

The research outlines possible tiers of openness, normative principles, and technical best practices for sourcing, processing, governing, and releasing open datasets for LLM training, as well as opportunities for policy and technical investments to help the emerging community overcome its challenges.

AI Training Dataset Sustainability Toolkit

Designed to support Lacuna Fund grantees and the broader machine learning community in publishing sustainable AI training datasets. The toolkit is intended as a step-by-step playbook, with guidance and checklists that researchers can reference as they prepare their datasets to ensure widespread reuse by others, covering topics like guidance on producing high-quality datasets before publication, selection of appropriate platforms to host the dataset, etc.

Climate AI Data Gaps

Designed to identify and catalog critical data gaps that impede AI/ML applications in addressing climate change, and lay out pathways for filling these gaps (candidate improvements to existing datasets, as well as "wishes" for new datasets whose creation would enable specific ML-for-climate use cases).

OECD AI Principles Implementation Toolkit

Designed to offer practical, region-specific guidance for countries working to strengthen their AI ecosystems due to difficulties in policy development, governance, unequal access to compute resources, etc.

Use Cases

Lacuna Fund Learning and Evaluation Report
Lacuna Fund Learning and Evaluation Report
Report
The report aimed to assess what aspects of Lacuna Fund have effectively and efficiently enabled the creation, expansion, and maintenance of representative and unbiased training datasets for ML; examine process challenges experienced by stakeholders; and provide recommendations for improvement.
The Responsible Foundation Model Development Cheatsheet
The Responsible Foundation Model Development Cheatsheet
Guide
The research is aimed at shaping responsible development practices with a growing collection of 250+ tools and resources. This includes a survey of resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices.