10 Data Commons for Cultural Knowledge and Preservation

Artificial Intelligence is opening unprecedented possibilities for how cultural heritage can be preserved, revitalized, and expressed — from revealing forgotten histories to enabling multilingual access and keeping endangered cultural practices alive. Yet realizing this potential requires access to rich and representative cultural data. This creates a unique tension: cultural data must be available enough to enable inclusive and culturally aware AI, while also being safeguarded against extractive use, misuse, and erasure.

Around the world, new cultural data commons are emerging to navigate this tension. These initiatives — spanning museum archives, public-domain literature, 3D scans of heritage sites, and contemporary artistic works — demonstrate how cultural data can be made accessible in ways that are collaborative, respectful, and aligned with community expectations. They show that access and protection are not opposing goals, but co-requirements for culturally responsible AI. In this blog we highlight 10 compelling cultural data commons. These initiatives are exemplary, not exhaustive, and are listed in alphabetical order. We then outline several commonalities across these examples and reflections on how they are designed and put into practice. We conclude with pathways for future research and initiatives.

The blog is part of a larger initiative to examine and illustrate how data commons (collaboratively governed data ecosystems) can enable responsible AI development. To complement our blueprint for using data commons and recent innovation challenge, we collected over 70 data commons from around the world that support AI in different ways. Among them, we found several initiatives that are supplying AI-ready cultural data to help developers adapt AI models for different cultures and contexts. These data commons range from museum archives to contemporary artworks to historical literature.

Data Commons Examples

Section 1

Section 2

Section 3

Common Qualities

The examples above demonstrate several patterns in how cultural data commons are designed and implemented. These include:

Purpose

Initiatives such as the common European data space, AI4Culture, MetaBelgica, and the TRANSFER Data Trust are increasing access to cultural heritage data for preservation and research. These initiatives underscore the importance of digitizing cultural artifacts and of sharing these digital assets with cultural heritage communities to minimize the duplication of efforts. Wikimedia Commons GLAM emphasizes the importance of supporting cultural heritage institutions, researchers, and those contributing to the platform. Similarly, Open Heritage seeks to support research and education and specifies that the data can only be used for non-commercial purposes.

Other initiatives — the Institutional Data Initiative, European Books Data Commons, and the Common Corpus — aim to broaden access to cultural knowledge through the digitization of books, newspapers, and educational texts. These efforts cover several topics and materials.

Data Types

These data commons are providing access to a range of cultural assets — from newspapers and artworks to audio recordings. Initiatives such as the common European data space and MetaBelgica are making cultural artifacts available, while The TRANSFER Data Trust focuses on contemporary artworks, enabling artists to preserve and share their work without relying on institutions.

Other efforts concentrate on cultural texts such as newspapers, books, and historical documents. For example, the HathiTrust Research Center offers access to an extensive corpus of books, ranging from historical volumes to works of literature in multiple languages.

Additionally, projects such as Open Heritage and Wikimedia Commons GLAM include 3D data. Open Heritage supplies 3D data from cultural heritage sites — including data collected from LiDAR sensors.

Funding Models

Setting up and maintaining a data commons for the cultural sector requires substantial investment. In Open Future’s recent publication, Outline for a European Books Data Commons, the team explains that “estimated annual operating costs [will be] between €500k and €750k” for their proposed books data commons emphasizing the importance of having a sustainable funding model over time.

The funding models behind these initiatives echo those identified in previous analyses. Projects such as AI4Culture, MetaBelgica, and the common European data space are supported by government funding. Others rely on grants from private companies and philanthropic organizations. For example, Harvard’s Institutional Data Initiative is funded by Microsoft and OpenAI; the Common Corpus has received support from several institutions, including the Nvidia Inception program; and the TRANSFER Data Trust is backed by the Knight Art + Tech Expansion Fund. The TRANSFER Data Trust also supplements their funding with public donations. In contrast, The HathiTrust Research Center collects membership fees.

Data Access Models

AI4Culture, The HathiTrust Research Center, and the TRANSFER Data Initiative have built their own platforms to facilitate access to cultural data with clear licenses. These projects illustrate the potential of data commons to not only provide secure access but also trusted environments where researchers, cultural heritage specialists, and others can process and analyze data responsibly.

Open Heritage’s data is available for download from their website. Much like other data commons, they use creative commons licensing but allow the data supplier to select the type of license for their respective dataset. Each dataset is also accompanied by a DOI to help streamline the citation process.

Other initiatives are applying different strategies to support attribution. The Wikimedia Commons GLAM, for instance, includes structured elements that can make it easier for the data to be attributed to the institution.

Some initiatives, such as the HathiTrust Research Center, use tiered access models: Members receive full access to the complete data corpus, while non-members have more limited access.

Other projects — including Common Corpus — leverage existing infrastructure, making their datasets available through commonly used platforms like GitHub and Hugging Face.

What emerges is not a uniform model but a shared orientation: data access structures that seek to balance openness with protection, and community agency with the practical needs of cultural preservation and technological innovation.

Governance and Operational Models

Across the examples, a distinctive governance landscape begins to take shape — one characterized far less by centralized authority than by distributed, collaborative arrangements.

Several cultural data commons are governed, managed and maintained in collaboration with partner institutions. These partners serve not only as data contributors but also as active participants in governance and decision-making. For instance, the common European data space is operated by the Europeana Foundation, which is composed of 19 partner institutions. MetaBelgica has established a “Follow-Up Committee” that brings together organizations such as the Dutch Heritage Network and Wikimedia Belgium to provide subject-matter expertise throughout implementation. Open Heritage is managed by a consortium of institutions.

Other initiatives rely upon robust community engagement and networks of volunteers. Wikimedia Commons GLAM, for instance, sets up data supply partnerships with GLAM institutions and harnesses its volunteers for maintenance needs and other tasks.

Reflections

As the examples above illustrate, data commons have the potential to transform how cultural knowledge is preserved and to make communities visible within AI applications. However, alongside this potential it is equally important to consider when it is appropriate to make community data available and when it should not be included in AI systems. Establishing a system to maintain a social license is critical not only to understand community expectations, but also for actively involving affected groups in decision-making processes.

Below we provide a set of illustrative questions to guide future research. As interest in cultural data commons continues to grow, we hope these questions serve as jumping off points for deeper exploration:

How can the governance of cultural data commons reflect and align with community values and expectations?
What strategies and mechanisms can cultural data commons adopt to establish and maintain the necessary social license to operate? In what ways can and do cultural data commons negotiate community expectations around what should or should not be made accessible for AI use?
How can data commons organizers ensure that the communities they represent meaningfully benefit from the resulting AI?
How can data commons organizers increase the use of cultural data commons in AI applications while reducing dependence on copyrighted data?
How can cultural institutions collaborate to create collective data commons and minimize duplication of effort?
How are data commons defining boundaries around the use of cultural data in AI (e.g., deepfakes, generative remixing, derivative works)?

***

Have any questions or are interested in collaborating? Reach out to us at newcommons@opendatapolicylab.org.