PAPER SESSION 14: Engaging with AI
Tracks
Matiu
Wednesday, November 5, 2025 |
3:00 PM - 4:30 PM |
Matiu Meeting Room |
Speaker
Matthias Priem
meemooo
Publishing AI generated metadata - lessons learned
Summary Abstract
For over five years, meemoo has been working on AI to generate descriptive metadata. We researched potential solutions and created pipelines to create metadata. In recent projects, we were able to scale this up, generating metadata for over 250,000 hours of audio and video from more than 120 archives in Flanders.
Until now, this metadata has been kept separate from -but linked to- the actual archived objects, to allow archivists to assess its quality.
We will briefly discuss the generated metadata and quality aspects before turning our focus to the next step: making use of this metadata in archive management tools or dissemination platforms.
We will not go into technical detail, but will highlight the many aspects and challenges that come into play when disseminating this kind of metadata, such as metadata quality, volume of metadata, technical challenges, paradata, and — most importantly — the legal and ethical considerations when working with privacy-sensitive information.
We will explain how we are approaching this in collaboration with a large number of partners and stakeholders, making sure their voices are heard and taken into account.
This lightning talk will present our current state of affairs, lessons learned, and pose questions for future work. We hope this sparks a fruitful discussion among conference attendees.
Until now, this metadata has been kept separate from -but linked to- the actual archived objects, to allow archivists to assess its quality.
We will briefly discuss the generated metadata and quality aspects before turning our focus to the next step: making use of this metadata in archive management tools or dissemination platforms.
We will not go into technical detail, but will highlight the many aspects and challenges that come into play when disseminating this kind of metadata, such as metadata quality, volume of metadata, technical challenges, paradata, and — most importantly — the legal and ethical considerations when working with privacy-sensitive information.
We will explain how we are approaching this in collaboration with a large number of partners and stakeholders, making sure their voices are heard and taken into account.
This lightning talk will present our current state of affairs, lessons learned, and pose questions for future work. We hope this sparks a fruitful discussion among conference attendees.
Biography
Miss Pengyin Shan
Senior Research Software Engineer
NCSA/University of Illinois Urbana-Champaign
Intelligence and Data: The Symbiotic Future of LLM and Digital Preservation
Summary Abstract
As someone who builds production LLM integrations with a knowledge base, I've witnessed firsthand how LLMs are creating both unprecedented opportunities and existential challenges for digital preservation. This lightning talk shares insights from the frontlines of AI development, examining how the intersection of technology and digital preservation theory is reshaping our field's future. This presentation complements my accepted Tutorial "Build Smart Search Interfaces: Use Prompts to Turn Questions into Solr Queries with Minimal Coding" by exploring the strategic implications that every preservation professional must understand.
The first half examines how digital preservation practices directly impact LLM effectiveness. As AI systems increasingly shape information access and understanding, preservation specialists' decisions, such as contextual documentation, can determine whether AI provides accurate, nuanced responses or sophisticated-sounding misinformation. In an era where organizations cannot afford custom model training but need LLM to understand their specific collections, digital preservation can become a primary factor determining whether AI applications can effectively serve users.
The second half addresses how LLMs challenge traditional preservation practices, particularly the fundamental question of what materials deserve preservation when AI-generated content becomes indistinguishable from human creation. Does AI-generated content possess the same preservation significance as human-created works? If not, how do we develop robust authentication methods? Conversely, suppose we acknowledge that AI-generated content reflects knowledge creation, then we need new frameworks capturing complete provenance chains, including AI systems, training data, and human prompts. I will bring forward the critical question of whether we need entirely new preservation frameworks for an age where the boundaries between human and artificial intelligence become increasingly blurred, challenging our fundamental assumptions about authenticity, authorship, and cultural significance.
The first half examines how digital preservation practices directly impact LLM effectiveness. As AI systems increasingly shape information access and understanding, preservation specialists' decisions, such as contextual documentation, can determine whether AI provides accurate, nuanced responses or sophisticated-sounding misinformation. In an era where organizations cannot afford custom model training but need LLM to understand their specific collections, digital preservation can become a primary factor determining whether AI applications can effectively serve users.
The second half addresses how LLMs challenge traditional preservation practices, particularly the fundamental question of what materials deserve preservation when AI-generated content becomes indistinguishable from human creation. Does AI-generated content possess the same preservation significance as human-created works? If not, how do we develop robust authentication methods? Conversely, suppose we acknowledge that AI-generated content reflects knowledge creation, then we need new frameworks capturing complete provenance chains, including AI systems, training data, and human prompts. I will bring forward the critical question of whether we need entirely new preservation frameworks for an age where the boundaries between human and artificial intelligence become increasingly blurred, challenging our fundamental assumptions about authenticity, authorship, and cultural significance.
Biography
Mr Corey Davis
Digital Preservation Librarian
University of Victoria
Unlocking Web Histories: Leveraging LLMs and RAG to Transform Discovery in Web Archives
Summary Abstract
This paper explores how Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) can be applied to improve access to web archives. These collections are often difficult to navigate due to their complexity and the limitations of traditional search tools. The author examines how RAG can help address concerns around trust and transparency in AI by grounding LLM outputs in external, curated sources. At the University of Victoria Libraries, a custom RAG pipeline was developed to build on tools like WARC-GPT. This pipeline enables natural language querying with source attribution and incorporates optimizations in data preprocessing, chunking strategies, and hardware acceleration. When tested on real-world web archives of cultural significance, it demonstrated improved retrieval accuracy and computational efficiency. The discussion also considers the broader implications for digital preservation in an age of uncertainty, emphasizing the importance of trust, ethics, sustainability, and continued human oversight in AI-powered discovery. The findings suggest that RAG offers a promising way to unlock the value of underused digital heritage collections while upholding the foundational values of research libraries, including long-term preservation, equitable access, and responsible stewardship of historic materials.
Biography
Corey Davis is the Digital Preservation Librarian at the University of Victoria, where he develops policies and strategies to ensure the long-term preservation of digital collections. With over 15 years of experience in academic libraries, he works at the intersection of digital preservation, web archives, and AI, advising faculty, researchers, and students while also advancing the Libraries' technological infrastructure. Corey is a founding member and co-chair of the Canadian Web Archiving Coalition (CWAC), founding chair of the Portage Preservation Experts Group, and an active member of the Canadian Association of Research Libraries’ Digital Preservation Working Group. He previously served as the Digital Preservation Network Coordinator for the Council of Prairie and Pacific University Libraries (COPPUL), where he was instrumental in creating a shared digital preservation infrastructure and education network. Corey frequently presents at national conferences and has authored peer-reviewed work such as “Archiving the Web: A Case Study from the University of Victoria” in the Code4Lib Journal. He also served as a Visiting Program Officer for Digital Preservation with CARL, contributing to the development of national strategies to ensure trustworthy, sustainable access to born-digital research collections. Corey holds a Master of Library and Information Studies from the University of British Columbia and a BA in Greek and Roman Studies from UVic.
Mr Naishuai Zhang
librarian
Peking University
From Preserving Newspapers to Preserving News: An Exploration of AI-Enhanced Preservation of Chinese Historical Newspapers
Summary Abstract
Historical newspapers comprehensively document various aspects of society, economic, cultural, educational, and daily life, serving as an indispensable medium for historical research and the understanding of societal transformation. This study proposes an AI-based framework for the recognition and structured processing of Chinese historical newspaper articles. The framework encompasses article region segmentation, identification of article titles and body text, and enhanced content representation. Experimental results demonstrate that annotation methods incorporating visual modifiers significantly improve model performance. The YOLOv10-based segmentation model accurately identifies article regions, while the detection model effectively distinguishes article titles and body text. By integrating with Optical Character Recognition (OCR) and large language models, the proposed framework enables automatic content structuring and summary generation, thereby substantially enhancing the readability and retrievability of historical newspapers. This work offers strong support for digital preservation, digital humanities research, and intelligent information retrieval.
Biography
Naishuai Zhang, Librarian at Peking University Library, with research interests in digital preservation, digital libraries, and data mining.
