Guest Post: The Library Innovation Lab Launches Data.gov Archive Search
This week's guest post is from the Harvard Law School Library Innovation Lab. As October draws to a closer, we are grateful to this team for their efforts to update Data.Gov Archive, which now includes a search interface to support the discoverability and usability of federal data.
In February, the Library Innovation Lab announced its archive of the federal data clearinghouse Data.gov, and now we have launched Data.gov Archive Search, an interface for exploring this important collection of government datasets. Our work builds on recent advancements in lightweight, browser-based querying to enable discovery of more than 311,000 datasets comprising some 17.9 terabytes of data on topics ranging from automotive recalls to chronic disease indicators.
Traditionally, supporting discovery features across massive collections has required investment in dedicated computing infrastructure, such as a server running a database or search index. In recent years, innovative tools and methods for client-side querying have opened a new path. With these technologies, users can execute fast queries over volumes of static data using only a web browser - saving resources for organizations while still providing high quality access
When LIL began thinking about how to provide discovery for the Data.gov Archive, we decided that building a lightweight and easily maintained access point from the beginning would be worth our team’s effort. We wanted to provide low-effort discovery with minimal impact on our resources. We also wanted to ensure that whatever path we chose would encourage, rather than impede, long-term access. Through an experiment implementing these new client-side technologies, we started to prove out how possible it is to provide comparable discovery for large amounts of data.
For hosting, LIL has chosen Source Cooperative as the ideal repository for its Data.gov archive for a number of reasons. Built on cloud object storage, the repository supports direct publication of massive datasets, making it easy to share the data in its entirety or as discrete objects. Additionally, LIL has used the Library of Congress standard for the transfer of digital files. The “BagIt” principles of archiving ensure that each object is digitally signed and retains detailed metadata for authenticity and provenance. Our hope is that these additional steps will make it easier for researchers and the public to cite and access the information they need over time.
In the coming month, we will continue our work, fine-tuning the interface and incorporating feedback. We also continue to explore various modes of access to large government datasets, and so we are exploring, for example, how we might create greater access to the 710 TB of Smithsonian collections data we recently copied. Please be in touch with questions or feedback.