Guest Post: GovScape: A Public Search System for 10+ Million Government PDFs
This week's guest post is from Benjamin Charles Germain Lee, Assistant Professor at the University of Washington, and Kyle Deeds, Assistant Professor at Boston University. Learn more about their recent collaboration to create GovScape, a fantastic resource for searching publicly-available government documents.
We are excited to share GovScape: https://govscape.net, a public search system for 10+ million government PDFs. GovScape is built upon the End of Term Web Archive (https://eotarchive.org/), an incredible multi-institutional effort to document the federal government’s online presence at the end of each presidential administration going back to 2004. GovScape currently includes all renderable PDFs from the 2020 crawl that are 50 pages or under in length.
GovScape supports 3 forms of searching over these PDFs:
- keyword search, or exact text search: this form of search is canonical keyword search over document text, i.e., basic keyword search.
- semantic text search, or vectorized natural language text search: with this form of search, you can define more flexible textual queries such as “budgetary data related to the Iraq war” or “rural healthcare for children,” which will return PDF pages ranked based on the relevance of page text to your query – even if the exact string is not present. For this search functionality, we leverage embeddings from the BAAI/bge-base-en-v1.5 model.
- visual search over individual PDF pages: here, you can try queries like “redacted documents,” “pie charts,” or “aerial photography” – PDF pages are returned based on the relevance of their visual features to the query. For this search functionality, we leverage CLIP embeddings generated using the openai/clip-vit-base-patch32 model.
All three of these search methods can be combined with metadata filtering according to domain and crawl date, available by clicking the filter button just to the left of the magnifying glass (which executes the search). The image above provides an example of visual search.
Want to learn more about how GovScape works? We have posted a pre-print to the ArXiv: https://arxiv.org/abs/2511.11010, and our open-source code is available on GitHub (https://github.com/bcglee/govscape). We have also uploaded a tutorial video showing how to use the site: https://www.youtube.com/watch?v=mNda8lVKT1U.
We are already working to incorporate PDFs from the other crawls in the End of Term Web Archive, including the 2024 crawl once it is fully uploaded. We are also working on improving accessibility, incorporating additional functionality, and adding additional metadata. Please do reach out to us at bcgl@uw.edu if you have any questions about GovScape or discover anything of interest – we would love to hear from you! For regular updates on GovScape, you can follow @govscape.bsky.social on BlueSky.
We are extremely grateful to collaborators on GovScape: Ying-Hsiang Huang, Claire Gong, Shreya Shaji, Alison Yan, Leslie Harka, Trevor Owens, Mark Phillips, Shannon Zejiang Shen, and SJ Klein!