Data infrastructure isn’t just technical

Data infrastructure isn’t just technical
Photo by Abraham Barrera / Unsplash

The New York Times recently published a story detailing the mishandling and misuse of genetic data from over 20,000 US Children. Scientists falsified access requests to the NIH, pretending to be located in the United States and obscuring their true aims: to use the data to make false race-based claims. It is chilling and devastating – but perhaps not surprising. While this misuse occurred in 2024, there is a heightened concern that this could happen again, since “the Government Accountability Office, a federal watchdog, reported last April that the N.I.H. did not have the resources to properly monitor all the downloads of genetic data and ‘may be missing violations that go unreported by researchers.’”

The infrastructure necessary to manage, and protect, data is complex, requiring a combination of technical systems, human labor, and intellectual policies and procedures. As the federal data ecosystem faces repeated threats, both to the content of the data as well as to the infrastructure underpinning it, critical gaps emerge, which leaves space for data loss and data misuse. Without the personnel to review and monitor data access requests and use, there is a continued threat to this information. As our colleagues in PEDP recently wrote, “Datasets do not maintain themselves.” 

In DRP’s work our efforts focus on capturing data and advocating for robust infrastructure. We remind our colleagues, our families, anyone who will listen: data requires a significant amount of technical, human, and intellectual infrastructure to properly steward. In other words, data cannot be captured and stored on a shelf, forgotten about. Data need active preservation to ensure the data remain usable and useful. This necessitates not just storage, but also experts who know the data as well as policies and procedures to protect them.

Public access to public data is a public good – and restricted, protected access to private data is essential for trust in science. Moments like this remind us that thoughtful data stewardship is required for proving data misuse and misinterpretation– and that requires an investment in the people that collect, process, publish, and understand the data.