New Data Sources to Improve Federal Statistics

New Data Sources to Improve Federal Statistics
Photo by Stephen Dawson / Unsplash

Today, we have the seventh post in the series to help diverse audiences understand and support the federal statistical system. This post is from Barry Johnson, and provides an overview of different sources of federal data.

Check out their first post in the series on the uses of public data from reducing lead exposure in consumer products to improving agricultural productivity.


In our last post, we discussed the origin and evolution of federal statistics and noted that, as survey response rates have declined and user expectations have evolved, agencies are increasingly exploring new data sources. Several experts in the field propose to supplement or replace surveys with administrative and commercial data. We further examine whether administrative and commercial data could supplement or replace traditional data used to produce official statistics.

Administrative Data to the Rescue? 

Administrative data are collected and maintained by government agencies or commercial firms to run public programs and provide services. These include data from taxes, Social Security, student loans, veterans’ benefits, and crime records—to name just a few. The potential benefits and challenges of using administrative data for official statistics were first recognized in the 1970s, both to expand coverage or content and for technical uses such as constructing sampling frames.

In the decades since, some agencies have integrated administrative data into their data production processes—most frequently to help measure and mitigate bias due to non-responses (those who choose not to participate in the survey), impute or fill in missing values that respondents fail to provide, or to otherwise supplement data collected through surveys. Recently, administrative data have been moving beyond this supporting role to become the stars of the show, such as the National Agricultural Statistical Service using NASA satellite images to improve crop yield estimates.

However, administrative data are not perfect replacements for traditional surveys. We describe four points to keep in mind when incorporating administrative data into federal statistical products.

Population Coverage

Survey methodologists carefully design questions for respondents that are scientifically chosen to address specific research questions, such as understanding U.S. households’ well-being. Government administrative data, on the other hand, are by law limited to just the information needed from those participating in a specific program. Often, this means administrative data will not fully cover a population of interest. As an example, tax data from the Internal Revenue Service (IRS) represent individuals or legally married couples, not households, and do not include those whose income is below the filing requirement. This means tax data could be a major component of a population count, but would need to be combined with other data sources to include all U.S. households. 

When administrative data come from state or local governments, varying local policies and available resources may make obtaining complete coverage of the target population especially challenging. For example, not all states participate in the Census Bureau’s Longitudinal Employer-Household Dynamics Program, which provides economic indicators including employment, job creation, earnings, and other measures of employment flows.

Construct Mismatch

Even when administrative data include the right data subjects, they may not gather all the required information. IRS tax returns do not collect demographic data like birth date, gender, or race, nor information on income that is not subject to tax (like some Social Security or other social benefit payments, and interest on tax-exempt investments). In addition, legal and program changes may impact the content of administrative data, thus breaking data series. For example, recent increases in the Standard Deduction greatly reduced tax data on specific deductions, making it impossible to use tax data to track changes in charitable contributions over time.

Privacy Limitations

Administrative data are also subject to agency-specific privacy laws that may prohibit data from being used for statistical purposes or shared across statistical agencies. The 2018 legislation, sometimes referred to as the Evidence Act, made some changes to expand the use of administrative data for official statistics but did not fully address this issue.

Data Quality

Finally, administrative data may have serious data quality challenges, including unintentional errors or intentional misreporting by program participants seeking to fraudulently claim a benefit. Program agencies work to detect fraud, but, in general, data items are not tested with the same rigor as applied to statistical survey data (which are also subject to undetected errors, both intentional and unintentional). Even high-quality administrative data may not be directly suitable for statistical uses without some editing. For example, occupation descriptions might need to be translated into standard codes.  Advances in artificial intelligence (AI) have great promise for mitigating these challenges by automating some formerly manual editing tasks.  

Commercial Data to the Rescue?

Data purchased from businesses are another possibility for supplementing official statistics. For example, Census uses point-of-sale (scanner) data to assist in producing its retail trade sector statistics. Commercial data have potential to provide more timely data at lower costs than some traditional sources. However, they are also subject to many of the same challenges as administrative government data. Additional issues arise when businesses are bought or sold, which can impact the content and availability of data.  Prices and even availability of commercial data may also vary significantly over time, risking that what is affordable one year may break the bank or disappear entirely in another. 

Despite Challenges, Administrative and Commercial Data Will Be Foundational to a Modernized Statistical System

Despite these issues, administrative and commercial data have vast potential for improving the timeliness, relevance, accuracy, and affordability of federal statistics in the 21st Century. Developing efficient, repeatable methods for mitigating the challenges associated with these data could reap significant benefits to policy makers. Fortunately, federal statistical agencies such as the Census Bureau, Bureau of Labor Statistics, National Agricultural Statistics Service, Statistics of Income Division, and many others are already making good progress.

Policy makers could support additional progress by:

  1. Ensuring that agencies have the resources needed to experiment with alternative data sources and modernize legacy production processes. This must include new uses of AI to address data cleaning, editing, and imputation needed to make administrative data suitable for statistical purposes. 
  2. Reducing remaining data sharing barriers that limit access to promising uses of existing administrative data and developing new privacy-protecting data sharing methods where direct sharing isn’t advisable.
  3. Clarifying roles to ensure that statistical agencies work closely with chief data officers at the local to federal levels of government and those responsible for administering government programs to ensure that administrative data are suitable for approved statistical uses.
  4. Providing incentives to data holders, including state, local, tribal, and territory governments and commercial businesses, to ensure data quality and availability.
  5. Fundamentally rethinking the role of federal surveys so that they more effectively complement administrative and commercial data by effectively supplementing missing information.

Administrative and commercial data sources will play an increasingly important role in ensuring timely, accurate, relevant, and affordable federal statistics. However, only by ensuring that statistical agencies have the resources needed to address the challenges these data sources pose will the public reap the potential benefits.

Read more