Data, Not Documents: Modernizing the Regulatory State

| March 2019


“On a back wall in the apple packinghouse, there are 13 clipboards with various logs — first-aid monitoring, pest control, visitor sign-in sheets and more — required for food safety audits. There are about another dozen thick binders and manuals in the farm office for navigating rules and regulations on such things as migrant and seasonal worker protections.” [1]

In December 2017, The New York Times documented how “17 federal regulations with about 5,000 restrictions and rules” imposed a significant burden on an apple orchard in upstate New York. When federal regulators showed up at the farm to check compliance with labor laws, they required “ream[s] of paper” in documentation.

The federal regulatory system—from rule development to compliance to enforcement—still relies on paper in too many areas. Even as regulatory agencies have moved to electronic systems, many agencies still collect simple, static documents throughout the regulatory process, often reverting to printed documents for review or adjudication. Paper, and electronic “paper” (i.e. documents in Microsoft Word or Adobe PDF formats), are still how most regulators make decisions and enforce rules. Americans spend 11.6 billion hours on regulatory paperwork each year; small businesses spend 3.3 billion hours, with an estimated annual paperwork cost of $111 billion. [2]

Catalyzed by the challenges of the rollout, the Obama Administration focused on improving government digital services and using technology to make government more user-focused. New teams, including the U.S. Digital Service and the U.S. General Services Administration’s 18F, have helped people apply for insurance on, check the status of their citizenship and visa applications, and discover nearby public parks, among many other projects. [3] The Trump Administration has continued to support USDS and 18F, and more broadly has made IT modernization a major component of the President’s Management Agenda. [4]

Federal regulatory agencies, in contrast, have not focused as much on improving the digital experiences of regulated parties. As a result, there is a huge opportunity to bring best practices of user-centered design and modern technology to the regulatory state. Among other benefits, modernizing the regulatory state could make it easier for businesses and other regulated entities to understand and comply with existing rules.

One area particularly ripe for modernization attention is the regulatory data collection process. Collecting data from regulated parties is a major function of many agencies, and existing technological tools and standards make this modernization a low-risk and high-impact way to improve efficiency while minimizing burden on businesses.

This report starts with a brief primer about the federal regulatory process, including how regulatory agencies collect information from businesses and regulated parties. We then explore potential benefits of data collection modernization—including newer models of regulators harvesting openly published data—and highlight the importance of regulators working with regulated parties to design data collection processes that work for both government and the regulated party.


The Regulatory Lifecycle

Creating a federal regulation is a lengthy process, governed by the Administrative Procedure Act of 1946. The process depends on the responsible agency and the regulation’s goals, but generally involves an agency drafting a proposed rule, seeking public comments on the original draft and subsequent revisions, and then codifying the rule in the Federal Register. [5]

While agencies can revise a proposed regulation based on public comments prior to finalizing the regulation, they cannot easily iterate and revise after it has been finalized. But as technology rapidly changes, some regulations can become outdated. Some have suggested that regulatory agencies could adopt “regulatory sandboxing,” and create a relaxed rule for a constrained area, then use lessons from that experiment to inform the final rule. [6]

After finalizing a rule, a regulatory agency typically then focuses on compliance, and if violations are found, enforcement. Compliance procedures vary from voluntary attestation to inspections. Some regulators ask firms affected by a new rule to send evidence of compliance. For example, every year the Consumer Financial Protection Bureau (CFPB) requires banks to send mortgage application data to ensure compliance with non-discrimination rules in home lending. [7]

As part of compliance, regulators may require firms to present documents, either in paper or as an electronic document. The regulator then processes that information to ensure a firm is compliant. This requires substantial time for both parties. Firms must write documents and keep written logs; regulators read the documents or logs to validate compliance with the rule.

Electronic data submission can transform this process for both regulatory agencies and businesses, and there are examples where the private sector is already seeing the benefits. Since 2009, the U.S. Securities and Exchange Commission has required financial institutions to electronically report information like annual earnings. [8] Businesses suggested that this change would prevent duplicating information. [9] Many also argued that this would allow “filers to complete their submissions more accurately and efficiently,” lower costs, increase transparency, and connect data sources to improve decision-making. [10] The resulting data was also of higher quality than the previous paper documents. Independent analysts say that machine-readable data makes their work easier; weeks-long projects can now be completed in hours. [11]

As the SEC example demonstrates, firms can more easily send compliance information to regulators by submitting data, rather than preparing full documents. If applied widely, both government agencies and private-sector firms could improve efficiencies by adopting techniques that are already commonplace in the technology industry.

By collecting structured data, government agencies can also more easily extract and analyze information about firms’ regulatory compliance. Duke School of Law Professor Lawrence Baxter argues that automating this analysis may be the only way government agencies can properly regulate large and dynamic industries like the financial industry. [12] Compliance data, once collected, can be used for predictive analytics. Imagine software that helps regulators anticipate firms that might become non-compliant; regulatory agencies could then develop and target programs to help firms become compliant before formal rule enforcement mechanisms.

Automating compliance analysis raises important questions about the technical capabilities of regulators, and how Congress and the public can provide effective oversight over regulatory agencies. In addition to economists and lawyers, should regulatory agencies also hire software engineers, designers, and data scientists? How should regulatory agencies responsibly and fairly incorporate artificial intelligence and machine learning?

There is clearly a growing societal debate about the role of artificial intelligence and its benefits and risks, and that debate needs to include the use of AI in the regulatory state—but it starts with regulators being able to collect data, not just documents.


Creating a Data Pipeline

There are two key steps a regulator needs to collect data: 1) create a standard format for reporting data, called a schema or taxonomy, and 2) develop a standard for data submission.


Developing a Schema

Collecting data from a regulated party starts with a regulator defining a data schema. Technically speaking, “a database schema specifies, on the basis of the database administrator's knowledge of possible applications, those facts which can enter the database.” [13] In other words, a schema is simply what information should be collected, and how that information should be represented.

For example, an air pollution regulation may require factories to tell the government the level of carbon emissions they produce. While a government rule usually says what information is necessary, they may not always specify how to submit that information. In the case of carbon emissions, what units should companies use? If a firm doesn’t produce any, should they write “0” or “none”? These differences may seem trivial, but collecting messy or imperfect data makes it difficult for governments to analyze the information they collect.

With clearly defined standards for submitting data, government agencies can help regulated entities by creating data validation tools which can tell regulated entities if they are following the schema. Regulated entities can use these tools to ensure the correct format before they submit data, instead of submitting incorrectly-formatted data and going through the process again. This reduces the total time it takes companies to submit their data.

Some government agencies have already embraced the idea of creating schemas when collecting data from firms. For example, the U.S. Centers for Medicare & Medicaid Services (CMS) asks insurance companies to publish data on insurance plans as well as the drugs and health care providers that each plan covers. This lets users search for insurance plans sorted by drugs and healthcare providers they care about.


Health care plans filtered to include specific drugs on


To make it possible for users to search plans for specific doctors or drugs, CMS needs to have specific information for each insurance plan. So, they worked with a software development firm called Ad Hoc to create a detailed schema for insurance companies. [14]



A screenshot of part of CMS’s schema for health insurance companies.

CMS’s schema is valuable because it defines exactly what information is required from the regulated entities. For a given data field, only a certain set of inputs may be accepted. For example, a given prescription drug may have a quantity limit. When CMS asks health care providers if a given drug has a quantity limit, it only accepts values “true” or “false.” In this example, Ad Hoc also developed data validation tools to help insurance companies submit data in the correct format.


Submitting Data to Regulators

After developing a schema, the next step is creating the data pipeline that lets companies submit their data to the government.

There are three different methods to submit data to a regulatory agency:

  • A dedicated web portal: A person in a regulated entity manually uploads a data file to a dedicated web portal that a regulatory agency maintains.
  • System-to-system communication: Regulated entities install software that automatically sends required data to the regulator’s data system.
  • Distributed publishing: In this newer model, regulated entities publish data at specified URLs, and the government regularly views or collects that data.


Web Portal

One of the most common data submission methods is a web portal, where people at regulated entities log in and upload data at a specific regulator website. This method is often used for data required at regular intervals, such as annual disclosures. It is also used widely for data that must be submitted at different points in a complex process, such as pharmaceutical drug development.

The benefits of using web portals for regular data submission can be seen through the Consumer Financial Protection Bureau (CFPB). Under the 1975 Home Mortgage Disclosure Act (HMDA), financial institutions must report data about all of their mortgage applications. [15] To collect this data, the CFPB created a web portal for submission as well as data validation tools to assist with formatting. [16]

Vendors who assist financial institutions in collecting mortgage data have praised the portal’s ease of use. Tim Kline, vice president and software development engineer at vendor Marquis, says that the interface is simple and intuitive. [17] He also noted that using modern web technology allows the CFPB to update the platform without requiring vendors to reinstall software.

Web portals work well when companies only need to report data on specific occasions. For example, the FDA requires pharmaceutical companies to submit clinical trial data for all New Drug Applications (NDAs). [18] Previously, the FDA required multiple physical copies of all this data. Now, companies upload this information through the FDA’s online web portal. In addition to being more environmentally friendly, the portal makes data submission easier and less costly for the pharmaceutical companies. The data also takes up less physical space, cutting costs for the FDA, and is more secure.


System-to-System Communication

Instead of a person submitting data to a web portal, businesses can use software to send data to a regulator’s data collection system using system-to-system communication. Older generations of IT systems used formats like Electronic Data Interchange (EDI); newer Internet-based systems use modern Application Programming Interfaces (APIs). Regardless of the interface, software, rather than people, submit data to the regulator. A human usually has to direct the software to send data to a regulator or approve the submission before the software submits it. For example, submit is a platform that lets research organizations submit data for government applications like the FDA’s New Drug Applications. [19] EDGARsuite is another software that lets companies submit electronic data directly to the SEC. [20] There is an entire industry of software companies—some have dubbed “reg-tech”—that help firms organize and submit data to regulators.

In Australia and the Netherlands, system-to-system communication has allowed different government agencies to combine schemas and share common information (i.e., company names, identification numbers). [21] This has dramatically reduced the amount of information companies need to submit. In Australia, the number of data fields a business needed to submit fell from 35,000 to 7,000, an 80 percent decrease. This reduction saved $400 million in business-to-business and business-to-government interactions.

It’s even possible to imagine scenarios where regulators have direct visibility into business systems, decreasing the need for formal data submission. For example, the Netherlands allows government agencies (like their Education Executive Agency) to automatically retrieve data they need directly from regulated entities. If a business is using specific software to log a certain type of transaction, that software could automatically send relevant data to the government. [22]

However, giving regulators direct access to company systems also pose a real risk of government overreach. For example, a regulator might see alarming data and impose penalties when the regulated entity could have solved the problem given some time. One way to mitigate this risk would be to only allow access to data at certain time intervals. We suspect that the broader political concerns about regulatory overreach mean that direct visibility into business systems isn’t likely to be tried in the U.S. anytime soon.


Distributed Publishing

In a newer model of regulatory data collection, regulated entities can openly publish data online on their own websites, for regulators to collect. CMS pioneered this method implementing the Affordable Care Act to collect information from insurance companies. CMS requires insurance companies to post files with data about their insurance plans as well as covered health care providers and drugs. CMS teams regularly access these files from each insurance company to ensure that has the latest information for consumers looking for health insurance. [23]

Distributed publishing allows regulated entities to update published data only when changes are made, while allowing regulators to access the most up-to-date information at any time. This makes this method well suited for data that does not change frequently. It also enables the public and accountability groups to see the same information as regulators, at the same time. However, because data is published online, this method should not be used to report confidential business or personally sensitive data.

The three data submission approaches outlined above can also be used in combination. For example, government agencies can develop APIs for internal use, while also using those APIs for both a web portal and distributed publishing programs. A combined approach can be useful because it brings the best of simplicity (human-readable instructions and data validation tools on a web portal) and automation (system-to-system communication and distributed publishing models for automatic and infrequent data submissions).

CMS uses this combined approach for the aforementioned example of collecting data from health insurance companies; in addition to requiring health insurance companies to publish data on their websites, they also manage a website with additional information and resources for insurance companies. [24] Up until April 2018, they also supported an API that offered data submission confirmation. [25]



While moving from documents to data has many potential benefits, there are also implementation risks to consider. For example, the state of California tried to replicate CMS’s tool to let consumers filter health care plans. [26] However, California regulators required insurance companies to submit much more information—three times the amount that CMS requires—and the companies struggled to comply. If California regulators had worked more closely with the insurance companies, it’s possible the state of California would have been able to adapt or reuse an existing schema and create an insurance shopping tool that worked for their residents.

Government agencies also need to make ongoing changes to their data collections systems to reflect the realities of regulated businesses. Tim Kline at HMDA vendor Marquis admires the CFPB’s work on their HMDA platform in part because CFPB readily adapted to changes requested by regulatory technology vendors like them. [27] When Kline reported an issue he had with the platform, the CFPB fixed the issue in less than five hours. This responsiveness builds trust between regulators and regulated entities, ensuring that businesses can more easily submit data using new tools.

Moving away from document-based compliance regimes also risks disadvantaging smaller and less sophisticated firms that are content to keep information on paper. Larger firms can more easily afford people and software to automate, submit, and publish data in whatever format a regulator requires. Smaller firms, including non-profits and single-person businesses, may not have the resources to adapt. We would suggest that regulators use human-centered designers to research and understand the perspective of a small firm trying to comply with the rules, and then build data submission mechanisms that are as simple and easy as possible, frequently testing them with real people. Regulators ought to also consider phased approaches, starting with larger firms.



The apple orchard in upstate New York required 13 clipboards and dozens of thick binders to understand and comply with government regulations. What if they could use software that automatically sends relevant financial, labor, and environmental data to relevant government regulators?

Currently, the regulatory state is still too document-based and requires lots of back-and-forth communication between regulatory agencies and regulated entities. By leveraging structured data, regulators can get higher-quality information faster, and analyze it more easily. Companies could submit data once, instead of multiple times.

Federal regulatory agencies can lead this move to structured data collection by collaborating with the firms they regulate to create well-defined schemas and data validation tools. They can also upgrade their compliance systems to ingest structured data, including via modern techniques like APIs and harvesting openly published data. As more regulatory agencies modernize, they can collaborate with each other, even across jurisdictions, so that if they regulate the same party, they can appropriately share data and further reduce regulatory burden.

Modernizing data collection is part of a larger opportunity to modernize the regulatory state. Regulatory sandboxing, greater collaboration among regulators and regulated parties,and making it easier for people to understand regulations—in addition to much-needed IT modernization—are opportunities to bring regulatory agencies into the modern age. [28] [29]



[1] Eder, “When Picking Apples on a Farm With 5,000 Rules, Watch Out for the Ladders.”

[2] “In The News | Small Business Committee - Democrats.”

[3] Deahl, “Reimagining the Immigration Process”; Barnes, “How to Design a Government Site for Kids.”

[4] “President’s Management Agenda.”

[5] “How Laws Are Made | USAGov.”

[6] “Sandboxing And Smart Regulation In An Age of A/B Testing.”

[7] Consumer Financial Protection Bureau, “About HMDA.”

[8] “Final Rule: Interactive Data to Improve Financial Reporting.”

[9] “Comments on SEC Proposed Rule - 33-8496: XBRL Voluntary Financial Reporting Program on the EDGAR System”; Wicklund, “Comments of G. Wicklund on S7-35-04.”

[10] PricewaterhouseCoopers LLP to Katz, “PricewaterhouseCoopers to Secretary Jonathan Katz”; Thornton to Katz, “Grant Thornton to Secretary Jonathan Katz.”

[11] Castagno, interview.

[12] Baxter, “Adaptive Financial Regulation and RegTech.”

[13] Imielinski and Lipski, “A Systematic Approach to Relational Database Theory.”

[14] Smith and Gershman, QHP-Provider-Formulary-APIs.

[15] Consumer Financial Protection Bureau, “The Home Mortgage Disclosure Act.”

[16] HMDA Platform Introduction; Consumer Financial Protection Bureau, “HMDA File Format Verification Tool.”

[17] Kline, interview.

[18] Minish et al., interview.

[19] “Instem - SubmitTM.”

[20] “EDGARsuite SEC Filing Software | XBRL US.”

[21] Hollister et al., “Standard Business Reporting: Open Data to Cut Compliance Costs.”

[22] Hollister et al.

[23] For example, CMS can access the healthcare providers covered in insurance plans at the URL

[24] “The Quality Payment Program.”

[25] “QPP Submissions API Developer Documentation.”

[26] Gershman, interview.

[27] Kline, interview.

[28] Thottungal, interview.

[29] CFPB and 18F’s eRegulations are examples of solutions to digitize regulations and make them easier to understand, for people participating in the regulatory process.



Thank you to everyone who agreed to be interviewed for this report: Hudson Hollister, Erie Meyer, Justin Herman, David Bray, Robin Thottungal, Paul Smith, Greg Gershman, Mike Willis, John Turner, Mohini Singh, Tim Kline, Melissa Kozicki, Juliet Minish, Gretchen Trout, Gina Funderberk, and Todd Castagno. Additionally, thank you to everyone who provided edits, including: Shannon Sartin, Ben McGuire, and Giuseppe Morgana.


About the Authors
Alisha Ukani is a Harvard undergraduate (class of 2020) studying Computer Science, with a focus in systems engineering. Outside of classes, she serves on the City of Cambridge's Open Data Review Board, and conducts computer science research on virtualization and browser security. She has previously served as the Director of the 501(c)(3) nonprofit organization HackHarvard, and as Senior Tech Director of the Harvard Political Review. She is an incoming software engineering intern at Google and previously completed a software engineering internship at Slack.

Nick Sinai is Adjunct Lecturer in Public Policy at the Harvard Kennedy School. Nick is a faculty affiliate of Digital HKS and the Shorenstein Center, and was previously a Walter Shorenstein Media and Democracy Fellow, where he researched and wrote about data as a public good. He teaches the field class Tech and Innovation in Government, where students learn design, product management, and public-sector entrepreneurship. Nick served as U.S. Deputy Chief Technology Officer in the Obama White House, and as Director of Energy and Environment at the Federal Communications Commission.

For more information on this publication: Belfer Communications Office
For Academic Citation: Ukani, Alisha and Nick Sinai. “Data, Not Documents: Modernizing the Regulatory State.” Paper, March 2019.

The Authors