Congress and Researcher Access to Social Media Data

Researcher Access to Social Media Data

Following the public disclosure of Facebook’s internal research on the adverse impacts of its products and tools on users, the United States Senate Committee on Homeland Security and Governmental Affairs held a hearing on October 26, 2021, to explore harms presented by social media companies. That day, Professor Nate Persily of Stanford University testified before the Committee on “Social Media Platforms and the Amplification of Domestic Extremism and Other Harmful Content.” Included with his testimony was a legislative proposal to give academic researchers greater access to social media company data, which resulted in a Senate proposal by Senators Coons, Portman and Klobuchar, the Platform Transparency and Accountability Act (PATA), released for review on December 9, 2021.

The PATA and other similar proposals (including the EU Digital Services Act) attempt to remedy the massive lack of supporting empirical data needed to substantiate arguments for regulatory interventions to address online harms. Without data, the formation of comprehensive theories about harms is left to existing case law, which is often fraught by outdated legal frameworks written before some modern technologies existed. Without empirical evidence it is also difficult to establish societal goals about the proper use and control of technologies, and their corresponding impact on society. When harms have not been fully documented or defined, policymakers are left to regulate assumptions.

The idea that the business practices, architectures, and tools of various technology companies contribute to social harms unique to the internet is not new. That the federal government should mandate the sharing of company data on its users with academic researchers and through an independent federal enforcement agency is novel, however. Although the United States has a societal interest in understanding how various technologies and companies are impacting privacy, competition, innovation, the vulnerable, and national security, among other issues, this note does not discuss those interests. Instead, it highlights examples from previous historical efforts and identifies eight concerns with existing approaches.

Background of data sharing, technology and the US Government

The earliest government-mandated data collection efforts in the United States occurred with the first census in 1790, not long after the inauguration of President George Washington. Until 1902 the census was conducted outside of any federal agency and was limited to four subject areas– population, agriculture, manufactures and vital statistics. The first federal bureau of statistics was not the census, however. In response to errors in the 1840 census, and to meet the needs of an expanding economy, in 1845, two reports from a select Committee in the House of Representatives called for the establishment of a permanent bureau of commerce and statistics within the Treasury Department. Despite the Congressional support for such an effort it took twenty years, until 1865, for a Bureau of Statistics to be created within the Treasury Department.

One hundred years later, with the increasing use of computers by both the federal government and major research universities, a number of academics began advocating for greater researcher access to federal census and statistical data. In 1965 a formal proposal for a National Data Center emerged from a three-year study by the Social Science Research Council (SSRC) of the American Economic Association (often referred to as the Ruggles report after its Chair). “The committee noted that 20 federal statistical agencies had over 600 major data sets that were stored on approximately 100 million punchcards and 30,000 computer tapes.” Researchers also highlighted that the way the data was organized and distributed made it extremely difficult to access and use.

The proposal was followed by another report by the Bureau of the Budget, led by Edgar Dunn, which endorsed many of the ideas in the Ruggles report. Following the Dunn and Ruggles Report, in a 1967 statement to Congress President Lyndon Johnson announced the creation of a task force to evaluate the ways that new information technologies could make the statistics and data of the government more efficient and useful.

The report of the Task Force on the Storage and Access to Government Statistics, otherwise known as the Kaysen Report, argued for the creation of a national data center under the Executive Office of the President, with a “Director of the Federal Statistical System”, and two advisory councils– one for the interests of government users and one for the private sector and broader public. Unlike the Dunn and Ruggles reports, the Kaysen report was the only one of the two to specifically address privacy issues.

Congress reacted strongly to the idea of a centralized repository of statistical data and it held several hearings where the concept of “dossier banks” emerged, centering the debate on whether a national data center would be a massive intelligence center or a purely statistical organization.1 The resulting press coverage of a national data center was negative, and concerns about privacy dominated the public conversation and led to the failure of the effort. A national data center was never created, but the ongoing debate about privacy and data banks played a significant role in the eventual development of the 1974 Privacy Act.

Although the research community was unsuccessful in the formation of a national data center, researchers have had access to Federal census and statistical data through the United States’ Federal Statistical System since before either the Ruggles or Kaysen reports. The system, coordinated through the Office of Management and Budget (OMB), is a decentralized network of 13 principal statistical agencies identified by OMB which sit on the Interagency Council on Statistical Policy. It includes 29 Federal Statistical Research Data Centers (RDC’s) which are partnerships between federal agencies and leading scientific research institutions. RDC’s are managed and administered by the United States Census Bureau and they provide researchers with access to survey and census data, and administrative data. Major academic research institutions provide the funding to maintain secure research labs for the RDC’s, and RDC’s grant restricted-access use of microdata to approved researchers and research projects.

Other provisions regarding researcher access to data include a memorandum in 2013 by the Office of Science and Technology Policy (OSTP), which directed federal agencies with more than $100 million in annual research and development expenditures to develop plans to increase public access to federally funded research. At the time the proposal generated a great deal of discussion, primarily around concerns that all academic research products that use federal data might be made free to everyone. Six years later, in a 2019 GAO report that reviewed the progress of the 19 eligible agencies, GAO found that several agencies had failed to develop plans due to issues with data access and mechanisms to ensure researchers complied with public access requirements.

Additionally, the Shelby Amendment or Data Access Act, attached to the Omnibus Appropriations Act for FY1999, mandated the Office of Management and Budget (OMB) to amend Circular A-110 to require federal agencies to ensure that “all data produced under a [federally funded] award will be made available to the public through the procedures established under the Freedom of Information Act [FOIA].”

The key differences of these provisions and the latest proposals in Congress, however, are that the federal government was attempting to make research it funded more accessible to the public who paid for it through taxes, and the data more accessible to and across agencies. The private sector as an information sharing partner was not involved.

The Cybersecurity Example

Although these examples highlight how the federal government has shared data with academic researchers, the competing interests involved in these cases are distinct from the concerns of information-sharing between the federal government and the private sector, as well as among academia, the private sector, and the government.

The example of cybersecurity data illustrates the complexities of information-sharing between the private sector and federal government, and the challenges involved in balancing private and public interests.

Since 1998, when President Clinton signed the Presidential Directive 63 on critical infrastructure, information-sharing between industry and the government regarding potential cybersecurity threats has been conducted through industry-specific information sharing and analysis centers (ISAC’s).2 Since then, and because information-sharing between industry and the government was often hampered by legitimate concerns about privacy, civil liability, intellectual property rights and antitrust issues, on February 13, 2015, President Obama signed Executive Order 13691 “to encourage and promote sharing of cybersecurity threat information within the private sector and between the private sector and government.”

Following the Executive Order, Congress passed two bills through the House Select Committee on Intelligence and the Department of Homeland Security, and one in the Senate through the Select Committee on Intelligence. After fierce negotiations, the three bills were reconciled into the Cybersecurity Information Sharing Act of 2015. The bill was the product of years of intense debate about how “non-federal entities,” which include the private sector, tribal, state and local government entities, receive and share cybersecurity threat information. To encourage non-federal entities to share such information, the bill provided protection from liability, non-waiver of privilege, and some FOIA disclosures, based on the entities involved.

Europe’s Digital Services Act

On January 20, 2022, the European Parliament passed the European Digital Services Act by a hefty majority. The European Digital Services Act and the European Digital Markets Act both attempt to address the public harms and many of the issues—particularly regarding mis and dis-information-- presented by the large technology companies and our increasing reliance on their services. Article 31 of the Act is very similar to the PATA proposal-- it includes a provision to allow “third party researchers” access to platform data to facilitate transparency and oversight. The European Commission’s original proposal included a very limited definition of what constitutes a researcher that was then expanded in the EU Council’s text to include anyone affiliated with the EU’s 2019 Copyright Directive. Despite the expansion, activists, social interest researchers and journalists, including Nobel Prize winner Maria Ressa, have urged the European Parliament to expand the definition of a researcher within Article 31.

Although there are many ways to analyze the DSA and DMA and their potential role in addressing content moderation and other issues, key differences between PATA and the DSA include that the EU already has in place the General Data Protection Regulation (GDPR) and United States Congress has not passed comprehensive privacy regulations, despite many efforts in Congress to do so. The DSA also includes provisions to protect users from dark patterns and the Senate Dark Patterns bill has not passed.

The PATA proposal and others that seek to ensure that social media technology companies are more transparent and accountable will likely face the same concerns about privacy and information security that arose during the cybersecurity information-sharing debates, or the national data center debates, or any public discourse about how data is accessed and handled, or how disinformation is addressed. There are also other concerns unique to the types of data that social media companies collect through their relationships with their users, many of which could be addressed if privacy legislation led attempts to regulate content moderation.

Issues to Consider

Several current proposals in Congress, like PATA, attempt to address harms created by increasingly powerful technology companies and protect consumers by bolstering the funding and technical capacity of the Federal Trade Commission’s (FTC) investigative, law enforcement and rulemaking authority. Some address privacy concerns, and others create full offices within the FTC for civil liberties and privacy protection. These are not bad ideas– the FTC needs greater technical capacity to fulfill its duties to stop unfair, deceptive, or fraudulent practices, particularly in the digital marketplace. Increased transparency and accountability are important, as are considerations of market distortions. Mandating and ensuring that social media companies comply with the FTC’s 6b investigative authority is different, however, from requiring social media companies to hand over user data to researchers without several other critical elements addressed.

There are at least eight areas for Congress to continue exploring when crafting legislation to provide researcher access to data:

Like the DSA proposals, the definition of a “researcher” is a source of debate. Currently there is no uniform or regulated definition or standard for what constitutes a “researcher”, even one affiliated with a major research institution. There are also no uniform and regulated standards for institutional Review Boards (IRBs) at major institutions, and there are varying perspectives on what constitutes “human subject” research. One researcher noted that images of people used for deepfake research did not constitute human subject research according to the IRB at their major research institution, but another researcher said it absolutely would qualify at theirs. Restrictions on researchers for privacy and national security concerns has been an issue since at least the second World War. Although social science researchers have limited access to census data in the U.S., the type of data that would be collected from social media technology companies presents different challenges than microdata about specific census-related information. On some social media platforms, users make up a significant part of the product.

Like the DSA proposals, the definition of a “researcher” is a source of debate. Currently there is no uniform or regulated definition or standard for what constitutes a “researcher”, even one affiliated with a major research institution. There are also no uniform and regulated standards for institutional Review Boards (IRBs) at major institutions, and there are varying perspectives on what constitutes “human subject” research. One researcher noted that images of people used for deepfake research did not constitute human subject research according to the IRB at their major research institution, but another researcher said it absolutely would qualify at theirs. Restrictions on researchers for privacy and national security concerns has been an issue since at least the second World War. Although social science researchers have limited access to census data in the U.S., the type of data that would be collected from social media technology companies presents different challenges than microdata about specific census-related information. On some social media platforms, users make up a significant part of the product.

The fifth amendment to the United States Constitution states that private property should not be taken for public use without just compensation. The United States does not currently have a doctrine for an “eminent domain for data”, even personal data, but provisions that require companies to share that type of data for the public interest could potentially lead toward one (See the work of Professor Usha Ramanathan. She argues that the doctrine of Eminent Domain in India applies to data rights and ownership as embodied in the Land Acquisition Act of 1984). Would these requirements be considered similar to the United States Government “condemning” personal user data for public purposes? And since “public purpose” for eminent domain of property has always been vaguely defined, what might it mean for personal data as “property”? This is a complicated element of several legislative proposals, and individual ownership of data is far from settled. Since users are a large part of the product of social media websites, is the data they produce on the platforms the “property” of individuals, or the company? Who, then, is primarily responsible for owning and maintaining the security of that “property”?

The privacy issues involved in sharing company data with academic researchers without comprehensive privacy regulations in place are not insignificant. The arguments from the academic research community for why privacy issues are not a concern have historical precedents, but they have not been resolved. Efforts to establish the National Data Center and later to increase threat sharing between the government and the private sector were both initially scuttled by privacy concerns. Recent proposals could face similar challenges.

The national security concerns presented by the collection and aggregation of social media data are also significant. In addition to not having international jurisdiction, maintaining the security of behavioral and other revealing data that could be re-identified is a legitimate concern. The OPM breach is an example of the vulnerabilities of aggregating valuable data that could be stolen by nefarious or state actors.

Even if the previously mentioned challenges did not exist, practical and technical challenges to sharing data are significant. Data collection and cleaning is expensive and time consuming, and we have not found a legislative solution to resolving the myriad of issues presented by interoperability, data residency laws, etc.

The Federal Government, as currently designed, lacks the capacity to adequately deal with the technical challenges presented by the harms that legislative proposals seek to address. The legislative branch also currently lacks the capacity to conduct adequate oversight of all of the issue areas where technology is involved in possible harms against the public.

Third-party auditing of research practices might mitigate some concerns about privacy and researcher standards, but it is not clear if GAO has the capacity to conduct the kind of auditing and oversight that would be needed, or if an adequate framework exists to evaluate the practices of social media researchers.

It is unclear how or if Congress might evaluate the impact of these proposals, and the unintended consequences and the potentially unwanted incentives they might create. Will companies be disincentivized to consider the public impact of their internal research or business models entirely?

Efforts to mitigate harms created by the power of technology companies and our reliance on digital systems are extremely important. But technology policy often has a sequencing problem, and researcher access to data should not precede robust privacy legislation.

Researcher Access to Social Media Data

Background of data sharing, technology and the US Government

The Cybersecurity Example

Issues to Consider

Leisel Bogan