In 1970, Philadelphia’s Franklin Institute got a Moog synthesizer—a giant, unwieldy computer that created music. The sound was weird and quite evidently electronic. My fellow high school science geeks and I found it endlessly fascinating. But we couldn’t imagine how wonderfully diverse and inventive electronic music would become. It’s grown into a mainstay of pop music and enabled a host of new musical genres.
Now, the Tax Policy Center (TPC) has built a sort-of tax Moog with generous support from Arnold Ventures. Our synthesizer creates data instead of music. Working with our data analytics colleagues at the Urban Institute and the tax data guardians at the Statistics on Income (SOI) Division of the IRS, we developed a new synthetic supplemental public use file (SSPUF)—a statistical representation of data on about 26,000 lower-income individuals who do not file income tax returns. A new TPC paper describes how we create the data and assesses how well the synthesis preserves information while protecting taxpayer privacy.
We built this tool to complement SOI’s public use tax return dataset. The SSPUF allows researchers, for the first time, to get the data on those nonfilers. It gives a more complete picture of the distribution of income and how potential reforms could affect households at all income levels.
While SOI compiles non-filer data from information returns—W-2 forms filed by employers, Social Security records, 1099 forms filed by financial institutions, etc.—it never has made that data public.
The SOI has, however, produced an annual public use file (PUF) of anonymous individual income tax returns. But even anonymous administrative data risks being uncloaked by miscreants who use their own powerful computers to match some tax information with the enormous amount of public information available in other datasets. Once they find a match, hackers could identify individuals and link them to their confidential tax data.
SOI and other government statistical offices have long recognized this threat and alter the tax return data in numerous ways to protect against disclosure, such as suppression of some data and the addition of random noise to easily identifiable records. But data stewards are concerned that as computers grow more powerful and more non-tax data becomes available, even aggressive measures might not fully protect against a determined attacker. Increasingly, aggressive disclosure-proofing also risks altering the data so much that it becomes much less valuable for analysis.
That’s where data synthesis comes in. By drawing random records from a statistical distribution designed to mimic the distribution of actual records, it can produce results similar to those derived from the actual administrative data without making those data public.
There are two practical challenges with this approach. First, since we start with administrative data, we might inadvertently reveal too much confidential information. A hacker might be able to reverse-engineer the synthesis process to learn about actual data records. Our working paper describes how we protect against such attacks.
The second problem is that the empirical distribution of tax data can be very complex. The SSPUF represents a relatively simple population—nonfilers—but our ultimate goal is to produce a synthetic PUF representing the entire tax filing public.
Matching the distribution without revealing too much information inevitably blurs important details about the underlying data. We think the synthetic data will be valuable for measuring the distribution of income and tax liability for a specific year, much as TPC does with its tax model. However, at this stage of development the synthetic data probably won’t work for more complex statistical analyses. For example, modern research attempts to measure taxpayers’ sensitivity to tax rate changes by studying how people lower their income to avoid jumps in marginal tax rates. That income bunching tends to be smoothed out in the synthesis process and so would not show any response to tax rate changes.
That is where the concept of a validation server comes in. The second stage in our project is to develop a secure way to create statistical programs based on actual administrative data, but altering the results just enough to preserve privacy. Researchers will be able to develop, test, and debug their programs on the synthetic PUF and then submit the programs via a secure portal to receive sanitized versions of the statistical results that are based on the actual data. The estimates will be statistically valid, but somewhat less precise than estimates based on the actual data due to the privacy-preserving alterations. Best of all, the researchers never touch the actual sensitive data.
Today, we are at the 1970 Moog synthesizer stage of data synthesis. The SSPUF is a modern wonder, but it loses a lot of nuance and detail that is in the original confidential data. Like synthetic music, synthetic data will get better as our techniques improve and computing power increases.
We are in the very early stages of developing a secure and reliable validation server, but the combination of synthetic data and validation servers eventually could produce much better evidence to guide public policy. Teenagers may never shake to the techno beat of our synthetic tax data, but we nerdy old researchers are dancing a little jig at the prospect of finally having access to the gold standard of tax research data bases.