SSN Study - FAQ

 

Predicting Social Security Numbers from Public Data [pdf] [html] [appendix] [Commentary by William E. Winkler]


Alessandro Acquisti (Heinz College, Carnegie Mellon University)

Ralph Gross (Heinz College, Carnegie Mellon University)


Proceedings of the National Academy of Science, July 7, 2009.


To be presented at BlackHat Las Vegas, July 29, 2009



We gratefully acknowledge research support from the National Science Foundation under Grant 0713361, from the U.S. Army Research Office under Contract DAAD190210389, from Carnegie Mellon CyLab and Berkman Fund, and from the Pittsburgh Supercomputing Center


We also would like to thank Jimin Lee, Ihn Aee Choi, Dhruv Deepan Mohindra, and, in particular, Ioanis Alexander Biternas Wischnienski for outstanding research assistantship.



This is a draft document. We will keep adding Q&As as we receive or read relevant questions about the study in comments and emails. Please bear with us as we add content and work towards a final, clean version of this FAQ. Thank you!



Additional information can be found on our research blog. Belorussian translation provided by PC.


Index


General Questions:


  1. What is this research about?

  2. Why does this matter? Why is such predictability a problem?

  3. What are the implications of your results?

  4. Why are you publishing these results?

  5. What steps have you taken before publishing the results?

  6. How can your results be used to address the problem of identity theft? Do you have practical recommendations?


Technical Questions:

  1. What exactly does it mean that SSNs are "predictable"?

  2. How do such predictions work?

  3. How did you verify your predictions?

  4. If the algorithm only produces windows of values likely to include the correct SSN, why is this a concern?

  5. Have you "broken" some secret code? Doesn't the Social Security Administration publicly discloses information about the assignment scheme?

  6. Isn't this old news? Everybody knows that Area Numbers are associated with states (etc.)

  7. Can the predictability of SSNs lead to identity theft? Does this research publication instruct identity thieves on how to acquire SSNs?

  8. How does this differ from previous research?

  9. What data do you need to predict SSN? Isn't birth data hard to come by?

  10. From which social networking site did you find data for one of your tests?

  11. Aren't SSN  in fact as available as birth data?

  12. Can you accurately predict *every* SSN?

  13. How many actual SSNs can be predicted?

  14. I posted my date of birth online. Has my SSN been "broken"?

  15. Aren't data breaches a larger problem?

  16. Isn't it cheaper to just pay a data broker to acquire SSN data?

  17. Isn't it the case that SSNs, alone, are not sufficient to impersonate a person? Banks and other services ask for additional information (such as mother maiden's name, your pet name, and so forth).

  18. If SSNs were no longer used for authentication, what else could we use?

  19. Who funded your research?

  20. Were the tests IRB approved?

Executive summary


Social Security numbers were created under the Social Security Act of 1935 as identifiers for accounts tracking individual earnings. However, over time, they started being used as sensitive authentication devices, becoming one of the pieces of information most often sought by identity thieves: knowledge of a person's name, SSN, and data of birth, is often a sufficient condition to impersonate that individual and obtain access to a variety of services, leading to so-called identity theft. The current public policy in the area of identity theft suggests that SSNs should be kept confidential: consumers are urged to protect their SSNs. However, we show that it is possible to predict individual SSNs simply from publicly available data. Based on observation of issuance patterns in the "Death Master File" (a public database that contains SSNs of people who have died), we were able to use information about an individual's date and state of birth to predict narrow ranges of values likely to contain that individual's SSN.  The predictions are particularly accurate for the SSNs of people who were born after 1988  (when the SSA initiated the Enumeration at Birth program, through which babies receive SSNs soon after birth) and in states with lower population. Since SSNs are predictable from public data, identity theft could occur even without events such as data breaches. Some of the implications are that 1) the SSA should randomize the entire SSN assignment process; 2) current policy initiatives in the area of SSN and identity theft should be reconsidered: most policy-making currently focuses on removing SSNs from databases or redacting their digits, so that they can still be used as "confidential information" - however, since SSNs are predictable from otherwise publicly available data, SSNs cannot be kept confidential even if they are removed from databases, and therefore those initiatives may be ineffective; 3) since SSNs can be predicted and are therefore, in a sense, semi-public information, consumers should not be required by private sector entities to use SSNs as passwords or for authentication.

General questions


Q. What is this research about?


We studied the assignment scheme of Social Security numbers (SSNs) and discovered that individual SSNs can be predicted entirely from public data. Specifically, we found that it is possible to combine information from government sources with simple demographic data (such as an individual's state and date of birth, widely available from commercial databases, voter registration lists, or online social networks) to predict narrow ranges of values wherein individual SSNs are likely to fall.



Q. Why does this research matter? Why is predictability of SSNs a problem?


SSNs are supposed to be confidential information - the predictability of SSNs increases the risk of vast-scale identity theft.

SSNs were originally designed in the 1930s to be used as identifiers of accounts tracking individual earnings. However, over time, they started being used for ``authentication'' in a variety of private sector services - that is, to verify identity and determine whether someone is who he/she is claiming to be. Hence, they came to be considered sensitive information. The inherent tensions between using the same number as identifier of an account (which may be shared with other parties) as well as a "password" (which is supposed to be private and confidential) has contributed to the rise of identity theft. In the US, knowledge of someone's name, date of birth, and SSN is often sufficient condition to impersonate that person for financial, medical, or other types of fraud. Hence, if SSNs can be predicted from public data, the risk of identity theft increases.



Q. What are the implications of your results?


First: SSNs, in their current form, are highly insecure passwords and should not be used for authentication. If one can successfully identify all nine digits of an SSN in fewer than 10, 100, or even 1,000 attempts, that Social Security number is no more secure than a three-digit PIN. Both government agencies (including the SSA and the FTC) and researchers (e.g., [LoPucki, 2003], [Samuelson, 2007], [Solove, 2003]) have warned against the use of SSNs for authentication. Unfortunately, SSNs are still used (and abused) everywhere in the private sector to authenticate identities, which leads to widespread crimes of identity theft.


Second: Current legislative and policy initiatives in the area of identity theft prevention which focus on removing SSNs from public exposure, or redacting their first five digits are well-meaning, but may be misguided - because even redacted or removed SSNs remain predictable from otherwise publicly available data.


Third: More broadly, our findings highlight the unexpected consequences of the interaction of multiple data sources in modern information economies. They show how non-sensitive personal data (such as information people reveal about themselves online) can be combined with other data sources, also non-sensitive, leading to the inference of much more sensitive information.



Q. Why are you publishing these results?


SSNs are very insecure passwords. However, notwithstanding warnings by numerous government agencies (including the SSA and the FTC), they are often used in the private sector both as identifiers and for authentication - this causes costs and damages of billions of dollars every year to businesses and consumers. Our intention is to show that, in their current form, SSNs are compromised as passwords; to alert not just policy-makers, but also businesses and consumers of the threats to individual identities deriving from the use (and abuse) of SSNs as means of authentication; and to contribute to the debate on more efficient, secure, and privacy-preserving means of verifying identities in our information society.

Identity theft is so widespread in the US because Social Security numbers are incongruously used by businesses both as identifiers and as passwords -- something they were never designed to be [Smith, 2002]. This is a practice that the Samuelson Clinic at UC Berkeley has defined as ``irresponsible'' [Samuelson, 2007] and that law scholar Daniel Solove has referred to as an "architecture of vulnerability"[Solove, 2003]. In the US, the overall costs of identity theft in 2007 were estimated at $49.3 billion [Johannes, 2006]. As Chris Hoofnagle noted [Hoofnagle, 2007], those costs are born by all parties, but particularly by consumers, either directly (lost time, inconvenience, and out-of-pocket costs) or indirectly (through higher fees paid for credit services, or as taxpayers, when financial institutions write off identity theft losses when computing their corporate income taxes). Furthermore, additional costs are incurred every year even in absence of fraud, because of costs caused by attempts to defend, and exploit, the system [idanalytics, 2005] -- consider, for instance, the investments that companies and individuals are required to bear in order to protect sensitive data. By showing that SSNs are predictable from public data, and therefore inadequate as passwords, we hope to help stop the costs associated with their use as means of verifying identities, and redirect the attention towards the progresses of research on secure, privacy-preserving authentication methods - from 2-factor authentication to digital certificates.



Q. What steps have you taken before publishing the results?


Among other things, we have omitted sensitive details about the prediction strategy from the published article, and we have shared our results with government agencies prior to publication.



Q. How can your results be used to address the problem of identity theft? Do you have practical recommendations?


The findings suggest a number of considerations and possible strategies for public and private sector entities, as well as for individuals.


Government agencies

The assignment scheme of SSNs could be changed to incorporate true randomness. This would eliminate the risk of predictability for newly assigned SSNs - however, it will not do much to protect the hundreds of millions of SSNs already assigned. It may also make us complacent with preserving the current -- and insecure -- system where SSNs are incongruously used by private sector entities both as public identifiers and private passwords - a role that SSNs never meant to fulfill when they were designed in the 1930s. Government agencies (and policy-makers) may instead consider incentivizing private sector entities to abandon the use (and penalizing the abuse) of SSNs as means of authentication, and may encourage academic and industry research on the application of more efficient, secure, and privacy-preserving means of electronic authentication - such as 2-factor authentication and digital certificates.


Policy-makers

Current policy initiatives in the area of SSN protection and identity theft prevention may be reevaluated [LoPucki, 2003]. Many current initiatives in this area (see [GAO, 2008], [FTC, 2008]), as well as the 2007 President's Identity Theft Task Force's recommendations, are well-meaning; however, they focus on removing SSNs from public exposure (or redacting their first five digits) , in order to preserve SSNs' role as sensitive numbers and means of authentication [The President’s Identity Theft Task Force, 2007]. Our results, instead, suggest that approaches solely focused on removing or redacting SSNs may be ineffective, or misguided: assigned SSNs cannot be revoked to avoid future fraud, exposed data cannot be taken back, and the first 5 digits of an SSNs are those, in fact, easier to infer. This leaves even redacted or truncated SSNs still predictable and, therefore, still vulnerable.


Credit Reporting Agencies, financial, and other institutions

CRAs and financial institutions should stop using SSNs for authentication (that is, as proof of identity), and strengthen their identity matching strategies and authentication techniques. Reports from the FTC [FTC, 2004] and academia [Hoofnagle, 2007] have highlighted how credit applications with incorrect names or even incorrect SSN digits are routinely accepted as valid (because credit reports are known to contain errors and inaccuracies). Such practices leave open "holes" in the identity verification infrastructure that fraudsters can and do exploit.

In fact, both Credit Reporting Agencies and initiatives such as E-Verify and SSNVS should pay particular and heightened attention to attempts at identity crimes that rely on 'tumbling.' Tumbling is a cyber-criminal practice that has already been documented and that consists of slightly changing numerical details in fraudulent applications, such as addresses and, in fact, the manipulation of known SSNs across multiple account applications [idanalytics, 2005].


Online services

Online services which post or allow members to post demographic information (from online people search services to online social networks) should consider strategies (from choosing appropriate defaults to setting adequate security policies) that as much as possible try to balance the need for free data flows and exchanges with the protection against abuses of those data, putting particular attention towards the consideration that even innocuous data can be recombined to produce more sensitive information together with other sources.


Consumers

By realizing the potential use of public documents as "breeder" documents of more sensitive data, we, as consumers, can make better informed decisions, trading-off and comparing the benefits of online information sharing with its potential costs. However, the problem our paper highlights goes way beyond users' control - it is a systemic problem due to the exploitation of SSNs for goals (authentication) they were never designed to fulfill. Hence, the emphasis on asking consumers to "protect" their SSNs [SSA, 2007] may be misplaced, if even well-meaning consumers' SSNs may be compromised because of information other entities have revealed about them. In other words, our results indicate that the problem of SSNs security goes much beyond consumers' responsibility and control: it has to do with the use (and abuse) of SSNs in the private sector for purposes (such as authentication) they were never designed to fulfill. As consumers, we have very little control on that. At the end of the day, this is a systematic problem that industry, policy-makers, and of course researchers must resolve.



Technical questions



Q. What exactly does it mean that SSNs are "predictable"?


It means that information about an individual's state and date of birth can be sufficient to statistically infer narrow ranges of values wherein that individual's SSN is likely to fall.

``Can,'' because this is true (in general, and simplifying things a bit) only for individuals who received their SSN around the time of their birth (by 2005, at least 92 percent of SSNs assigned to US citizens were assigned at birth [SSA, 2006]; the percentages of individuals receiving their SSNs around the time of their birth started increasing dramatically in the late 1980s as a result of the Enumeration at Birth initiative).

"Ranges of values" means that the predictions are based on statistical inferences: in general, the first 5 digits can be predicted with a very high degree of accuracy with a single attempt - especially for individuals born after 1988 and in less populous states. In some cases, we were able predict the whole 9 digits of individual SSNs at the very first attempt. More often, the predictions produce windows of values that are likely to include the actual 9 digits. These windows can be very large (and, therefore, inaccurate) for certain years and states (for instance, for individuals born in California in 1973), but can get very narrow (and therefore more concerning, in terms of identity theft risks) for smaller states and recent years (for instance, 1 out of 20 SSNs of individuals born in DE in 1996 in our dataset could be identified with just 10 or fewer attempts per SSN).


Q. How do your SSN predictions work?


Our predictions are based on the fact that SSNs are assigned according to a complex yet regular - and therefore predictable - pattern. The prediction works based on the interpolation of an individual's date and state of birth with SSN issuance patterns derived from the so-called "Death Master File", a publicly available file reporting SSNs, names, dates of birth and death, and states of SSN application for individuals whose deaths have been reported to the SSA (also popularly known as SSDI or SSN Death Index).  Part of the process is described in the PNAS paper. Certain details have been omitted from publication.



Q. How did you verify your predictions?


We ran two tests. In the first test, we plotted the SSNs of Death Master File (DMF) records versus time for data between 1973 and 2003. We observed statistical patterns that appeared in the DMF data; then, we used these patterns to predict the SSNs of DMF records. In a second test, we interpolated demographic data extracted from students' profiles on an online social network, with patterns extracted from the DMF, and used it to predict the profile owners' SSNs. We verified the accuracy of our predictions against the individuals' actual SSNs using a secure, IRB-approved, anonymized protocol which only produced aggregate statistics, without revealing to us the actual SSN of any individual in particular.



Q. If the algorithm only produces windows of values likely to include the correct SSN, why is this a concern?


Because various public- and private-sector online services may be attacked to test (using brute-force verifications) subsets of variations predicted by the algorithm.

Statistical predictions of windows of possible SSNs do not imply, alone, that an exact SSN will be found. However, when the range of values wherein an SSN is likely to fall gets dramatically reduced, a number of "brute force" attacks which would be otherwise inefficient or unfeasible become possible and feasible. When one or two attempts are sufficient to identify a large proportion of issued SSNs' first five digits, an attacker has incentives to invest resources into harvesting the remaining four from public documents or commercial services. When fewer than 10, 100, or 1,000 attempts are sufficient to identify complete SSNs for massive amounts of targets, attackers can exploit various public- and private-sector online services (such as online "instant" credit approval sites, as discussed in the paper) to test subsets of variations predicted by the algorithm in order to verify which SSN corresponds to an individual with a given birth date.



Q. Have you "broken" some secret code? Doesn't the Social Security Administration publicly discloses information about the assignment scheme?


No, we have not broken a secret code, and yes, the assignment scheme is publicly available. The SSN assignment scheme was created in the 1930s and was not designed to be "secure": back then, it was not imagined that one day SSNs would start being used for authentication. The assignment scheme is complex, and that complexity has led to the belief that the assignment, from the perspective of the user, is effectively random (see ``SSNs are assigned randomly by computer within the confines of the area numbers allocated to a particular state based on data keyed to the Modernized Enumeration System'' [SSA, 2001]). Indeed, we only used publicly available information, and ended up discovering, based on that information, that the randomness is effectively so low that the entire 9 digits of an SSN can be predicted with a limited number of attempts. We also discovered that certain interpretations of the assignment scheme held outside the SSA were, in fact, incorrect.



Q. Isn't this old news? Everybody knows that Area Numbers are associated with states (etc.)


Yes, the SSN assignment scheme is well known, and yes, the existence of a link between Area Numbers and states is public knowledge - but the patterns we discovered (and the accuracy of the predictions based on them) are not.

As noted in the manuscript, the SSN assignment scheme is public knowledge (p. 1). In fact, previous work in this area used those patterns to estimate when and where a SSN may have been issued (p. 1 and [Wessmiller, 2002], [Sweeney, 2004], [EPIC, 2008]: that is, starting from a *known* SSN, and trying to infer the state and the range of years when it may have been issued. Instead, our work focused on the inverse, harder, and much more consequential inference: exploiting the presumptive exact date and location of SSN issuance to estimate, quite reliably, SSNs. This became possible because:


- We discovered (p. 3) that the interpretation held *outside* the SSA about how Area Numbers are assigned was incorrect: contrary to a commonly held view about their assignment, the same AN is used for 9,999 consecutively assigned SSNs (under the interpretation of the assignment scheme held outside the SSA, the SSA was believed to rotate through all of a state's ANs for each assigned SN. Such scheme would render the AN random for states with multiple ANs, and the predictions we present in this article dramatically less accurate).


- We discovered (p. 4) that the assignment of the last 4 digits is not only sequential (as indeed stated in the publicly available information about the assignment scheme), but in fact highly correlated with the applicant's date of birth, and therefore not random (note that the SSA states, instead, that ``SSNs are assigned randomly by computer within the confines of the area numbers allocated to a particular state'' [SSA, 2001]). In various cases, we were able to predict the entire 9-digits of an SSN at the first attempt (the odds of that happening by random guess are roughly 1 over 1 billion). This is particularly the case for SSNs assigned after the onset of the EAB (1987 onwards).


- We discovered that the analysis of publicly available SSNs assigned to deceased individuals (and included in the DMF) allows the inferences of granular assignment patters that make it possible to predict the SSNs of individuals still alive. For instance, the relationship between Area Numbers and states, while public knowledge, would not be sufficient, alone, to predict Area Numbers except in very specific cases (see p. 1): low-population states (such as WY) and certain U.S. possessions are allocated 1 AN each - implying that knowledge that an individual applied for his/her SSN in that state or possession does indeed provide almost certain knowledge of the first 3 digits of his/her SSN. However, other states are allocated *sets* of ANs. For instance, an individual applying from a zipcode within the state of New York may be assigned any of 85 possible first 3 SSN digits. Therefore, knowledge that an individual applied for his/her SSN in that state provides low odds (1 over 85) of correctly guessing his/her first 3 digits with a single random guess. Those odds do not even include the probability of also correctly guessing the Group Numbers - which vary from 01 to 99 in combination with the different Area Numbers.


In short, without the discovery of patterns linking SSN digits to demographic data, knowledge of the assignment scheme would not be sufficient to predict neither the first 5 digits or in fact the entire 9 digits of an SSN with a degree of accuracy necessary to expose them to practical risks of identification. For instance, the probability of correctly guessing the first 5 digits of the SSN of an individual born in NY in 1998, even assuming knowledge that the SSN was issued within that state, would be 0.012%, and the probability of correctly guessing the entire 9 digits with fewer than 1,000 attempts would be 0.0012%. However, under the more granular understanding of the relationships between assignment scheme and demographic patterns described in the manuscript, those probabilities are 30% and 3% respectively: several orders of magnitude larger, and much more vulnerable to brute-force attacks. See Table 6 on p. 27 of the Supporting Information.



Q. Can the predictability of SSNs lead to identity theft? Does this research publication provides all is needed to acquire SSNs?


No. Aside from the fact that sensitive details were omitted from the article, to move from mere statistical predictions to actual identity theft an attacker needs to exploit holes and weaknesses in the U.S. identity "infrastructure:" the widespread availability of personal, demographic data for millions of individuals, the existence of large botnets of compromised computers, and the lax identity matching and authentication techniques adopted in the credit/financial sectors (among others). Our findings can help combat and decrease identity theft by showing why such known (yet underestimated) weaknesses in our identity infrastructure should finally be addressed; by alerting industry and policy-makers of a new exploit; and by highlighting the need to abandon SSNs as passwords and move toward more secure, efficient, and privacy-preserving means of authenticating identities.



Q. How does this differ from previous research?


Previous research in the area of SSNs focused on detecting SSNs in public databases, using SSNs to link data across multiple data sources, or - in the cases closest to our study - inferring the year[s] and state of issuance of known SSNs. Per se, the existence of SSN issuance patterns is well known - the SSA makes certain details available through public materials, and others (notably, Latanya Sweeney and her "SSN Watch") have used those patterns, plus a combination of public and private SSN data, to estimate when and where a <known> SSN may have been issued [Wessmiller, 2002], [Sweeney, 2004], [EPIC, 2008]. However, our work focuses on the inverse, harder, and much more consequential inference: it shows that it is possible to exploit the presumptive time and location of SSN issuance to estimate, quite reliably, <unknown> SSNs.



Q. What data do you need to predict SSN? Isn't birth data hard to come by?


Data about SSNs from the so-called "Death Master File," which is publicly available, and demographic data (dates of birth and states of birth) from wherever it is available. Mass amounts of birth data for US residents can be obtained or inferred - often for free, or at negligible per unit prices - from multiple sources, including  commercial data brokers (such as www.peoplefinders.com, which sells access to birth data and personal addresses for ``almost every adult in the United States''); voter registration lists (for most states); online free people searches (such as www.zabasearch.com); as well as social networking sites: our estimates indicate that at least 10 millions US residents make publicly available or inferable their birthday information on their online profiles.



Q. From which social networking site did you find data for one of your tests?


There is no specific networking site which is uniquely exposed. The data can be extracted from several such sites, as well as other sources, as noted above.



Q. Aren't SSN  in fact as available as birth data?


They are not.

It is true that SSNs are widely available. They have been found in public records of federal agencies, states, counties, courts, hospitals, and so forth [The President’s Identity Theft Task Force, 2007], as well as in personal documents, such as online resumes [Sweeney, 2006]. Companies exchange SSNs in personal information markets, and individuals obtain ``credit reports,'' containing their SSNs, from credit bureaus; stolen SSNs are lucratively exchanged in underground cybermarkets [Franklin, 2007]. However, the GAO found that only few brokers offering SSNs for sale to the general public are actually able to sell whole SSNs [GAO, 2006]. Furthermore, the GAO also found that while still widespread, SSNs are becoming harder to find in public documents [GAO, 2008]. In fact, the number of SSNs widely available may also be decreasing because of numerous legislative initiatives in this area. Various recent initiatives have been focusing on removing SSNs from public exposure or redacting their first five digits [NCSL, 2007], [FTC, 2008], and [GAO, 2008]. On the other hand, birth data remains widely available, as noted above.




Q. Can you accurately predict *every* SSN?


No.

Every SSN is issued under the same basic assignment scheme (and the scheme, while complex, contains observable regularities). Hence, in theory, any SSN may be predicted. However, the probability that a given SSN can be effectively predicted ranges from very low (or zero) to very high, depending on factors such as the year and state the SSN was applied for, how close to the individual's birth data it was applied for, and so forth. For the tests we ran, our predictions were several orders of magnitudes more accurate than random chance over the 1973 through 1988 period; however, dramatic and widespread increases in accuracy were especially observable for individuals born after 1988 (the onset of the nationwide EAB program), particularly in less-populous states.



Q. How many actual SSNs can be predicted?


There is no single number that can answer that question. The number is function of many parameters, and probabilistic inferences, including - as noted above - the availability of birth data, the accuracy of prediction across different states and years, the availability of tools to verify the system, and so forth. We present some possible extrapolations in the paper, but we stress that they must be weighted and considered under the caveats also presented there.



Q. I posted my date of birth online. Has my SSN been "broken"?


No.

That knowledge is not sufficient to "compromise" an SSN without the possibility of attempting to find the right number among possible variations - that is, attackers still need to succeed in exploiting other systems to compromise one's identity. Again, statistical predictions of windows of possible SSNs do not imply, alone, identity theft. The likelihood that probabilistic inferences can translate to actual SSN identification is function of several parameters. Inaccurate or unavailable birth information, or the attacker's inability to complete repeated attempts, will reduce the accuracy of the predictions and the number of individuals' SSNs under actual threat compared to the DMF estimations we present in the paper.



Q. Aren't data breaches a larger problem?


Not necessarily - although this is an oranges vs. apples kind of comparison.


First: not all data breaches involve SSNs. Estimates based on attrition.org data at the time of writing indicate that the average breach involves 140k SSN records. However, that average (as well as most of the largest breaches that involved SSNs) includes accidental data losses that may not have resulted in actual information exposure, such as the 26.5M US veterans' records stored in a laptop stolen during a burglary in 2006.


Second, and more importantly, unlike data breaches, which are local threats (that is, specific to the records contained within a certain database, however large that may be), the predictability we observed is, in principle, universal, in that it applies, theoretically (and with different degree of accuracy, depending on the factors highlighted above), to any current and future SSNs - unless their assignment scheme is modified.


Third: Companies can invest to protect their databases, and compromised credit cards can be blocked and renewed. However, unlike traditional passwords, SSNs cannot be blacklisted after failed attempts, nor changed to avoid future fraud [SSA, 2009].


Fourth: data breaches can be discovered, and the owners' of compromised accounts can be notified of the breach. Predicting SSNs is more akin to a "stealth" way of compromising an identity, and could be harder to detect.


Hence, the predictability of SSNs is an issue that should be faced with different tools than the ones used to prevent and deal with data breaches.



Q. Isn't it cheaper to just pay a data broker to acquire SSN data?


Unfortunately (or, perhaps, fortunately), no.


In the grey market (that is, excluding the market where certified and vetted companies trade personal data), it is becoming increasingly difficult to obtain SSNs [GAO, 2006], and prohibitively expensive: according to [Krim, 2005]. SSNs are sold in the grey markets for prices around $35 to $45. In the black market, according to [idanalytics, 2006], stolen identities in the US can be traded in the black market for a value of $30 to $50 per identity. However, estimates of the value of SSNs in underground markets vary greatly (with some estimates significantly smaller than $30), given the relative illiquidity of these markets [McCarty, 2003], [Thomas, 2006] (ranging from $0.10 to $25 (credit cards), and full identities (comprising SSNs) as ranging from $0.90 to $25 [Herley, 2009]).


On the other hand, the birth data necessary for the predictions is much cheaper, and the availability of botnets of compromised computers  may make harvesting credentials on large scales quite easy (although estimates vary, controlling 10,000 IPs for an entire day could cost as little as $1000 [Lesk, 2007]).



Q. Isn't it the case that SSNs, alone, are not sufficient to impersonate a person? Banks and other services ask for additional information (such as mother maiden's name, your pet name, and so forth).


In the US, knowledge of someone's name, date of birth, and SSN is sometime sufficient to impersonate that person in a variety of situations.

We need to distinguish between current account fraud (somebody tries to access a bank account you already created) and new account fraud (somebody tries to create a new credit card under your name). While in ``current account'' frauds the attacker, to gain access to an account already created and owned by the individual, indeed needs not just the victim's name, date of birth, and SSN, but (most often) also additional passwords or personal information, in ``new account'' frauds, the attacker more likely only needs to use the victim's name, date of birth, and SSN to create a new account on the victim's name. Therefore, new account frauds can be perpetrated even without knowledge of the victim's phone number, mother maiden's name, or other pieces of personal information. Mounting empirical evidence suggests, in fact, that providing an SSN and a date of birth which match that SSN is sufficient to create new fraudulent accounts [Cook, 2005], [Hoofnagle, 2007], [Consumers Union, 2007], even when the name associated with that SSN did not match, or the address was wrong, or even - as noted above - some of the submitted SSN digits were wrong.


Besides, adding more questions to authenticate a person to an account is hardly good security, if answers to those questions can be still inferred, or compromised.



Q. If SSNs were no longer used for authentication, what else could we use?


Simply asking more personal questions (such as your mother' maiden name, your pet's name, or your high school) cannot work, since that information can also be compromised, stolen or - in this age of self-revelation - inferred from various sources. However, plenty of research has focused on systems which protect sensitive data while allowing exchanges of information: work on 2-factor authentications, digital certificates, and privacy preserving identity management systems. While there is no foolproof system nor a panacea (as Bruce Schneier noted, "Proposed fixes tend to concentrate on [...] making personal data harder to steal--whereas the real problem is [...] preventing and detecting fraudulent transactions" [Schneier, 2007]), research in this area has made significant progresses in recent years, and we hope that the debate will focus on systems which combine privacy with the necessary and efficient flow of information.



Q. Who funded your research?


The National Science Foundation (under Grant 0713361) and the U.S. Army Research Office (under Contract DAAD190210389, through Carnegie Mellon's CyLab). We also received support from the Carnegie Mellon Berkman Fund and from the Pittsburgh Supercomputing Center.



Q. Were the tests IRB approved?


Yes, they were approved. No SSNs were harmed during the writing of this paper.