SSN Study - FAQ
SSN Study - FAQ
Predicting Social Security Numbers from Public Data [pdf] [html] [appendix] [Commentary by William E. Winkler]
Alessandro Acquisti (Heinz College, Carnegie Mellon University)
Ralph Gross (Heinz College, Carnegie Mellon University)
Proceedings of the National Academy of Science, July 7, 2009.
To be presented at BlackHat Las Vegas, July 29, 2009
We
gratefully acknowledge research support from the National Science
Foundation under Grant 0713361, from the U.S. Army Research Office
under Contract DAAD190210389, from Carnegie Mellon CyLab and Berkman Fund,
and from the Pittsburgh Supercomputing Center
We also would like to thank Jimin Lee, Ihn Aee Choi, Dhruv Deepan
Mohindra, and, in particular, Ioanis Alexander Biternas Wischnienski for
outstanding research assistantship.
This is a draft document. We will keep adding Q&As as we receive or read relevant questions about the study in comments and emails. Please bear with us as we add content and work towards a final, clean version of this FAQ. Thank you!
Additional information can be found on our research blog. Belorussian translation provided by PC.
Index
General Questions:
Technical Questions:
•Isn't this old news? Everybody knows that Area Numbers are associated with states (etc.)
•What data do you need to predict SSN? Isn't birth data hard to come by?
•From which social networking site did you find data for one of your tests?
•I posted my date of birth online. Has my SSN been "broken"?
•Isn't it cheaper to just pay a data broker to acquire SSN data?
•If SSNs were no longer used for authentication, what else could we use?
Executive summary
Social Security numbers were created under the Social Security Act of 1935 as identifiers for accounts tracking individual earnings. However, over time, they started being used as sensitive authentication devices, becoming one of the pieces of information most often sought by identity thieves: knowledge of a person's name, SSN, and data of birth, is often a sufficient condition to impersonate that individual and obtain access to a variety of services, leading to so-called identity theft. The current public policy in the area of identity theft suggests that SSNs should be kept confidential: consumers are urged to protect their SSNs. However, we show that it is possible to predict individual SSNs simply from publicly available data. Based on observation of issuance patterns in the "Death Master File" (a public database that contains SSNs of people who have died), we were able to use information about an individual's date and state of birth to predict narrow ranges of values likely to contain that individual's SSN. The predictions are particularly accurate for the SSNs of people who were born after 1988 (when the SSA initiated the Enumeration at Birth program, through which babies receive SSNs soon after birth) and in states with lower population. Since SSNs are predictable from public data, identity theft could occur even without events such as data breaches. Some of the implications are that 1) the SSA should randomize the entire SSN assignment process; 2) current policy initiatives in the area of SSN and identity theft should be reconsidered: most policy-making currently focuses on removing SSNs from databases or redacting their digits, so that they can still be used as "confidential information" - however, since SSNs are predictable from otherwise publicly available data, SSNs cannot be kept confidential even if they are removed from databases, and therefore those initiatives may be ineffective; 3) since SSNs can be predicted and are therefore, in a sense, semi-public information, consumers should not be required by private sector entities to use SSNs as passwords or for authentication.
General questions
Q. What is this research about?
We
studied the assignment scheme of Social Security numbers (SSNs) and
discovered that individual SSNs can be predicted entirely from public
data. Specifically, we found that it is possible to combine information
from government sources with simple demographic data (such as an
individual's state and date of birth, widely available from commercial
databases, voter registration lists, or online social networks) to
predict narrow ranges of values wherein individual SSNs are likely to
fall.
Q. Why does this research matter? Why is predictability of SSNs a problem?
SSNs are supposed to be confidential information - the predictability of SSNs increases the risk of vast-scale identity theft.
SSNs
were originally designed in the 1930s to be used as identifiers of
accounts tracking individual earnings. However, over time, they started
being used for ``authentication'' in a variety of private sector
services - that is, to verify identity and determine whether someone is
who he/she is claiming to be. Hence, they came to be considered
sensitive information. The inherent tensions between using the same
number as identifier of an account (which may be shared with other
parties) as well as a "password" (which is supposed to be private and
confidential) has contributed to the rise of identity theft. In the US,
knowledge of someone's name, date of birth, and SSN is often sufficient
condition to impersonate that person for financial, medical, or other
types of fraud. Hence, if SSNs can be predicted from public data, the
risk of identity theft increases.
Q. What are the implications of your results?
First:
SSNs, in their current form, are highly insecure passwords and should
not be used for authentication. If one can successfully identify all
nine digits of an SSN in fewer than 10, 100, or even 1,000 attempts,
that Social Security number is no more secure than a three-digit PIN. Both government agencies (including the SSA and the FTC) and researchers (e.g., [LoPucki, 2003], [Samuelson, 2007], [Solove, 2003]) have warned against the use of SSNs for authentication. Unfortunately, SSNs are still used (and abused) everywhere in the private sector to authenticate identities, which leads to widespread crimes of identity theft.
Second:
Current legislative and policy initiatives in the area of identity
theft prevention which focus on removing SSNs from public exposure, or
redacting their first five digits are well-meaning, but may be
misguided - because even redacted or removed SSNs remain predictable
from otherwise publicly available data.
Third:
More broadly, our findings highlight the unexpected consequences of the
interaction of multiple data sources in modern information economies.
They show how non-sensitive personal data (such as information people
reveal about themselves online) can be combined with other data
sources, also non-sensitive, leading to the inference of much more
sensitive information.
Q. Why are you publishing these results?
SSNs are very insecure passwords. However, notwithstanding warnings by numerous government agencies (including the SSA and the FTC), they are often used in the private sector both as identifiers and for authentication - this causes costs and damages of billions of dollars every year to businesses and consumers. Our intention is to show that, in their current form, SSNs are compromised as passwords; to alert not just policy-makers, but also businesses and consumers of the threats to individual identities deriving from the use
(and abuse) of SSNs as means of authentication; and to contribute to
the debate on more efficient, secure, and privacy-preserving means of
verifying identities in our information society.
Identity
theft is so widespread in the US because Social Security numbers are
incongruously used by businesses both as identifiers and as passwords
-- something they were never designed to be [Smith, 2002]. This is a practice that the Samuelson Clinic at UC Berkeley has defined as ``irresponsible'' [Samuelson, 2007] and that law scholar Daniel Solove has referred to as an "architecture of vulnerability"[Solove, 2003]. In the US, the overall costs of identity theft in 2007 were estimated at $49.3 billion [Johannes, 2006]. As Chris Hoofnagle noted [Hoofnagle, 2007],
those costs are born by all parties, but particularly by consumers,
either directly (lost time, inconvenience, and out-of-pocket costs) or
indirectly (through higher fees paid for credit services, or as
taxpayers, when financial institutions write off identity theft losses
when computing their corporate income taxes). Furthermore, additional
costs are incurred every year even in absence of fraud, because of
costs caused by attempts to defend, and exploit, the system [idanalytics, 2005]
-- consider, for instance, the investments that companies and
individuals are required to bear in order to protect sensitive data. By
showing that SSNs are predictable from public data, and therefore
inadequate as passwords, we hope to help stop the costs associated with their use as means of
verifying identities, and redirect the
attention towards the progresses of research on secure,
privacy-preserving authentication methods - from 2-factor
authentication to digital certificates.
Q. What steps have you taken before publishing the results?
Among
other things, we have omitted sensitive details about the prediction
strategy from the published article, and we have shared our results
with government agencies prior to publication.
Q. How can your results be used to address the problem of identity theft? Do you have practical recommendations?
The
findings suggest a number of considerations and possible strategies for public
and private sector entities, as well as for individuals.
Government agencies
The
assignment scheme of SSNs could be changed to incorporate true
randomness. This would eliminate the risk of predictability for newly
assigned SSNs - however, it will not do much to protect the hundreds of
millions of SSNs already assigned. It may also make us complacent with preserving the current -- and insecure -- system where SSNs are incongruously used by private sector entities both as public identifiers and private passwords - a role that SSNs never meant to fulfill when they were designed in the 1930s. Government agencies (and policy-makers)
may instead consider incentivizing private sector entities to
abandon the use (and penalizing the abuse) of SSNs as means of authentication, and may
encourage academic and industry research on the application of more
efficient, secure, and privacy-preserving means of electronic
authentication - such as 2-factor authentication and digital
certificates.
Policy-makers
Current policy initiatives in the area of SSN protection and identity theft prevention may be reevaluated [LoPucki, 2003].
Many current initiatives in this area (see [GAO, 2008], [FTC, 2008]), as well as the 2007 President's
Identity Theft Task Force's recommendations, are well-meaning; however, they focus on removing SSNs
from public exposure (or redacting their first five digits) , in order to preserve SSNs' role as sensitive numbers and means of authentication [The President’s Identity Theft Task Force, 2007].
Our results, instead, suggest that approaches solely focused on removing or redacting SSNs may be ineffective, or misguided: assigned SSNs cannot be revoked to avoid future fraud, exposed
data cannot be taken back, and the first 5 digits of an SSNs are those,
in fact, easier to infer. This leaves even redacted or truncated SSNs
still predictable and, therefore, still vulnerable.
Credit Reporting Agencies, financial, and other institutions
CRAs and financial institutions should stop using SSNs for authentication (that is, as proof of identity), and strengthen their identity matching strategies and authentication techniques. Reports from the FTC [FTC, 2004] and academia [Hoofnagle, 2007]
have highlighted how credit applications with incorrect
names or even incorrect SSN digits are routinely accepted as valid (because credit reports are known to contain errors and inaccuracies).
Such practices leave open "holes" in the identity verification infrastructure that fraudsters can and do exploit.
In fact, both Credit Reporting Agencies and initiatives such as E-Verify and SSNVS
should pay particular and heightened attention to attempts at identity
crimes that rely on 'tumbling.' Tumbling is a cyber-criminal practice that has already been
documented and that consists of slightly changing numerical details in
fraudulent applications, such as addresses and, in fact, the
manipulation of known SSNs across multiple account applications [idanalytics, 2005].
Online services
Online
services which post or allow members to post demographic information
(from online people search services to online social networks) should
consider strategies (from choosing appropriate defaults to setting adequate security policies) that as
much as possible try to balance the need for free data flows and
exchanges with the protection against abuses of those data, putting
particular attention towards the consideration that even innocuous data can be
recombined to produce more sensitive information together with
other sources.
Consumers
By
realizing the potential use of public documents as "breeder" documents
of more sensitive data, we, as consumers, can make better informed
decisions, trading-off and comparing the benefits of online
information sharing with its potential costs. However, the problem our paper
highlights goes way beyond users' control - it is a systemic problem due
to the exploitation of SSNs for goals (authentication) they were never
designed to fulfill. Hence, the emphasis on asking consumers to
"protect" their SSNs [SSA, 2007]
may be misplaced, if even well-meaning consumers' SSNs may be
compromised because of information other entities have revealed about
them. In other words, our results indicate that the problem of SSNs security goes much beyond consumers'
responsibility and control: it has to do with the use (and abuse) of SSNs in
the private sector for purposes (such as authentication) they were never
designed to fulfill. As consumers, we have very little control on that. At
the end of the day, this is a systematic problem that industry,
policy-makers, and of course researchers must resolve.
Technical questions
Q. What exactly does it mean that SSNs are "predictable"?
It
means that information about an individual's state and date of birth
can be sufficient to statistically infer narrow ranges of values
wherein that individual's SSN is likely to fall.
``Can,''
because this is true (in general, and simplifying things a bit) only
for individuals who received their SSN around the time of their birth
(by 2005, at least 92 percent of SSNs assigned to US citizens were
assigned at birth [SSA, 2006];
the percentages of individuals receiving their SSNs around the time of
their birth started increasing dramatically in the late 1980s as a
result of the Enumeration at Birth initiative).
"Ranges
of values" means that the predictions are based on statistical
inferences: in general, the first 5 digits can be predicted with a very
high degree of accuracy with a single attempt - especially for
individuals born after 1988 and in less populous states. In some cases,
we were able predict the whole 9 digits of individual SSNs at the very
first attempt. More often, the predictions produce windows of values
that are likely to include the actual 9 digits. These windows can be
very large (and, therefore, inaccurate) for certain years and states
(for instance, for individuals born in California in 1973), but can get
very narrow (and therefore more concerning, in terms of identity theft
risks) for smaller states and recent years (for instance, 1 out of 20
SSNs of individuals born in DE in 1996 in our dataset could be
identified with just 10 or fewer attempts per SSN).
Q. How do your SSN predictions work?
Our
predictions are based on the fact that SSNs are assigned according to a
complex yet regular - and therefore predictable - pattern. The
prediction works based on the interpolation of an individual's date and
state of birth with SSN issuance patterns derived from the so-called "Death Master File",
a publicly available file reporting SSNs, names, dates of birth and
death, and states of SSN application for individuals whose deaths have
been reported to the SSA (also popularly known as SSDI or SSN Death
Index). Part of the process is described in the PNAS paper.
Certain details have been omitted from publication.
Q. How did you verify your predictions?
We
ran two tests. In the first test, we plotted the SSNs of Death Master
File (DMF) records versus time for data between 1973 and 2003. We
observed statistical patterns that appeared in the DMF data; then, we
used these patterns to predict the SSNs of DMF records. In a second
test, we interpolated demographic data extracted from students'
profiles on an online social network, with patterns extracted from the
DMF, and used it to predict the profile owners' SSNs. We verified the
accuracy of our predictions against the individuals' actual SSNs using
a secure, IRB-approved, anonymized protocol which only produced
aggregate statistics, without revealing to us the actual SSN of any
individual in particular.
Q. If the algorithm only produces windows of values likely to include the correct SSN, why is this a concern?
Because
various public- and private-sector online services may be attacked to
test (using brute-force verifications) subsets of variations predicted
by the algorithm.
Statistical
predictions of windows of possible SSNs do not imply, alone, that an
exact SSN will be found. However, when the range of values wherein an
SSN is likely to fall gets dramatically reduced, a number of "brute
force" attacks which would be otherwise inefficient or unfeasible
become possible and feasible. When one or two attempts are sufficient
to identify a large proportion of issued SSNs' first five digits, an
attacker has incentives to invest resources into harvesting the
remaining four from public documents or commercial services. When fewer
than 10, 100, or 1,000 attempts are sufficient to identify complete
SSNs for massive amounts of targets, attackers can exploit various
public- and private-sector online services (such as online "instant"
credit approval sites, as discussed in the paper) to test subsets of variations predicted by the
algorithm in order to verify which SSN corresponds to an individual
with a given birth date.
Q. Have you "broken" some secret code? Doesn't the Social Security Administration publicly discloses information about the assignment scheme?
No, we have not broken a secret code, and yes, the assignment scheme is publicly available.
The SSN assignment scheme was created in the
1930s and was not designed to be "secure": back then, it was not imagined that one day SSNs would start being used for authentication. The assignment scheme is
complex, and that complexity has led to the belief that the assignment,
from the perspective of the user, is effectively random (see ``SSNs are
assigned randomly by computer within the confines of the area numbers
allocated to a particular state based on data keyed to the Modernized
Enumeration System'' [SSA, 2001]).
Indeed, we only used publicly available information, and
ended up discovering, based on that information, that the randomness is effectively so low that the
entire 9 digits of an SSN can be predicted with a limited number of attempts. We also discovered that certain interpretations of the assignment scheme held outside the SSA were, in fact, incorrect.
Q. Isn't this old news? Everybody knows that Area Numbers are associated with states (etc.)
Yes, the SSN assignment scheme is well known, and yes, the existence of a link between Area Numbers and states is public knowledge - but the patterns we discovered (and the accuracy of the predictions based on them) are not.
As noted in the manuscript, the SSN assignment scheme is public knowledge (p. 1). In fact, previous work in this area used those patterns to estimate when and where a
- We discovered (p. 3) that the interpretation held *outside* the SSA about how Area Numbers are assigned was incorrect: contrary to a commonly held view about their assignment, the same AN is used for 9,999 consecutively assigned SSNs (under the interpretation of the assignment scheme held outside the SSA, the SSA was believed to rotate through all of a state's ANs for each assigned SN. Such scheme would render the AN random for states with multiple ANs, and the predictions we present in this article dramatically less accurate).
- We discovered (p. 4) that the assignment of the last 4 digits is not only sequential (as indeed stated in the publicly available information about the assignment scheme), but in fact highly correlated with the applicant's date of birth, and therefore not random (note that the SSA states, instead, that ``SSNs are assigned randomly by computer within the confines of the area numbers allocated to a particular state'' [SSA, 2001]). In various cases, we were able to predict the entire 9-digits of an SSN at the first attempt (the odds of that happening by random guess are roughly 1 over 1 billion). This is particularly the case for SSNs assigned after the onset of the EAB (1987 onwards).
- We discovered that the analysis of publicly available SSNs assigned to deceased individuals (and included in the DMF) allows the inferences of granular assignment patters that make it possible to predict the SSNs of individuals still alive. For instance, the relationship between Area Numbers and states, while public knowledge, would not be sufficient, alone, to predict Area Numbers except in very specific cases (see p. 1): low-population states (such as WY) and certain U.S. possessions are allocated 1 AN each - implying that knowledge that an individual applied for his/her SSN in that state or possession does indeed provide almost certain knowledge of the first 3 digits of his/her SSN. However, other states are allocated *sets* of ANs. For instance, an individual applying from a zipcode within the state of New York may be assigned any of 85 possible first 3 SSN digits. Therefore, knowledge that an individual applied for his/her SSN in that state provides low odds (1 over 85) of correctly guessing his/her first 3 digits with a single random guess. Those odds do not even include the probability of also correctly guessing the Group Numbers - which vary from 01 to 99 in combination with the different Area Numbers.
In short, without the discovery of patterns linking SSN digits to demographic data, knowledge of the assignment scheme would not be sufficient to predict neither the first 5 digits or in fact the entire 9 digits of an SSN with a degree of accuracy necessary to expose them to practical risks of identification. For instance, the probability of correctly guessing the first 5 digits of the SSN of an individual born in NY in 1998, even assuming knowledge that the SSN was issued within that state, would be 0.012%, and the probability of correctly guessing the entire 9 digits with fewer than 1,000 attempts would be 0.0012%. However, under the more granular understanding of the relationships between assignment scheme and demographic patterns described in the manuscript, those probabilities are 30% and 3% respectively: several orders of magnitude larger, and much more vulnerable to brute-force attacks. See Table 6 on p. 27 of the Supporting Information.
Q. Can the predictability of SSNs lead to identity theft? Does this research publication provides all is needed to acquire SSNs?
No. Aside from the fact that sensitive details were omitted from the article, to move from mere statistical predictions to actual identity theft an attacker needs to exploit holes and weaknesses in the U.S. identity "infrastructure:" the widespread availability of personal, demographic data for millions of individuals, the existence of large botnets of compromised computers, and the lax identity matching and authentication techniques adopted in the credit/financial sectors (among others). Our findings can help combat and decrease identity theft by showing why such known (yet underestimated) weaknesses in our identity infrastructure should finally be addressed; by alerting industry and policy-makers of a new exploit; and by highlighting the need to abandon SSNs as passwords and move toward more secure, efficient, and privacy-preserving means of authenticating identities.
Q. How does this differ from previous research?
Previous
research in the area of SSNs focused on detecting SSNs in public
databases, using SSNs to link data across multiple data sources, or -
in the cases closest to our study - inferring the year[s] and state of
issuance of known
SSNs. Per se, the existence of SSN issuance patterns is well known -
the SSA makes certain details available through public materials, and
others (notably, Latanya Sweeney and her "SSN Watch")
have used those patterns, plus a combination of public and private SSN
data, to estimate when and where a <known> SSN may have been
issued [Wessmiller, 2002], [Sweeney, 2004], [EPIC, 2008].
However, our work focuses on the inverse, harder, and much more
consequential inference: it shows that it is possible to exploit the
presumptive time and location of SSN issuance to estimate, quite
reliably, <unknown> SSNs.
Q. What data do you need to predict SSN? Isn't birth data hard to come by?
Data about SSNs from the so-called "Death Master File,"
which is publicly available, and demographic data (dates of birth and
states of birth) from wherever it is available. Mass amounts of birth
data for US residents can be obtained or inferred - often for free, or
at negligible per unit prices - from multiple sources, including
commercial data brokers (such as www.peoplefinders.com,
which sells access to birth data and personal addresses for ``almost
every adult in the United States''); voter registration lists (for most
states); online free people searches (such as www.zabasearch.com);
as well as social networking sites: our estimates indicate that at
least 10 millions US residents make publicly available or inferable
their birthday information on their online profiles.
Q. From which social networking site did you find data for one of your tests?
There
is no specific networking site which is uniquely exposed. The data can
be extracted from several such sites, as well as other sources, as
noted above.
Q. Aren't SSN in fact as available as birth data?
They are not.
It
is true that SSNs are widely available. They have been found in public
records of federal agencies, states, counties, courts, hospitals, and
so forth [The President’s Identity Theft Task Force, 2007], as well as in personal documents, such as online resumes [Sweeney, 2006].
Companies exchange SSNs in personal information markets, and
individuals obtain ``credit reports,'' containing their SSNs, from
credit bureaus; stolen SSNs are lucratively exchanged in underground
cybermarkets [Franklin, 2007].
However, the GAO found that only few brokers offering SSNs for sale to
the general public are actually able to sell whole SSNs [GAO, 2006].
Furthermore, the GAO also found that while still widespread, SSNs are becoming harder to find in public documents [GAO, 2008].
In fact, the number of SSNs widely available may also be decreasing
because of numerous legislative initiatives in this area. Various
recent initiatives have been focusing on removing SSNs from public
exposure or redacting their first five digits [NCSL, 2007], [FTC, 2008], and [GAO, 2008]. On the other hand, birth data remains widely available, as noted above.
Q. Can you accurately predict *every* SSN?
No.
Every SSN is issued under the same basic assignment scheme (and the
scheme, while complex, contains observable regularities). Hence, in theory, any SSN may be predicted. However, the
probability that a given SSN can be effectively predicted ranges from very
low (or zero) to very high, depending on factors such as the year and
state the SSN was applied for, how close to the individual's birth data
it was applied for, and so forth. For the tests we ran, our predictions were several orders of magnitudes more accurate than
random chance over the 1973 through 1988 period; however, dramatic and widespread increases in
accuracy were especially observable for individuals born after 1988 (the onset of the nationwide EAB program), particularly in less-populous states.
Q. How many actual SSNs can be predicted?
There
is no single number that can answer that question. The number is
function of many parameters, and probabilistic inferences, including -
as noted above - the availability of birth data, the accuracy of
prediction across different states and years, the availability of tools
to verify the system, and so forth. We present some possible
extrapolations in the paper, but we stress that they must be weighted
and considered under the caveats also presented there.
Q. I posted my date of birth online. Has my SSN been "broken"?
No.
That
knowledge is not sufficient to "compromise" an SSN without the
possibility of attempting to find the right number among possible
variations - that is, attackers still need to succeed in exploiting
other systems to compromise one's identity. Again, statistical
predictions of windows of possible SSNs do not imply, alone, identity theft.
The likelihood that probabilistic inferences can translate to actual
SSN identification is function of several parameters. Inaccurate or
unavailable birth information, or the attacker's inability to complete
repeated attempts, will reduce the accuracy of the predictions and the
number of individuals' SSNs under actual threat compared to the DMF
estimations we present in the paper.
Q. Aren't data breaches a larger problem?
Not necessarily - although this is an oranges vs. apples kind of comparison.
First: not all data breaches involve SSNs. Estimates based on attrition.org
data at the time of writing indicate that the average breach involves
140k SSN records. However, that average (as well as most of the largest
breaches that involved SSNs) includes accidental data losses that may not
have resulted in actual information exposure, such as the 26.5M US
veterans' records stored in a laptop stolen during a burglary in 2006.
Second,
and more importantly, unlike data breaches, which are local threats
(that is, specific to the records contained within a certain database,
however large that may be), the predictability we observed is, in principle,
universal, in that it applies, theoretically (and with different degree of accuracy, depending on the factors highlighted above), to any current and future
SSNs - unless their assignment scheme is modified.
Third:
Companies can invest to protect their databases, and compromised credit
cards can be blocked and renewed. However, unlike traditional
passwords, SSNs cannot be blacklisted after failed attempts, nor
changed to avoid future fraud [SSA, 2009].
Fourth:
data breaches can be discovered, and the owners' of compromised
accounts can be notified of the breach. Predicting SSNs is more akin to
a "stealth" way of compromising an identity, and could be harder to
detect.
Hence,
the predictability of SSNs is an issue that should be faced with
different tools than the ones used to prevent and deal with data
breaches.
Q. Isn't it cheaper to just pay a data broker to acquire SSN data?
Unfortunately (or, perhaps, fortunately), no.
In
the grey market (that is, excluding the market where certified and
vetted companies trade personal data), it is becoming increasingly
difficult to obtain SSNs [GAO, 2006], and prohibitively expensive: according to [Krim, 2005]. SSNs are sold in the grey markets for prices around $35 to $45. In the black market, according to [idanalytics, 2006],
stolen identities in the US can be traded in the black market for a
value of $30 to $50 per identity. However, estimates of the value of
SSNs in underground markets vary greatly (with some estimates
significantly smaller than $30), given the relative illiquidity of
these markets [McCarty, 2003], [Thomas, 2006] (ranging from $0.10 to $25 (credit cards), and full identities (comprising SSNs) as ranging from $0.90 to $25 [Herley, 2009]).
On
the other hand, the birth data necessary for the predictions is much
cheaper, and the availability of botnets of compromised computers
may make harvesting credentials on large scales quite easy (although
estimates vary, controlling 10,000 IPs for an entire day could cost as
little as $1000 [Lesk, 2007]).
Q.
Isn't it the case that SSNs, alone, are not sufficient to impersonate a
person? Banks and other services ask for additional information (such
as mother maiden's name, your pet name, and so forth).
In
the US, knowledge of someone's name, date of birth, and SSN is sometime
sufficient to impersonate that person in a variety of situations.
We
need to distinguish between current account fraud (somebody tries to
access a bank account you already created) and new account fraud
(somebody tries to create a new credit card under your name). While in
``current account'' frauds the attacker, to gain access to an account already
created and owned by the individual, indeed needs not just the victim's
name, date of birth, and SSN, but (most often) also additional
passwords or personal information, in ``new account'' frauds, the
attacker more likely only needs to use the victim's name, date of
birth, and SSN to create a new
account on the victim's name. Therefore, new account frauds can be
perpetrated even without knowledge of the victim's phone number, mother
maiden's name, or other pieces of personal information. Mounting
empirical evidence suggests, in fact, that providing an SSN and a date
of birth which match that SSN is sufficient to create new fraudulent accounts [Cook, 2005], [Hoofnagle, 2007], [Consumers Union, 2007], even when the name associated with that SSN did not
match, or the address was wrong, or even -
as noted above - some of the submitted SSN digits were wrong.
Besides,
adding more questions to authenticate a person to an account is hardly
good security, if answers to those questions can be still inferred, or
compromised.
Q. If SSNs were no longer used for authentication, what else could we use?
Simply
asking more personal questions (such as your mother' maiden name, your
pet's name, or your high school) cannot work, since that information
can also be compromised, stolen or - in this age of self-revelation -
inferred from various sources. However, plenty of research has focused
on systems which protect sensitive data while allowing exchanges of
information: work on 2-factor authentications, digital certificates,
and privacy preserving identity management systems. While there is no
foolproof system nor a panacea (as Bruce Schneier noted, "Proposed
fixes tend to concentrate on [...] making personal data harder to
steal--whereas the real problem is [...] preventing and detecting
fraudulent transactions" [Schneier, 2007]),
research in this area has made significant progresses in recent years,
and we hope that the debate will focus on systems which combine privacy
with the necessary and efficient flow of information.
The
National Science Foundation (under Grant 0713361) and the U.S. Army
Research Office (under Contract DAAD190210389, through Carnegie
Mellon's CyLab). We also received support from the Carnegie Mellon
Berkman Fund and from the Pittsburgh Supercomputing Center.
Q. Were the tests IRB approved?
Yes, they were approved. No SSNs were harmed during the writing of this paper.