« Another model for usability | Main | Zn vs. Z/nZ »

Tuesday, 11 May 2010

Zipf's law for data breaches

As we've seen in previous posts, there's lots of structure in the available data on data breaches. In particular, the size of data breaches seems to follow a lognormal distribution as well as Benford's law. It looks like we can add a third law that this data follows, and that's Zipf's law.

Suppose that we rank our data from largest to smallest. Zipf's law tells us that if we plot the log of the data versus the log of the rank we get a straight line. Zipf's law was first formulated based on the observation by linguist George Zipf that the frequency of words is inversely proportional to the rank of the word in a word frequency table. It also seems to hold for other data sets, like the size of US cities.

Let's look at the  the data breaches from 2007 through 2009 that are listed in the OSF's data breach database. Here's what we get when when we plot the log of the rank of a breach versus the log of the size of the breach. The blue dots represent actual data breaches. The red dots are what we get from fitting a straight line to the log-log data.

Image001 
Although the fit is much better for the breaches that aren't either too big or too small, the line actually fits the overall data fairly well, having a correlation coefficient R2 = 0.873.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00e55375ef1c88330133ece08b92970b

Listed below are links to weblogs that reference Zipf's law for data breaches:

Comments

Post a comment

If you have a TypeKey or TypePad account, please Sign In.

Voltage Data Breach Index

  • Grab the Voltage Data Breach Index

February 2012

Sun Mon Tue Wed Thu Fri Sat
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29