« Interactive Data Breach Map - Zoom into local breaches | Main | The real reason for cloud computing »

Tuesday, 26 May 2009

Why do data breaches have a lognormal distribution?

In a previous post, I noted how the size of data breaches seems to follow a lognormal distribution. The available data seems to support this claim, but this raises an interesting question: exactly why do the sizes of data breaches have a lognormal distribution?

One way to get a lognormal distribution is to have a distribution that's the product of several independent normal distributions. If we try to apply that interpretation to data breaches, we might think of a data breach as resulting from the failure of one or more security mechanisms, with the failure of each mechanism being independent and having a normal distribution.

On the other hand, just because a distribution is lognormal doesn't mean that it's derived from the product of several factors, at least not in a meaningful way. There are many examples of things that follow a lognormal distribution, but don't seem to be derived from a product of normal distributions. There's an interesting article by Eckhard Limpert, Werner Staehel and Markus Abbt, "Log-normal Distributions across the Sciences: Keys and Clues," that was published in BioScience magazine that gives some examples of where this is the case. You can find this article here.

According to this article, the following values have a lognormal distribution:

  • the concentration of gold or uranium in ore deposits

  • the latency period of bacterial food poisoning

  • the age of the onset of Alzheimer's disease

  • the amount of air pollution in Los Angeles

  • the abundance of fish species

  • the size of ice crystals in ice cream

  • the number of words spoken in a telephone conversation

  • the length of sentences written by George Bernard Shaw or Gilbert K. Chesterton

Because the lognormal distribution seems to be so common, the fact that data breaches follow it shouldn't be that surprising. And just like it's hard to explain why the number of words spoken in telephone conversations follows a lognormal distribution by thinking of a lognormal distribution as being derived from the product of several independent normal distributions, there may also be a limit to how much we can understand data breaches by analyzing the lognormal distribution. The size of data breaches seems to follow a lognormal distribution, but it may be a mistake to analyze that fact too much.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00e55375ef1c88330115709211e9970b

Listed below are links to weblogs that reference Why do data breaches have a lognormal distribution?:

Comments

Post a comment

If you have a TypeKey or TypePad account, please Sign In.

Voltage Data Breach Index

  • Grab the Voltage Data Breach Index

September 2010

Sun Mon Tue Wed Thu Fri Sat
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30