« The Rockefeller-Snowe bill | Main | 3DES 128 bit encryption does not exist »

Tuesday, April 14, 2009

What on earth is tokenization?

Tokenization is a technology that's sometimes used to obscure sensitive information. I'm not convinced that it's a good solution in many cases, but I'm definitely convinced that calling it "tokenization" was a bad idea. "Tokenization" has a well-established meaning that most undergraduate computer science students learn, and it seems reasonable to assume that the marketing person who decided to overload this term didn't know this. I doubt that they were trying to confuse things by using an existing term in a totally unrelated way, but I may be giving them too much credit.

A token is one or more characters in a source code file that act as a single logical symbol in a high-level language when they're grouped together, and converting a source code file into tokens is called tokenization. An example of this is converting the code

commission=sales*rate;

into the following tokens

commission

=

sales

*

rate

;

Tokenization is also used by linguists who try to understand natural languages. Instead of parsing a file of source code, linguists want to parse a sentence into words. An example of this is converting the sentence

I flew to New York.

into the following tokens

I

flew

to

New York

Tokenization, particularly of natural languages, is actually fairly tricky. How do you know whether to parse "New York" into a single token or into "New" and "York," for example?

There have been a significant number of books and papers published over the past few decades that talk about how to do tokenization, both with source code and natural languages. Because this is true, why would a marketing person decide to use the same term for a totally unrelated concept? Particularly when many of the people they'll be talking to about their version of tokenization already understand the other concept? The average guy on the street doesn't think of tokenization as something that a compiler does, but many IT people do, and they're the ones who are going to be involved in deciding which technology to use. Overloading a term that they already know was probably a bad idea.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00e55375ef1c883301157018a067970b

Listed below are links to weblogs that reference What on earth is tokenization?:

Comments

Post a comment

If you have a TypeKey or TypePad account, please Sign In.

Voltage Data Breach Index

  • Grab the Voltage Data Breach Index

March 2010

Sun Mon Tue Wed Thu Fri Sat
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31