What is the meaning of life, universe and everything ... and some information theory ?

We all know the answer to the question :
- What is the meaning of life, universe and everything ?
- 42.
... right ! Douglas Adams told us not so long ago.

How did he knew... of course it was a joke, still ... it is fun to ponder :)
Can we find some contrived way to prove that ? You are lucky I found one...
Even if you don't find the example compelling, because it is really out of ... somewhere :), still I'm using it as excuse so we can have some fun exploring the basics of Information theory...

“Information is a difference that makes a difference.” --Gregory Bateson

Wow !!! ... deep.
The central tenet and goal of Information theory is to understand what information is, how do we measure information content, how do we transport and encode information efficiently etc..

We will first explore the problem of measuring the amount of information content in a message.
To shorten the article lets accept that the most basic way to encode information is bits. (That is true anyway we would not go trough the steps to show it.)
(The word bit comes from the shorthand of BInary digiT).
A bit represent at most two possible choices 1 and 0, yes and no, true or false, black or white....etc.. you get it ? To phantom some meaning from new information we need to have a way to measure the amount of the difference/unpredictability i.e. we have to first find a way to quantify information.
The good thing is this is very intuitive concept. If you just learned something new, then you didn't know about it beforehand, ergo your knowledge changed by that "difference", which we will call for now information-content.

Our question then is what is the information content of the things we learn ? Glad you asked :) .... Lets go with the easiest case by using 1 bit i.e. there is two possible values 0 and 1. So we decide then the information content of a bit of information will be 2 units of something (whatever the UNIT will be). Using 0 and 1 we can exchange 2 type of messages(symbols) between each other. So 1 bit = 2 symbols = 2 units of information.
That is nice, but in reality we very rarely have such a simple needs i.e. exchange only 2 type of messages (symbols). Let say for example we want to communicate 8 different types of messages(symbols). In this case we will need to use 3 bits (n bits allow us `2^n` permutations).

What then will be the information content of messages using 3 bits long symbols. Based on the table on the left we can clearly see it should be 8 units of information.
We would expect that summing 1 bit information content 3 times once for every bit will give us the correct answer i.e. `2+2+2 = 6`, but as we can see that is not the case. The correct answer is `2*2*2 = 8` (3 bits can encode 8 different symbols) ... we have a dilemma, how can we measure information content in "compound" messages in a such a way so that it is consistent with our basic unit definition ?
Don't despair ;) there is very elegant solution to our problem, if we decide to represent information content as a logarithm of the possible symbols, rather than of the enumeration of all possible permutations. We can cheat this way because we are still in exploration and definition phase :). So what are the consequences if we follow this trough ...

If we accept logarithm definition instead of the enumeration, then the information content of 1 bit will be :
`log_2(2) = 1` unit
and for the 3 bits case we will sum the logarithms of the 1 bit case :
`log_2 (2*2*2) = log_2(2) + log_2(2) + log_2(2) = 3` units

This translates to : We need 3 bits to convey 8 type of messages and the information content is 3 units, instead of 8.
We picked logarithm of base two, because it suits our needs best for binary like system ... we can pick any other we wish, but let stick with it for now.
The guiding principle is to preserve the summing-capability we just explored, so we can do easy calculation going from simple to more complex encodings.

We also made an assumption so far as we are sending messages back and forth on average the probability of getting any one symbol in a message out of n-possible symbols is equal i.e. `1/n`.
But in real life that is not always true i.e. some symbols are used more often than others and this changes the balance of the information content that can be transmitted.
Remember from a layman perspective the more surprising the message is to us the more information it contains.
What will happen if we want to encode messages using 7 symbols, instead of 8, we will have to still use 3 bits, because 2 bits can only encode 4 symbols i.e. one of the codes/symbols would be unused, this means that the probability of getting any single one of the 8 possible changes.
Or what if we still use 8 symbols, but the symbol '101' occurs more often that the rest, this means that we will expect it more often and we wont be surprised so much. But we said that the measure of information content is how surprised we are, or how unpredictable the message is.
To get another angle of how to handle those cases in our newly invented scheme, let get back to the case of 1 bit symbol. For example when flipping a unfair coin, over and over we find out that we get head(1) 9 out 10 times, and tail(0) 1 out of 10 times. Now we will take into account this unbalance by using the probabilities (in current case based on frequencies of event occurring):
` 90% * log_2(90%) + 10% * log_2(10%) ~= 0.9 * log_2(0.9) + 0.1 * log_2(0.1) = -0.4689`

the information content will then be 0.47 bits. (We disregard the minus sign)
BTW we will also use bits as a unit of measure from now on, it makes sense.
The term Information-content is also synonym to Entropy or unpredictability. (Try not to think of the meaning of Entropy as used in Physics, Information theory approach is much more cleaner and easy to grasp)
The formula we just used above is the following one :
`H = sum_{i=1}^n p_i * log_2(1/p_i)` where : `H` - is information content/Entropy `p_i` - probability of encountering symbol "i" `n` - number of possible messages using those symbols

Which is the same thing that we used before, just using probability.
Let's try it for the 8 bit case (the probability of a message happening is 1/8, if everyone of them have equal probability) :
`sum_{i=1}^8 1/8 * log_2(1/(1/8)) = 8 * 1/8 * log_2(8) = log_2(8) = 3` bits

So far so good, it seems like we found a way to quantify information-content, now we can further explore how to manage, manipulate and transport information. Our speculations are starting to look more and more like a Science.

The next step is to explore the process of exchanging information.
As we saw if we use n-bits to define a message in binary system there is `2^n` possible symbols, that is in the ideal conditions, in real life we can't do that. If we use all the available bits to encode all possible symbols, we will lose information transmitting it over noisy channel (there is no such thing as noiseless channel). To make information exchange possible then we are forced to add redundancy in our encoding scheme to overcome the inevitability of noise disrupting our communications.
Striking the right balance is the major goal of Information theory. In this short article we wont go so deep, our main goal here is to get good understanding of Information content/Entropy.

OK, now that we know how to measure Information ... lets proceed to the answer of the Ultimate question !.
We will take a look at the English language.
The Latin alphabet contains 26 characters, plus space i.e. 27. Knowing this we can find the maximum entropy : (Maximum entropy happens when there is equal probability of using any character/symbol to devise a word, if we assume we have full freedom in combining characters)
`sum_{i=1}_27 1/27 * log_2(1/(1/27))` or just : `log_2(27) = 4.7548` so the information content is : ` ~ 4.76` bits

That is the maximum information content we can expect, how can we find the real one ... we are not free to use all possible combinations when building words, English language imposes a constrains on us. First approximation is to use the frequency occurrence of the characters in English text. If we did that and used the probability equation for Entropy we devised earlier, we would find out that the answer is approximately ~4 bits.
We can go to the next level i.e. using words frequencies build per word entropy then divide it by the average length of word in English (~4.5 characters) to get per character entropy/unpredictability.
Claude Shanon the father of Information theory did exactly that. Then he also employed several other methods to finally come with entropy of the English language of ~2 bits.
So the maximum Entropy of English language per letter is 4.76 bits (less predictable) and real average Entropy is something like 2 bits (more predictable) i.e. with every new character we get 2-5 bits of new information. We can not pinpoint exact average Entropy very easy, because if you think about it is much easier to figure out a word when we are in the middle of it than when we are in the beginning, it make intuitive sense. Another gut feeling that we have which prove to be true is the the "longer" the symbol we use to encode the messages the greater the information content will be. For example using 1 bit symbol (0|1) vs using alphabet character (5 bit symbol), a character provides more information because in a smaller message because we have 27 possibilities. In the same sense let say we enumerated all the possible words and created a code of the most often used ones. Using those codes to convey information instead of using words build with characters provides even more information. The same goes for sentence vs words.
The drawback of this scenario is if we don't know the alphabet, or we don't know the word-codes we can not extract meaningful information from a message. So If we talk about machines exchanging information they both have to have the tables with codes, or the same algorithm have to be known on both sides (if we use algorithm to encode the messages).
We can translate this example to human parlance by comparing different languages, let speculate that Chinese hieroglyphs provide more information than English alphabet. If Chinese is more "information rich", why don't we use it then. The reason is that to use it you have to learn it first ;). Or vs. versa
So this is another problem that is tackled by Information theory, how to find the best code with the least complexity of the code. This is called Compression. You have probably heard of .zip files as just one example of application of Information theory.
But let's go back to our original problem, English language ....
Now that we have the range of Entropy what can we do with it ? For one if we calculate this as a ratio we could find how much relative freedom of expression is allowed in the English language.
Let's do that :
`2/4.76 = 0.42` choice
`1 - 0.42 = 0.58` redundancy

So we got our answer : English allow for 42% of freedom/choice of picking what the next character in a word will be. Unity minus freedom = 58% is representation of the constraints that the language imposes OR said in different way the amount of redundancy.
More examples :
It is easy now to see that we gain some understanding just by finding out how to measure information content.
For example let see how much more information we can exchange when we use binary 0|1 compared to decimal system 0-9. We know the formula already :
`log_2(2) = 1` `log_2(10) = 3.32`
So the decimal notation is more than three times information rich.

Final remarks

I guess I cheated abit, but hey it is for science ;)...
The more recent tests shows that the Entropy of English language lower band may be 1.5 bits.. or lower the longer the average word in English language is, but I will assume that the acquisition in the last 50 years of many new words from Information age, Internet and globalization, any new word would add higher unpredictability of guessing the next character in building words.
The important thing that we learned in this short article is how to measure the Information content. If we know the frequency/probability of the symbols used to encode the possible messages generated by "any" source we just sum the products of the probability multiplied by the logarithm of probability of the possible symbols.
`H = sum_{i=1}^n p_i * log_2(1/p_i)`
We also learned that this is also the so called Entropy.

What other use this could be for us ? We now know how to find the physical limits on what is the maximum information that can be encoded given the number of symbols and their probabilities.
Based on this insight Information theory goes further to explore the capabilities of information channels and the ways to encode information, so in real life we can build reliable communication systems.
The same thing the Carno-cycle give to thermodynamics, Information theory gave to Communication.

Math references

    Information theory
  • An Introduction to Information Theory: Symbols, Signals and Noise - John R. Pierce
  • Great ideas in Information theory, Language and Cybernetics - Jagjit Singh