Wednesday, December 12, 2012

Information theory: questions and answers

Information theory is fundamentally about questions and answers.

We understand information itself in terms of questions and answers: 1 bit of information is the uncertainty in the answer to a question with a 50-50 outcome, e.g. "will this coin flip give tails?".

Just as importantly though, the measures of information theory themselves are all about questions and answers too.

For the basic measures, the questions they ask seem fairly obvious. The Shannon Entropy asks:
"How much uncertainty is there in the state of this variable X?".
Mutual information asks "how much information does the state of variable X tell me about the state of Y?", while conditional mutual information asks "how much information does the state of variable X tell me about the state of Y, given that I already know the state of Z?"

But I want to make a few more subtle points about these questions and answers.

In my opinion (which is of course the only correct one), the answers that the measures give are always correct. If you think they're wrong, then you're asking the wrong question, or have malformed the question in some way. There are plenty of ways to do this, or at least to inadvertently change the question that you're asking.

I see the sample data itself as part of the question that a measure is answering. When you estimate the probability distribution functions (PDFs) empirically from a given sample data set, your original question about entropy really becomes:
"How much uncertainty is there in the state of this variable X, given what we're assuming to be a representative sample of realisations x of X here?"
Of course, your representative sample could simply be too short, and thereby completely misrepresent the PDF. Or you could get into trouble with stationarity (1) of the process - you might implicitly have appended "given what we're assuming to be a representative stationary sample here" to the question, but that assumption may not be true.
In both cases, the measure will give the correct answer to your question, but it might not be the question you really intended to ask.

As another way of inadvertently changing the question, one must realise that for the same information-theoretic measure, different estimators (or indeed different parameter settings for the same estimator) answer different questions. Take the mutual information, for example, which one could measure on continuous-valued data via (box) kernel estimation. Using this estimator, the measure asks: "how much information does knowing the state of variable X within radius r tell me about the state of variable Y within radius r?" Clearly, using different parameter values for r amount to asking different questions - potentially the questions are very different if one uses radically different scales for r. Going further, one could measure the mutual information using the enhanced Kraskov-Grassberger kernel estimation technique. With this estimator, the mutual information measure asks "how much information does knowing the state of variable X tell me about the state of variable Y, to the precision defined in their k closest neighbours of the sample data set in the joint X-Y space?" Apart from that being something of a mouthful, it's obviously a different question to what the box kernel estimation is asking. And again, changing the parameter k changes the question being asked as well.

So to reiterate, information theory is fundamentally about questions and answers - the better you can keep that in mind, the better you will understand information theory and its tools.


UPDATE- 13/12/12 - My colleague Oliver Obst provided a perfect quote about this: "Better a rough answer to the right question than an exact answer to the wrong question" - attributed to Lord Kelvin.

-------------------
Footnotes:
(1) Here's a controversial statement: I suggest that it can be valid to make information-theoretic measurements on non-stationary processes. This simply changes the question that is being asked to something like: "how much uncertainty is there in the state of this non-stationary variable X, if we don't know how the joint probability distribution of the non-stationary process is operating at this specific time, given what we're assuming to be a representative sample of the joint probability distribution weighted over all possible ways it may operate?". Now, obviously that's quite a mouthful, but I'm trying to capture that intuition that one could validly consider how much information it takes to predict X if we don't know the specifics of the non-stationarity at this particular point in time, but do know the overall distribution of X (covering all possible behaviours). So long as one bears in mind that a different question is being asked (indeed a question that is quite different to the intended use of the measure), then certainly the answer can be validly interpreted. Of course, the bigger issue is in properly sampling the PDF of X over all possible behaviours, but that's another story.
 
;