Wednesday, December 12, 2012

Information theory: questions and answers

Information theory is fundamentally about questions and answers.

We understand information itself in terms of questions and answers: 1 bit of information is the uncertainty in the answer to a question with a 50-50 outcome, e.g. "will this coin flip give tails?".

Just as importantly though, the measures of information theory themselves are all about questions and answers too.

For the basic measures, the questions they ask seem fairly obvious. The Shannon Entropy asks:
"How much uncertainty is there in the state of this variable X?".
Mutual information asks "how much information does the state of variable X tell me about the state of Y?", while conditional mutual information asks "how much information does the state of variable X tell me about the state of Y, given that I already know the state of Z?"

But I want to make a few more subtle points about these questions and answers.

In my opinion (which is of course the only correct one), the answers that the measures give are always correct. If you think they're wrong, then you're asking the wrong question, or have malformed the question in some way. There are plenty of ways to do this, or at least to inadvertently change the question that you're asking.

I see the sample data itself as part of the question that a measure is answering. When you estimate the probability distribution functions (PDFs) empirically from a given sample data set, your original question about entropy really becomes:
"How much uncertainty is there in the state of this variable X, given what we're assuming to be a representative sample of realisations x of X here?"
Of course, your representative sample could simply be too short, and thereby completely misrepresent the PDF. Or you could get into trouble with stationarity (1) of the process - you might implicitly have appended "given what we're assuming to be a representative stationary sample here" to the question, but that assumption may not be true.
In both cases, the measure will give the correct answer to your question, but it might not be the question you really intended to ask.

As another way of inadvertently changing the question, one must realise that for the same information-theoretic measure, different estimators (or indeed different parameter settings for the same estimator) answer different questions. Take the mutual information, for example, which one could measure on continuous-valued data via (box) kernel estimation. Using this estimator, the measure asks: "how much information does knowing the state of variable X within radius r tell me about the state of variable Y within radius r?" Clearly, using different parameter values for r amount to asking different questions - potentially the questions are very different if one uses radically different scales for r. Going further, one could measure the mutual information using the enhanced Kraskov-Grassberger kernel estimation technique. With this estimator, the mutual information measure asks "how much information does knowing the state of variable X tell me about the state of variable Y, to the precision defined in their k closest neighbours of the sample data set in the joint X-Y space?" Apart from that being something of a mouthful, it's obviously a different question to what the box kernel estimation is asking. And again, changing the parameter k changes the question being asked as well.

So to reiterate, information theory is fundamentally about questions and answers - the better you can keep that in mind, the better you will understand information theory and its tools.


UPDATE- 13/12/12 - My colleague Oliver Obst provided a perfect quote about this: "Better a rough answer to the right question than an exact answer to the wrong question" - attributed to Lord Kelvin.

-------------------
Footnotes:
(1) Here's a controversial statement: I suggest that it can be valid to make information-theoretic measurements on non-stationary processes. This simply changes the question that is being asked to something like: "how much uncertainty is there in the state of this non-stationary variable X, if we don't know how the joint probability distribution of the non-stationary process is operating at this specific time, given what we're assuming to be a representative sample of the joint probability distribution weighted over all possible ways it may operate?". Now, obviously that's quite a mouthful, but I'm trying to capture that intuition that one could validly consider how much information it takes to predict X if we don't know the specifics of the non-stationarity at this particular point in time, but do know the overall distribution of X (covering all possible behaviours). So long as one bears in mind that a different question is being asked (indeed a question that is quite different to the intended use of the measure), then certainly the answer can be validly interpreted. Of course, the bigger issue is in properly sampling the PDF of X over all possible behaviours, but that's another story.

Saturday, March 17, 2012

Workshop on Non-linear Interdependence Measures in Neuroscience

I'm pleased to announce the NeFF-Workshop on Non-linear and model-free Interdependence Measures in Neuroscience and TRENTOOL course which will be held at Goethe University Frankfurt, Germany on April 26-27, 2012 (hosted by the MEG Unit of the Brain Imaging Centre, Frankfurt).

The synopsis from the workshop announcement is as follows:
Understanding complex systems composed of many interacting units, such as neural networks, means understanding their directed and causal interactions. If the units in question interact in a nonlinear way, as it can be assumed in neural networks, we are faced with the problem that the analysis of interactions must be blind to the type of interaction if we want to cover all possible interactions in the network, as we may not know the type of nonlinear interaction a priori. Prematurely limiting our search to specific models, nonlinearities or, even worse, linear interactions may block the road to discovery. Novel model-free techniques for the quantification of directed interactions from information theory offer a promising alternative to more traditional methods in the field of interaction analyses, but also come with their own specific challenges. This symposium brings together the most active researchers in the field to discuss the state of the art, future prospects and challenges on the way to an model-free, information theoretic assessment of neuronal directed interactions.
I'm happy to be co-organising this workshop with Michael Wibral (head of the MEG Unit, Brain Imaging Center, Goethe University Frankfurt) and Raul Vicente  (Frankfurt Institute for Advanced Studies).

We've got several speakers lined up to talk about their work in this field, particularly using information-theoretic tools including the transfer entropy. The speakers include some collaborators of mine (e.g. Mikhail Prokopenko, Paul Williams), many others I'm looking forward to meeting (e.g. Stefano Panzeri, Luca Faes), and the organisers of course :).

Plus there will also be a workshop on Michael and Raul's Transfer Entropy toolbox (TRENTOOL), which is designed to provide effective network analysis on neuro data sets in Matlab. I'm looking forward to playing around with this more myself, I've already got a project and some data in mind.

We're hoping to get lots of participants (though space is limited) - full details on how to register are available at the workshop website - http://www.neff-ffm.de/de/veranstaltungen/seminars/workshop.php

I hope to see you there!

Sunday, March 4, 2012

Identifying influential spreaders and efficiently estimating infection numbers in epidemic models: a walk counting approach

Updated 26 October - our paper has been published in Europhysics Letters 99 68007 (2012) doi:10.1209/0295-5075/99/68007 - we've also updated the preprint on arXiv with the revised material.

Update 3 December - MPI made a press release about the paper, and New Scientist (German edition) published an article about it (in German).

-------------------------------------------------------------------

The news at this end is that Frank Bauer and I just submitted a new preprint on arXiv:

F. Bauer and J.T. Lizier, "Identifying influential spreaders and efficiently estimating infection numbers in epidemic models: a walk counting approach", MPI MIS Preprint 1/2012, arXiv:1203.0502, 2012.

We introduce a new method to efficiently approximate the number of infections resulting from a given initially-infected node in a network of susceptible individuals. Our approach is based on counting the number of possible infection walks of various lengths to each other node in the network. We analytically study the properties of our method, in particular demonstrating different forms for SIS and SIR disease spreading (e.g. under the SIR model our method counts self-avoiding walks). In comparison to existing methods to infer the spreading efficiency of different nodes in the network (based on degree, k-shell decomposition analysis and different centrality measures), our method directly considers the spreading process and, as such, is unique in providing estimation of actual numbers of infections. Crucially, in simulating infections on various real-world networks with the SIR model, we show that our walks-based method improves the inference of effectiveness of nodes over a wide range of infection rates compared to existing methods. We also analyse the trade-off between estimate accuracy and computational cost, showing that the better accuracy here can still be obtained at a comparable computational cost to other methods.

Epidemic spreading in biological, social, and technological networks has recently attracted much attention. The structure of such networks is generally complex and heterogeneous, so a key question in this domain is: "Given a first infected individual of the network (patient zero) - how likely is it that a substantial part of the network will become infected?" It is, of course, of particular interest to identify the most influential spreaders. This knowledge could, for instance, be used to prioritise vaccinations.

The most obvious and direct way to address this question is to estimate the number of infections by running simulations of the infection model. There are several well-known infection models which can be used to simulate diseases with different properties. These include the SIR (susceptible-infected-removed) model for diseases where a subject may only be infected once (due to either recovery with full immunity or death), and the SIS (susceptible-infected-susceptible) model for diseases where infected subjects recover and become susceptible to reinfection. To run a simulation, one must consider the network structure connecting susceptible individuals, and the infection rate or probability β that an infected individual will infect a given neighbour.

One problem with running simulations however, is that it takes a lot of computational time to obtain appropriate accuracy. For example, for an SIR model we run on a network of 27519 nodes and 116181 undirected edges, 10000 simulated initial infections per node for around 20 values of infection rate β took 2000 hours to simulate. Of course, this runtime does not scale well with the size of the network, while for real-world problems it is generally the large networks that we are genuinely interested in.

As such, it would be useful to find a more efficient way than full simulation to estimate infection numbers, and/or to be able to use local network properties of nodes to understand their spreading efficiencies.
So the problem that we are addressing here is two-fold:
  1. How to efficiently estimate the number of infections resulting from a given initially-infected node in a network of susceptible individuals?
  2. What network structural properties which are local to the initially-infected node are most useful for predicting how well disease will spread from it?
In fact, there has been a lot of work recently trying to find local network properties of nodes that are useful in predicting the relative spreading influence of different initially-infected nodes in a network. This attempts to address problem 1, but additionally gives very useful insight into how local network structure can promote or inhibit disease spreading (i.e. problem 2). The properties other authors have investigated range from simply examining out-degree, to k-shell analysis, to various measures of node centrality in the network (e.g. eigenvector centrality). And the good news is that you get a surprisingly accurate insight into the relative spreading efficiency of the various nodes with these very simple measures. Such work has attracted a lot of attention, for example being published in Nature Physics.

However, we observed two issues with these approaches:
  1. They only infer the relative spreading efficiency of initially infected nodes; i.e. they do not provide an estimate of the actual numbers of infections resulting from each node. These actual infection numbers could be very important in many applications.
  2. While the existing inference measures do a good job, they do not actually directly consider the mechanics of disease spreading. As such, there is still room for improvement. As an example of potential improvement areas: none of the above-mentioned measures change their relative inference with the rate of infection β.
So, we sat down and thought hard about the local network properties that best relate to the mechanics of disease spreading. We focussed on the fact that disease spreads on a network in the manner of a walk. The disease can only reach node B from node A on a walk from A to B, where every node on that walk is also infected. Our basic premise is that the count of the number of walks from the initially infected node to other susceptible nodes (an approach known as walk counting) should be a local network structural property that gives good insight into disease spreading. The idea had been previously raised in the literature, but not properly examined.

We developed the idea further, working out the mathematics to turn these walk counts into estimates of infection numbers, as a function of infection rate β. Interestingly, different types of walks are involved for different disease spreading models; e.g. for SIR spreading one is only interested in self-avoiding walks (since no node can be infected twice), whereas for SIS spreading one is interested in any type of walk. Our estimates are not perfect, and we identify where and how known errors will occur. However, the estimates have several very important and useful properties compared to other approaches:
  1. Our technique provides estimates of actual numbers of infections from a given initially-infected node, which is more useful than inferred relative spreading efficiency alone. It's useful to know who the most influential spreaders are, but you also want to know the extent to which they will spread the disease.
  2. Importantly, testing with simulations on various social network structures reveals that our technique infers more accurate relative spreading efficiencies than those of the aforementioned previously published techniques over a wide range of infection rates β (up to the lower super-critical spreading regime). This is because our technique directly considers the mechanics of disease spreading.
  3. And our technique has excellent computational efficiency. Note that there is a trade-off in our technique between computational efficiency and higher accuracy: by considering only short infection walks, our algorithm runs faster, but better accuracy is generally obtained by considering longer walks. Our short-walk estimates can be made in approximately the same run-time as the aforementioned techniques, but with greater accuracy in inferring relative spreading efficiency. Our longer-walk estimates produce better accuracy again, and do so with orders of magnitude less runtime than simulations of the disease spreading process which obtain the same accuracy.
As such, we show that our walk counting approach provides the following unique combination of features: they are the most relevant local network structural feature to infer relative disease spreading efficiency, provide estimates of actual infection numbers, and are computationally efficient.

Comments/suggestions welcome, of course ...
 
;