Sunday, March 4, 2012

Identifying influential spreaders and efficiently estimating infection numbers in epidemic models: a walk counting approach

Updated 26 October - our paper has been published in Europhysics Letters 99 68007 (2012) doi:10.1209/0295-5075/99/68007 - we've also updated the preprint on arXiv with the revised material.

Update 3 December - MPI made a press release about the paper, and New Scientist (German edition) published an article about it (in German).


The news at this end is that Frank Bauer and I just submitted a new preprint on arXiv:

F. Bauer and J.T. Lizier, "Identifying influential spreaders and efficiently estimating infection numbers in epidemic models: a walk counting approach", MPI MIS Preprint 1/2012, arXiv:1203.0502, 2012.

We introduce a new method to efficiently approximate the number of infections resulting from a given initially-infected node in a network of susceptible individuals. Our approach is based on counting the number of possible infection walks of various lengths to each other node in the network. We analytically study the properties of our method, in particular demonstrating different forms for SIS and SIR disease spreading (e.g. under the SIR model our method counts self-avoiding walks). In comparison to existing methods to infer the spreading efficiency of different nodes in the network (based on degree, k-shell decomposition analysis and different centrality measures), our method directly considers the spreading process and, as such, is unique in providing estimation of actual numbers of infections. Crucially, in simulating infections on various real-world networks with the SIR model, we show that our walks-based method improves the inference of effectiveness of nodes over a wide range of infection rates compared to existing methods. We also analyse the trade-off between estimate accuracy and computational cost, showing that the better accuracy here can still be obtained at a comparable computational cost to other methods.

Epidemic spreading in biological, social, and technological networks has recently attracted much attention. The structure of such networks is generally complex and heterogeneous, so a key question in this domain is: "Given a first infected individual of the network (patient zero) - how likely is it that a substantial part of the network will become infected?" It is, of course, of particular interest to identify the most influential spreaders. This knowledge could, for instance, be used to prioritise vaccinations.

The most obvious and direct way to address this question is to estimate the number of infections by running simulations of the infection model. There are several well-known infection models which can be used to simulate diseases with different properties. These include the SIR (susceptible-infected-removed) model for diseases where a subject may only be infected once (due to either recovery with full immunity or death), and the SIS (susceptible-infected-susceptible) model for diseases where infected subjects recover and become susceptible to reinfection. To run a simulation, one must consider the network structure connecting susceptible individuals, and the infection rate or probability β that an infected individual will infect a given neighbour.

One problem with running simulations however, is that it takes a lot of computational time to obtain appropriate accuracy. For example, for an SIR model we run on a network of 27519 nodes and 116181 undirected edges, 10000 simulated initial infections per node for around 20 values of infection rate β took 2000 hours to simulate. Of course, this runtime does not scale well with the size of the network, while for real-world problems it is generally the large networks that we are genuinely interested in.

As such, it would be useful to find a more efficient way than full simulation to estimate infection numbers, and/or to be able to use local network properties of nodes to understand their spreading efficiencies.
So the problem that we are addressing here is two-fold:
  1. How to efficiently estimate the number of infections resulting from a given initially-infected node in a network of susceptible individuals?
  2. What network structural properties which are local to the initially-infected node are most useful for predicting how well disease will spread from it?
In fact, there has been a lot of work recently trying to find local network properties of nodes that are useful in predicting the relative spreading influence of different initially-infected nodes in a network. This attempts to address problem 1, but additionally gives very useful insight into how local network structure can promote or inhibit disease spreading (i.e. problem 2). The properties other authors have investigated range from simply examining out-degree, to k-shell analysis, to various measures of node centrality in the network (e.g. eigenvector centrality). And the good news is that you get a surprisingly accurate insight into the relative spreading efficiency of the various nodes with these very simple measures. Such work has attracted a lot of attention, for example being published in Nature Physics.

However, we observed two issues with these approaches:
  1. They only infer the relative spreading efficiency of initially infected nodes; i.e. they do not provide an estimate of the actual numbers of infections resulting from each node. These actual infection numbers could be very important in many applications.
  2. While the existing inference measures do a good job, they do not actually directly consider the mechanics of disease spreading. As such, there is still room for improvement. As an example of potential improvement areas: none of the above-mentioned measures change their relative inference with the rate of infection β.
So, we sat down and thought hard about the local network properties that best relate to the mechanics of disease spreading. We focussed on the fact that disease spreads on a network in the manner of a walk. The disease can only reach node B from node A on a walk from A to B, where every node on that walk is also infected. Our basic premise is that the count of the number of walks from the initially infected node to other susceptible nodes (an approach known as walk counting) should be a local network structural property that gives good insight into disease spreading. The idea had been previously raised in the literature, but not properly examined.

We developed the idea further, working out the mathematics to turn these walk counts into estimates of infection numbers, as a function of infection rate β. Interestingly, different types of walks are involved for different disease spreading models; e.g. for SIR spreading one is only interested in self-avoiding walks (since no node can be infected twice), whereas for SIS spreading one is interested in any type of walk. Our estimates are not perfect, and we identify where and how known errors will occur. However, the estimates have several very important and useful properties compared to other approaches:
  1. Our technique provides estimates of actual numbers of infections from a given initially-infected node, which is more useful than inferred relative spreading efficiency alone. It's useful to know who the most influential spreaders are, but you also want to know the extent to which they will spread the disease.
  2. Importantly, testing with simulations on various social network structures reveals that our technique infers more accurate relative spreading efficiencies than those of the aforementioned previously published techniques over a wide range of infection rates β (up to the lower super-critical spreading regime). This is because our technique directly considers the mechanics of disease spreading.
  3. And our technique has excellent computational efficiency. Note that there is a trade-off in our technique between computational efficiency and higher accuracy: by considering only short infection walks, our algorithm runs faster, but better accuracy is generally obtained by considering longer walks. Our short-walk estimates can be made in approximately the same run-time as the aforementioned techniques, but with greater accuracy in inferring relative spreading efficiency. Our longer-walk estimates produce better accuracy again, and do so with orders of magnitude less runtime than simulations of the disease spreading process which obtain the same accuracy.
As such, we show that our walk counting approach provides the following unique combination of features: they are the most relevant local network structural feature to infer relative disease spreading efficiency, provide estimates of actual infection numbers, and are computationally efficient.

Comments/suggestions welcome, of course ...

No comments: