```
library(spatstat)
plot(X)
<- density(x = X, sigma = 10)
den plot(den, main = "Intensity")
contour(den, add = TRUE)
```

Assistant Professor of Statistics

King Abdullah University of Science

and Technology (KAUST), Saudi Arabia

Geospatial methods use data on disease cases, population at risk, and risk factors such as environmental, climate and socio-economic variables to

- Understand geographic and temporal patterns
- Identify potential risk factors
- Highlight high risk areas and detect clusters
- Measure inequalities
- Early detection of outbreaks

🗣 Results can be communicated using maps and other visualizations

✍️ Results guide decision-makers to better allocate limited resources and to design strategies for disease prevention and control

🌍 Many methods useful in fields such as ecology, environment, criminology

Moraga and Lawson, Computational Statistics & Data Analysis, 2012

Moraga et al., Parasites & Vectors, 2015

Moraga and Montes, Statistics in Medicine, 2011

**Standardized Mortality Ratio (SMR)** is often used to estimate disease risk

\[ SMR_i = \frac{Y_i}{E_i} = \frac{\mbox{number observed cases in area } i}{\mbox{number expected cases in area } i}\]

\(SMR_i = 1\) same number observed as expected

\(SMR_i > 1\) more observed than exp. (high risk)

\(SMR_i < 1\) less observed than exp. (low risk)

**Limitations**

SMRs easy to calculate but may be misleading and unreliable in areas with small populations or rare diseases. Models enable to incorporate covariates & borrow information from neighboring areas to obtain smoothed relative risks

Model to estimate disease relative risk \(\theta_i\) in areas \(i=1,\ldots,n\)

\[Y_i|\theta_i \sim Poisson(E_i \times \theta_i)\] \[\log(\theta_i) = \boldsymbol{z}_i \boldsymbol{\beta} + u_i + v_i\]

- \(Y_i\) observed cases, \(E_i\) expected cases, \(\theta_i\) relative risk in area \(i\)

Fixed effects quantify the effects of the covariates on the disease risk

- \(\boldsymbol{z}_i = (1, z_{i1}, \ldots, z_{ip})\) intercept and cov. \(\beta = (\beta_0, \beta_1, \ldots, \beta_p)'\) coeff.

Random effects represent residual variation not explained by the covariates

- \(u_i\): structured spatial effect to account for spatial dependence between relative risks (close areas show more similar risk than not close areas)
- \(v_i\): unstructured effect to account for independent noise

\[Y_i|\theta_i \sim Poisson(E_i \times \theta_i)\] \[\log(\theta_i) = \boldsymbol{z}_i \boldsymbol{\beta} + u_i + v_i\]

```
library(INLA)
formula <- Y ~ cov +
f(idareau, model = "besag", graph = g, scale.model = TRUE) +
f(idareav, model = "iid")
res <- inla(formula, family = "poisson", data = map@data,
E = E, control.predictor = list(compute = TRUE))
# Posterior mean and 95% CI
map$RR <- res$summary.fitted.values[, "mean"]
map$LL <- res$summary.fitted.values[, "0.025quant"]
map$UL <- res$summary.fitted.values[, "0.975quant"]
```

```
library(leaflet)
pal <- colorNumeric(palette = "YlOrRd", domain = map$SIR)
leaflet(map) %>% addTiles() %>%
addPolygons(color = "grey", weight = 1, fillColor = ~ pal(SIR)) %>%
addLegend(pal = pal, values = ~SIR, title = "SIR")
```

Lymphatic filariasis caused by microscopic worms and transmitted by mosquitoes

Main strategy against the disease is Mass Drug Administration. Resources are limited and need to decide which areas most in need

3197 LF prevalence surveys, 1990-2014

Models to predict prevalence at unsampled locations

\[Y_i|P(\boldsymbol{x}_i)\sim \mbox{Binomial} (n_i, P(\boldsymbol{x}_i))\] \[\mbox{logit}(P(\boldsymbol{x}_i)) = \boldsymbol{z}_i \boldsymbol{\beta} + S(\boldsymbol{x}_i) + u_i\]

\(Y_i\) number people positive, \(n_i\) number people tested, \(P(\boldsymbol{x}_i)\) prevalence at \(\boldsymbol{x}_i\)

Covariates based on characteristics known to affect disease transmission (temperature, precipitation, vegetation, elevation, land cover, population density, etc.)

Random effects model residual variation not explained by covariates

```
library(INLA)
mesh <- inla.mesh.2d(
loc = coo, offset = c(50, 100),
cutoff = 1, max.edge = c(30, 60))
```

\[Y_i|P(\boldsymbol{x}_i)\sim \mbox{Binomial} (n_i, P(\boldsymbol{x}_i))\] \[\mbox{logit}(P(\boldsymbol{x}_i)) = \boldsymbol{z}_i \boldsymbol{\beta} + S(\boldsymbol{x}_i) + u_i\]

```
library(INLA)
formula <- y ~ 0 + b0 + alt + f(s, model=spde) + f(idu, model="iid")
res <- inla(formula, family = "binomial", Ntrials = numtrials,
control.family = list(link = "logit"),
data = inla.stack.data(stk.full),
control.predictor = list(compute = TRUE, link = 1,
A = inla.stack.A(stk.full)))
index <- inla.stack.index(stack = stk.full, tag = "pred")$data
# Posterior mean and 95% CI
prev_mean <- res$summary.fitted.values[index, "mean"]
prev_ll <- res$summary.fitted.values[index, "0.025quant"]
prev_ul <- res$summary.fitted.values[index, "0.975quant"]
```

Assume point pattern \(\{s_i: i=1, \ldots, n\}\) has been generated as a realization of a point process \(Z = \{Z(s): s \in \mathbb{R}^2\}\)

A point process model can be used to estimate the intensity of events, identify patterns in the distribution of the observed locations, and learn about the correlation between the locations and spatial covariates

Intensity of the process: mean number of events per unit area at location \(s\) \[\lambda(s) = lim_{|ds|\rightarrow 0} \frac{E[N(ds)]}{|ds|}\]

Simple model that assumes events are equally likely to occur at any location within the study area, independent of the locations of other events

Typically used to model phenomena that are environmentally driven. Poisson process with a varying intensity which is itself a stochastic process of the form \(\Lambda(s) = exp(\eta(s))\), \(\eta = \{\eta(s): s \in \mathbb{R}^2\}\) spatial Gaussian process

number of events in any region \(A\) is distributed as \(Po(\int_A \Lambda(s)ds)\)

locations indep. random sample with probability density proportional to \(\Lambda(s)\)

Disease mapping is important to understand geographic and temporal patterns of diseases and allocate resources where most needed

Often, maps given at an areal resolution which difficulties decision-making

Map shows malaria prevalence in Mozambique. However, disease risk varies continuously in space & areal data unable to show how risk varies within areas

Areal estimates make difficult targeting health interventions and directing resources where most needed

High-resolution estimates permit to find differences in disease risk within study regions, and identify areas and groups of people at higher risk

Model assumes there is a spatially continuous variable underlying all observations that can be modeled using a zero-mean Gaussian random field

\[\begin{equation*} \begin{aligned} Y(\mathbf{x}) & \sim \pi \left( \theta(\mathbf{x}), \tau \right), \quad \mathbf{x} \in A \subset \mathbb{R}^2, \\ \theta(\mathbf{x}_i) & = g^{-1}\left(\mu(\mathbf{x}_i)+S\left(\mathbf{x}_i\right) \right), \quad i=1, \ldots, n, \\ \theta(B_i) & =\left|B_i\right|^{-1} \int_{B_i} g^{-1}(\mu (\mathbf{x}) + S(\mathbf{x})) d \mathbf{x}, \quad i=n+1, \ldots, n+m. \end{aligned} \end{equation*}\]

Inference using INLA and a modification of the SPDE approach

Integrated nested Laplace approximations (INLA) is a computational approach to perform approximate Bayesian inference in latent Gaussian models

In the SPDE approach, the continuously indexed Gaussian random field \(S\) is represented as a discretely indexed Gaussian Markov random field (GMRF) by means of a finite basis function defined on a triangulation of the study region

\[S(\boldsymbol{x}) = \sum_{g=1}^G \psi_g(\boldsymbol{x}) S_g\]

\(\psi_g(\cdot)\) piecewise polynomial basis functions on each triangle

\(\{S_g \}\) zero-mean Gaussian distributed

\(G\) number of vertices in triangulation

\(S(\boldsymbol{x})\) weighted average of the GMRF values at the vertices of the triangle containing the point. Weights = barycentric coordinates

\[S(\boldsymbol{x}) \approx \frac{T_{1}}{T}S_1 + \frac{T_{2}}{T}S_2 + \frac{T_{3}}{T}S_3\]

\(T_1, T_2, T_3\) areas subtriangles formed by \(\boldsymbol{x}\) and vertices. \(T\) area whole triangle

\(S(B)=|B|^{-1} \int_{B} S(\boldsymbol{x})d\boldsymbol{x}\) weighted average of the GMRF values at the vertices of the triangles within the area. Weights = \(\mbox{ (number vertices)}^{-1}\)

\[S(B) \approx \frac{1}{m} \sum_{g \in B} S_g\]

\[S(\boldsymbol{x}_i) \approx \sum_{g=1}^G A_{ig} S_g\ \ \ \ \ \ \ \ \ \ \ S(B_j) \approx \sum_{g=1}^G A_{jg} S_g\] \(A\) projection matrix that maps GMRF from observations to triangulation nodes

Row \(i\) of point observation: possibly three non-zero values at columns of vertices of the triangle containing the point (= barycentric coordinates)

Row \(j\) of area: non-zero values in all the m vertices inside the area (= \(1/m\))

\[A = \begin{bmatrix} A_{11} & A_{12} & A_{13} & \dots & A_{1G} \\ A_{21} & A_{22} & A_{23} & \dots & A_{2G} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ A_{n1} & A_{n2} & A_{n3} & \dots & A_{nG} \end{bmatrix} = \begin{bmatrix} 0 & 0 & 1 & \dots & 0 \\ .2 & .2 & 0 & \dots & .6 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1/m & 1/m & 1/m & \dots & 0 \end{bmatrix}\]

\[\begin{aligned} Y(\mathbf{x}) & \sim \operatorname{Binomial}\left(N(\mathbf{x}), P(\mathbf{x})\right), \quad \mathbf{x} \in A \subset \mathbb{R}^2, \\ P(\mathbf{x_i}) & = \text{logit}^{-1}\left(\mu(\mathbf{x}_i)+S(\mathbf{x}_i)\right), \quad i=1, \ldots, n, \\ P(B_i) & = \left|B_i\right|^{-1} \int_{B_i} \text{logit}^{-1} \left(\mu (\mathbf{x}) + S(\mathbf{x}) \right) d \mathbf{x}, \quad i=n+1, \ldots, n+m. \end{aligned}\]

A disease cluster is an unusual aggregation of cases occurring together in a particular place and time. Traditional cluster detection methods utilizing areal data often identify clusters comprising multiple areas, despite disease risk varying continuously in space

We propose a method to detect clusters of any shape indep. of boundaries

First obtain risk surfaces from a Bayesian spatial disaggregation model

\[\begin{equation*} \begin{aligned} Y(\mathbf{x}) & \sim \pi \left( \theta(\mathbf{x}), \tau \right), \quad \mathbf{x} \in A \subset \mathbb{R}^2, \\ \theta(\mathbf{x}_i) & = g^{-1}\left(\mu(\mathbf{x}_i)+S\left(\mathbf{x}_i\right) \right), \quad i=1, \ldots, n, \\ \theta(B_i) & =\left|B_i\right|^{-1} \int_{B_i} g^{-1}(\mu (\mathbf{x}) + S(\mathbf{x})) d \mathbf{x}, \quad i=n+1, \ldots, n+m. \end{aligned} \end{equation*}\]

Then use exceedance probabilities to identify high-risk locations

\[P(\theta(\mathbf{x}) > \mbox{threshold})\]

Through simulation, the disaggregation model showed high sensitivity and competitive specificity when compared to the circular scan statistic, flexible scan statistic, and exceedance probabilities from a Bayesian areal model

Simulation of a cluster with shape rectangle with a hole

Detecting clusters of lung cancer in Pennsylvania

Disease surveillance systems are critical to early detection of epidemics and the design of control strategies

Traditional surveillance systems rely on data gathered with a considerable delay and make surveillance systems ineffective for real-time surveillance

Real-time digital information may enable to detect outbreaks earlier

“Flu plus fever, not a good way to start the weekend”

“I’m so irritated at this cough and fever”

“This flu, fever & throat ache won’t let me sleep”

**Data-gathering platform****Modelling framework**that integrates multiple data sources so as to produce local probabilistic predictions of disease activity in real-time**Interactive dashboard**that alerts public health officials when elevated disease levels are anticipated, and provides insights about disease drivers

Dengue is a mosquito-borne disease caused by four dengue virus serotypes and transmitted by mosquitoes of the *Aedes* species. Dengue poses significant public health challenges in tropical and sub-tropical regions, including Brazil.

Many dengue cases only result in mild, flu-like illness, but some can be severe and even fatal.

Dengue does not have a specific treatment, but early detection and timely access to proper medical care significantly reduce the fatality rates associated with severe cases. Prevention focuses on personal protection and mosquito control.

Surveillance systems are crucial for dengue prevention and control.

*Aedes aegypti*

Vector control efforts

In Brazil, the InfoDengue system collects and generates indicators of dengue and other arboviruses: https://info.dengue.mat.br/

In principle, dengue is meant to be reported within seven days of case identification. In practice,

- Less than 50% cases are reported within one week
- Less than 75% cases reported within four weeks
- No more than 90% cases reported within nine weeks

Reported dengue cases in Rio de Janeiro, January 2011 to April 2012. Red line reported cases for those weeks.

Black line eventually reported cases after 10 weeks.

Assess the value Google Trends information to complement reported dengue cases for understanding dengue activity levels

Google Trends index for a specific keyword is an index ranging from 0 to 100. Calculated using the number of searches for that keyword divided by the total number of searches of the region and time period considered to compare relative popularity

Weekly Google Trends index for keyword ‘dengue’ in Brazil, 2019 to 2024.

We wish to assess the value of Google Trends for weekly dengue nowcasting in the 27 Brazilian states using dengue data from March to June 2024.

Each we collect reported dengue cases and Google Trends indices, fit several nowcasting models using different information, and compare nowcasts with the actual cases reported after 10 weeks. Models incorporate:

- Only dengue cases
- Only Google Trends data
- Both cases and Google Trends
- InfoDengue joint model for reported cases and delay distribution
- Naive approach where nowcasts are reported cases in previous week

Performance evaluated using error measures and uncertainty intervals

Results vary by state. In general, Google Trends and InfoDengue are the best-performing approaches

Dengue-tracker provides weekly updates on the number of dengue cases per state in Brazil

We present official and corrected case counts incorporating information from Google Trends

Reports assist policymakers

and the general public in understanding dengue levels

and guide their decisions

Geospatial health problems deal with data that come from different sources and are available at several spatial and spatio-temporal resolutions.

We have presented a flexible and fast model-based approach for the combination of spatially misaligned data that can be extended to address many problems of interest (including covariates, preferential sampling, spatio-temporal settings) and applied in a wide range of disciplines.

Integrating health, climate, environmental, socio-economic, and digital data sources enhances traditional surveillance systems, leading to better decision-making and improved health and well-being of the population.

Zhong, et al. (2024). Spatial data fusion adjusting for preferential sampling using integrated nested Laplace approximation and stochastic partial differential equation. *Journal of the Royal Statistical Society Series A: Statistics in Society*

Pavani, et al. (2023). Joint spatial modeling of the risks of co-circulating mosquito-borne diseases in Ceara, Brazil. *Spatial and Spatio-temporal Epidemiology*

Zhong and Moraga (2023). Bayesian Hierarchical Models for the Combination of Spatially Misaligned Data: A Comparison of Melding and Downscaler Approaches Using INLA and SPDE. *Journal of Agricultural, Biological and Environmental Statistics*, 29, 110–129

Ribeiro Amaral, et al. (2022). Spatio-temporal modeling of infectious diseases by integrating compartment and point process models. *Stochastic Environmental Research and Risk Assessment*, 37, 1519-1533

Moraga and Baker (2022). rspatialdata: a collection of data sources and tutorials on downloading and visualising spatial data using R. *F1000Research*, 11:77

Moraga. (2018). Small Area Disease Risk Estimation and Visualization Using R. *The R Journal*, 10(1):495-506

Moraga. (2017). SpatialEpiApp: A Shiny Web Application for the analysis of Spatial and Spatio-Temporal Disease Data. *Spatial and Spatio-temporal Epidemiology*, 23:47-57

Moraga, et al. (2017). A geostatistical model for combined analysis of point-level and area-level data using INLA and SPDE. *Spatial Statistics*, 21:27-41