# Bayesian modeling of spatially misaligned health and environmental data

### Paula Moraga, Ph.D.

Assistant Professor of Statistics

King Abdullah University of Science
and Technology (KAUST), Saudi Arabia

# Digital health surveillance

## Geospatial methods for disease surveillance

Geospatial methods use disease, population and other data at several spatial and time resolutions to understand geographic and temporal patterns, identify risk factors, measure inequalities, and early detection of outbreaks. Results help inform the development of strategies for disease prevention and control

# Precision disease mapping

## Disease mapping

Disease mapping is important to understand geographic and temporal patterns of diseases and allocate resources where most needed

Often, maps given at an areal resolution which difficulties decision-making

Map shows malaria prevalence in Mozambique. However, disease risk varies continuously in space & areal data unable to show how risk varies within areas

Areal estimates make difficult targeting health interventions and directing resources where most needed

## Disaggregate area-level data

High-resolution estimates permit to find differences in disease risk within study regions, and identify areas and groups of people at higher risk

## Bayesian spatial disaggregation model

Model assumes there is a spatially continuous variable underlying all observations that can be modeled using a zero-mean Gaussian random field

\begin{equation*} \begin{aligned} Y(\mathbf{x}) & \sim \pi \left( \theta(\mathbf{x}), \tau \right), \quad \mathbf{x} \in A \subset \mathbb{R}^2, \\ \theta(\mathbf{x}_i) & = g^{-1}\left(\mu(\mathbf{x}_i)+S\left(\mathbf{x}_i\right) \right), \quad i=1, \ldots, n, \\ \theta(B_i) & =\left|B_i\right|^{-1} \int_{B_i} g^{-1}(\mu (\mathbf{x}) + S(\mathbf{x})) d \mathbf{x}, \quad i=n+1, \ldots, n+m, \end{aligned} \end{equation*}

Inference using INLA and a modification of the SPDE approach

Moraga et al., Spatial Statistics, 2017

## Inference using INLA and SPDE

Integrated nested Laplace approximations (INLA) is a computational approach to perform approximate Bayesian inference in latent Gaussian models

In the SPDE approach, the continuously indexed Gaussian random field $S$ is represented as a discretely indexed Gaussian Markov random field (GMRF) by means of a finite basis function defined on a triangulation of the study region

$S(\boldsymbol{x}) = \sum_{g=1}^G \psi_g(\boldsymbol{x}) S_g$

$\psi_g(\cdot)$ piecewise polynomial basis functions on each triangle
$\{S_g \}$ zero-mean Gaussian distributed
$G$ number of vertices in triangulation

## Point observations

$S(\boldsymbol{x})$ weighted average of the GMRF values at the vertices of the triangle containing the point. Weights = barycentric coordinates

$S(\boldsymbol{x}) \approx \frac{T_{1}}{T}S_1 + \frac{T_{2}}{T}S_2 + \frac{T_{3}}{T}S_3$

$T_1, T_2, T_3$ areas subtriangles formed by $\boldsymbol{x}$ and vertices. $T$ area whole triangle

## Areal observations

$S(B)=|B|^{-1} \int_{B} S(\boldsymbol{x})d\boldsymbol{x}$ weighted average of the GMRF values at the vertices of the triangles within the area. Weights = $\mbox{ (number vertices)}^{-1}$

$S(B) \approx \frac{1}{m} \sum_{g \in B} S_g$

## Projection matrix

$S(\boldsymbol{x}_i) \approx \sum_{g=1}^G A_{ig} S_g\ \ \ \ \ \ \ \ \ \ \ S(B_j) \approx \sum_{g=1}^G A_{jg} S_g$ $A$ projection matrix that maps GMRF from observations to triangulation nodes

• Row $i$ of point observation: possibly three non-zero values at columns of vertices of the triangle containing the point (= barycentric coordinates)

• Row $j$ of area: non-zero values in all the m vertices inside the area (= $1/m$)

$A = \begin{bmatrix} A_{11} & A_{12} & A_{13} & \dots & A_{1G} \\ A_{21} & A_{22} & A_{23} & \dots & A_{2G} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ A_{n1} & A_{n2} & A_{n3} & \dots & A_{nG} \end{bmatrix} = \begin{bmatrix} 0 & 0 & 1 & \dots & 0 \\ .2 & .2 & 0 & \dots & .6 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1/m & 1/m & 1/m & \dots & 0 \end{bmatrix}$

\begin{aligned} Y(\mathbf{x}) & \sim \operatorname{Binomial}\left(N(\mathbf{x}), P(\mathbf{x})\right), \quad \mathbf{x} \in A \subset \mathbb{R}^2, \\ P(\mathbf{x_i}) & = \text{logit}^{-1}\left(\mu(\mathbf{x}_i)+S(\mathbf{x}_i)\right), \quad i=1, \ldots, n, \\ P(B_i) & = \left|B_i\right|^{-1} \int_{B_i} \text{logit}^{-1} \left(\mu (\mathbf{x}) + S(\mathbf{x}) \right) d \mathbf{x}, \quad i=n+1, \ldots, n+m. \end{aligned}

Alahmadi and Moraga (under review), 2024

# Detection of disease clusters

## Detection of disease clusters

A disease cluster is an unusual aggregation of cases occurring together in a particular place and time. Detection of clusters is crucial to determine whether they are due to chance or specific environmental or occupational risk factors, allowing to allocate resources and respond more effectively to health threats

Disease cases are typically aggregated at areal resolution based on administrative boundaries, mainly for confidentiality reasons

Traditional cluster detection methods utilizing areal data often identify clusters comprising multiple areas, despite disease risk varying continuously in space

## Exceedance probabilities from Bayesian spatial disaggregation models

We propose a method to detect clusters of any shape indep. of boundaries

First obtain risk surfaces from a Bayesian spatial disaggregation model

\begin{equation*} \begin{aligned} Y(\mathbf{x}) & \sim \pi \left( \theta(\mathbf{x}), \tau \right), \quad \mathbf{x} \in A \subset \mathbb{R}^2, \\ \theta(\mathbf{x}_i) & = g^{-1}\left(\mu(\mathbf{x}_i)+S\left(\mathbf{x}_i\right) \right), \quad i=1, \ldots, n, \\ \theta(B_i) & =\left|B_i\right|^{-1} \int_{B_i} g^{-1}(\mu (\mathbf{x}) + S(\mathbf{x})) d \mathbf{x}, \quad i=n+1, \ldots, n+m, \end{aligned} \end{equation*}

Then use exceedance probabilities to identify high-risk locations

$P(\theta(\mathbf{x}) > \mbox{threshold})$

Alahmadi and Moraga (under review), 2024

## Simulation

Through simulation, the disaggregation model showed high sensitivity and competitive specificity when compared to the circular scan statistic, flexible scan statistic, and exceedance probabilities from a Bayesian areal model

## Results

Simulation of a cluster with shape rectangle with a hole

## Real application

Detecting clusters of lung cancer in Pennsylvania

# Digital health surveillance

## Disease surveillance systems

Disease surveillance systems are critical to early detection of epidemics and the design of control strategies

Traditional surveillance systems rely on data gathered with a considerable delay and make surveillance systems ineffective for real-time surveillance

## Digital data sources

Real-time digital information may enable to detect outbreaks earlier

“Flu plus fever, not a good way to start the weekend”

“I’m so irritated at this cough and fever”

“This flu, fever & throat ache won’t let me sleep”

## Digital health surveillance system

• Data-gathering platform
• Modelling framework that integrates multiple data sources so as to produce local probabilistic predictions of disease activity in real-time
• Interactive dashboard that alert public health officials when elevated disease levels are anticipated, and provide insights about disease drivers

## Dengue in Brazil

Dengue is a mosquito-borne disease that poses significant public health challenges in tropical and sub-tropical regions, including Brazil.

Many dengue cases only result in mild, flu-like illness, but some can be severe and even fatal.

Dengue does not have a specific treatment, but early detection and timely access to proper medical care significantly reduce the fatality rates associated with severe cases. Prevention focuses on personal protection and mosquito control.

Surveillance systems are crucial for dengue prevention and control.

Aedes aegypti

Vector control efforts

## Reporting delays in official dengue cases

InfoDengue is a data collection and analysis system that generates indicators of dengue and other arboviruses in Brazil: https://info.dengue.mat.br/

In principle, dengue is meant to be reported within seven days of case identification. In practice,

• Less than 50% cases are reported within one week
• Less than 75% cases reported within four weeks
• No more than 90% cases reported within nine weeks

Reported dengue cases in Rio de Janeiro, January 2011 to April 2012. Red line reported cases for those weeks.
Black line eventually reported cases after 10 weeks.

Bastos, et al. Statistics in Medicine, 2019

## Dengue nowcasting in Brazil

The objective of this work is to assess the usefulness of Google Trends data for weekly dengue nowcasting in each of the 27 Brazilian states.

We collect weekly reported dengue cases and Google Trends indices, fit several nowcasting models using different information, and compare nowcasts with the actual cases reported after 10 weeks. Models incorporate:

• Only dengue cases
• Both cases and Google Trends
• InfoDengue joint model for reported cases and delay distribution
• Naive approach where nowcasts are reported cases in previous week

Performance evaluated using error measures and uncertainty intervals

## Preliminary results

Results vary by state. In general, Google Trends and InfoDengue are the best-performing approaches

## Dengue tracker in Brazil

Dengue-tracker provides weekly updates on the number of dengue cases per state in Brazil

We present official and corrected case counts incorporating information from Google Trends

Reports assist policymakers
and the general public in understanding dengue levels
and guide their decisions

Xiao, et al.
(in preparation), 2024

## Conclusion

• Geospatial health problems deal with data that come from different sources and are available at several spatial and spatio-temporal resolutions

• We have presented a flexible and fast model-based approach to combine data at different spatial resolutions. This model can be extended to address many problems of interest (including covariates, preferential sampling, spatio-temporal settings). It has applications in a wide range of disciplines where information at different spatial resolutions needs to be combined

• The dengue study highlights the value of digital data in improving traditional surveillance systems. This is preliminary research, future studies will address search query biases, and utilize spatial models and covariates to obtain predictions at higher resolutions for more actionable insights.

• Integrating health, climate, environmental, socio-economic, and digital data sources enhances traditional surveillance systems, leading to better decision-making and improved health and well-being of the population.

## Join my research group at KAUST!

KAUST is an international university located on the shores of the Red Sea

All students receive a living allowance, free housing and medical coverage

👩‍💻 Potential research areas include the development of innovative statistical and computational methods for health and environmental applications

💪 Work closely with collaborators at KAUST and around the world

✈️ Generous travel funding for conferences and collaborative work

✨ Excellent research environment. Superb equipment and research facilities

## References

Zhong, et al. (2024). Spatial data fusion adjusting for preferential sampling using integrated nested Laplace approximation and stochastic partial differential equation. Journal of the Royal Statistical Society Series A: Statistics in Society

Pavani, et al. (2023). Joint spatial modeling of the risks of co-circulating mosquito-borne diseases in Ceara, Brazil. Spatial and Spatio-temporal Epidemiology

Zhong and Moraga (2023). Bayesian Hierarchical Models for the Combination of Spatially Misaligned Data: A Comparison of Melding and Downscaler Approaches Using INLA and SPDE. Journal of Agricultural, Biological and Environmental Statistics, 29, 110–129

Ribeiro Amaral, et al. (2022). Spatio-temporal modeling of infectious diseases by integrating compartment and point process models. Stochastic Environmental Research and Risk Assessment, 37, 1519-1533

Moraga and Baker (2022). rspatialdata: a collection of data sources and tutorials on downloading and visualising spatial data using R. F1000Research, 11:77

Moraga. (2018). Small Area Disease Risk Estimation and Visualization Using R. The R Journal, 10(1):495-506

Moraga. (2017). SpatialEpiApp: A Shiny Web Application for the analysis of Spatial and Spatio-Temporal Disease Data. Spatial and Spatio-temporal Epidemiology, 23:47-57

Moraga, et al. (2017). A geostatistical model for combined analysis of point-level and area-level data using INLA and SPDE. Spatial Statistics, 21:27-41

Thanks!

Paula Moraga