Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny describes spatial and spatio-temporal statistical methods and visualization techniques to analyze georeferenced health data in R. After a detailed introduction of geospatial data, the book shows how to develop Bayesian hierarchical models for disease mapping and apply computational approaches such as the integrated nested Laplace approximation (INLA) and the stochastic partial differential equation (SPDE) to analyze areal and geostatistical data. These approaches allow to quantify disease burden, understand geographic patterns and changes over time, identify risk factors, and measure inequalities between populations. The book also shows how to create interactive and static visualizations such as disease maps and time plots, and describes several R packages that can be used to easily turn analyses into visually informative and interactive reports, dashboards, and Shiny web applications that facilitate the communication of insights to collaborators and policymakers.
The book features detailed worked examples of several disease and environmental applications using real-world data such as malaria in The Gambia, cancer in Scotland and the USA, and air pollution in Spain. Examples in the book focus on health applications, but the approaches covered are also applicable to other fields that use georeferenced data including epidemiology, ecology, demography or criminology. The book covers the following topics:
- Types of spatial data and coordinate reference systems,
- Manipulating and transforming point, areal, and raster data,
- Retrieving high-resolution spatially referenced environmental data,
- Fitting and interpreting Bayesian spatial and spatio-temporal models with the R-INLA package,
- Modeling disease risk and quantifying risk factors in different settings,
- Creating interactive and static visualizations such as disease risk maps and time plots,
- Creating reproducible reports with R Markdown,
- Developing dashboards with flexdashboard,
- Building interactive Shiny web applications.
The book uses publicly available data, and provides clear descriptions of the R code for data importing, manipulation, modeling and visualization, as well as the interpretation of the results. This ensures contents are fully reproducible and accessible for students, researchers and practitioners.
This book is primarily aimed at epidemiologists, biostatisticians, public health specialists, and professionals of government agencies working with georeferenced health data. Moreover, since the methods discussed in the book are applicable not only to health data but also to many other fields that deal with georeferenced data, the book is also suitable for researchers and practitioners of other areas wishing to learn how to model and visualize this type of data such as epidemiology, ecology, demography or criminology. The book is also appropriate for postgraduate students of statistics and epidemiology or other subjects with a strong statistical background.
Prerequisites and recommended reading
It is assumed readers are familiar with R and the basics of data analysis. R (https://www.r-project.org) is a free, open source, software environment for statistical computing and graphics with many excellent packages for importing and manipulating data, statistical modeling, and visualization. R can be downloaded from CRAN (the Comprehensive R Archive Network) (https://cran.rstudio.com). It is recommended to run R using the integrated development environment (IDE) called RStudio which can be freely downloaded from https://www.rstudio.com/products/rstudio/download. RStudio allows one to interact with R more readily. It includes a console, syntax-highlighting editor that supports direct code execution, as well as a variety of tools for plotting, history, debugging and workspace management.
Resources available for readers wanting to improve their R skills include Grolemund (2014) which provides a friendly introduction to R with hands-on examples. Books for readers already comfortable with R include Wickham and Grolemund (2016) which teaches how to do data science with R, and Wickham (2019) which is designed primarily for R users who want to improve their programming skills and understanding of the language. Excellent resources to learn how to handle, analyze, and visualize spatial and spatio-temporal data in R are Bivand, Pebesma, and Gómez-Rubio (2013), Lovelace, Nowosad, and Muenchow (2019), and the website https://www.r-spatial.org.
It is also recommended that readers have a working knowledge of linear models, generalized linear models, Gaussian, Poisson and Binomial probability distributions, and Bayesian inference. Wang, Ryan, and Faraway (2018) covers a wide range of Bayesian regression models and detailed examples to fit them using INLA. Specific resources that focus on spatial and spatio-temporal modeling include Blangiardo and Cameletti (2015) which provides an introduction to the Bayesian approach and presents practical examples using real data problems. Krainski et al. (2019) describes the SPDE approach in detail and presents models that can deal with a variety of problems including multivariate data, measurement error, non-stationarity, and point process models. Further resources to learn INLA and SPDE can be found on the website http://www.r-inla.org/.
This book describes several R packages that can be used to easily turn our analyses into visually informative and interactive reports (Allaire et al. 2021), dashboards (Iannone, Allaire, and Borges 2020), and Shiny web applications (Chang et al. 2021). These tools facilitate the communication with collaborators and allow stakeholders to understand our research and make informed decisions. Resources to deepen expertise in these packages can be found on the RStudio website https://www.rstudio.com/ which contains excellent tutorials, articles and examples on advanced concepts as well as information on hosting and deployment of web products.
Why read this book?
Geospatial health data are essential to inform public health and policy across high-, middle-, and low-income countries. These data can be used to understand the burden and geographic patterns of disease, and can help in the development of hypotheses that relate disease risk to potential demographic and environmental factors.
This book shows how to apply cutting-edge statistical spatial and spatio-temporal methods on disease data to produce disease risk maps and quantify risk factors. Specifically, the book shows how to develop Bayesian hierarchical models and apply computational approaches such as INLA and SPDE to analyze data collected in areas (e.g., counties or provinces) and at particular locations by disease registries, national and regional institutes of statistics, and other organizations. These approaches allow to quantify the disease burden, understand geographic and temporal patterns, identify risk factors, and measure inequalities.
This book also provides the necessary tools to design and develop web-based digital applications such as disease atlases that incorporate interactive visualizations to make disease risk estimates available and accessible to a wide audience, including policymakers, researchers, health professionals, and the general public. These tools allow to explore vast amounts of data in an interactive and approachable way by means of maps, time plots, tables and other visualizations that support interactive filtering and zooming over different regions and periods of time to display the information of interest. These tools are beneficial when trying to identify information for specific regions, compare risks between populations, and understand how disease patterns have changed over time.
The statistical methods and visualization techniques presented in this book are valuable to analyze a wide range of conditions including infectious diseases, non-communicable diseases, injuries, and health-related behaviors, and provide policymakers with actionable information for the development and implementation of appropriate population health policies.
Structure of the book
This book consists of three parts and an appendix. Part I provides an overview of geospatial health and the R-INLA package (Havard Rue et al. 2021). The goal of this part is to provide some ground to geospatial data and computational methods that can help the development of the subsequent chapters. Chapter 1 provides an overview of geospatial health and discusses methods for analysis and tools for communication of results. Chapter 2 reviews the basic characteristics of spatial data including areal, geostatistical and point patterns, and introduces coordinate reference systems and geographical data storages. This chapter also shows R packages that are commonly used to create maps in R. Chapter 3 provides an introduction to Bayesian inference and INLA to perform approximate Bayesian inference in latent Gaussian models. This first part concludes with Chapter 4 which provides an overview of the R-INLA package. This chapter details how to use R-INLA to specify and fit models and how to interpret the results.
Part II of the book is devoted to modeling and visualization of both areal and geostatistical data. Health data that are aggregated over areas such as administrative divisions are common in public health surveillance. Examples include the number of disease cases in provinces or the number of road accidents in provinces. Chapter 5 introduces methods to analyze this type of data including spatial proximity matrices and standardized incidence ratios (SIRs), and discusses common areal issues such as the Misaligned Data Problem (MIDP) and the Modifiable Areal Unit Problem (MAUP). This chapter also introduces Bayesian hierarchical models to obtain small area disease risk estimates in spatial and spatio-temporal settings. Chapter 6 provides an example on how to use INLA to obtain cancer risk estimates in the Scotland counties and quantify risk factors. Chapter 7 uses a spatio-temporal model to obtain cancer risk estimates in the Ohio counties across several years.
Geostatistical data refers to data about a spatially continuous phenomenon that have been collected at particular sites. Examples of this type of data are disease prevalence observations collected at specific villages using surveys, and air pollution levels measured at several monitoring stations. Chapter 8 shows how to develop spatial and spatio-temporal models that enable to make predictions at unsampled locations and times using the SPDE approach. Chapter 9 presents an example to predict malaria prevalence in The Gambia using survey data and high-resolution environmental covariates. Chapter 10 shows how to model measurements of air pollution obtained at several monitoring stations in Spain across different years to produce continuous maps representing the spatial variation of air pollution over time. The examples presented in these chapters provide the R code needed for data importing, manipulation and modeling, and show how to create static and interactive visualizations such as maps and time plots of disease risk and risk factors using the R packages ggplot2 (Wickham, Chang, et al. 2021), gganimate (Pedersen and Robinson 2020), plotly (Sievert et al. 2021), leaflet (Cheng, Karambelkar, and Xie 2021), mapview (Appelhans et al. 2021) and tmap (Tennekes 2021).
A key aspect of geospatial research is to determine how to share the results of our analyses in a proper, timely and actionable way. Part III of the book describes several R packages that facilitate the communication with collaborators and stakeholders. In Chapter 11 we introduce the package R Markdown (Allaire et al. 2021). This package enables the easy creation of high quality fully reproducible reports including narrative text, tables and visualizations, as well as the R code to generate them. While documents generated with R Markdown can be easily used to reproduce results and help other researchers determine how they were derived, they may not be the best tool for reporting to the relevant stakeholders. Stakeholders may not be interested in the statistical analyses, but they need to fully understand the results to support decision making. Dashboards can help to communicate large amounts of information visually and quickly and support data-driven decision making. In Chapter 12 we introduce the R package flexdashboard (Iannone, Allaire, and Borges 2020) which can be used to create dashboards that contain visual displays of the most important information arranged on a single screen on HTML format.
Interactive web applications are also an essential tool that enable to communicate information in an approachable and actionable way. In Chapter 13 we introduce the package shiny (Chang et al. 2021) which provides a framework to turn results into web applications that allow users to experiment with different data scenarios so that they can answer their own questions. For example, they can filter data to obtain specific summaries, or change several options to obtain different visualizations. In Chapter 14 we show how to create interactive dashboards with Shiny, and Chapter 15 describes how to build a Shiny app that permits to upload and visualize spatio-temporal data. Chapter 16 presents SpatialEpiApp (Moraga 2017b), a Shiny web application that allows to visualize spatial and spatio-temporal disease data, estimate disease risk and detect clusters. Finally, Appendix A contains resources about R and shows the packages used in this book.
R represents an excellent tool for the analysis of geospatial health data. I would like to thank the R community and the developers and contributors of open-source software that enable reproducible data analysis. In particular, I would like to thank the developers of spatial packages, and the authors of INLA and SPDE for the great resources they created for spatial and spatio-temporal modeling. I would also like to thank the developers of packages for mapping, interactive visualization, and the creation of Shiny web applications which really make a difference on how insights are communicated.
This book is written in R Markdown (Allaire et al. 2021) with bookdown (Xie 2021a). I am grateful to the developers of these packages which made really easy the creation of this book.
I would also like to express my sincere gratitude to the anonymous reviewers for their helpful comments that greatly improved the first version of this book. I also thank my editor John Kimmel and the team at CRC Press for their suggestions and guidance throughout the publication process.
Finally, I would like to thank Peter J. Diggle, Francisco Montes, Al Ozonoff, Martin Kulldorff, and all my collaborators and colleagues for their guidance and support, and for the opportunity to work with them in great problems to advance spatial data science and public health surveillance.
KAUST, Saudi Arabia