Preface

Spatial Statistics for Data Science: Theory and Practice with R describes statistical methods, modeling approaches, and visualization techniques to analyze spatial data using R. The book starts by providing a comprehensive overview of the types of spatial data and R packages for spatial data retrieval, manipulation, and visualization. Then, it provides a detailed explanation of the theoretical concepts of spatial statistics, along with fully reproducible examples demonstrating how to simulate, describe, and analyze areal, geostatistical, and point pattern data in various applications.

The book combines theory and practice using real-world data science examples such as disease risk mapping, air pollution prediction, species distribution modeling, crime mapping, and real state analyses. The book covers the following topics:

Spatial data including areal, geostatistical, and point patterns
Coordinate reference systems and geographical data storages
R packages for retrieval, manipulation, and visualization of spatial data
Statistical methods to simulate, describe, and analyze spatial data
Areal data: neighborhood matrices, spatial autocorrelation, Bayesian spatial models
Geostatistical data: Gaussian random fields, spatial interpolation, Kriging, model-based geostatistics
Point patterns: kernel intensity estimation, clustering, log-Gaussian Cox processes
Fitting and interpreting Bayesian spatial models using the integrated nested Laplace approximation (INLA) and stochastic partial differential equation (SPDE) approaches
Model assessment criteria and cross-validation
Effective communication using interactive visualizations and dashboards

The book utilizes publicly available data and offers clear explanations of the R code for importing, manipulating, analyzing, and visualizing data, as well as the interpretation of the results. This ensures contents are easily accessible and reproducible for students, researchers, and practitioners.

Audience

This book serves as a valuable resource to anyone interested in the theoretical and practical aspects of spatial statistics, with a focus on applying these methods using R. This includes statisticians, data scientists, epidemiologists, environmental scientists, geographers, urban planners, climate scientists, and professionals of government agencies looking to deepen their understanding of spatial data analysis. The book is also appropriate for students of statistics and data science, as well as other fields with a strong statistical background. The book provides readers with a solid foundation in the theory of spatial statistics, as well as practical skills for working with spatial data using R for data retrieval, manipulation, and visualization across a range of disciplines.

Prerequisites and recommended reading

Readers are assumed to have a good understanding of statistical concepts such as probability distributions, descriptive statistics, confidence intervals, hypothesis testing, and generalized linear models. The book employs the statistical software R to illustrate methods and examples. It is assumed readers have some basic understanding of R programming, including how to install and load packages, manipulate data objects, and create plots. Books that can assist readers in enhancing their R skills include Grolemund (2014) and Wickham and Grolemund (2016), which provide friendly introductions to R and data analysis with hands-on examples. Moraga (2019) describes spatial and spatio-temporal models in health and environmental applications. It also shows how to easily turn analyses into visually informative reports, dashboards, and Shiny web applications for reproducible research and communication.

Why read this book?

Spatial data arise in many fields including environment, health, ecology, agriculture, urban planning, economy, and society. The utilization of spatial data has emerged as a critical component in data science, serving as a powerful tool for governments, companies, and individuals to improve their decision-making processes. A significant example is the utilization of spatial data by statistical offices across the world to improve the assessment and surveillance of the United Nations’ Sustainable Development Goals (SDGs). Spatial data are crucial to understand patterns of health outcomes and risk factors, monitor and manage natural resources, analyze demographics, design cities, preserve endangered species, and rapidly detect infectious disease outbreaks.

This book provides a comprehensive reference to spatial statistics for data science, supported by practical and fully reproducible examples across diverse fields. The statistical software R is used throughout the book, providing a wide range of packages and functions for handling spatial data in different formats, and facilitating analysis and visualization. R is available for download and use for free, making it an accessible option for researchers, educators, and practitioners. By employing the cutting-edge methods presented in the book, readers can gain valuable insights that support informed decision-making across a wide range of fields including public health, environment, and business.

Structure of the book

This book consists of four parts and an appendix.

Part I. Spatial data

The objective of the first part of the book is to present readers to the different types of spatial data, storage files of spatial data, and coordinate reference systems. This part also introduces packages that can be used to create, read, manipulate, and write spatial data in R. Additionally, it presents packages that facilitate the creation of maps, and packages that allow us to download open spatial data.

Part II. Areal data

The second part is devoted to the analysis of areal data. This type of data arise when a study region is partitioned into a finite number of areas at which outcomes are aggregated. Examples of areal data include the number of individuals with a certain disease in municipalities of a country or the average housing prices in districts of a city. This part covers key concepts such as spatial neighborhood matrices and spatial autocorrelation. It also shows how to fit and interpret Bayesian spatial models to analyze areal data. Examples of disease mapping and housing prices prediction are used to illustrate the application of these techniques.

Part III. Geostatistical data

The third part of the book is centered on geostatistical data, which refers to measurements of a spatially continuous phenomenon collected at specific locations, such as air pollution or temperature levels taken at a set of monitoring stations. This part provides an introduction to Gaussian fields and R packages used for their simulation and analysis. In addition, it presents various spatial interpolation methods including inverse distance weighted methods, Kriging, and model-based geostatistics. These methods are illustrated using several examples such as the prediction of soil metal concentrations and air pollution levels. This part also covers measures to assess the predictive performance of the interpolation methods using cross-validation techniques.

Part IV. Spatial point patterns

Spatial point patterns are countable sets of points that arise as realizations of stochastic spatial point processes within a planar region. Examples of point patterns include the locations of trees in a forest, addresses of individuals with a disease in a city, and the locations of cells in a tissue. The fourth part of the book provides an overview of techniques for analyzing point pattern data, including methods to assess the randomness of spatial point patterns, intensity estimation and clustering analysis. It also demonstrates how to formulate and fit log-Gaussian Cox process models for point pattern data, which are typically used to model environmentally driven phenomena. Examples in this part include the analysis of disease data, crime mapping, and species distribution modeling.

Appendix

Finally, the appendix provides useful resources on the R software and packages for visualization, as well as the creation of interactive dashboards and Shiny web applications to effectively communicate results to collaborators and policymakers.

Acknowledgments

R is a powerful and accessible tool for spatial statistics and data science. I am grateful to the R community, and the developers and contributors of open-source software for providing valuable resources that enable the analysis of spatial data. This book is the result of a compilation of teaching materials developed over several years. I am grateful to the students of the courses where I had the opportunity to teach for their feedback and insightful questions that helped me improve this book. Finally, I would also like to express my sincere gratitude to my research group and all my collaborators over the years for the opportunity to work with them on great problems to advance spatial statistics and data science.

Paula Moraga
KAUST

Welcome

About the author