# 23 Point process modeling

## 23.1 Log-Gaussian Cox processes

Log-Gaussian Cox processes (LGCPs) are typically used to model phenomena that are environmentally driven . A LGCP is a Poisson process with a varying intensity, which is itself a stochastic process of the form

$\Lambda(\boldsymbol{s}) = exp(Z(\boldsymbol{s})),$ where $$Z = \{Z(\boldsymbol{s}) : \boldsymbol{s} \in \mathbb{R^2}\}$$ is a Gaussian process. Then, conditional on $$\Lambda(\cdot)$$, the point process is an inhomogeneous Poisson process with intensity $$\Lambda(\cdot)$$. This implies that the number of events in any region $$A$$ is Poisson distributed with mean $$\int_A \Lambda(\boldsymbol{s})d\boldsymbol{s}$$, and the locations of events are an independent random sample from the distribution on $$A$$ with probability density proportional to $$\Lambda(\cdot)$$. A LGCP model can also include spatial explanatory variables providing a flexible approach for describing and predicting a wide range of spatial phenomena.

In this chapter, we assume that we have observed a spatial point pattern of event locations $$\{\boldsymbol{x}_i: i = 1, \ldots, n\}$$ that has been generated as a realization of a LGCP, and we show how to fit a LGCP model to the data using the INLA and SPDE approaches. Chapter 15 introduced the SPDE approach and described its implementation in the context of model-based geostatistics using an example of air pollution prediction. Here, we describe how to use SPDE to fit a LGCP model to a point pattern of plant species to estimate the intensity of the process.

## 23.2 Fitting a LGCP

### Grid approach

A common method for inference in LGCP models is to approximate the latent Gaussian field by means of a gridding approach . In this approach, the study region is first discretized into a regular grid of $$n_1 \times n_2 = N$$ cells, $$\{s_{ij}\}$$, $$i=1,\ldots,n_1$$, $$j=1,\ldots,n_2$$. In the LGCP, the mean number of events in cell $$s_{ij}$$ is given by the integral of the intensity over the cell, $$\Lambda_{ij} = \int_{s_{ij}} exp(\eta(s)) ds$$, and this integral can be approximated by $$\Lambda_{ij} \approx |s_{ij}| exp(\eta_{ij})$$, where $$|s_{ij}|$$ is the area of the cell $$s_{ij}$$. Then, conditional on the latent field $$\eta_{ij}$$, the observed number of locations in grid cell $$s_{ij}$$, $$y_{ij}$$, are independent and Poisson distributed as follows,

$y_{ij}| \eta_{ij} \sim Poisson(|s_{ij}| exp(\eta_{ij})).$

Then, the LGCP model can be expressed within the generalized linear mixed model framework. For example, the log-intensity of the Poisson process can be modeled using covariates and random effects as follows:

$\eta_{ij} = c(s_{ij}) \beta + f_s(s_{ij}) + f_u(s_{ij}).$

Here, $$\boldsymbol{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)'$$ is the coefficient vector of $$\boldsymbol{c}(s_{ij}) = (1, c_1(s_{ij}), \ldots, c_p(s_{ij}))$$, the vector of the intercept and covariates values at $$s_{ij}$$. $$f_s()$$ is a spatially structured random effect reflecting unexplained variability specified as a second-order two-dimensional conditional autoregressive model on a regular lattice. $$f_u()$$ is an unstructured random effect reflecting independent variability in the cells. Moraga (2021b) provides an example on how to implement the grid approach to fit a LGCP with INLA using a species distribution modeling study in Costa Rica.

### Going off the grid

While the previous approach is a common method for inference in LGCP models, the results obtained depend on the construction of a fine regular grid that cannot be locally refined. An alternative computationally efficient method to perform computational inference on LGCP is presented in Simpson et al. (2016). This method, rather than defining a Gaussian random field over a fine regular grid, proposes a finite-dimensional continuously specified random field of the form

$Z(\boldsymbol{s}) = \sum_{i=1}^n z_i \phi_i(\boldsymbol{s}),$ where $$\boldsymbol{z} = (z_1, \ldots, z_n)'$$ is a multivariate Gaussian random vector and $$\{ \phi_i(\boldsymbol{s})\}_{i=1}^n$$ is a set of linearly independent deterministic basis functions. Unlike the lattice approach, this method models observations considering its exact location instead of binning them into cells. Thus, the implementation of this method does not need the specification of a regular grid but a triangulated mesh to approximate the Gaussian random field using the SPDE approach. Below, we give an example of species distribution modeling where we fit a LGCP using this method and INLA and SPDE for fast approximate inference. This approach is also explained in Krainski et al. (2019).

## 23.3 Species distribution modeling

Species distribution models allow us to understand spatial patterns, and assess the influence of factors on species occurrence. These models are crucial for the development of appropriate strategies that help protect species and the environments where they live. Here, we show how to formulate and fit a LGCP model for Solanum plant species in Bolivia using a continuously Gaussian random field with INLA and SPDE. The model allows us to estimate the intensity of the process that generates the locations.

### 23.3.1 Observed Solanum plant species in Bolivia

In this example, we estimate the intensity of Solanum plant species in Bolivia from January 2015 to December 2022 which are obtained from the Global Biodiversity Information Facility (GBIF) database with the spocc package. We retrieve the data using the occ() function specifying the plant species scientific name, data source, dates, and country code. We also specify has_coords = TRUE to just return occurrences that have coordinates, and limit = 1000 to specify the limit of the number of records.

library("sf")
library("spocc")

df <- occ(query = "solanum", from = "gbif",
date = c("2015-01-01", "2022-12-31"),
gbifopts = list(country = "BO"),
has_coords = TRUE, limit = 1000)
d <- occ2df(df)

We use the st_as_sf() function to create a sf object called d that contains the nrow(d) = 244 locations retrieved. We set the coordinate reference system (CRS) to EPSG code 4326 since the coordinates of the locations are given by longitude and latitude values.

d <- st_as_sf(d[, 2:3], coords = c("longitude", "latitude"))
st_crs(d) <- "EPSG:4326"

In order to work with kilometers instead of degrees, we project the data to UTM 19S corresponding to the EPSG code 5356 with kilometers as units. To do that, we obtain st_crs("EPSG:5356")$proj4string and change +units=m by +units=km. st_crs("EPSG:5356")$proj4string
projUTM <- "+proj=utm +zone=19 +south +ellps=GRS80
+towgs84=0,0,0,0,0,0,0 +units=km +no_defs"
d <- st_transform(d, crs = projUTM)

We also obtain the map of Bolivia with the rnaturalearth package, and we project it to UTM 19S with kilometers as units.

library(rnaturalearth)
map <- ne_countries(type = "countries", country = "Bolivia",
scale = "medium", returnclass = "sf")
map <- st_transform(map, crs = projUTM)

Figure 23.1 shows a map with the retrieved locations of Solanum plant species in Bolivia.

library("ggplot2")
ggplot() + geom_sf(data = map) +
geom_sf(data = d) + coord_sf(datum = projUTM)

Finally, we create data frame coo with the event locations.

coo <- st_coordinates(d)

### 23.3.2 Prediction data

Now, we construct a matrix with the locations coop where we want to predict the point process intensity. To do that, we first create a raster covering the map with the rast() function of terra. Then, we retrieve the coordinates of the cells with the xyFromCell() function of terra.

library(sf)
library(terra)

# raster grid covering map
grid <- terra::rast(map, nrows = 100, ncols = 100)
# coordinates of all cells
xy <- terra::xyFromCell(grid, 1:ncell(grid))

We create a sf object called dp with the prediction locations with st_as_sf(), and use st_filter() to keep the prediction locations that lie within the map. We also retrieve the indices of the points within the map by using st_intersects() setting sparse = FALSE.

# transform points to a sf object
dp <- st_as_sf(as.data.frame(xy), coords = c("x", "y"),
crs = st_crs(map))

# indices points within the map
indicespointswithin <- which(st_intersects(dp, map,
sparse = FALSE))

# points within the map
dp <- st_filter(dp, map)

Figure 23.1 depicts the prediction locations in the study region.

ggplot() + geom_sf(data = map) +
geom_sf(data = dp) + coord_sf(datum = projUTM)

We create the matrix coop with the prediction locations.

coop <- st_coordinates(dp)

### 23.3.3 Model

We use a LGCP to model the point pattern of plant species. Thus, we assume that the process that originates plant species locations is a Poisson process with a varying intensity expressed as

$\log(\Lambda(\boldsymbol{s})) = \beta_0 + Z(\boldsymbol{s}),$

where $$\beta_0$$ is the intercept, and $$Z(\cdot)$$ is a zero-mean Gaussian spatial process with MatÃ©rn covariance function.

### 23.3.4 Mesh construction

To fit the model using INLA and SPDE, we first construct a mesh. In the analysis of point patterns, we do not usually employ the locations as mesh vertices. We construct a mesh that covers the study region using the inla.mesh.2d() function setting loc.domain equal to a matrix with the point locations of the boundary of the region. Other arguments are as follows. max.edge denotes the maximum allowed triangle edge lengths in the inner region and the extension. offset specifies the size of the inner and outer extensions around the data locations. cutoff is the minimum allowed distance between points that we use to avoid building many small triangles around clustered locations. Figure 23.2 shows the mesh created.

library(INLA)

summary(dist(coo)) # summary of distances between event locations
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.0    65.2   196.8   238.8   382.9  1171.0 
loc.d <- cbind(st_coordinates(map)[, 1], st_coordinates(map)[, 2])

mesh <- inla.mesh.2d(loc.domain = loc.d, max.edge = c(50, 100),
offset = c(50, 100), cutoff = 1)
plot(mesh)
points(coo, col = "red")
axis(1)
axis(2)

We also create variables nv with the number of mesh vertices, and the variable n with the number of events of the point pattern. Later, we will use these variables to construct the data stacks.

### 23.3.9 Results

A summary of the results can be inspected by typing summary(res). The data frame res$summary.fitted.values contains the fitted values. The indices of the rows corresponding to the predictions can be obtained with inla.stack.index() specifying the tag "pred.pp" of the prediction stack. index <- inla.stack.index(stk.full.pp, tag = "pred.pp")$data
pred_mean <- res$summary.fitted.values[index, "mean"] pred_ll <- res$summary.fitted.values[index, "0.025quant"]
pred_ul <- res$summary.fitted.values[index, "0.975quant"] Then, we add layers to the grid raster with the posterior mean, and 2.5 and 97.5 percentiles values in the cells that are within the map. grid$mean <- NA
grid$ll <- NA grid$ul <- NA

grid$mean[indicespointswithin] <- pred_mean grid$ll[indicespointswithin] <- pred_ll
grid$ul[indicespointswithin] <- pred_ul Finally, we create maps of the posterior mean and the lower and upper limits of 95% credible intervals of the intensity of the point process of species in Bolivia (Figure 23.4). To do that, we use the levelplot() function of the rasterVis package specifying names.attr with the name of each panel and layout with the number of columns and rows. library(rasterVis) levelplot(raster::brick(grid), layout = c(3, 1), names.attr = c("Mean", "2.5 percentile", "97.5 percentile")) Overall, we observe a low intensity of species, with higher intensity in the central west part of Bolivia. Note that we have modeled species occurrence data retrieved from GBIF by assuming the observed spatial point pattern is a realization of the underlying process that generates the species locations. In real applications, it is important to understand how data was collected, and assess potential data biases such as overrepresentation of certain areas that can invalidate the results. Moreover, it is important to incorporate expert knowledge, to create models that include relevant covariates and random effects to account for various types of variability, enabling a more comprehensive understanding of the variable under investigation. ### 23.3.10 Function to create the dual mesh The following code corresponds to the book.mesh.dual() function to create the dual mesh. book.mesh.dual <- function(mesh) { if (mesh$manifold=='R2') {
ce <- t(sapply(1:nrow(mesh$graph$tv), function(i)
colMeans(mesh$loc[mesh$graph$tv[i, ], 1:2]))) library(parallel) pls <- mclapply(1:mesh$n, function(i) {
p <- unique(Reduce('rbind', lapply(1:3, function(k) {
j <- which(mesh$graph$tv[,k]==i)
if (length(j)>0)
return(rbind(ce[j, , drop=FALSE],
cbind(mesh$loc[mesh$graph$tv[j, k], 1] + mesh$loc[mesh$graph$tv[j, c(2:3,1)[k]], 1],
mesh$loc[mesh$graph$tv[j, k], 2] + mesh$loc[mesh$graph$tv[j, c(2:3,1)[k]], 2])/2))
else return(ce[j, , drop=FALSE])
})))
j1 <- which(mesh$segm$bnd$idx[,1]==i) j2 <- which(mesh$segm$bnd$idx[,2]==i)
if ((length(j1)>0) | (length(j2)>0)) {
p <- unique(rbind(mesh$loc[i, 1:2], p, mesh$loc[mesh$segm$bnd$idx[j1, 1], 1:2]/2 + mesh$loc[mesh$segm$bnd$idx[j1, 2], 1:2]/2, mesh$loc[mesh$segm$bnd$idx[j2, 1], 1:2]/2 + mesh$loc[mesh$segm$bnd$idx[j2, 2], 1:2]/2)) yy <- p[,2]-mean(p[,2])/2-mesh$loc[i, 2]/2
xx <- p[,1]-mean(p[,1])/2-mesh$loc[i, 1]/2 } else { yy <- p[,2]-mesh$loc[i, 2]
xx <- p[,1]-mesh$loc[i, 1] } Polygon(p[order(atan2(yy,xx)), ]) }) return(SpatialPolygons(lapply(1:mesh$n, function(i)
Polygons(list(pls[[i]]), i))))
}
else stop("It only works for R2!")
}