http://www.paulamoraga.com/book-geospatial

In this tutorial we use disease mapping methods to estimate lung cancer risk in Pennsylvania, United States, in year 2002. We use data of the R package `SpatialEpi`

which contains the population, the lung cancer cases and the smoking proportions in the Pennsylvania counties. Population data was obtained from the 2000 decennial census, and the disease cases and smoking proportions were obtained from the Pennsylvania Department of Health website.

We show how to calculate the observed and expected disease cases, and the Standardized Mortality Ratios (SMRs) in each of the Pennsylvania counties. We also obtain disease risk estimates and quantify risk factors using `INLA`

. Finally, we show how to make interactive maps of the disease risk estimates using `leaflet`

.

A extended version of this example appears in Moraga, P. Small Area Disease Risk Estimation and Visualization Using R. The R Journal, 10(1):495-506, 2018, and the book Moraga, P. Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny. Chapman & Hall/CRC, 2019.

We start by loading the `SpatialEpi`

package and attaching the data.

```
library(SpatialEpi)
data(pennLC)
```

Next we inspect the data.

`class(pennLC)`

`## [1] "list"`

`names(pennLC)`

`## [1] "geo" "data" "smoking" "spatial.polygon"`

`?pennLC`

We see that `pennLC`

is a list object with the following elements:

`geo`

: a table of county ids, and the longitude and latitude of the centroid of each county,`data`

: a table of county ids, number of cases, population and strata information,`smoking`

: a table of county ids and proportion of smokers,`spatial.polygon`

: a`SpatialPolygons`

object.

We inspect the first rows of `pennLC$data`

and `pennLC$smoking`

. `pennLC$data`

contains the number of lung cancer cases and the population at county level, stratified on race (white and non-white), gender (female and male) and age (under 40, 40-59, 60-69 and 70+).

`head(pennLC$data)`

`pennLC$smoking`

contains county-specific proportion of smokers.

`head(pennLC$smoking)`

The map of Pennsylvania counties is given by the `SpatialPolygons`

object called `pennLC$spatial.polygon`

. We can plot the map as follows:

```
map <- pennLC$spatial.polygon
plot(map)
```

We now create a data frame called `d`

with columns containing the counties ids, the observed and expected number of cases, the smoking proportions and the SMRs. Specifically, `d`

will contain the following columns:

`county`

: id of each county,`Y`

: observed number of cases in each county,`E`

: expected number of cases in each county,`smoking`

: smoking proportion in each county,`SMR`

: SMR of each county.

`pennLC$data`

contains the cases in each county stratified by race, gender and age. We can obtain the number of cases in each county, `Y`

, by aggregating the rows of `pennLC$data`

by county and adding up the observed number of cases using the `dplyr`

package.

```
library(dplyr)
d <- group_by(pennLC$data, county) %>% summarize(Y = sum(cases))
head(d)
```

An alternative to `dplyr`

to calculate the number of cases in each county is to use the `aggregate()`

function. `aggregate()`

accepts the three following arguments:

`x`

: data frame with the data,`by`

: list of grouping elements (each element needs to have the same length as the variables in data frame`x`

),`FUN`

: function to compute the summary statistics to be applied to all data subsets.

We can use `aggregate()`

with `x`

equal to the vector of cases, `by`

equal to a list with the counties, and `FUN`

equal to `sum`

. Then, we set the names of the columns of the data frame obtained equal to `county`

and `Y`

.

```
d <- aggregate(
x = pennLC$data$cases,
by = list(county = pennLC$data$county),
FUN = sum
)
names(d) <- c("county", "Y")
```

Now we calculate the expected number of cases in each county which represent the total number of disease cases one would expect if the population in the county behaved the way the Pennsylvania population behaves. The expected counts \(E_i\) in each county \(i\) can be calculated using indirect standardization as \[E_i=\sum_{j=1}^m r_j^{(s)} n_j^{(i)},\] where \(r_j^{(s)}\) is the rate (number of cases divided by population) in stratum \(j\) in the population of Pennsylvania, and \(n_j^{(i)}\) is the population in stratum \(j\) of county \(i\).

We can calculate the expected counts this by using the `expected()`

function of the `SpatialEpi`

package. This function has three arguments, namely,

`population`

: a vector of population counts for each strata in each area,`cases`

: a vector with the number of cases for each strata in each area,`n.strata`

: number of strata considered.

Vectors `population`

and `cases`

have to be sorted by area first and then, within each area, the counts for all strata need to be listed in the same order. All strata need to be included in the vectors, including strata with 0 cases.

In order to obtain the expected counts we first sort the data using the `order()`

function where we specify the order as county, race, gender and finally age.

`pennLC$data <- pennLC$data[order(pennLC$data$county, pennLC$data$race, pennLC$data$gender, pennLC$data$age), ]`

Then, we obtain the expected counts `E`

in each county by calling the `expected()`

function where we set `population`

equal to `pennLC$data$population`

and `cases`

equal to `pennLC$data$cases`

. There are 2 races, 2 genders and 4 age groups for each county, so number of strata is set to 2 x 2 x 4 = 16.

`E <- expected(population = pennLC$data$population, cases = pennLC$data$cases, n.strata = 16)`

Now we add the vector `E`

to the data frame `d`

which contains the counties ids (`county`

) and the observed counts (`Y`

).

```
d$E <- E
head(d)
```

We also add to `d`

the variable `smoking`

which represents the proportion of smokers in each county. We add this variable using the `merge()`

function where we specify `county`

as the column for merging.

`d <- merge(d, pennLC$smoking, by = "county")`

For each county \(i\) the SMR is defined as the ratio of observed counts to the expected counts. \[\mbox{SMR}_i=Y_i/E_i.\]

We compute the vector of SMRs as the ratio of the observed to the expected counts, and add it to the data frame `d`

.

`d$SMR <- d$Y/d$E`

The final dataset is this:

`head(d)`

We have constructed a data frame `d`

containing the observed and expected disease counts, the smokers proportions and the SMR for each of the counties. The map of Pennsylvania counties is given by the `SpatialPolygons`

object called `map`

. Using this object and the data frame `d`

we can create a `SpatialPolygonsDataFrame`

that will allow us to make maps of the variables in `d`

. In order to do that, we first set the row names of the data frame `d`

equal to `d$county`

. Then we merge the `SpatialPolygons`

`map`

and the data frame `d`

matching the `SpatialPolygons`

member Polygons ID slot values with the data frame row names (`match.ID = TRUE`

).

```
library(sp)
rownames(d) <- d$county
map <- SpatialPolygonsDataFrame(map, d, match.ID = TRUE)
head(map@data)
```

We can visualize the observed and expected disease counts, the SMRs, as well as the smokers proportions in an interactive chropleth map. We construct this map using the `leaflet`

package.

We create the map by first calling `leaflet()`

and adding the default OpenStreetMap map tiles to the map with `addTiles()`

. Then we add the Pennsylvania counties with `addPolygons()`

where we specify the areas boundaries color (`color`

) and the stroke width (`weight`

). We fill the areas with the colours given by the color palette function generated with `colorNumeric()`

, and set `fillOpacity`

to a value less than 1 to be able to see the background map. We use `colorNumeric()`

to create a color palette function that maps data values to colors according to a given palette. We create the function using the two following parameters:

`palette`

: color function that values will be mapped to, and`domain`

: possible values that can be mapped.

Finally, we add the legend by specifying the color palette function (`pal`

) and the values used to generate colors from the palette function (`values`

). We set `opacity`

to the same value as the opacity in the areas, and specify a title and a position for the legend.

```
library(leaflet)
l <- leaflet(map) %>% addTiles()
pal <- colorNumeric(palette = "YlOrRd", domain = map$SMR)
l %>% addPolygons(color = "grey", weight = 1, fillColor = ~pal(SMR), fillOpacity = 0.5) %>%
addLegend(pal = pal, values = ~SMR, opacity = 0.5, title = "SMR", position = "bottomright")
```

We can improve the map by highlighting the counties when the mouse hovers over them, and showing information about the observed and expected counts, SMRs, and smoking proportions. We do this by adding the arguments `highlightOptions`

, `label`

and `labelOptions`

to `addPolygons()`

. We choose to highlight the areas using a bigger stroke width (`highlightOptions(weight = 4)`

). We create the labels using HTML syntax. First, we create the text to be shown using the function `sprintf()`

which returns a character vector containing a formatted combination of text and variable values and then applying `htmltools::HTML()`

which marks the text as HTML. In `labelOptions`

we specify the labels `style`

, `textsize`

, and `direction`

. Possible values for `direction`

are `left`

, `right`

and `auto`

and this specifies the direction the label displays in relation to the marker. We choose `auto`

so the optimal direction will be chosen depending on the position of the marker.

```
labels <- sprintf("<strong>%s</strong><br/>Observed: %s <br/>Expected: %s <br/>Smokers proportion: %s <br/>SMR: %s",
map$county, map$Y, round(map$E, 2), map$smoking, round(map$SMR, 2)) %>%
lapply(htmltools::HTML)
l %>% addPolygons(color = "grey", weight = 1, fillColor = ~pal(SMR), fillOpacity = 0.5,
highlightOptions = highlightOptions(weight = 4),
label = labels,
labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px", direction = "auto")) %>%
addLegend(pal = pal, values = ~SMR, opacity = 0.5, title = "SMR", position = "bottomright")
```

Now we can examine the map and see which areas have

- SMR \(=\) 1 which indicates observed cases are the same as expected,
- SMR \(>\) 1 which indicates observed cases are higher than expected,
- SMR \(<\) 1 which indicates observed cases are lower than expected.

This map gives a sense of the disease risk across Pennsylvania. However, SMRs may be misleading and insufficiently reliable in counties with small populations. In contrast, model-based approaches enable to incorporate covariates and borrow information from neighboring counties to improve local estimates, resulting in the smoothing of extreme values based on small sample sizes. In the next section we will show how to obtain disease risk estimates using a model using `INLA`

.

In this Section we specify the model for the observed data, and detail the required steps to fit the model and obtain the disease risk estimates using `INLA`

.

Let \(Y_i\) and \(E_i\) be the observed and expected number of disesase cases, respectively, and let \(\theta_i\) be the relative risk for county \(i=1,\ldots,n\). The model is specified as follows:

\[Y_i|\theta_i \sim Poisson(E_i \times \theta_i),\ i=1,\ldots,n,\]

\[\log(\theta_i) = \beta_0 + \beta_1 \times smoking_i + u_i + v_i.\]

- \(\beta_0\) is the intercept,
- \(\beta_1\) is the coefficient of the smokers proportion covariate,
- \(u_i\) is a structured spatial effect, \(u_i|\mathbf{u_{-i}} \sim N(\bar{u}_{\delta_i}, \frac{1}{\tau_u n_{\delta_i}})\) (Conditionally autoregressive model (CAR)),
- \(v_i\) is an unstructured spatial effect, \(v_i \sim N(0, 1/\tau_v)\).

We create the neighbourhood matrix needed to define the spatial random effect using the `poly2nb()`

and the `nb2INLA()`

functions of the `spdep`

package. First, we use `poly2nb()`

to create a neighbours list based on areas with contiguous boundaries. Each element of the list `nb`

represents one area and contains the indices of its neighbours. For example, `nb[[2]]`

contains the neighbours of area 2.

```
library(spdep)
library(INLA)
nb <- poly2nb(map)
head(nb)
```

```
## [[1]]
## [1] 21 28 67
##
## [[2]]
## [1] 3 4 10 63 65
##
## [[3]]
## [1] 2 10 16 32 33 65
##
## [[4]]
## [1] 2 10 37 63
##
## [[5]]
## [1] 7 11 29 31 56
##
## [[6]]
## [1] 15 36 38 39 46 54
```

Then, we use the `nb2INLA()`

function to convert the `nb`

list into a file called `map.adj`

with the representation of the neighbourhood matrix as required by `INLA`

. The `map.adj`

file is saved in the working directory which can be obtained with `getwd()`

. Then, we read the `map.adj`

file using the `inla.read.graph()`

function of `INLA`

, and store it in the object `g`

which we will later use to specify the spatial disease model with `INLA`

.

```
nb2INLA("map.adj", nb)
g <- inla.read.graph(filename = "map.adj")
```

The model includes two random effects, namely, \(u_i\) for modelling spatial residual variation, and \(v_i\) for modelling unstructured noise. We need to include two vectors in the data that denote the indices of these random effects. We call `re_u`

the indices vector for \(u_i\), and `re_v`

the indices vector for \(v_i\). We set both `re_u`

and `re_v`

equal to \(1,\ldots,n\), where \(n\) is the number of counties. In our example, \(n\)=67 and this can be obtained with the number of rows in the data (`nrow(map@data)`

).

```
map$re_u <- 1:nrow(map@data)
map$re_v <- 1:nrow(map@data)
```

We specify the model formula by including the response in the left-hand side, and the fixed and random effects in the right-hand side. Random effects are set using `f()`

with parameters equal to the name of the variable and the chosen model. For \(u_i\), we use `model = "besag"`

with neighbourhood matrix given by `g`

. For \(v_i\) we choose `model = "iid"`

.

`formula <- Y ~ smoking + f(re_u, model = "besag", graph = g, scale.model = TRUE) + f(re_v, model = "iid")`

We fit the model by calling the `inla()`

function and using the default priors in `INLA`

. We specify the formula, family, data, and the expected counts, and set `control.predictor`

equal to `list(compute = TRUE)`

to compute the posterior means of the predictors.

`res <- inla(formula, family = "poisson", data = map@data, E = E, control.predictor = list(compute = TRUE))`

We can inspect the results object `res`

using `summary()`

.

`summary(res)`

```
##
## Call:
## c("inla(formula = formula, family = \"poisson\", data = map@data, ", " E = E, control.predictor = list(compute = TRUE))")
##
## Time used:
## Pre-processing Running inla Post-processing Total
## 0.7038 0.6089 0.2032 1.5159
##
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) -0.3233 0.1502 -0.6207 -0.3231 -0.0277 -0.3228 0
## smoking 1.1556 0.6244 -0.0808 1.1568 2.3841 1.1600 0
##
## Random effects:
## Name Model
## re_u Besags ICAR model
## re_v IID model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant
## Precision for re_u 230.78 124.81 77.34 202.17 551.13
## Precision for re_v 18148.62 18199.95 1164.16 12725.30 66367.47
## mode
## Precision for re_u 156.91
## Precision for re_v 3144.96
##
## Expected number of effective parameters(std dev): 18.68(4.433)
## Number of equivalent replicates : 3.587
##
## Marginal log-Likelihood: -289.45
## Posterior marginals for linear predictor and fitted values computed
```

We see the intercept \(\hat \beta_0=\) -0.323 with a 95% credible interval equal to (-0.621, -0.028), and the coefficient of smoking is \(\hat \beta_1=\) 1.156 with a 95% credible interval equal to (-0.081, 2.384). We can plot the posterior distribution of the smoking coefficient. We do this by calculating a smoothing of the marginal distribution of the coefficient with `inla.smarginal()`

and then plot it with `ggplot()`

of the `ggplot2`

package.

```
library(ggplot2)
marginal <- inla.smarginal(res$marginals.fixed$smoking)
marginal <- data.frame(marginal)
ggplot(marginal, aes(x = x, y = y)) + geom_line() + labs(x = expression(beta[1]), y = "Density") +
geom_vline(xintercept = 0, col = "blue") + theme_bw()
```

The disease risk estimates and uncertainty for each of the counties are given by the mean posterior and the 95% credible intervals of \(\theta_i\), \(i=1,\ldots,n\) which are in `res$summary.fitted.values`

. Column `mean`

is the mean posterior and `0.025quant`

and `0.975quant`

are the 2.5 and 97.5 percentiles, respectively. We add these data to `map`

to be able to make maps of these variables. We assign `mean`

to the estimate of the relative risk, and `0.025quant`

and `0.975quant`

to the lower and upper limits of 95% credible intervals of the risks.

`head(res$summary.fitted.values)`

```
map$RR <- res$summary.fitted.values[, "mean"]
map$LL <- res$summary.fitted.values[, "0.025quant"]
map$UL <- res$summary.fitted.values[, "0.975quant"]
```

We show the estimated disease risk in an interactive map using `leaflet`

. In the map, we add labels that appear when mouse hovers over the counties showing information about observed and expected counts, SMRs, smokers proportions, RRs, and lower and upper limits of 95% credible intervals.

We observe counties with greater disease risk are located in the west and south east of Pennsylvania, and counties with lower risk are located in the center. The 95% credible intervals indicate the uncertainty in the risk estimates.

```
pal <- colorNumeric(palette = "YlOrRd", domain = map$RR)
labels <- sprintf("<strong> %s </strong> <br/> Observed: %s <br/> Expected: %s <br/>
Smokers proportion: %s <br/>SMR: %s <br/>RR: %s (%s, %s)",
map$county, map$Y, round(map$E, 2), map$smoking, round(map$SMR, 2),
round(map$RR, 2), round(map$LL, 2), round(map$UL, 2)) %>%
lapply(htmltools::HTML)
leaflet(map) %>% addTiles() %>%
addPolygons(color = "grey", weight = 1, fillColor = ~pal(RR), fillOpacity = 0.5,
highlightOptions = highlightOptions(weight = 4),
label = labels,
labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px", direction = "auto")) %>%
addLegend(pal = pal, values = ~RR, opacity = 0.5, title = "RR", position = "bottomright")
```

Spatial modeling using areal data.

Datasets about health outcomes are presented at the end of this document. `map`

contains a map with the area id (`areaid`

), number of observed cases (`Y`

), number of expected cases (`E`

), and a covariate (`covariate`

) for each of the areas.

Please choose a dataset and answer the following questions:

- Calculate the Standardized Mortality Ratios (SMRs) for of each area. Create a leaflet map with the SMR values.
- Fit a spatial model to obtain the relative risks for each of the areas. The linear predictor of the model should include an intercept, a covariate, a spatial random effect, and an unstructured random effect.
- What is the effect of the covariate on the disease risk?
- Create a table with columns area id, number of observed cases, number of expected cases, covariate, SIR, RR, and lower and upper limits of 95% CI.
- Create a leaflet map that shows RR values.

**Data 1**

Lip cancer among males in Scotland, 1975-1980.

Covariate: proportion of the population engaged in agriculture, fishing, or forestry (AFF).

Data and map are in package `SpatialEpi`

.

```
library(SpatialEpi)
data(scotland)
d <- scotland$data
map <- scotland$spatial.polygon
# Map is in projection OSGB 1936/British National Grid which has EPSG code 27700
proj4string(map) <- "+proj=tmerc +lat_0=49 +lon_0=-2
+k=0.9996012717 +x_0=400000 +y_0=-100000 +datum=OSGB36
+units=km +no_defs"
# Transform to latitude/longitude
map <- spTransform(map, CRS("+proj=longlat +datum=WGS84 +no_defs"))
library(sp)
rownames(d) <- d$county.names
map <- SpatialPolygonsDataFrame(map, d, match.ID = TRUE)
map$areaid <- map$county.names
d$areaid <- d$county.names
d$Y <- d$cases
d$E <- d$expected
d$covariate <- d$AFF
order <- match(map$areaid, d$areaid)
map@data <- d[order, c("areaid", "Y", "E", "covariate")]
head(map@data)
```

**Data 2**

Leukemia in upstate New York, 1978-1982.

Covariate: inverse of the distance to the nearest Trichloroethene (TCE) site.

Data and map are in package `DClusterm`

.

```
library(DClusterm)
data(NY8)
d <- NY8
d$areaid <- d$AREAKEY
d$Y <- round(d$TRACTCAS)
d$E <- d$POP8*sum(d$TRACTCAS)/sum(d$POP8)
d$covariate <- d$PEXPOSURE
map <- d[, c("areaid", "Y", "E", "covariate")]
map <- spTransform(map, CRS("+proj=longlat +datum=WGS84 +no_defs"))
head(map@data)
```

**Data 3**

Sudden infant deaths in North Carolina, 1974-1978.

Covariate: latitude.

Data are in package `spdep`

. Map is in package `maptools`

.

```
library(spdep)
library(maptools)
data(nc.sids)
d <- nc.sids
map <- readShapeSpatial(paste0(system.file("shapes", package = "maptools"),"/co37_d90"))
map$areaid <- map$NAME
d$areaid <- rownames(d)
d$Y <- d$SID74
d$E <- d$BIR74*sum(d$SID74)/sum(d$BIR74)
order <- match(map$areaid, d$areaid)
map@data <- d[order, c("areaid", "Y", "E")]
map$covariate <- coordinates(map)[, 2]
head(map@data)
```

**Data 4**

Brain cancer in males in Navarre, Spain, 1988-1994.

Covariate: latitude.

Data and map are in package `DClusterm`

.

```
library(DClusterm)
data(Navarre)
d <- brainnav
d$areaid <- d$NAME
d$Y <- d$OBSERVED
d$E <- d$EXPECTED
map <- d[, c("areaid", "Y", "E")]
map <- spTransform(map, CRS("+proj=longlat +datum=WGS84 +no_defs"))
map$covariate <- coordinates(map)[, 2]
head(map@data)
```

**Data 5**

Lung cancer in Ohio, 1968.

Covariate: latitude.

Data and map are in the folder `appdir`

that can be downloaded from Chapter 15 at https://www.paulamoraga.com/book-geospatial-info/. You need to use `setwd()`

to set the path of the folder `appdir`

in your computer.

```
# set the path of the folder appdir in your computer
setwd("C:\\appdir")
library(rgdal)
d <- read.csv("data/data.csv")
map <- readOGR("data/fe_2007_39_county/fe_2007_39_county.shp")
d <- d[which(d$year == 1968), ]
map@data <- d[ match(map@data$NAME, d$county), ]
map$areaid <- map$county
map$Y <- map$cases
map$E <- map$population*(sum(map$Y)/sum(map$population))
map$covariate <- coordinates(map)[, 2]
map@data <- map@data[, c("areaid", "Y", "E", "covariate")]
head(map@data)
```

**Data 6**

Lung cancer in Ohio, 1988.

Covariate: longitude.

Data and map are in the folder `appdir`

that can be downloaded from Chapter 15 at https://www.paulamoraga.com/book-geospatial-info/. You need to use `setwd()`

to set the path of the folder `appdir`

in your computer.

```
# set the path of the folder appdir in your computer
setwd("C:\\appdir")
library(rgdal)
d <- read.csv("data/data.csv")
map <- readOGR("data/fe_2007_39_county/fe_2007_39_county.shp")
d <- d[which(d$year == 1988), ]
map@data <- d[ match(map@data$NAME, d$county), ]
map$areaid <- map$county
map$Y <- map$cases
map$E <- map$population*(sum(map$Y)/sum(map$population))
map$covariate <- coordinates(map)[, 1]
map@data <- map@data[, c("areaid", "Y", "E", "covariate")]
head(map@data)
```

Moraga, P. Small Area Disease Risk Estimation and Visualization Using R. The R Journal, 10(1):495-506, 2018 https://journal.r-project.org/archive/2018/RJ-2018-036/index.html

Moraga, P. (2019). Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny. Chapman & Hall/CRC Biostatistics Series, http://www.paulamoraga.com/book-geospatial

This work by Paula Moraga is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.