library(tidyverse)
library(sf)
library(rnaturalearth)
africa <- ne_countries(continent = "Africa",
returnclass = "sf")
africa %>%
select(geometry)%>%
head()
#> Simple feature collection with 6 features and 0 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -17.06342 ymin: -13.25723 xmax: 51.13387 ymax: 27.65643
#> Geodetic CRS: WGS 84
#> geometry
#> 1 MULTIPOLYGON (((33.90371 -0...
#> 2 MULTIPOLYGON (((-8.66559 27...
#> 3 MULTIPOLYGON (((29.34 -4.49...
#> 4 MULTIPOLYGON (((41.58513 -1...
#> 5 MULTIPOLYGON (((39.20222 -4...
#> 6 MULTIPOLYGON (((24.56737 8....
12 Spatial Data Modelling and Visualisation
Learning Objectives
- Learn how to model and visualise spatial data
- Understand the concepts of spatial data, spatial data models, and spatial models
- Create maps, simulate infections, and predict the spatial distribution of phenomena
In the previous chapters, we explored the fundamentals of data visualisation with various techniques for creating static and interactive plots. In this chapter, we will learn how to model and visualise spatial data, which involves analysing and representing data with spatial components. We will explore the concepts of spatial data, spatial data models, and spatial models, and learn how to create maps, simulate infections, and predict the spatial distribution of phenomena such as disease spread. By combining spatial data with machine learning techniques, we can model the spatial dynamics of disease transmission and assess the risks associated with the spread of infections. We will use various packages such as sf
, ggplot2
, and gstat
to create spatial models and visualisations, and gain insights into the spatial distribution of infections in a specific region.
12.1 Spatial Data, Spatial Data Models and Spatial Models
Understanding the spatial distribution of phenomena such as disease spread, land cover changes, or climate patterns requires the use of spatial data and spatial models. These tools are essential for analysing and predicting spatial processes and for informing data-driven decisions in public health, environmental management, and urban planning.
A notable example is the Ebola Virus Disease outbreak in West Africa (2014–2016). The virus spread across countries like Guinea, Liberia, and Sierra Leone, driven by factors such as population density, healthcare infrastructure, and human mobility. By analysing spatial data on cases, fatalities, and healthcare facilities and applying spatial models to simulate the virus’s spread, researchers gained valuable insights into the dynamics of the outbreak. This analysis helped identify high-risk areas for targeted interventions1.
But what exactly distinguishes spatial data, spatial data models, and spatial models?
- Spatial Data: This refers to data that includes geographic coordinates, addresses, or boundaries, representing the physical location of objects or events. Spatial data can be stored in two main formats: vector and raster. Vector data models use geometric shapes (points, lines, polygons) to represent features such as roads, cities, and boundaries. Raster data models, on the other hand, use a grid of cells or pixels, each with a value representing a variable like temperature, land cover, or elevation.
- Spatial Data Models: These are frameworks for organizing and representing spatial data. They provide the structure for connecting real-world phenomena with their digital representation through algorithms and spatial primitives (like spatial relationships and topology). Spatial data models form the foundation for managing and processing spatial information, enabling meaningful analysis and visualization.
- Spatial Models: Unlike spatial data models, which organize data, spatial models simulate dynamic spatial processes—phenomena that change over time. Examples include the spread of infectious diseases, flood development, or land use changes. Spatial models are crucial for understanding and forecasting the evolution of spatial phenomena, allowing researchers and decision-makers to plan interventions and management strategies effectively.
For further information on spatial data models and spatial process models, refer to the ArcGIS documentation for guidance on solving spatial problems and the Esri blog for in-depth discussions on spatial analysis. Additionally, R-Spatial.org offers a wealth of resources on spatial data analysis, including tutorials and the latest developments in the field.
12.2 Making a Map
To learn how to make a map we first need spatial data. There are various sources and packages that provide spatial data for different regions and purposes. One such package is rnaturalearth, this package contains various types of boundaries of countries in the world. For more information type ?rnaturalearth
. In particular, we use the ne_countries()
function to get the boundaries of Africa, with the option returnclass = "sf"
to return the data as simple features.
A simple feature object is a type of spatial data that contains the geometry of the spatial features (e.g., points, lines, polygons) and additional attributes such as the name of the country, population, and other information.
This is a simple map of Africa, but we can make it more interesting by adding some colours to the countries. A better resolution of the colored map can be found in the online version of the book.
ggplot(data = africa) +
geom_sf() +
coord_sf()
# Africa with colours
ggplot(data = africa) +
geom_sf(aes(fill=name),
color= "white",
linewidth = 1,
show.legend = F) +
coord_sf()
12.3 Coordinate Reference System (CRS)
A coordinate reference system (CRS) is a standardised way of defining the spatial location of features on the Earth’s surface. It uses a set of coordinates to represent the position of points, lines, and polygons in space. There are different types of CRS, such as geographic and projected CRS, which are used to represent the Earth’s surface in different ways. Geographic CRS (LatLong) represents locations on a curved surface using latitude and longitude for global mapping, while projected CRS or Universal Transverse Mercator (UTM) projects the curved Earth onto a flat map, using metres for precise mapping. In particular, the UTM system provides the distance in metres from the equator and the central meridian of a specific zone.
We use the sf package to define and transform CRS. The st_crs()
function allows you to retrieve the CRS of a spatial object, while the st_transform()
function allows you to transform the coordinates of a spatial object to a different CRS.
africa_crs <- st_crs(africa)
africa_crs
#> Coordinate Reference System:
#> User input: WGS 84
#> wkt:
#> GEOGCRS["WGS 84",
#> DATUM["World Geodetic System 1984",
#> ELLIPSOID["WGS 84",6378137,298.257223563,
#> LENGTHUNIT["metre",1]]],
#> PRIMEM["Greenwich",0,
#> ANGLEUNIT["degree",0.0174532925199433]],
#> CS[ellipsoidal,2],
#> AXIS["latitude",north,
#> ORDER[1],
#> ANGLEUNIT["degree",0.0174532925199433]],
#> AXIS["longitude",east,
#> ORDER[2],
#> ANGLEUNIT["degree",0.0174532925199433]],
#> ID["EPSG",4326]]
For example, the CRS of the Africa map is WGS 84 or World Geodetic System 1984 which is a reference system for the Earth’s surface that defines origin and orientation of the coordinate axes, called a datum (a fact known from direct observation). The EPSG is a structured dataset of CRS and Coordinate Transformations, it was originally compiled by the, now defunct, European Petroleum Survey Group2. The EPSG code for WGS 84 is 4326.
12.4 Example: Simulation of Infections in Central African Rep.
Synthetic data are created for the Central African Republic. The simulation of the number of infected individuals, their location, and temperature levels are obtained with random numbers generating functions from the {stats}
package, which is a base package.
The Central African Rep. is a landlocked country in Central Africa, with a population of approximately 5 million people. The country is known for its diverse wildlife and natural resources, and has faced challenges such as political instability and armed conflict.
12.4.1 Bounding Box
We can use the st_bbox()
function to retrieve the bounding box of the ctr_africa
object, as a matrix with the minimum and maximum values of the coordinates. A bounding box is a rectangular area defined by two points: the lower-left corner (minimum values of the coordinates) and the upper-right corner (maximum values of the coordinates). It provides a quick way to determine the spatial extent of a region and is often used in spatial analysis to define the boundaries of a study area.
12.4.2 Spatial Coordinates
The st_coordinates()
function allows to extract the coordinates from a geometry vector and release a matrix with 5 columns, where the first two columns are the longitude (X) and latitude (Y), and the other are indexes to group the coordinates into polygons.
The as.data.frame()
function converts the matrix to a data frame, and the dplyr::select()
function selects the longitude and latitude columns. The summary()
function provides a summary of the data frame, including the mean, median, and quartiles of the longitude and latitude values.
ctr_africa_coords <- ctr_africa %>%
sf::st_coordinates() %>%
as.data.frame() %>%
dplyr::select(X, Y)
ctr_africa_coords %>%
summary()
#> X Y
#> Min. :14.46 Min. : 2.268
#> 1st Qu.:16.58 1st Qu.: 4.648
#> Median :20.96 Median : 5.354
#> Mean :20.70 Mean : 6.294
#> 3rd Qu.:24.26 3rd Qu.: 8.145
#> Max. :27.37 Max. :11.142
Synthetic data are created as an image of the spread of infections observed in Central Africa on a specific point in time. The data is generated using random numbers for the number of infected individuals, their location, and temperature levels. The temperature levels consider a min of 20.3 degrees Celsius (69 degrees Fahrenheit), a maximum of 29.2 °C (85 °F), and a daily average of 24.7 °C (76 °F).
The synthetic data is created using the rbinom()
, rnorm()
, and rpois()
functions from the {stats}
package. The data is generated for 100 locations in Central Africa, with 70% of the locations infected and 30% non-infected. The latitude and longitude values are generated using the rnorm()
function with mean values of 6.294 and 20.70, respectively. The number of infected individuals is generated using the rpois()
function with a mean value of 10, and the temperature levels are generated using the rnorm()
function with a mean value of 25.
Set the number of points:
num_points <- 100
12.4.3 Data Simulation
set.seed(21082024) # set seed for reproducibility
# simulate the presence of infection
# 1 = "infected" and 0 = "non_infected"
longitude <- rnorm(n = num_points, mean = 20.70, sd = 1)
latitude <- rnorm(n = num_points, mean = 6.294, sd = 1)
presence <- rbinom(100, 1, prob = 0.3)
cases <- ifelse(presence == 1,
rpois(n = num_points*0.7, lambda = 10), 0)
temperature <- rnorm(n = num_points,
mean = 24.7,
sd = (29.2 - 20.3) / 4)
# build a dataframe
df <- data.frame(longitude, latitude,
presence, cases, temperature)
head(df)
#> longitude latitude presence cases temperature
#> 1 21.09134 6.052162 0 0 20.58159
#> 2 22.25082 7.814450 0 0 29.77619
#> 3 21.52356 6.101959 0 0 23.97023
#> 4 19.76787 6.004291 0 0 28.53059
#> 5 21.25535 5.239799 0 0 22.16691
#> 6 20.10175 7.256303 0 0 20.66836
The str()
function allows to inspect the structure of the data frame, showing the type of each column and the first few rows of the data frame.
df %>%
filter(presence == 1) %>%
head() %>%
str()
#> 'data.frame': 6 obs. of 5 variables:
#> $ longitude : num 18.8 20.3 22.9 20 20.1 ...
#> $ latitude : num 4.54 5.09 6.21 8.42 7.63 ...
#> $ presence : int 1 1 1 1 1 1
#> $ cases : num 13 10 11 13 9 9
#> $ temperature: num 24 27.3 25.2 25.1 24.2 ...
ggplot() +
geom_sf(data = africa) +
geom_sf(data = ctr_africa, fill = "brown") +
labs(title = "Map of Africa and Central African Rep.")
ggplot() +
geom_sf(data = africa) +
geom_sf(data = ctr_africa,
fill = alpha("brown", alpha = 0.2)) +
geom_point(data = df,
aes(x = longitude, y = latitude,
color = factor(presence)),
shape = ".") +
labs(title = "Synthetic Data Visualisation",
color = "Infection Status") +
theme(legend.position = "top")
12.4.4 Correlation between Response and Predictor
The correlation coefficient between the number of infected individuals and the temperature can be calculated using the cor()
function with the method set to “pearson”. This will provide a measure of the strength and direction of the linear relationship between the two variables. A correlation coefficient close to 1 indicates a strong positive linear relationship, while a coefficient close to -1 indicates a strong negative linear relationship. A coefficient close to 0 indicates no significant linear relationship between the variables.
cor(df$cases, df$temperature, method = "pearson")
#> [1] -0.05243272
The correlation coefficient is -0.052, suggesting that there is a weak negative linear relationship between the number of infected individuals and the temperature. This indicates that as the temperature increases, the number of infected individuals decreases slightly, but the relationship is not significant.
12.4.5 Histogram and Scatter Plot
To visualise the distribution of the presence of infection in Central Africa, we can create a histogram of the presence variable using the geom_bar()
function from the ggplot2 package. The histogram shows the frequency of infected and non-infected locations in Central Africa, with the x-axis representing the presence of infection and the y-axis representing the count of locations.
ggplot(data = df,
aes(x = factor(presence))) +
geom_bar() +
labs(title = "Histogram of Presence of Infection",
x = "Presence",
y = "Count") +
theme_minimal()
ggplot(data = df %>% filter(presence== 1),
aes(x = temperature, y = cases)) +
geom_point() +
geom_smooth() +
labs(title = "Relationship between Temperature and Number of Infected Individuals",
x = "Temperature",
y = "Number of Infected Individuals") +
theme_minimal()
The scatter plot shows the temperature on the x-axis and the number of infected individuals on the y-axis, with each point representing an infected location in Central Africa. The geom_point()
function adds the points to the plot, while the geom_smooth()
function fits a smooth curve to the data, showing the trend between independent and dependent variables, temperature and the number of infected individuals, respectively. The labs()
function adds titles and labels to the plot, specifying the title, x-axis label, and y-axis label. The theme_minimal()
function sets the plot theme to a minimal style.
12.4.6 Grid of Points
To create a grid of points we transform the df
data as a simple features object with the st_as_sf()
function from the sf package, specifying the longitude and latitude columns as the coordinates and the CRS as 43263. Then, intersect the data with st_intersection(ctr_africa)
to keep only the points within the country. Finally, select the relevant columns for the analysis.
df_sf <- df %>%
st_as_sf(coords = c("longitude","latitude"),
# or use crs = 4326
crs = "+proj=longlat +datum=WGS84") %>%
st_intersection(ctr_africa) %>%
st_make_valid()
grid <- df_sf %>%
st_bbox() %>%
st_as_sfc() %>%
st_make_grid(what = "centers",
cellsize = .2,
square = F)
To generate the grid of points to cover all Central African Republic we can also use the expand_grid()
function from the tidyr package, which generates a series of points within specified bounding box, and the point in polygon (PtInPoly()
) function from the DescTools package. The pip
vector contains the information if the point is inside the polygon or not.
library(DescTools)
set.seed(240724) # set seed for reproducibility
bbox_grid <- expand_grid(x = seq(from = bbox$xmin,
to = bbox$xmax,
length.out = 100),
y = seq(from = bbox$ymin,
to = bbox$ymax,
length.out = 100))
ctr_africa_grid_full <- data.frame(PtInPoly(bbox_grid,
ctr_africa_coords))
ctr_africa_grid_full %>% head()
#> x y pip
#> 1 14.45941 2.267640 0
#> 2 14.45941 2.357284 0
#> 3 14.45941 2.446928 0
#> 4 14.45941 2.536572 0
#> 5 14.45941 2.626216 0
#> 6 14.45941 2.715860 0
Mapping the grid of points on the map of the Central African Republic, we can visualise the spatial distribution of infected locations.
# Window Grid
ggplot() +
geom_sf(data = ctr_africa)+
geom_sf(data = grid, shape = ".") +
geom_sf(data = df_sf %>% filter(presence == 1),
aes(fill = presence),
shape = 21, stroke = 0.3) +
coord_sf() +
labs(title = "Window Grid") +
theme(legend.position = "right")
# Central Africa Grid
ggplot(data = ctr_africa_grid_full) +
geom_point(aes(x, y, color = factor(pip)), shape = ".") +
geom_sf(data = df_sf %>% filter(presence == 1),
aes(fill = presence),
shape = 21, stroke = 0.3,
show.legend = F,) +
coord_sf() +
scale_color_manual(values = c("1" = "grey20", "0" = "grey")) +
labs(title = "Grid Map",
color = "PIP") +
theme(legend.position = "top")
12.4.7 Create a Raster of Temperature
To visualise the temperature data on the map of the Central African Republic we create a raster template. The raster template has the same extent as the simulated infected locations and will be used to rasterize the temperature data. This means that the temperature data will be converted into a raster format, with each cell in the raster grid representing a specific temperature value. The rasterized temperature data will be visualised as a heatmap on the map, with warmer colours indicating higher temperatures.
The raster template is created using the rast()
function from the terra package, specifying the number of rows and columns, the minimum and maximum values for the x and y coordinates, and the coordinate reference system (CRS) of the raster. If you want to know more about the package, type ?terra
.
The raster template is set to have 18 rows and 36 columns, with the x and y coordinates ranging from 11 to 28 and 2 to 12, respectively.
Rasterize the temperature data using the rasterize()
function to convert the temperature data from the data frame to a raster format based on the raster template. The temperature data is assigned to the raster cells based on the latitude and longitude coordinates.
ctr_africa_raster <- rasterize(df_sf,
raster_template,
field = "temperature",
fun = max)
Before plotting the rasterized temperature data on the map of the Central African Republic, we need to convert the rasterized data to a data frame format. This will allow us to visualise the temperature data as a heatmap on the map. The rasterized data contains the temperature values for each cell in the raster grid, which are assigned based on the latitude and longitude coordinates of the temperature data.
ctr_africa_raster_df <- as.data.frame(ctr_africa_raster,
xy = TRUE)
ctr_africa_raster_df %>% head()
#> x y max
#> 200 20.20833 8.944444 24.09246
#> 203 21.62500 8.944444 21.51325
#> 205 22.56944 8.944444 23.46981
#> 234 19.26389 8.388889 25.77824
#> 235 19.73611 8.388889 25.11388
#> 268 18.31944 7.833333 25.24466
# Temperature Raster
ggplot() +
geom_raster(data = ctr_africa_raster_df,
aes(x = x, y = y, fill = max)) +
viridis::scale_fill_viridis(name = "Temperature",
na.value = "transparent") +
labs(title = "Rasterized Temperature") +
theme(legend.position = "right")
# Central Africa Raster
ggplot() +
geom_point(data = ctr_africa_grid_full %>% filter(pip == 1),
aes(x = x, y = y),
shape = ".") +
geom_raster(data = ctr_africa_raster_df,
aes(x = x, y = y, fill = max)) +
geom_sf(data = df_sf %>% filter(presence == 1),
aes(size = cases),
shape = 21, stroke = 0.3) +
scale_fill_gradient(low = "white", high = "grey30",
na.value = "transparent") +
labs(title = "Max Temperature in Central African Rep.\n(Simulated Values)",
size = "Number of Infections",
fill = "Max Temperature") +
theme(legend.position = "right")
This plot shows the temperature data as a heatmap on the map of the Central African Republic, with warmer colours indicating higher temperatures. The infected locations are represented as circles with size indicating the number of infected individuals at each location.
12.4.8 The Epicenter of Infection
A central point that represents the spatial distribution of infections in the region is the center of mass. It is a useful metric for understanding the spatial dynamics of disease spread and identifying critical zones for intervention. To calculate the coordinates of the center of mass for the infections, even called the epicenter of the infections, we use the center_of_mass()
function.
The center of mass is the average of the latitude and longitude of the infected individuals, weighted by the number of infected individuals at each location.
df_com <- tibble(longitude = center_of_mass(df)[1],
latitude = center_of_mass(df)[2])
df_com
#> # A tibble: 1 × 2
#> longitude latitude
#> <dbl> <dbl>
#> 1 20.7 6.47
Finally, plot the infections on the map of the Central African Republic, with the temperature data represented as a raster layer. The temperature data is displayed as a heatmap, with warmer colours indicating higher temperatures. The infections are shown as points on the map, with different colours representing infected and non-infected locations.
Note, here a new package is used, the ggnewscale package, which is a package that allows you to add multiple colour scales to a single plot. This is useful when you want to visualise different variables with different colour scales in the same plot.
# Location of Infections
ggplot() +
geom_sf(data = df_sf, aes(shape = factor(presence),
color = factor(presence))) +
geom_point(data = df_com,
aes(x = longitude, y = latitude),
color = "red", size = 4) +
scale_colour_manual("Infection Status",
values = c("1" = "red", "0" = "blue")) +
# combine shape and colour legends
guides(color = guide_legend(title = "Infection Status"),
shape = guide_legend(title = "Infection Status")) +
labs(title = "Simulated Locations in Central African Rep.",
subtitle = "Synthetic Data with Infections Center of Mass") +
theme(legend.position = "right")
# Infections Size on Max Temperature
ggplot() +
geom_point(data = ctr_africa_grid_full %>% filter(pip == 1),
aes(x = x, y = y), shape = ".") +
geom_sf(data = ctr_africa, fill = NA) +
geom_raster(data = ctr_africa_raster_df,
aes(x = x, y = y, fill = max)) +
scale_fill_gradient(low = "white", high = "grey30",
na.value = "transparent") +
geom_sf(data = df_sf %>% filter(presence == 1),
mapping = aes(group = presence),
alpha = 0.8, color = "red", size = 0.5) +
geom_sf(data = df_sf %>% filter(presence == 1),
aes(size = cases),
shape = 21, stroke = 0.3) +
geom_point(data = df_com,
aes(x = longitude, y = latitude),
color = "red", size = 4, shape = 13) +
labs(title = "Simulated Infections with Size on Max Temperature in Central African Rep.",
subtitle = "Synthetic Data with Center of Mass",
fill = "Temperature",
size = "Number") +
theme(legend.position = "right")
Interactive Maps
The infections can be visualised with mapview or leaflet, type of interactive maps, here is a code snippet of how to do it.
Map with Mapview:
This is the interactive map with the max temperatures for the simulated data in Central African Republic. The data represented as a raster points, with warmer colours indicating higher temperatures are shown as points on the map.