library(tidyverse)
library(sf)
library(rnaturalearth)
africa <- ne_countries(continent = "Africa",
returnclass = "sf")
africa %>%
select(geometry)%>%
head()
#> Simple feature collection with 6 features and 0 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -17.06342 ymin: -13.25723 xmax: 51.13387 ymax: 27.65643
#> Geodetic CRS: WGS 84
#> geometry
#> 1 MULTIPOLYGON (((33.90371 -0...
#> 2 MULTIPOLYGON (((-8.66559 27...
#> 3 MULTIPOLYGON (((29.34 -4.49...
#> 4 MULTIPOLYGON (((41.58513 -1...
#> 5 MULTIPOLYGON (((39.20222 -4...
#> 6 MULTIPOLYGON (((24.56737 8....
12 Spatial Data Modelling and Visualization
In the previous chapters, we explored the fundamentals of data visualization with various techniques for creating static and interactive plots. In this chapter, we will learn how to model and visualize spatial data, which involves analyzing and representing data with spatial components. We will explore the concepts of spatial data, spatial data models, and spatial models, and learn how to create maps, simulate infections, and predict the spatial distribution of phenomena such as disease spread. By combining spatial data with machine learning techniques, we can model the spatial dynamics of disease transmission and assess the risks associated with the spread of infections. We will use various packages such as sf
, ggplot2
, and gstat
to create spatial models and visualizations, and gain insights into the spatial distribution of infections in a specific region.
12.1 Spatial Data, Spatial Data Models and Spatial Models
Essential for understanding the spatial distribution of phenomena such as disease spread, land cover change, or climate patterns, spatial data and spatial models are crucial for analyzing and predicting spatial processes. But, what is the difference between spatial data, spatial data models, and spatial models?
Spatial data includes geographic coordinates, addresses, or boundaries. This type of data is most often represented by one of two data models: vector or raster. Vector data models use geometric shapes like points, lines, and polygons to represent spatial features, whereas raster data models use a grid of cells or pixels, where each cell has a value representing information such as temperature or land cover.
Another important distinction is that between spatial data models and spatial models. Spatial data models are frameworks for organizing and representing spatial data, connecting our understanding of events to their representation and processing through algorithms as spatial primitives and relationships.
On the other hand, spatial models are defined as process models, which represent dynamic spatial processes—phenomena that change over time, such as the spread of a virus, flood formation, or land cover change. These models are crucial for understanding and predicting how spatial phenomena evolve, providing a basis for intervention and management strategies.
To learn more about the differences and applications of spatial data models and process models, you can refer to the ArcGIS documentation on solving spatial problems with representation and process models and the Esri blog on spatial analysis.
Another great source for practical examples and applications of spatial data in R is https://r-spatial.org/ which is the main source for the most recent developments in spatial data analysis in R.
12.2 Making a Map
To learn how to make a map we first need spatial data. There are various sources and packages that provide spatial data for different regions and purposes. One such package is rnaturalearth, this package contains various types of boundaries of countries in the world. For more information type ?rnaturalearth
. In particular, we use the ne_countries()
function to get the boundaries of Africa, with the option returnclass = "sf"
to return the data as simple features.
A simple feature object is a type of spatial data that contains the geometry of the spatial features (e.g., points, lines, polygons) and additional attributes such as the name of the country, population, and other information.
This is a simple map of Africa, but we can make it more interesting by adding some colors to the countries. A better resolution of the colored map can be found in the online version of the book.
ggplot(data = africa) +
geom_sf() +
coord_sf()
# Africa with colors
ggplot(data = africa) +
geom_sf(aes(fill=name),
color= "white",
linewidth = 1,
show.legend = F) +
coord_sf()
12.3 Coordinate Reference System (CRS)
A coordinate reference system (CRS) is a standardized way of defining the spatial location of features on the Earth’s surface. It uses a set of coordinates to represent the position of points, lines, and polygons in space. There are different types of CRS, such as geographic and projected CRS, which are used to represent the Earth’s surface in different ways. Geographic CRS (LatLong) represents locations on a curved surface using latitude and longitude for global mapping, while projected CRS or Universal Transverse Mercator (UTM) projects the curved Earth onto a flat map, using meters for precise mapping. In particular, the UTM system provides the distance in meters from the equator and the central meridian of a specific zone.
We use the sf package to define and transform CRS. The st_crs()
function allows you to retrieve the CRS of a spatial object, while the st_transform()
function allows you to transform the coordinates of a spatial object to a different CRS.
africa_crs <- st_crs(africa)
africa_crs
#> Coordinate Reference System:
#> User input: WGS 84
#> wkt:
#> GEOGCRS["WGS 84",
#> DATUM["World Geodetic System 1984",
#> ELLIPSOID["WGS 84",6378137,298.257223563,
#> LENGTHUNIT["metre",1]]],
#> PRIMEM["Greenwich",0,
#> ANGLEUNIT["degree",0.0174532925199433]],
#> CS[ellipsoidal,2],
#> AXIS["latitude",north,
#> ORDER[1],
#> ANGLEUNIT["degree",0.0174532925199433]],
#> AXIS["longitude",east,
#> ORDER[2],
#> ANGLEUNIT["degree",0.0174532925199433]],
#> ID["EPSG",4326]]
For example, the CRS of the Africa map is WGS 84 or World Geodetic System 1984 which is a reference system for the Earth’s surface that defines origin and orientation of the coordinate axes, called a datum (a fact known from direct observation). The EPSG is a structured dataset of CRS and Coordinate Transformations, it was originally compiled by the, now defunct, European Petroleum Survey Group1. The EPSG code for WGS 84 is 4326.
12.4 Simulate Infections in Central Africa Rep.
Synthetic data are created for the Central African Republic. The simulation of the number of infected individuals, their location, and temperature levels are obtained with random numbers generating functions from the {stats}
package, which is a base package in R.
The Central African Rep. is a landlocked country in Central Africa, with a population of approximately 5 million people. The country is known for its diverse wildlife and natural resources, and has faced challenges such as political instability and armed conflict.
12.4.0.1 Bounding Box
We can use the st_bbox()
function to retrieve the bounding box of the ctr_africa
object, as a matrix with the minimum and maximum values of the coordinates. A bounding box is a rectangular area defined by two points: the lower-left corner (minimum values of the coordinates) and the upper-right corner (maximum values of the coordinates). It provides a quick way to determine the spatial extent of a region and is often used in spatial analysis to define the boundaries of a study area.
12.4.0.2 Spatial Coordinates
The st_coordinates()
function allows to extract the coordinates from a geometry vector and release a matrix with 5 columns, where the first two columns are the longitude (X) and latitude (Y), and the other are indexes to group the coordinates into polygons.
The as.data.frame()
function converts the matrix to a data frame, and the dplyr::select()
function selects the longitude and latitude columns. The summary()
function provides a summary of the data frame, including the mean, median, and quartiles of the longitude and latitude values.
ctr_africa_coords <- ctr_africa %>%
sf::st_coordinates() %>%
as.data.frame() %>%
dplyr::select(X, Y)
ctr_africa_coords %>%
summary()
#> X Y
#> Min. :14.46 Min. : 2.268
#> 1st Qu.:16.58 1st Qu.: 4.648
#> Median :20.96 Median : 5.354
#> Mean :20.70 Mean : 6.294
#> 3rd Qu.:24.26 3rd Qu.: 8.145
#> Max. :27.37 Max. :11.142
Synthetic data are created as an image of the spread of infections observed in Central Africa on a specific point in time. The data is generated using random numbers for the number of infected individuals, their location, and temperature levels. The temperature levels consider a min of 20.3 degrees Celsius (69 degrees Fahrenheit), a maximum of 29.2 °C (85 °F), and a daily average of 24.7 °C (76 °F).
The synthetic data is created using the rbinom()
, rnorm()
, and rpois()
functions from the {stats}
package. The data is generated for 100 locations in Central Africa, with 70% of the locations infected and 30% non-infected. The latitude and longitude values are generated using the rnorm()
function with mean values of 6.294 and 20.70, respectively. The number of infected individuals is generated using the rpois()
function with a mean value of 10, and the temperature levels are generated using the rnorm()
function with a mean value of 25.
Set the number of points:
num_points <- 100
Simulate Data:
set.seed(21082024) # set seed for reproducibility
# simulate the presence of infection
# 1 = "infected" and 0 = "non_infected"
longitude <- rnorm(n = num_points, mean = 20.70, sd = 1)
latitude <- rnorm(n = num_points, mean = 6.294, sd = 1)
presence <- rbinom(100, 1, prob = 0.3)
cases <- ifelse(presence == 1,
rpois(n = num_points*0.7, lambda = 10), 0)
temperature <- rnorm(n = num_points,
mean = 24.7,
sd = (29.2 - 20.3) / 4)
# build a dataframe
df <- data.frame(longitude, latitude,
presence, cases, temperature)
head(df)
#> longitude latitude presence cases temperature
#> 1 21.09134 6.052162 0 0 20.58159
#> 2 22.25082 7.814450 0 0 29.77619
#> 3 21.52356 6.101959 0 0 23.97023
#> 4 19.76787 6.004291 0 0 28.53059
#> 5 21.25535 5.239799 0 0 22.16691
#> 6 20.10175 7.256303 0 0 20.66836
The str()
function allows to inspect the structure of the data frame, showing the type of each column and the first few rows of the data frame.
df %>%
filter(presence == 1) %>%
head() %>%
str()
#> 'data.frame': 6 obs. of 5 variables:
#> $ longitude : num 18.8 20.3 22.9 20 20.1 ...
#> $ latitude : num 4.54 5.09 6.21 8.42 7.63 ...
#> $ presence : int 1 1 1 1 1 1
#> $ cases : num 13 10 11 13 9 9
#> $ temperature: num 24 27.3 25.2 25.1 24.2 ...
ggplot() +
geom_sf(data = africa) +
geom_sf(data = ctr_africa, fill = "brown") +
labs(title = "Map of Africa and Central Africa Rep.")
ggplot() +
geom_sf(data = africa) +
geom_sf(data = ctr_africa,
fill = alpha("brown", alpha = 0.2)) +
geom_point(data = df,
aes(x = longitude, y = latitude,
color = factor(presence)),
shape = ".") +
labs(title = "Synthetic Data Visualization",
color = "Infection Status") +
theme(legend.position = "top")
12.5 Correlation between Response and Predictor
The correlation coefficient between the number of infected individuals and the temperature can be calculated using the cor()
function with the method set to “pearson”. This will provide a measure of the strength and direction of the linear relationship between the two variables. A correlation coefficient close to 1 indicates a strong positive linear relationship, while a coefficient close to -1 indicates a strong negative linear relationship. A coefficient close to 0 indicates no significant linear relationship between the variables.
cor(df$cases, df$temperature, method = "pearson")
#> [1] -0.05243272
The correlation coefficient is -0.052, suggesting that there is a weak negative linear relationship between the number of infected individuals and the temperature. This indicates that as the temperature increases, the number of infected individuals decreases slightly, but the relationship is not significant.
12.6 Histogram and Scatter Plot
To visualize the distribution of the presence of infection in Central Africa, we can create a histogram of the presence variable using the geom_bar()
function from the ggplot2 package. The histogram shows the frequency of infected and non-infected locations in Central Africa, with the x-axis representing the presence of infection and the y-axis representing the count of locations.
ggplot(data = df,
aes(x = factor(presence))) +
geom_bar() +
labs(title = "Histogram of Presence of Infection",
x = "Presence",
y = "Count") +
theme_minimal()
ggplot(data = df %>% filter(presence== 1),
aes(x = temperature, y = cases)) +
geom_point() +
geom_smooth() +
labs(title = "Relationship between Temperature and Number of Infected Individuals",
x = "Temperature",
y = "Number of Infected Individuals") +
theme_minimal()
The scatter plot shows the temperature on the x-axis and the number of infected individuals on the y-axis, with each point representing an infected location in Central Africa. The geom_point()
function adds the points to the plot, while the geom_smooth()
function fits a smooth curve to the data, showing the trend between independent and dependent variables, temperature and the number of infected individuals, respectively. The labs()
function adds titles and labels to the plot, specifying the title, x-axis label, and y-axis label. The theme_minimal()
function sets the plot theme to a minimal style.
12.7 Grid of Points
To create a grid of points we transform the df
data as a simple features object with the st_as_sf()
function from the sf package, specifying the longitude and latitude columns as the coordinates and the CRS as 43262. Then, intersect the data with st_intersection(ctr_africa)
to keep only the points within the country. Finally, select the relevant columns for the analysis.
df_sf <- df %>%
st_as_sf(coords = c("longitude","latitude"),
# or use crs = 4326
crs = "+proj=longlat +datum=WGS84") %>%
st_intersection(ctr_africa) %>%
st_make_valid()
grid <- df_sf %>%
st_bbox() %>%
st_as_sfc() %>%
st_make_grid(what = "centers",
cellsize = .2,
square = F)
To generate the grid of points to cover all Central African Republic we can also use the expand_grid()
function from the tidyr package, which generates a series of points within specified bounding box, and the point in polygon (PtInPoly()
) function from the DescTools package. The pip
vector contains the information if the point is inside the polygon or not.
library(DescTools)
set.seed(240724) # set seed for reproducibility
bbox_grid <- expand_grid(x = seq(from = bbox$xmin,
to = bbox$xmax,
length.out = 100),
y = seq(from = bbox$ymin,
to = bbox$ymax,
length.out = 100))
ctr_africa_grid_full <- data.frame(PtInPoly(bbox_grid,
ctr_africa_coords))
ctr_africa_grid_full %>% head()
#> x y pip
#> 1 14.45941 2.267640 0
#> 2 14.45941 2.357284 0
#> 3 14.45941 2.446928 0
#> 4 14.45941 2.536572 0
#> 5 14.45941 2.626216 0
#> 6 14.45941 2.715860 0
Mapping the grid of points on the map of Central African Republic, we can visualize the spatial distribution of infected locations.
# Window Grid
ggplot() +
geom_sf(data = ctr_africa)+
geom_sf(data = grid, shape = ".") +
geom_sf(data = df_sf %>% filter(presence == 1),
aes(fill = presence),
shape = 21, stroke = 0.3) +
coord_sf() +
labs(title = "Window Grid") +
theme(legend.position = "right")
# Central Africa Grid
ggplot(data = ctr_africa_grid_full) +
geom_point(aes(x, y, color = factor(pip)), shape = ".") +
geom_sf(data = df_sf %>% filter(presence == 1),
aes(fill = presence),
shape = 21, stroke = 0.3,
show.legend = F,) +
coord_sf() +
scale_color_manual(values = c("1" = "grey20", "0" = "grey")) +
labs(title = "Grid Map",
color = "PIP") +
theme(legend.position = "top")
12.8 Create a Raster of Temperature
To visualize the temperature data on the map of Central African Republic we create a raster template. The raster template has the same extent as the simulated infected locations and will be used to rasterize the temperature data. This means that the temperature data will be converted into a raster format, with each cell in the raster grid representing a specific temperature value. The rasterized temperature data will be visualized as a heatmap on the map, with warmer colors indicating higher temperatures.
The raster template is created using the rast()
function from the terra package, specifying the number of rows and columns, the minimum and maximum values for the x and y coordinates, and the coordinate reference system (CRS) of the raster. If you want to know more about the package, type ?terra
.
The raster template is set to have 18 rows and 36 columns, with the x and y coordinates ranging from 11 to 28 and 2 to 12, respectively.
Rasterize the temperature data using the rasterize()
function to convert the temperature data from the data frame to a raster format based on the raster template. The temperature data is assigned to the raster cells based on the latitude and longitude coordinates.
ctr_africa_raster <- rasterize(df_sf,
raster_template,
field = "temperature",
fun = max)
Before plotting the rasterized temperature data on the map of Central African Republic, we need to convert the rasterized data to a data frame format. This will allow us to visualize the temperature data as a heatmap on the map. The rasterized data contains the temperature values for each cell in the raster grid, which are assigned based on the latitude and longitude coordinates of the temperature data.
ctr_africa_raster_df <- as.data.frame(ctr_africa_raster,
xy = TRUE)
ctr_africa_raster_df %>% head()
#> x y max
#> 200 20.20833 8.944444 24.09246
#> 203 21.62500 8.944444 21.51325
#> 205 22.56944 8.944444 23.46981
#> 234 19.26389 8.388889 25.77824
#> 235 19.73611 8.388889 25.11388
#> 268 18.31944 7.833333 25.24466
# Temperature Raster
ggplot() +
geom_raster(data = ctr_africa_raster_df,
aes(x = x, y = y, fill = max)) +
viridis::scale_fill_viridis(name = "Temperature",
na.value = "transparent") +
labs(title = "Rasterized Temperature") +
theme(legend.position = "right")
# Central Africa Raster
ggplot() +
geom_point(data = ctr_africa_grid_full %>% filter(pip == 1),
aes(x = x, y = y),
shape = ".") +
geom_raster(data = ctr_africa_raster_df,
aes(x = x, y = y, fill = max)) +
geom_sf(data = df_sf %>% filter(presence == 1),
aes(size = cases),
shape = 21, stroke = 0.3) +
scale_fill_gradient(low = "white", high = "grey30",
na.value = "transparent") +
labs(title = "Max Temperature in Central African Rep.\n(Simulated Values)",
size = "Number of Infections",
fill = "Max Temperature") +
theme(legend.position = "right")
This plot shows the temperature data as a heatmap on the map of Central African Republic, with warmer colors indicating higher temperatures. The infected locations are represented as circles with size indicating the number of infected individuals at each location.
12.9 The Epicenter of Infection
A central point that represents the spatial distribution of infections in the region is the center of mass. It is a useful metric for understanding the spatial dynamics of disease spread and identifying critical zones for intervention. To calculate the coordinates of the center of mass for the infections, even called epicenter of the infections, we use the center_of_mass()
function.
The center of mass is the average of the latitude and longitude of the infected individuals, weighted by the number of infected individuals at each location.
df_com <- tibble(longitude = center_of_mass(df)[1],
latitude = center_of_mass(df)[2])
df_com
#> # A tibble: 1 × 2
#> longitude latitude
#> <dbl> <dbl>
#> 1 20.7 6.47
Finally, plot the infections on the map of Central African Republic, with the temperature data represented as a raster layer. The temperature data is displayed as a heatmap, with warmer colors indicating higher temperatures. The infections are shown as points on the map, with different colors representing infected and non-infected locations.
Note, here a new package is used, the ggnewscale package, which is a package that allows you to add multiple color scales to a single plot. This is useful when you want to visualize different variables with different color scales in the same plot.
# Location of Infections
ggplot() +
geom_sf(data = df_sf, aes(shape = factor(presence),
color = factor(presence))) +
geom_point(data = df_com,
aes(x = longitude, y = latitude),
color = "red", size = 4) +
scale_colour_manual("Infection Status",
values = c("1" = "red", "0" = "blue")) +
# combine shape and color legends
guides(color = guide_legend(title = "Infection Status"),
shape = guide_legend(title = "Infection Status")) +
labs(title = "Simulated Locations in Central Africa Rep.",
subtitle = "Synthetic Data with Infections Center of Mass") +
theme(legend.position = "right")
# Infections Size on Max Temperature
ggplot() +
geom_point(data = ctr_africa_grid_full %>% filter(pip == 1),
aes(x = x, y = y), shape = ".") +
geom_sf(data = ctr_africa, fill = NA) +
geom_raster(data = ctr_africa_raster_df,
aes(x = x, y = y, fill = max)) +
scale_fill_gradient(low = "white", high = "grey30",
na.value = "transparent") +
geom_sf(data = df_sf %>% filter(presence == 1),
mapping = aes(group = presence),
alpha = 0.8, color = "red", size = 0.5) +
geom_sf(data = df_sf %>% filter(presence == 1),
aes(size = cases),
shape = 21, stroke = 0.3) +
geom_point(data = df_com,
aes(x = longitude, y = latitude),
color = "red", size = 4, shape = 13) +
labs(title = "Simulated Infections with Size on Max Temperature in Central Africa Rep.",
subtitle = "Synthetic Data with Center of Mass",
fill = "Temperature",
size = "Number") +
theme(legend.position = "right")
Interactive Maps
The infections can be visualized with mapview or leaflet, type of interactive maps, here is a code snippet of how to do it.
Map with Mapview:
This is the interactive map with the max temperatures for the simulated data in Central African Republic. The data represented as a raster points, with warmer colors indicating higher temperatures are shown as points on the map.