13 Advanced Data Visualization Techniques
Learning Objectives
- Learn advanced data visualization techniques
- Create contour plots to visualize interaction effects
- Create a pyramid plot to visualize population distribution
13.1 Example: Detecting Interaction Effects with Contour Plots
Interaction effects can cause predictors to act together on the response variable, leading to additional variation in the response. Interaction effects occur when the combined effect of two or more predictors is different from what would be expected if the impact of each predictor were considered alone. For instance, in the case of cardio vascular disease (CVD) risk prediction, the interaction between age and cholesterol levels can have a significant impact on the risk of developing heart disease. In addition, if smoke is added to the model, the effect of cholesterol levels on CVD risk may change depending on whether the individual is a smoker or non-smoker. So, smoke is said to interact with cholesterol levels in predicting CVD risk, as well as age, but in what way?
y=\beta_0 +\beta_1 x_1 + \beta_2 x_2 +\beta_3 x_3 + \epsilon \tag{13.1}
where y is the risk of developing heart disease, \beta_0 is the intercept, \beta_1 and \beta_2 are the coefficients for the predictors x_1 and x_2, respectively, and \beta_3 is the coefficient for the interaction term x_1*x_2. The error term \epsilon accounts for the variability in the response variable that is not explained by the predictors.
There are four type of interactions:
- additive is when \beta_3 \approx{0}
- antagonistic is when \beta_3 < 0
- synergistic is when \beta_3 > 0
- atypical is when \beta_3 \neq 0
In order to simulate the interaction effects for each type of interaction, we cannot just simulate random values for y, we need to simulate different levels of interactions between predictors setting up the values for the coefficients \beta_0, \beta_1, \beta_2, and \beta_3, and the predictors x_1 and x_2 and the error term \epsilon.
Load the necessary libraries and set the seed for reproducibility.
Simulate the data ready for different interaction effects.
Then, simulate the predictors x_1 and x_2 and the error term \epsilon for 200 observations.
Then for each of the three cases we simulate a \beta_3 based on antagonism, no interaction, or synergism with values -10, 0, and 10 respectively.
Case 1: Antagonism
beta3 <- rep(-10,200)
Assemble the values for the coefficients and predictors into the equation, to obtaining the response variable y. This will be our synthetic “observed” data.
y1 = beta0 + beta1*x1 + beta2*x2 + beta3*(x1*x2) + e
Apply a linear model to the observed data to predict the response variable y based on interaction between x_1 and x_2.
Then create a grid of values for x_1 and x_2 to predict the response variable y based on the interaction between x_1 and x_2.
grid <- with(observed1,
interp::interp(x = x1, y = x2, z))
griddf <- subset(data.frame(x = rep(grid$x, nrow(grid$z)),
y = rep(grid$y,
each = ncol(grid$z)),
z = as.numeric(grid$z)),!is.na(z))
p1 <- ggplot(griddf, aes(x, y, z = z)) +
geom_contour(aes(colour = after_stat(level),
linetype = factor(after_stat(level))),
linewidth = 2) +
scale_color_viridis_c()+
guides(linetype = "none")+
labs(title="Antagonism",
color="Prediction", x = "x1", y = "x2")+
theme(legend.position = "top")
Case 2: Additive (no interaction)
beta3 <- rep(0,200)
y2 = beta0 + beta1*x1 + beta2*x2 + beta3*(x1*x2) + e
observed2 <- tibble(y2, x1, x2)
mod2 <- lm(y2 ~ x1*x2, data = observed2)
observed2$z <- predict(mod2, observed2)
grid <- with(observed2, interp::interp(x = x1, y = x2, z))
griddf <- subset(data.frame(x = rep(grid$x, nrow(grid$z)),
y = rep(grid$y,
each = ncol(grid$z)),
z = as.numeric(grid$z)),!is.na(z))
p2 <- ggplot(griddf, aes(x, y, z = z)) +
geom_contour(aes(colour = after_stat(level),
linetype = factor(after_stat(level))),
linewidth = 2) +
scale_color_viridis_c()+
guides(linetype = "none")+
labs(title="Antagonism",
color="Prediction", x = "x1", y = "x2")+
theme(legend.position = "top")
Case 3: Synergism
beta3<- rep(10,200)
y3 = beta0 + beta1*x1 + beta2*x2 + beta3*(x1*x2) + e
observed3 <- tibble(y3, x1, x2)
mod3 <- lm(y3 ~ x1 * x2 , data = observed3)
observed3$z <- predict(mod3, observed3)
grid <- with(observed3, interp::interp(x=x1,y=x2,z))
griddf <- subset(data.frame(x = rep(grid$x, nrow(grid$z)),
y = rep(grid$y, each = ncol(grid$z)),
z = as.numeric(grid$z)),!is.na(z))
p3 <- ggplot(griddf, aes(x, y, z = z)) +
geom_contour(aes(colour = after_stat(level),
linetype = factor(after_stat(level))),
linewidth = 2) +
scale_color_viridis_c()+
guides(linetype = "none")+
labs(title="Antagonism",
color="Prediction", x = "x1", y = "x2")+
theme(legend.position = "top")
13.2 Example: Pyramid Plot
A pyramid plot is a type of bar chart that displays the distribution of a population by age. We use the data from the wpp2022 package, data(package = "wpp2022")
which is a collection of population datasets, with population estimates. The data is from the United Nations World Population Prospects 2022.
The data contains the following columns:
# load the data on the population
data(popAge1dt)
popAge1dt %>% head()
#> country_code name year age popM popF pop
#> <int> <char> <int> <int> <num> <num> <num>
#> 1: 900 World 1949 0 41312.32 39439.29 80751.61
#> 2: 900 World 1949 1 35761.05 34274.35 70035.40
#> 3: 900 World 1949 2 33514.72 32065.08 65579.81
#> 4: 900 World 1949 3 31076.46 29780.73 60857.19
#> 5: 900 World 1949 4 28786.66 27647.06 56433.72
#> 6: 900 World 1949 5 28082.09 26882.94 54965.03
Check all Countries in the dataset with View()
:
The data is then aggregated to show the total population for all ages by year. Here, in particular, we show the use of geomtextpath::geom_textline()
to add a label to the World line plot. geomtextpath is a nice R-package which allows to add text along a path.
all_ages <- popAge1dt %>%
group_by(year, name) %>%
reframe(tot_pop=sum(pop))
all_ages %>%
filter(!name == "World")%>%
ggplot(aes(x = year, y = tot_pop, group = name)) +
geom_line(color= "grey", linewidth =0.2) +
geomtextpath::geom_textline(data = all_ages %>% filter(name == "World"),
aes(label=name),
color = "red", linewidth = 1) +
labs(title = "United Nations World Population Prospects 2022 - [1949 - 2021]",
x = "Year", y = "Population in thousands",
caption = "Data Source: UN World Pop | Graphic: @fgazzelloni")
Data is then transformed to a long format to create a pyramid plot. We select the columns name
, year
, age
, popF
, and popM
and pivot the data to a long format using the pivot_longer()
function. We then create a new column value
that contains positive values for male population and negative values for females.
data <- popAge1dt %>%
select(name, year, age, popF, popM) %>%
pivot_longer(cols = c(popM, popF),
names_to = "sex",
values_to = "population") %>%
mutate(value = ifelse(sex == "popF",
as.integer(population * -1),
as.integer(population)))
data %>% head()
#> # A tibble: 6 × 6
#> name year age sex population value
#> <chr> <int> <int> <chr> <dbl> <int>
#> 1 World 1949 0 popM 41312. 41312
#> 2 World 1949 0 popF 39439. -39439
#> 3 World 1949 1 popM 35761. 35761
#> 4 World 1949 1 popF 34274. -34274
#> 5 World 1949 2 popM 33515. 33514
#> 6 World 1949 2 popF 32065. -32065
Then, we can create pyramid plots that shows the distribution of the population by age for High-income countries, Lower-middle-income countries, and Low-income countries, and put them in one chunk with the option layout-ncol: 3
.
data %>%
filter(name == "High-income countries") %>%
ggplot(aes(x = age, y = value, fill = sex)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("#CC6666", "#9999CC")) +
coord_flip() +
labs(title = "United Nations World Population Prospects 2022",
subtitle = "High-income countries",
x= "Age", y = "Population in thousands", fill = "",
caption = "Data Source: UN World Pop | Graphic: @fgazzelloni") +
theme_minimal() +
theme(plot.title = element_text(face = "bold",
hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text = element_text(size = 10))
data %>%
filter(name == "Lower-middle-income countries") %>%
ggplot(aes(x = age, y = value, fill = sex)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("#CC6666", "#9999CC")) +
coord_flip() +
labs(title = "United Nations World Population Prospects 2022",
subtitle = "Lower-middle-income countries",
x= "Age", y = "Population in thousands", fill = "",
caption = "Data Source: UN World Pop | Graphic: @fgazzelloni") +
theme_minimal() +
theme(plot.title = element_text(face = "bold",
hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text = element_text(size = 10))
data %>%
filter(name == "Low-income countries") %>%
ggplot(aes(x = age, y = value, fill = sex)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("#CC6666", "#9999CC")) +
coord_flip() +
labs(title = "United Nations World Population Prospects 2022",
subtitle = "Low-income countries",
x= "Age", y = "Population in thousands", fill = "",
caption = "Data Source: UN World Pop | Graphic: @fgazzelloni") +
theme_minimal() +
theme(plot.title = element_text(face = "bold",
hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text = element_text(size = 10))
pyramid <-
data %>%
filter(name == "World") %>%
ggplot(aes(x = age, y = value, fill = sex)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("#CC6666", "#9999CC")) +
coord_flip() +
labs(title = "United Nations World Population Prospects 2022 - {closest_state}",
x= "Age", y = "Population in thousands", fill = "",
caption = "Data Source: UN World Pop | Graphic: @fgazzelloni") +
theme_minimal() +
theme(plot.title = element_text(face = "bold",
hjust = 0.5),
axis.text = element_text(size = 10))
library(gganimate)
pyramid_gif <- pyramid +
transition_states(year,
transition_length = 1,
state_length = 2) +
enter_fade() +
exit_fade() +
ease_aes("cubic-in-out")
animate(pyramid_gif,
fps = 72, duration = 6,
width = 1200, height = 1400, res = 180,
renderer = gifski_renderer("images/12_unp_pyramid.gif"))