library(deSolve)
library(tidyverse)
# Define parameters
N <- 1e6 # Total population
beta <- 0.5 # Transmission rate
gamma <- 0.1 # Recovery rate
t_exp <- 5 # Latent period
# Initial conditions
init <- c(S = N - 1, E = 1, I = 0, R = 0)
# Define the SEIR model
seir_model <- function(t, y, parameters) {
with(as.list(y), {
dS <- -beta * S * I / N
dE <- beta * S * I / N - (1 / t_exp) * E
dI <- (1 / t_exp) * E - gamma * I
dR <- gamma * I
return(list(c(dS, dE, dI, dR)))
})
}
15 COVID-19 Outbreaks
In this chapter, we explore the dynamics of COVID-19 disease outbreaks in more details. We examine the results of various model’s applications that simulate the spread of the virus.
15.1 Epidemiology
COVID-19, short for “Coronavirus Disease 2019,” is an infectious disease caused by the novel coronavirus SARS-CoV-2, a type of virus that is part of the Coronaviridae1 family of viruses that can cause illness in animals and humans alike. The disease was first identified in December 2019 in Wuhan, Hubei province, China, and has since then spread globally, leading to a pandemic2.
The origin of the virus is still under investigation3, but it is believed to have zoonotic4 origins, with bats being the most likely reservoir. The virus has also been linked to pangolins5, mammals that are illegally trafficked for their scales and meat. The virus is thought to have jumped from animals to humans, possibly through a wet market in Wuhan where live animals were sold.
On January 30, 2020, The World Health Organization (WHO) declared the outbreak a Public Health Emergency of International Concern and a worldwide investigation started to understand more about the virus and its effects on humans. This challenge involved the unprecedented identification of this type of virus, requiring thorough testing and confirmation of the incubation period, recovery rate, and infection rate through the collection of cases globally.
The virus spreads primarily through respiratory droplets. Symptoms can include loss of taste and smell, fatigue, muscle aches, and gastrointestinal issues, with fever, cough, and shortness of breath. Severe cases can lead to pneumonia, acute respiratory distress syndrome (ARDS), and death. The emergency was primarily due to the high number of deaths occurring without a known pharmacological intervention to contain it.
Various preventive measures were established, but the presence of asymptomatic and symptomatic cases created challenges in identifying the incubation time, leading governments to impose restrictions through lockdowns of entire cities and regions in affected countries. These measures restricted movement and activities to curb the virus’s transmission.
The pandemic has led to widespread healthcare challenges, economic disruptions, and significant changes to daily life, including social distancing measures, lockdowns, and travel restrictions.
Since its identification, COVID-19 has caused multiple outbreaks globally. The incubation period ranges from 2 to 14 days, during which an infected person may be asymptomatic but still contagious. On March 11, 2020, it was declared a pandemic. As of May 2024, there have been over 775 million confirmed cases and more than seven million deaths worldwide. The impact of the pandemic has varied across regions, influenced by factors such as virus variants, public health responses, and vaccination rates.
COVID-19 has had a profound impact on global health, economies, and daily life. Efforts to combat the pandemic have involved unprecedented levels of international cooperation, scientific research, and public health initiatives. Ongoing vaccination campaigns and adherence to public health guidelines are critical in controlling the spread of the virus and preventing further outbreaks. However, vaccine hesitancy and resistance, driven by concerns about the speed of development and potential long-term effects, remain significant challenges in achieving widespread vaccination coverage.
15.2 Mapping COVID-19 Outbreaks
Mapping COVID-19 outbreaks helps identify infection hotspots, track the virus’s spread, and inform public health interventions. Geographic Information Systems (GIS) and spatial analysis techniques can visualize and analyze COVID-19 data, such as the number of cases, deaths, and recoveries, at local, regional, and global levels.
15.3 Example: Modelling the Spread of COVID-19
Spatiotemporal modelling of COVID-19 outbreaks can provide valuable insights into the pandemic’s dynamics and help guide public health responses. In this example, we’ll use R to model the spread of COVID-19 using a Susceptible-Infected-Recovered (SIR) model, a common compartmental model used in epidemiology to simulate the spread of infectious diseases. A Bayesian machine learning approach is used to predict the spread of the virus over time and across different regions. Bayesian methods handle uncertainties and allow for the integration of prior information into the model.
First, we use the SIR model function to simulate the spread of the disease. The model compartments are: the number of susceptible (S), the number of infected (I), and R represents the number of recovered individuals. The parameters of the model are the transmission rate (\beta) and the recovery rate (\gamma), and (N) is the total population.
As seen previously, we’ll use the ode()
function from the deSolve package to solve the ordinary differential equations (ODEs) and simulate the spread of the virus over a period of 365 days:
# Solve the ODEs
times <- seq(0, 365, by = 1)
result <- ode(y = init, times = times, func = seir_model)
# Plot the results
ggplot(as.data.frame(result), aes(time, I, color = "Infected")) +
geom_line() +
geom_line(aes(time, R, color = "Recovered")) +
geom_line(aes(time, S, color = "Susceptible")) +
geom_line(aes(time, E, color = "Exposed")) +
scale_color_discrete(name = "Compartment") +
labs(x = "Time (days)", y = "Number of individuals")
In reality, the spread of a virus is much more complex and influenced by many factors such as human behavior, government policies, healthcare systems, vaccination campaigns, and human contact.
15.3.1 Bayesian Analysis
The use of Bayesian analysis in modelling infectious diseases, such as COVID-19, offers several advantages in understanding and predicting the dynamics of the spread. We implement a Bayesian regression model in R using the brms
package, which interfaces with Stan for Bayesian analysis.
The incorporation of prior knowledge allows for the integration of expert insights, which is particularly valuable for new viruses where historical data may be limited. Expert knowledge can improve model accuracy and reduce uncertainty in the estimates.
We can specify the prior distributions for beta and gamma using expert knowledge and available data. For example, we might specify a gamma distribution with a mean of 0.1 and a standard deviation of 0.05 for beta, and a gamma distribution with a mean of 0.05 and a standard deviation of 0.02 for gamma.
library(brms)
# Load the data
data <- as.data.frame(result)
# Convert simulate data to have the date column to Date type and add additional predictors
data <- data %>%
mutate(date = as.Date(time, origin = "2022-01-01"),
day_of_week = wday(date, label = TRUE),
week_of_year = week(date),
month = month(date),
lag_1_cases = lag(I, 1),
lag_7_cases = lag(I, 7)) %>%
drop_na()
# Define the Bayesian regression model
bayesian_model <- brm(
I ~ day_of_week + week_of_year + month + lag_1_cases + lag_7_cases,
data = data,
family = gaussian(),
prior = c(set_prior("normal(0, 10)", class = "b"),
set_prior("cauchy(0, 5)", class = "sigma")),
iter = 2000,
warmup = 1000,
chains = 4,
seed = 42
)
After fitting the model, we can plot the model diagnostics to assess its performance:
# Plot the model diagnostics
plot(bayesian_model)
To make predictions, we extract the posterior samples from the model and apply them to the original data:
# Make predictions
predictions <- predict(bayesian_model, newdata = data)
predictions <- as_tibble(predictions)
data <- data %>%
mutate(predicted_cases = predictions$Estimate)
# Plot actual vs. predicted values
ggplot(data, aes(x = date)) +
geom_point(aes(y = I,
color = "Actual")) +
geom_line(aes(y = predicted_cases,
color = "Predicted")) +
labs(title = "Actual vs. Predicted COVID-19 Cases",
x = "Date", y = "Cases", color = "Legend")
15.3.2 Ensemble Modelling - Combining Multiple Models
Ensemble modelling combines the predictions of multiple models to improve the overall accuracy and robustness of predictions. This technique is particularly useful in infectious disease modelling to account for uncertainties and variability in data, providing more reliable estimates of disease spread.
library(caret)
library(randomForest)
library(xgboost)
library(e1071)
# Scale the cases data
data <- data %>%
mutate(cases_scaled = scale(I))
# Split the data into training and testing sets
set.seed(42)
train_index <- createDataPartition(data$cases_scaled, p = 0.8, list = FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]
# Define the predictors for the model
predictors <- c("day_of_week", "week_of_year",
"month", "lag_1_cases", "lag_7_cases")
In summary, integrating machine learning models and advanced statistical techniques such as Bayesian analysis and ensemble modelling can significantly enhance our ability to predict and understand the spread of infectious diseases like COVID-19. These models provide critical insights that can inform public health responses and help mitigate the impact of pandemics.
15.4 COVID-19 & DALYs
The influence of COVID-19 on global health can be measured using the DALYs. We use the data from the JHU official GitHub repository CSSEGISandData6 to calculate the DALYs for COVID-19. The data includes the number of confirmed cases, deaths, and recovered cases for each country and region. We will calculate the DALYs for the US, China, United Kingdom, and Canada.
cases1 <- cases %>%
pivot_longer(cols = starts_with("X"),
names_to = "date",
values_to = "cases") %>%
mutate(date = gsub("X", "", date),
date = as.Date(date, format = "%m.%d.%y")) %>%
janitor::clean_names()
cases1 %>% head()
#> # A tibble: 6 × 6
#> province_state country_region lat long date cases
#> <chr> <chr> <dbl> <dbl> <date> <int>
#> 1 "" Afghanistan 33.9 67.7 2020-01-22 0
#> 2 "" Afghanistan 33.9 67.7 2020-01-23 0
#> 3 "" Afghanistan 33.9 67.7 2020-01-24 0
#> 4 "" Afghanistan 33.9 67.7 2020-01-25 0
#> 5 "" Afghanistan 33.9 67.7 2020-01-26 0
#> 6 "" Afghanistan 33.9 67.7 2020-01-27 0
deaths1 <- deaths %>%
pivot_longer(cols = starts_with("X"),
names_to = "date",
values_to = "deaths") %>%
mutate(date = gsub("X", "", date),
date = as.Date(date, format = "%m.%d.%y")) %>%
janitor::clean_names()
deaths1 %>% head()
#> # A tibble: 6 × 6
#> province_state country_region lat long date deaths
#> <chr> <chr> <dbl> <dbl> <date> <int>
#> 1 "" Afghanistan 33.9 67.7 2020-01-22 0
#> 2 "" Afghanistan 33.9 67.7 2020-01-23 0
#> 3 "" Afghanistan 33.9 67.7 2020-01-24 0
#> 4 "" Afghanistan 33.9 67.7 2020-01-25 0
#> 5 "" Afghanistan 33.9 67.7 2020-01-26 0
#> 6 "" Afghanistan 33.9 67.7 2020-01-27 0
COVID19 <- cases1 %>%
left_join(deaths1,
by = c("province_state",
"country_region",
"lat", "long", "date")) %>%
mutate(cases = ifelse(is.na(cases), 0, cases),
deaths = ifelse(is.na(deaths), 0, deaths)) %>%
group_by(country_region, date) %>%
mutate(cases = sum(cases),
deaths = sum(deaths),
cfr = round(deaths / cases, 3)) %>%
filter(cases > 0) %>%
arrange(country_region, date)
COVID19 %>%
arrange(desc(date)) %>%
head()
#> # A tibble: 6 × 8
#> # Groups: country_region, date [6]
#> province_state country_region lat long date cases deaths cfr
#> <chr> <chr> <dbl> <dbl> <date> <int> <int> <dbl>
#> 1 "" Afghanistan 33.9 67.7 2023-03-09 209451 7896 0.038
#> 2 "" Albania 41.2 20.2 2023-03-09 334457 3598 0.011
#> 3 "" Algeria 28.0 1.66 2023-03-09 271496 6881 0.025
#> 4 "" Andorra 42.5 1.52 2023-03-09 47890 165 0.003
#> 5 "" Angola -11.2 17.9 2023-03-09 105288 1933 0.018
#> 6 "" Antarctica -71.9 23.3 2023-03-09 11 0 0
COVID19_4countries <- COVID19 %>%
filter(country_region %in%
c("US", "China", "United Kingdom", "Canada"))
COVID19_4countries %>% head()
#> # A tibble: 6 × 8
#> # Groups: country_region, date [1]
#> province_state country_region lat long date cases deaths cfr
#> <chr> <chr> <dbl> <dbl> <date> <int> <int> <dbl>
#> 1 Alberta Canada 53.9 -117. 2020-01-23 2 0 0
#> 2 British Columbia Canada 53.7 -128. 2020-01-23 2 0 0
#> 3 Diamond Princess Canada 0 0 2020-01-23 2 0 0
#> 4 Grand Princess Canada 0 0 2020-01-23 2 0 0
#> 5 Manitoba Canada 53.8 -98.8 2020-01-23 2 0 0
#> 6 New Brunswick Canada 46.6 -66.5 2020-01-23 2 0 0
Selected countries are the US, China, United Kingdom, and Canada.
COVID19 %>%
ggplot(aes(x = date, y = cases,
group = country_region)) +
geom_line(color = "grey",
linewidth = 0.2) +
geomtextpath::geom_textline(
data = COVID19_4countries,
aes(label = country_region),
show.legend = F) +
scale_y_log10() +
labs(title = "COVID-19 Cases by Country",
x = "Date",
y = "Cases")
COVID19 %>%
ggplot(aes(x = date, y = deaths,
group = country_region)) +
geom_line(color = "grey",
linewidth = 0.2) +
geomtextpath::geom_textline(
data = COVID19_4countries,
aes(label = country_region),
show.legend = F) +
scale_y_log10() +
labs(title = "COVID-19 Deaths by Country",
x = "Date",
y = "Deaths")
options(scipen = 999)
ggplot() +
geom_polygon(data = worldmap,
aes(x = long, y = lat, group = group),
fill = "grey",
color = "white") +
geom_point(data = centroids,
aes(x = long, y = lat,
size = tot_cases,
color = log(avg_cfr)),
alpha = 0.5) +
scale_size_continuous(
labels = scales::unit_format(unit = "M",
scale = 1e-6)) +
scale_color_viridis_c() +
coord_quickmap() +
labs(title = "COVID-19 Cases by Country",
caption = "Circle size: total cases, Color: average CFR",
size= "Total cases (M)",
color= "Average CFR (log scale)",
x = "", y = "") +
theme(axis.text = element_blank(),
plot.subtitle = element_text(size = 8))
The same graph made with selected countries only.
ggplot() +
geom_polygon(data = worldmap,
aes(x = long, y = lat, group = group),
fill = "grey",
color = "white") +
geom_point(data = centroids %>%
filter(country_region %in%
c("US", "China","United Kingdom", "Canada")),
aes(x = long, y = lat,
size = tot_cases,
color = log(avg_cfr)),
alpha = 0.5) +
scale_size_continuous(
labels = scales::unit_format(unit = "M",
scale = 1e-6)) +
scale_color_viridis_c() +
coord_quickmap() +
labs(title = "COVID-19 Cases by Country",
caption = "Circle size: total cases, Color: average CFR",
subtitle = "US, China, United Kingdom, Canada",
size= "Total cases (M)",
color= "Average CFR (log scale)",
x = "", y = "") +
theme(axis.text = element_blank(),
plot.subtitle = element_text(size = 8))
The difference in the number of cases and deaths between the US and China is striking. The US has the highest number of cases and deaths, while China has the lowest number of cases and deaths. The United Kingdom and Canada have a similar number of cases and deaths, but the United Kingdom has a higher case fatality rate (CFR) than Canada.
Let’s now calculate the DALYs for COVID-19, starting with the YLLs. We know that the simplified formula for YLLs is the number of deaths multiplied by the standard life expectancy at the age of death. The standard life expectancy for these countries varies, but we can use the global average of 72.6 years.
Also, we group the date by 4 months cycles to calculate the YLLs.
COVID19_yll_4months <- COVID19_4countries %>%
mutate(month = as.numeric(format(date, "%m")),
year = as.numeric(format(date, "%Y")),
month_cycle = case_when(month %in% 1:4 ~ "Jan-Apr",
month %in% 5:8 ~ "May-Aug",
month %in% 9:12 ~ "Sep-Dec")) %>%
group_by(country_region, year, month_cycle) %>%
summarise(cases = sum(cases),
deaths = sum(deaths),
cfr = round(deaths / cases, 3)) %>%
mutate(ylls = deaths * 72.6) %>%
arrange(year, month_cycle, ylls)
COVID19_yll_4months %>%
select(year, month_cycle, ylls) %>%
head()
#> # A tibble: 6 × 4
#> # Groups: country_region, year [4]
#> country_region year month_cycle ylls
#> <chr> <dbl> <chr> <dbl>
#> 1 Canada 2020 Jan-Apr 62225750.
#> 2 US 2020 Jan-Apr 80674499.
#> 3 China 2020 Jan-Apr 655246654.
#> 4 United Kingdom 2020 Jan-Apr 800881092
#> 5 US 2020 May-Aug 1158612365.
#> 6 Canada 2020 May-Aug 1165061568
And then plot the YLLs by country and month cycle, with the y axis representing the YLLs formatted in millions.
COVID19_yll_4months %>%
ggplot() +
geom_col(aes(x = month_cycle, y = ylls,
fill = country_region)) +
scale_y_continuous(
labels = scales::unit_format(unit = "M",
scale = 1e-6)) +
labs(title = "YLLs by Country and Month Cycle",
fill = "Country",
x = "4-Month cycle",
y = "YLLs") +
facet_wrap(~year) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The majority of YLLs are in the United Kingdom, followed by the US, Canada, and China. The YLLs are highest in the first 4-month cycle (Jan-Apr) and decrease in the following cycles.
Let’s now calculate the YLDs for COVID-19. The YLDs are calculated by multiplying the number of cases by the disability weight. The Disability Weight applied is that for low respiratory diseases which is 0.133. As of 2020, the Global Burden of Disease Study 2019 estimated that the DALYs for lower respiratory diseases was 1,000,000. We can use the hmsidwR package to get the disability weight for lower respiratory diseases.
hmsidwR::disweights %>%
filter(str_detect(sequela, "lower respiratory")) %>%
select(sequela, dw)
#> # A tibble: 4 × 2
#> sequela dw
#> <chr> <dbl>
#> 1 Moderate lower respiratory infections 0.051
#> 2 Severe lower respiratory infections 0.133
#> 3 Moderate lower respiratory infections 0.0506
#> 4 Severe lower respiratory infections 0.133
We know that the simplified formula for YLDs is the prevalence of the disease multiplied by the disability weight. To calculate the prevalence, we need the population of each country. For this task we use the wpp2022 package, which provides the population data from the 2022 revision of the United Nations World Population Prospects (WPP 2022). It includes both historical estimates and projections of demographic indicators such as population counts and fertility rates, accommodating the challenges brought on by events like the COVID-19 pandemic and transitioning from five-year to one-year data intervals.
# library(devtools)
# options(timeout = 600)
# install_github("PPgp/wpp2022")
data(pop1dt)
pop_4countries <- pop1dt %>%
filter(
year %in% c(2020, 2021),
name %in% c(
"United States of America",
"China",
"United Kingdom",
"Canada")) %>%
mutate(name = ifelse(name == "United States of America", "US", name)) %>%
select(name, year, pop) %>%
arrange(year, name)
pop_4countries %>% head()
#> name year pop
#> <char> <int> <num>
#> 1: Canada 2020 38019.18
#> 2: China 2020 1425861.54
#> 3: US 2020 336495.77
#> 4: United Kingdom 2020 67167.77
#> 5: Canada 2021 38290.85
#> 6: China 2021 1425925.39
Calculate the prevalence of COVID-19 cases per 1,000,000 people.
prevalence = \frac{cases}{population} * 1,000,000
COVID19_dalys <- COVID19_yll_4months %>%
full_join(pop_4countries,
by = c("country_region" = "name", "year")) %>%
mutate(prevalence = cases / pop * 1000000,
dw = 0.133,
ylds = prevalence * dw,
dalys = ylls + ylds) %>%
arrange(year, month_cycle, prevalence) %>%
select(country_region, year,
month_cycle, ylls, ylds, dalys)
COVID19_dalys %>%
head()
#> # A tibble: 6 × 6
#> # Groups: country_region, year [4]
#> country_region year month_cycle ylls ylds dalys
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 US 2020 Jan-Apr 80674499. 8342075. 89016575.
#> 2 China 2020 Jan-Apr 655246654. 21433596. 676680249.
#> 3 Canada 2020 Jan-Apr 62225750. 60966727. 123192477.
#> 4 United Kingdom 2020 Jan-Apr 800881092 107263600. 908144692.
#> 5 China 2020 May-Aug 1434752563. 34879873. 1469632436.
#> 6 US 2020 May-Aug 1158612365. 153456778. 1312069143.
If we calculate the YLDs with the incidence instead of prevalence, we get the same results, because the incidence is the number of new cases in a given period, while the prevalence is the number of existing cases in a given period. So, we should consider the cumulative number of cases to calculate the prevalence. In order to calculate the cumulative number of cases, we need to group the data by year and month cycle.
And then plot the YLDs by country and month cycle, with the y axis representing the YLDs formatted in millions.
COVID19_dalys %>%
filter(year %in% c(2020, 2021)) %>%
ggplot() +
geom_col(aes(x = month_cycle, y = ylds,
fill = country_region)) +
scale_y_continuous(
labels = scales::unit_format(unit = "M",
scale = 1e-6)) +
labs(title = "YLDs by Country and Month Cycle",
fill = "Country",
x = "4-Month Cycle", y = "YLDs") +
facet_wrap(~year) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We can see that the YLDs are highest in the United Kingdom, followed by the US, Canada, and China. The YLDs are highest in the first 4-month cycle (Jan-Apr) and decrease in the following cycles.
Finally, let’s plot the DALYs by country and month cycle, with the y axis representing the DALYs formatted in millions.
COVID19_dalys %>%
filter(year %in% c(2020, 2021)) %>%
ggplot() +
geom_col(aes(x = month_cycle, y = dalys,
fill = country_region)) +
scale_y_continuous(
labels = scales::unit_format(unit = "M",
scale = 1e-6)) +
labs(title = "DALYs by Country and Month Cycle",
fill = "Country",
x = "4-Month Cycle", y = "DALYs") +
facet_wrap(~year) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
To explain the magnitude of difference between ylls and ylds,
prevalence as the proportion of cases at baseline among the total number of enrolled participants who completed baseline visits. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8992269/)
cumulative incidence as the number of incident cases divided by the total follow-up time per 100 person-years and assumed a uniform incidence distribution across the 6-month follow-up time. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8992269/)
“Coronaviridae - Wikipedia,” n.d., https://en.wikipedia.org/wiki/Coronaviridae.↩︎
“Pandemic - Wikipedia,” n.d., https://en.wikipedia.org/wiki/Pandemic.↩︎
“WHO-Convened Global Study of Origins of SARS-CoV-2: China Part,” n.d., https://www.who.int/publications/i/item/who-convened-global-study-of-origins-of-sars-cov-2-china-part.↩︎
“Zoonosis - Wikipedia,” n.d., https://en.wikipedia.org/wiki/Zoonosis.↩︎
Xing-Yao Huang et al., “A Pangolin-Origin SARS-CoV-2-Related Coronavirus: Infectivity, Pathogenicity, and Cross-Protection by Preexisting Immunity,” Cell Discovery 9, no. 1 (June 17, 2023): 1–13, doi:10.1038/s41421-023-00557-9.↩︎
cases <- read.csv(“https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv”)) deaths <- read.csv(”https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv”))↩︎