Corona in Belgium

I lost a few hours this afternoon when digging into the Corona virus data mainly caused by reading this article at this website which gives a nice view on how to be aware of potential issues which can arise when collecting data and to be aware of hidden factors and it also shows Belgium.

  • As a Belgian, I was interested to see how Corona might impact our lives in the next weeks and out of curiosity I was interested to see how we are doing compared to other countries regarding containment of the Corona virus outspread - especially since we still do not have a government in Belgium after elections 1 year ago. 
  • In what follows, I'll be showing some graphs using data available at https://github.com/CSSEGISandData/COVID-19 (it provides up-to-date statistics on Corona cases). If you want to reproduce this, pull the repository and just execute the following R code shown.

Data

Let's see first if the data is exactly what is shown at our National Television.

library(data.table)
library(lattice)
x <- list.files("csse_covid_19_data/csse_covid_19_daily_reports/", pattern = ".csv", full.names = TRUE)
x <- data.frame(file = x, date = substr(basename(x), 1, 10), stringsAsFactors = FALSE)
x <- split(x$file, x$date)
x <- lapply(x, fread)
x <- rbindlist(x, fill = TRUE, idcol = "date")
x$date <- as.Date(x$date, format = "%m-%d-%Y")
x <- setnames(x, 
              old = c("date", "Country/Region", "Province/State", "Confirmed", "Deaths", "Recovered"),
              new = c("date", "region", "subregion", "confirmed", "death", "recovered"))
x <- subset(x, subregion %in% "Hubei" |
region %in% c("Belgium", "France", "Netherlands", "Spain", "Singapore", "Germany", "Switzerland", "Italy"))

x$area <- ifelse(x$subregion %in% "Hubei", x$subregion, x$region)
x <- x[!duplicated(x, by = c("date", "area")), ]
x <- x[, c("date", "area", "confirmed", "death", "recovered")]
subset(x, area %in% "Belgium" & confirmed > 1)

Yes, the data from https://github.com/CSSEGISandData/COVID-19  looks correct indeed. Same numbers as reported on the Belgian Television. 

dateareaconfirmeddeathrecovered
2020-03-01 Belgium 2 0 1
2020-03-02 Belgium 8 0 1
2020-03-03 Belgium 13 0 1
2020-03-04 Belgium 23 0 1
2020-03-05 Belgium 50 0 1
2020-03-06 Belgium 109 0 1
2020-03-07 Belgium 169 0 1
2020-03-08 Belgium 200 0 1
2020-03-09 Belgium 239 0 1
2020-03-10 Belgium 267 0 1
2020-03-11 Belgium 314 3 1

Exponential number of cases of Corona

  • Now is the outbreak really exponential? Let's make some graphs.

What is clear when looking at the plots is that indeed infections happen at a exponential scale except in Singapore where the government managed to completely isolate the Corona cases, while in Belgium and other European countries the government lacked the opportunity to isolate the Corona cases and we are now in a phase of trying to slow down to reduce and spread the impact.

corona1

You can reproduce the plot as follows

trellis.par.set(strip.background = list(col = "lightgrey"))
xyplot(confirmed ~ date | area, data = x, type = "b", pch = 20, 
scales = list(y = list(relation = "free", rot = 0), x = list(rot = 45, format = "%A %d/%m")), 
layout = c(5, 2), main = sprintf("Confirmed cases of Corona\n(last date in this graph is %s)", max(x$date)))

Compare to other countries - onset

It is clear that the onset of Corona is different in each country. Let's define the onset (day 0) as the day where 75 persons had Corona in the country. That will allow us to compare different countries. In Belgium we started to have more than 75 patients with Corona on Friday 2020-03-06.  In the Netherlands that was one day earlier. 

dateareaconfirmed
2020-01-22 Hubei 444
2020-02-17 Singapore 77
2020-02-23 Italy 155
2020-02-29 Germany 79
2020-02-29 France 100
2020-03-01 Spain 84
2020-03-04 Switzerland 90
2020-03-05 Netherlands 82
2020-03-06 Belgium 109

Reproduce as follows:

x <- x[order(x$date, x$area, decreasing = TRUE), ]
x <- x[, days_since_case_onset := as.integer(date - min(date[confirmed > 75])), by = list(area)]
x <- x[, newly_confirmed := as.integer(confirmed - shift(confirmed, n = 1, type = "lead")), by = list(area)]
onset <- subset(x, days_since_case_onset == 0, select = c("date", "area", "confirmed"))
onset[order(onset$date), ]

Compare to other countries - what can we expect?

  • Now are we doing better than other countries in the EU?

Following plot shows the log of the number of people diagnosed as having Corona since the onset date shown above. It looks like Belgium has learned a bit from the issues in Italy but it still hasn't learned the way to deal with the virus outbreak the same as e.g. Singapore has done (a country which learned from the SARS outbreak).

Based on the blue line, we can expect Belgium to have next week between roughly 1100 confirmed cases (log(1100)=7) or if we follow the trend of France that would be roughly 3000 (log(3000)=8) patients with Corona. We hope that it is only the first.

corona2 

Reproduce as follows:

xyplot(log(confirmed) ~ days_since_case_onset | "Log(confirmed cases) of Corona since onset of sick person nr 75", 
groups = area,
data = subset(x, days_since_case_onset >= 0 &
area %in% c("Hubei", "France", "Belgium", "Singapore", "Netherlands", "Italy")),
xlab = "Days since Corona onset (confirmed case 75)", ylab = "Log of number of confirmed cases",
auto.key = list(space = "right", lines = TRUE),
type = "b", pch = 20, lwd = 2) 

Compared to the Netherlands

  • Now, are we doing better than The Netherlands?

Currently it looks like we are. But time will tell. Given the trend shown above, I can only hope everyone in Belgium follows the government guidelines as strict as possible.

corona3

 

Reproduce as follows:

xyplot(newly_confirmed ~ date | "Newly confirmed cases of Corona", groups = area,
data = subset(x, area %in% c("Belgium", "Netherlands") & date > as.Date("2020-03-01")),
xlab = "Date", ylab = "Number of new Corona cases",
scales = list(x = list(rot = 45, format = "%A %d/%m", at = seq(as.Date("2020-03-01"), Sys.Date(), by = "day"))),
auto.key = list(space = "right", lines = TRUE),
type = "b", pch = 20, lwd = 2)

Last call for the course on Advanced R programming

Next week we will hold our yearly course on Advanced R programming at LStat, Leuven. If you are interested in learning one of the following techniques, don't hesitate to subscribe at https://lstat.kuleuven.be/training/coursedescriptions/AdvancedprogramminginR.html

  • Functions, the apply family of functions, parallelisation, advanced data manipulation with R
  • S3 programming
  • Building reports with markdown / Sweave
  • Build an R package

Interested in other trainings, vist: http://bnosac.be/index.php/training

r training 

See you next week!

Neural Network Machine Learning for NLP

Last week, we updated package ruimtehol on CRAN. The package provides an easy interface for R users to Starspace which is a general purpose neural embedding model for text data.

Notable changes are that the package now also builds fine on Mac OS and runs fine on all CRAN platforms. If you are interested to see what the package can do, have a look at the presentation below or visit the package vignette at https://cran.r-project.org/web/packages/ruimtehol/vignettes/ground-control-to-ruimtehol.pdf

screenshot pdf 

If you like it, give it a star at https://github.com/bnosac/ruimtehol and if you need commercial support on text mining, get in touch.

Upcoming training schedule 

Interested in NLP? Then you might as well be interested in the following courses provided in Belgium. Hope to see you there!

  • 2020-02-19&20: Advanced R programming: Subscribe here
  • 2020-03-12&13: Computer Vision with R and Python: Subscribe here
  • 2020-03-16&17: Deep Learning/Image recognition: Subscribe here
  • 2020-04-22&23: Text Mining with R: Subscribe here
  • 2020-05-06&07: Text Mining with Python: Subscribe here

upcoming AI-related courses

I forgot to do some marketing for the following upcoming AI-related courses which will be given in Leuven, Belgium by BNOSAC

  • 2019-10-17&18: Statistical Machine Learning with R: Subscribe here
  • 2019-11-14&15: Text Mining with R: Subscribe here
  • 2019-12-17&18: Applied Spatial Modelling with R: Subscribe here
  • 2020-02-19&20: Advanced R programming: Subscribe here
  • 2020-03-12&13: Computer Vision with R and Python: Subscribe here
  • 2020-03-16&17: Deep Learning/Image recognition: Subscribe here
  • 2020-04-22&23: Text Mining with R: Subscribe here
  • 2020-05-06&07: Text Mining with Python: Subscribe here

Hope to see you there.

dependency parsing with udpipe

We have been blogging about udpipe several times now in the following posts:

Dependency parsing

A point which we haven't touched upon yet too much was dependency parsing. Dependency parsing is an NLP technique which provides to each word in a sentence the link to another word in the sentence, which is called it's syntactical head. This link between each 2 words furthermore has a certain type of relationship giving you further details about it.

The R package udpipe provides such a dependency parser. With the output of dependency parsing, you can answer questions like

  1. What is the nominal subject of a text
  2. What is the object of a verb
  3. Which word modifies a noun
  4. What is the linked to negative words
  5. Which words are compound statements
  6. What are noun phrases, verb phrases in the text

Examples

In the following sentence:

His speech about marshmallows in New York is utter bullshit

you can see this dependency parsing in action in the graph below. You can see compound statement like 'New York', that the word speech is linked to the word bullshit with relationship nominal subject, that the 2 nominals marshmallow and speech are linked as nominal noun modifiers, that the word utter is an adjective which modifies the noun bullshit.

depenceny parsing example2

Obtaining such relationships in R is pretty simple nowadays. Running this code, will provide you the dependency relationships among the words of the sentence in the columns token_id, head_token_id and dep_rel. The possible values in the field dep_rel are defined at https://universaldependencies.org/u/dep/index.html.

library(udpipe)
x <- udpipe("His speech about marshmallows in New York is utter bullshit", "english")

depenceny parsing example1

R is excellent in visualisation. For visualising the relationships between the words which were found, you can just use the ggraph R package. Below we create a basic function which selects the right columns from the annotation and puts it into a graph.

library(igraph)
library(ggraph)
library(ggplot2)
plot_annotation <- function(x, size = 3){
  stopifnot(is.data.frame(x) & all(c("sentence_id", "token_id", "head_token_id", "dep_rel",
                                     "token_id", "token", "lemma", "upos", "xpos", "feats") %in% colnames(x)))
  x <- x[!is.na(x$head_token_id), ]
  x <- x[x$sentence_id %in% min(x$sentence_id), ]
  edges <- x[x$head_token_id != 0, c("token_id", "head_token_id", "dep_rel")]
  edges$label <- edges$dep_rel
  g <- graph_from_data_frame(edges,
                             vertices = x[, c("token_id", "token", "lemma", "upos", "xpos", "feats")],
                             directed = TRUE)
  ggraph(g, layout = "linear") +
    geom_edge_arc(ggplot2::aes(label = dep_rel, vjust = -0.20),
                  arrow = grid::arrow(length = unit(4, 'mm'), ends = "last", type = "closed"),
                  end_cap = ggraph::label_rect("wordswordswords"),
                  label_colour = "red", check_overlap = TRUE, label_size = size) +
    geom_node_label(ggplot2::aes(label = token), col = "darkgreen", size = size, fontface = "bold") +
    geom_node_text(ggplot2::aes(label = upos), nudge_y = -0.35, size = size) +
    theme_graph(base_family = "Arial Narrow") +
    labs(title = "udpipe output", subtitle = "tokenisation, parts of speech tagging & dependency relations")
}

We can now call the function as follows to get the plot shown above:

plot_annotation(x, size = 4)

Let us see what is gives with the following sentence. 

The economy is weak but the outlook is bright

x <- udpipe("The economy is weak but the outlook is bright", "english")
plot_annotation(x, size = 4)

depenceny parsing example3

You can see that with dependency parsing you can now answer the question 'What is weak?', it is the economy. 'What is bright?', it is the outlook as these nouns relate to the adjectives with nominal subject as type of relationship. That's a lot more rich information than just looking at wordclouds.

Hope this has triggered beginning users of natural language processing that there is a myriad of NLP options beyond mere frequency based word counting. Enjoy!