update of udpipe

I'm happy to announce that the R package udpipe was updated recently on CRAN. CRAN now hosts version 0.8.3 of udpipe. The main features incorporated in the update include

  • parallel NLP annotation across your CPU cores
  • default models now use models trained on Universal Dependencies 2.4, allowing to do annotation in 64 languages, based on 94 treebanks from Universal Dependencies. We now have models built on afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, classical_chinese-kyoto, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, estonian-ewt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-vit, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-alksnis, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, old_russian-torot, persian-seraji, polish-lfg, polish-pdb, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb, wolof-wtb
  • some fixes as indicated in the NEWS file

How does parallel NLP annotation looks like right now? Let's do some annotation in French.

data("brussels_reviews", package = "udpipe")
x <- subset(brussels_reviews, language %in% "fr")
x <- data.frame(doc_id = x$id, text = x$feedback, stringsAsFactors = FALSE)
anno <- udpipe(x, "french-gsd", parallel.cores = 1, trace = 100)
anno <- udpipe(x, "french-gsd", parallel.cores = 4) ## this will be 4 times as fast if you have 4 CPU cores

docusaurus udpipe

Note that udpipe particularly works great in combination with the following R packages

And nothing stops you from using R packages tm / tidytext / quanteda or text2vec alongside it!

Upcoming training schedule

If you want to know more, come attend the course on text mining with R or text mining with Python. Here is a list of scheduled upcoming public courses which BNOSAC is providing each year at the KULeuven in Belgium.

  • 2019-10-17&18: Statistical Machine Learning with R: Subscribe here
  • 2019-11-14&15: Text Mining with R: Subscribe here
  • 2019-12-17&18: Applied Spatial Modelling with R: Subscribe here
  • 2020-02-19&20: Advanced R programming: Subscribe here
  • 2020-03-12&13: Computer Vision with R and Python: Subscribe here
  • 2020-03-16&17: Deep Learning/Image recognition: Subscribe here
  • 2020-04-22&23: Text Mining with R: Subscribe here
  • 2020-05-05&06: Text Mining with Python: Subscribe here

Transfer learning and semi-supervised learning with ruimtehol

Last week the R package ruimtehol was updated on CRAN giving R users who perform Natural Language Processing access to the possibility to

  • Allow to do semi-supervised learning (learning where you have both text as labels but not always both of them on the same document identifier.
  • Allow to do transfer learning by passing on an embedding matrix (e.g. obtained via fasttext or Glove or the like) and keep on training based on that matrix or just use the embeddings in your Natural Language Processing flow.

More information can be found in the package vignette shown below or which you can obtain by installing the package and visiting the vignette with the following R code. Enjoy!

vignette("ground-control-to-ruimtehol", package = "ruimtehol")


Koning Filip lijkt op ...

Last call for the course on Text Mining with R, held next week in Leuven, Belgium on April 1-2. Viewing the course description as well as subscription can be done at https://lstat.kuleuven.be/training/coursedescriptions/text-mining-with-r

Some things you'll learn ... is that King Filip of Belgium is similar to public expenses if we just look at open data from questions and answers in Belgian parliament (retrieved from here http://data.dekamer.be). Proof is below. See you next week.koning filip

data("dekamer", package = "ruimtehol")
dekamer$x <- strsplit(dekamer$question, "\\W")
dekamer$x <- lapply(dekamer$x, FUN = function(x) setdiff(x, ""))
dekamer$x <- sapply(dekamer$x, FUN = function(x) paste(x, collapse = " "))
dekamer$x <- tolower(dekamer$x)
dekamer$y <- strsplit(dekamer$question_theme, split = ",")
dekamer$y <- lapply(dekamer$y, FUN=function(x) gsub(" ", "-", x))
model <- embed_tagspace(x = dekamer$x, y = dekamer$y,
                        early_stopping = 0.8, validationPatience = 10,
                        dim = 50,
                        lr = 0.01, epoch = 40, loss = "softmax", adagrad = TRUE,
                        similarity = "cosine", negSearchLimit = 50,
                        ngrams = 2, minCount = 2)embedding_words  <- as.matrix(model, type = "words")
embedding_labels <- as.matrix(model, type = "labels", prefix = FALSE)
embedding_person <- starspace_embedding(model, tolower(c("Theo Francken")))
embedding_person <- starspace_embedding(model, tolower(c("Koning Filip")))
similarities <- embedding_similarity(embedding_person, embedding_words, top = 9)
similarities <- subset(similarities, !term2 %in% c("koning", "filip"))
similarities$term <- factor(similarities$term2, levels = rev(similarities$term2))
plt1 <- barchart(term ~ similarity | term1, data = similarities,
         scales = list(x = list(relation = "free"), y = list(relation = "free")),
         col = "darkgreen", xlab = "Similarity", main = "Koning Filip lijkt op ...")similarities <- embedding_similarity(embedding_person, embedding_labels, top = 7)
similarities$term <- factor(similarities$term2, levels = rev(similarities$term2))
plt2 <- barchart(term ~ similarity | term1, data = similarities,
         scales = list(x = list(relation = "free"), y = list(relation = "free")),
         col = "darkgreen", xlab = "Similarity", main = "Koning Filip lijkt op ...")
c(plt1, plt2)

Human Face Detection with R

Doing human face detection with computer vision is probably something you do once unless you work for police departments, you work in the surveillance industry or for the Chinese government. In order to reduce the time you lose on that small exercise, bnosac created a small R package (source code available at https://github.com/bnosac/image) which wraps the weights of a Single Shot Detector (SSD) Convolutional Neural Network which was trained with the Caffe Deep Learning kit. That network allows to detect human faces in images. An example is shown below (tested on Windows and Linux).

install.packages("image.libfacedetection", repos = "https://bnosac.github.io/drat")
image <- image_read("http://bnosac.be/images/bnosac/blog/wikipedia-25930827182-kerry-michel.jpg")
faces <- image_detect_faces(image)
plot(faces, image, border = "red", lwd = 7, col = "white")

libfacedetection example

What you get out of this is for each face the x/y locations and the width and height of the face. If you want to extract only the faces, loop over the detected faces and get them from the image as shown below.

allfaces <- Map(
    x      = faces$detections$x,
    y      = faces$detections$y,
    width  = faces$detections$width,
    height = faces$detections$height,
    f = function(x, y, width, height){
      image_crop(image, geometry_area(x = x, y = y, width = width, height = height))
allfaces <- do.call(c, allfaces)

Hope this gains you some time when doing which seems like a t-test of computer vision. Want to learn more on computer vision, next time just follow our course on Computer Vision with R and Python: https://lstat.kuleuven.be/training/coursedescriptions/ComputervisionwithRandPython

Making thematic maps for Belgium

For people from Belgium working in R with spatial data, you can find excellent workshop material on creating thematic maps for Belgium at https://workshop.mhermans.net/thematic-maps-r/index.html. The workshop was given by Maarten Hermans from HIVA - Onderzoeksinstituut voor Arbeid en Samenleving.
The plots are heavily based on BelgiumMaps.Statbel - an R package from bnosac released 2 years ago (more info at http://www.bnosac.be/index.php/blog/55-belgiummaps-statbel-r-package-with-administrative-boundaries-of-belgium
thematic maps r