A comparison between spaCy and UDPipe for Natural Language Processing for R users

In the last few years, Natural Language Processing (NLP) has become more and more an open multi-lingual task instead of being held back by language, country and legal boundaries. With the advent of commonly used open data regarding natural language processing tasks as available at http://universaldependencies.org one can now relatively easily compare different toolkits which perform natural language processing. In this post we compare the udpipe R package to the spacyr R package.

UDPipe - spaCy comparison

A traditional natural language processing flow consists of a number of building blocks which can be used to structure your Natural Language Application on top of it. Namely

1. tokenisation
2. parts of speech tagging
3. lemmatisation
4. morphological feature tagging
5. syntactic dependency parsing
6. entity recognition
7. extracting word & sentence meaning

Both of these R packages allow to do this where

Comparison

In the comparison, we will provide general feedback on the following elements

  • Languages which are covered by the tools
  • Ease of use
  • Annotation possibilities
  • Annotation accuracy of the models
  • Annotation speed   

udpipe spacyAnnotation languages

  • udpipe provides annotation models for more than 50 languages (afrikaans, ancient_greek-proiel, ancient_greek, arabic, basque, belarusian, bulgarian, catalan, chinese, coptic, croatian, czech-cac, czech-cltt, czech, danish, dutch-lassysmall, dutch, english-lines, english-partut, english, estonian, finnish-ftb, finnish, french-partut, french-sequoia, french, galician-treegal, galician, german, gothic, greek, hebrew, hindi, hungarian, indonesian, irish, italian, japanese, kazakh, korean, latin-ittb, latin-proiel, latin, latvian, lithuanian, norwegian-bokmaal, norwegian-nynorsk, old_church_slavonic, persian, polish, portuguese-br, portuguese, romanian, russian-syntagrus, russian, sanskrit, serbian, slovak, slovenian-sst, slovenian, spanish-ancora, spanish, swedish-lines, swedish, tamil, turkish, ukrainian, urdu, uyghur, vietnamese) of which 17 languages are released under a commercially more liberal license, the others are released under the CC-BY-SA-NC licence
  • spaCy provides currently models for 8 languages: English/German/Spanish/Portugues/French/Italian/Dutch.
    • For English and German these were trained on data which is not available on http://universaldependencies.org, for the other models they were trained on data from http://universaldependencies.org
    • In order to train your own models you need to do this directly in Python, the Python community is building these since the end of 2017.

Ease of use

  • Both packages are on CRAN
  • Models can be easily downloaded with both packages. For udpipe this is directly from R, for spacy this needs to be done in Python.
  • udpipe has no external dependencies and can easily be installed with install.packages('udpipe') and next you are ready to go
  • installation of spacyr will probably give you some trouble namely it
    • requires installation of the Python package spacy which is more difficult to install on a local computer in your corporate office. You can easily get stuck in problems of Python versioning, 32 vs 64 bit architecture issues, admin rights or basic shell commands that you should be aware of
    • does not allow to switch seamlessly between 2 languages (you need to initialize and finalize) which is a burden if you live in a multi-language country like e.g. Belgium
    • spaCy models are constructed on different treebanks each following different guidelines which make cross-language downstream analysis more difficult to harmonise

Annotation accuracy of the models

As the spaCy and UDPipe models for Spanish, Portuguese, French, Italian and Dutch have been built on data from the same Universal Dependencies treebank (version 2.0) one can compare the accuracies of the different NLP processing steps (tokenisation, POS tagging, morphological feature tagging, lemmatisation, dependency parsing).
Evaluation is traditionally done by leaving out some sentences from the training part and seeing how good the model did on these hold-out sentences which were tagged by humans that's why they are called 'gold'.
Below you can find accuracy statistics for the different NLP tasks by using the conllu 2017 shared task evaluation script on the holdout test sets. These graphs basically show that

  • UDPipe provides better results for French, Italian & Portuguese, equal results for Spanish and less good results for dependency parsing and treebank-specific tags for Dutch but better results for the universal parts of speech tags.
  • For English, only the Penn Treebank XPOS tag can be compared and spaCy shows less good results than what we expected when comparing to the UDPipe model

results alignedaccuracy2

results alignedaccuracy2

Exact reproducible details on the evaluation can be found at https://github.com/jwijffels/udpipe-spacy-comparison. Feel free to provide comments there.

Annotation possibilities

  • udpipe
    • allows to do tokenisation, parts of speech tagging, morphological feature tagging and dependency parsing
    • udpipe does not provide entity recognition and does not provide word vectors (you can use existing R packages for that (e.g text2vec)
  • spacyr
    • spaCy allows to do tokenisation, parts of speech tagging, morphological feature tagging and dependency parsing
    • On top of that it also does entity recognition
    • spacyr does not provide lemmatisation
    • spaCy also provides wordvectors (for English only) but they are not made available in spacyr

    So if you need entity recognition, udpipe is not an option. If you need lemmatisation, spacyr is not an option.

Annotation speed

  • spacyr is in our experiments 5 times faster than udpipe for a comparable full annotation pipeline (Tokenisation, POS tagging, Lemmatisation, Feature tagging, Dependency Parsing) and comparable output (see code below)
library(udpipe)
library(spacyr)
library(microbenchmark)
data(brussels_reviews, package = "udpipe")
f_udpipe <- function(x, model){

  x_anno <- udpipe_annotate(model, x = x)
  x_anno <- as.data.frame(x_anno)
  invisible()
}
f_spacy <- function(x){
  x_anno <- spacy_parse(x, pos = TRUE, tag = TRUE, lemma = TRUE, entity = FALSE, dependency = TRUE)
  invisible()
}
## Dutch

x <- subset(brussels_reviews, language == "nl")
x <- x$feedback
ud_model <- udpipe_download_model(language = "dutch")

ud_model <- udpipe_load_model(ud_model$file)
spacy_initialize(model = "nl", python_executable = "C:/Users/Jan/Anaconda3/python.exe")
microbenchmark(
  f_udpipe(x, model = ud_model),
  f_spacy(x),
  times = 2)
spacy_finalize()

Enjoy

Hope this provides you some guidance when you are thinking about extending your nlp workflow with more deeper natural language processing than merely sentiment analysis.

Last call for the course on Advanced R programming

Last call for the course on Advanced R programming scheduled in Leuven, Belgium on Febuary 20-21 2018. Register at: https://lstat.kuleuven.be/training/coursedescriptions/AdvancedprogramminginR.html

You'll learn during that course:

wawitsr

  • The apply family of functions, basic parallel programming for these functions and commonly needed data manipulation skills
  • Making a basic reproducible report using Sweave and knitr including tables, graphs and literate programming
  • How to create an R package
  • Understand how S3 programming works, generics, environments, namespaces.
  • Basic tips on how to organise and develop R code and test it.

Need other training: visit http://bnosac.be/index.php/training

 

Log shiny app visitors and R usage to Google Analytics

If you work on applications for clients or have open sourced some shiny apps, a question that arises is how is your application being used. What you can do in order to find out how your hard work is being consumed is putting your code in logs and then viewing the logs.

An easier way however to track usage of your application is just sending page views or application events to Google Analytics. That's exactly what the GAlogger R package (https://github.com/bnosac/GAlogger) is doing. It allows to log R events and R usage to Google Analytics and was created with the following use cases in mind:

Track usage of your application

  • If someone visits a page in your web application (e.g. Shiny) or web service (e.g. RApache, Plumber), use the GAlogger R package to send the page and title of the page which is visited so that you can easily see how visitors are using your application
  • Do you want to know which user inputs are set in your Shiny app, you can now collect these events easily with this R package

Track usage of your scripts / package usage / functions

  • Keep track on how your internal useRs are using your package (e.g. when a user loads your package or uses a specific function or webservice)
  • Do you want to keep track on the status of a long-running process or keep track of an error message if something failed.

GALogger screenshot 1

How

First of all, get the R package from https://github.com/bnosac/GAlogger

Get your own free tracking ID from Google Analytics (it looks like UA-XXXXX-Y), set it as shown below and indicate that you approve that data will be send to Google Analytics. Put that code in your shiny app or R script.

library(GAlogger)
ga_set_tracking_id("UA-25938715-4")
ga_set_approval(consent = TRUE)

Next start sending data to Google Analytics. You can either send page visits or events.

Page visits

Someone is visiting your web service or shiny web application, great, log it is follows.

ga_collect_pageview(page = "/home")
ga_collect_pageview(page = "/simulation", title = "Mixture process")
ga_collect_pageview(page = "/simulation/bayesian")
ga_collect_pageview(page = "/textmining-exploratory")
ga_collect_pageview(page = "/my/killer/app")
ga_collect_pageview(page = "/home", title = "Homepage", hostname = "www.xyz.com")

Events

An event is happening in your app or R code, great, log it is follows.

ga_collect_event(event_category = "Start", event_action = "shiny app launched")
ga_collect_event(event_category = "Error", event_label = "convergence failed", event_action = "Oh no")
ga_collect_event(event_category = "Error", event_label = "Bad input", 
                 event_action = "send the firesquad", event_value = 911)
ga_collect_event(event_category = "Simulation", event_label = "Launching Bayesian multi-level model",
                 event_action = "How many simulations", event_value = 10)

Visit Google Analytics to see who visited you or what happened in your script

  • Logged pageviews can be viewed in the Google Analytics > Behaviour tab or in the Real-Time part of Google Analytics
  • Logged events can be viewed in the Google Analytics > Behaviour > Events tab or in the Real-Time part of Google Analytics

Enjoy!

GALogger screenshot 2

Natural Language Processing for non-English languages with udpipe

BNOSAC is happy to announce the release of the udpipe R package (https://bnosac.github.io/udpipe/en) which is a Natural Language Processing toolkit that provides language-agnostic 'tokenization', 'parts of speech tagging', 'lemmatization', 'morphological feature tagging' and 'dependency parsing' of raw text. Next to text parsing, the package also allows you to train annotation models based on data of 'treebanks' in 'CoNLL-U' format as provided at http://universaldependencies.org/format.html.

Language models

The package provides direct access to language models trained on more than 50 languages. The following languages are directly available:

afrikaans, ancient_greek-proiel, ancient_greek, arabic, basque, belarusian, bulgarian, catalan, chinese, coptic, croatian, czech-cac, czech-cltt, czech, danish, dutch-lassysmall, dutch, english-lines, english-partut, english, estonian, finnish-ftb, finnish, french-partut, french-sequoia, french, galician-treegal, galician, german, gothic, greek, hebrew, hindi, hungarian, indonesian, irish, italian, japanese, kazakh, korean, latin-ittb, latin-proiel, latin, latvian, lithuanian, norwegian-bokmaal, norwegian-nynorsk, old_church_slavonic, persian, polish, portuguese-br, portuguese, romanian, russian-syntagrus, russian, sanskrit, serbian, slovak, slovenian-sst, slovenian, spanish-ancora, spanish, swedish-lines, swedish, tamil, turkish, ukrainian, urdu, uyghur, vietnamese

We hope that the package will allow other R users to build natural language applications on top of the resulting parts of speech tags, tokens, morphological features and dependency parsing output. And we hope in particular that applications will arise which are not limited to English only (like the textrank R package or the cleanNLP package to name a few)

udpipe website

Easy installation, great docs

  • Note that the package has no external software dependencies (no java nor python) and depends only on 2 R packages (Rcpp and data.table), which makes the package easy to install on any platform. The package is available for download at https://CRAN.R-project.org/package=udpipe and is developed at https://github.com/bnosac/udpipe. A small docusaurus website is made available at https://bnosac.github.io/udpipe/en
  • We hope you enjoy using it and we would like to thank Milan Straka for all the efforts done on UDPipe as well as all persons involved in http://universaldependencies.org

Training on Text Mining with R

Are you interested in text mining. Feel free to register for the upcoming course on text mining

Example

Want to get started with it right away? Example annotating Polish text in UTF-8 encoding, but you can pick any language of choice listed above. Enjoy.

library(udpipe)
model <- udpipe_download_model(language = "polish")
model <- udpipe_load_model(file = model$file_model)
x <- udpipe_annotate(model, x = "Budynek otrzymany od parafii wymaga remontu, a placówka nie otrzymała jeszcze żadnej dotacji.")
x <- as.data.frame(x)
x

udpipe example polish

An overview of open data from Belgium

BNOSAC is working on building an application on top of open data from questions and answers given at the parliament in Belgium. It will basically show what our civil servants in parliament are busy with. If you are interested in co-developing, feel free to get in touch for a quick chat. For those of you interested in an overview of open data available in Belgium, we've made a presentation showing what open data is available in Belgium for direct use (see below).

Interested in how open data can be used for your business, get in touch.

Loading...