Basic R Automation

Last Wednesday, a small presentation was given at the RBelgium meetup in Brussels on Basic R Automation. For those of you who could not attend, here are the slides of that presentation which showed the use of the cronR and taskscheduleR R packages for automating basic R scripts.

If you are interested in setting up a project for more advanced ways on how to automate your R processes for your specific environment, get in touch.

Loading...

An overview of keyword extraction techniques

In this blogpost, we will show 6 keyword extraction techniques which allow to find keywords in plain text. Keywords are frequently occuring words which occur somehow together in plain text. Common examples are New York, Monte Carlo, Mixed Models, Brussels Hoofdstedelijk Gewest, Public Transport, Central Station, p-values, ...

If you master these techniques, it will allow you to easily step away from doing simple word frequency statistics to more business-relevant text summarisation. For this, we will use the udpipe R package (docs at https://CRAN.R-project.org/package=udpipe or https://bnosac.github.io/udpipe/en) which is the core R package you need for doing this type of t ext processing.We'll basically show how to easily extract keywords as follows:

1. Find keywords by doing Parts of Speech tagging in order to identify nouns
2. Find keywords based on Collocations and Co-occurrences
3. Find keywords based on the Textrank algorithm
4. Find keywords based on RAKE (rapid automatic keyword extraction)
5. Find keywords by looking for Phrases (noun phrases / verb phrases)
6. Find keywords based on results of dependency parsing (getting the subject of the text)

These techniques will allow you to move away from showing silly word graphs to more relevant graphs containing keywords.

wordclouds

Example

As an example we are going to use feedback in Spanish of customers going to an AirBnB appartment in Brussels. This data is part of the udpipe R package. We extract the Spanish text and annotate it using the udpipe R package. Annotation performs tokenisation, parts of speech tagging, lemmatisation and dependency parsing.

library(udpipe)
library(textrank)
## First step: Take the Spanish udpipe model and annotate the text. Note: this takes about 3 minutes
data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "es")
ud_model <- udpipe_download_model(language = "spanish")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = comments$feedback)
x <- as.data.frame(x)

udpipe example airbnb

Once we have the annotation, finding keywords is a breeze. Let's show how this can be easily accomplished.

Option 1: Extracting only nouns

An easy way in order to find keywords is by looking at nouns. As each term has a Parts of Speech tag if you annotated text using the udpipe package, you can easily do this as follows.

stats <- subset(x, upos %in% "NOUN")
stats <- txt_freq(x = stats$lemma)
library(lattice)
stats$key <- factor(stats$key, levels = rev(stats$key))
barchart(key ~ freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring nouns", xlab = "Freq")

keywords plot1

Option 2: Collocation & co-occurrences

Although nouns are a great start, you are probably interested in multi-word expressions. You can get multi-word expression by looking either at collocations (words following one another), at word co-occurrences within each sentence or at word co-occurrences of words which are close in the neighbourhood of one another. These approaches can be executed as follows using the udpipe R package. If we combine this with selecting only the nouns and adjectives, this becomes already nice.

## Collocation (words following one another)
stats <- keywords_collocation(x = x,
                             term = "token", group = c("doc_id", "paragraph_id", "sentence_id"),
                             ngram_max = 4)
## Co-occurrences: How frequent do words occur in the same sentence, in this case only nouns or adjectives
stats <- cooccurrence(x = subset(x, upos %in% c("NOUN", "ADJ")),
                     term = "lemma", group = c("doc_id", "paragraph_id", "sentence_id"))
## Co-occurrences: How frequent do words follow one another
stats <- cooccurrence(x = x$lemma,
                     relevant = x$upos %in% c("NOUN", "ADJ"))
## Co-occurrences: How frequent do words follow one another even if we would skip 2 words in between
stats <- cooccurrence(x = x$lemma,
                     relevant = x$upos %in% c("NOUN", "ADJ"), skipgram = 2)
head(stats)
      term1     term2 cooc
     barrio tranquilo   36
   estacion      tren   30
 transporte   publico   23
     centro    ciudad   23
      pleno    centro   20
   estacion   central   19

Visualisation of these co-occurrences can be done using a network plot as follows for the top 30 most frequent co-occurring nouns and adjectives.

library(igraph)
library(ggraph)
library(ggplot2)
wordnetwork <- head(stats, 30)
wordnetwork <- graph_from_data_frame(wordnetwork)
ggraph(wordnetwork, layout = "fr") +
  geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "pink") +
  geom_node_text(aes(label = name), col = "darkgreen", size = 4) +
  theme_graph(base_family = "Arial Narrow") +
  theme(legend.position = "none") +
  labs(title = "Cooccurrences within 3 words distance", subtitle = "Nouns & Adjective")

keywords plot2

Option 3: Textrank (word network ordered by Google Pagerank)

Another approach for keyword detection is Textrank. Textrank is an algorithm implemented in the textrank R package. The algorithm allows to summarise text and as well allows to extract keywords. This is done by constructing a word network by looking if words are following one another. On top of that network the 'Google Pagerank' algorithm is applied to extract relevant words after which relevant words which are following one another are combined to get keywords. In the below example, we are interested in finding keywords using that algorithm of either nouns or adjectives following one another. You can see from the plot below that the keywords combines words together into multi-word expressions.

stats <- textrank_keywords(x$lemma, 
                          relevant = x$upos %in% c("NOUN", "ADJ"),
                          ngram_max = 8, sep = " ")
stats <- subset(stats$keywords, ngram > 1 & freq >= 5)
library(wordcloud)
wordcloud(words = stats$keyword, freq = stats$freq)

keywords plot5

Option 4: Rapid Automatic Keyword Extraction: RAKE

Next basic algorithm is called RAKE which is an acronym for Rapid Automatic Keyword Extraction. It looks for keywords by looking to a contiguous sequence of words which do not contain irrelevant words. Namely by

  1. calculating a score for each word which is part of any candidate keyword, this is done by
    • among the words of the candidate keywords, the algorithm looks how many times each word is occurring and how many times it co-occurs with other words
    • each word gets a score which is the ratio of the word degree (how many times it co-occurs with other words) to the word frequency
  2. a RAKE score for the full candidate keyword is calculated by summing up the scores of each of the words which define the candidate keyword
stats <- keywords_rake(x = x, 
                      term = "token", group = c("doc_id", "paragraph_id", "sentence_id"),
                      relevant = x$upos %in% c("NOUN", "ADJ"),
                      ngram_max = 4)
head(subset(stats, freq > 3))
keyword ngram freq     rake
 perfectas condiciones     2    4 2.000000
            unica pega     2    7 2.000000
           grand place     2    6 1.900000
   grandes anfitriones     2    4 1.809717
    transporte publico     2   21 1.685714
    buenos anfitriones     2    9 1.662281

Option 5: Phrases

Next option is to extract phrases. These are defined as a sequence of Parts of Speech Tags. Common type of phrases are noun phrases or verb phrases. How does this work? Parts of Speech tags are recoded to one of the following one-letters: (A: adjective, C: coordinating conjuction, D: determiner, M: modifier of verb, N: noun or proper noun, P: preposition). Next you can define a regular expression to indicate a sequence of parts of speech tags which you want to extract from the text.

## Simple noun phrases (a adjective+noun, pre/postposition, optional determiner and another adjective+noun)
x$phrase_tag <- as_phrasemachine(x$upos, type = "upos")
stats <- keywords_phrases(x = x$phrase_tag, term = x$token,
                         pattern = "(A|N)+N(P+D*(A|N)*N)*",
                         is_regex = TRUE, ngram_max = 4, detailed = FALSE)
head(subset(stats, ngram > 2))
keyword ngram freq
                   Gare du Midi     3   12
       pleno centro de Bruselas     4    6
               15 minutos a pie     4    4
               nos explico todo     3    4
 primera experiencia con Airbnb     4    3
                   Gare du Nord     3    3

Option 6: Use dependency parsing output to get the nominal subject and the adjective of it

In the last option, we will show how to use the results of the dependency parsing. When you executed the annotation using udpipe, the dep_rel field indicates how words are related to one another. A token is related to the parent using token_id and head_token_id. The dep_rel field indicates how words are linked to one another. The type of relations are defined at http://universaldependencies.org/u/dep/index.html. For this exercise we are going to take the words which have as dependency relation nsubj indicating the nominal subject and we are adding to that the adjective which is changing the nominal subject.

In this way we can combine what are people talking about with the adjective they use when they talk about the subject.

stats <- merge(x, x, 
           by.x = c("doc_id", "paragraph_id", "sentence_id", "head_token_id"),
           by.y = c("doc_id", "paragraph_id", "sentence_id", "token_id"),
           all.x = TRUE, all.y = FALSE,
           suffixes = c("", "_parent"), sort = FALSE)
stats <- subset(stats, dep_rel %in% "nsubj" & upos %in% c("NOUN") & upos_parent %in% c("ADJ"))
stats$term <- paste(stats$lemma_parent, stats$lemma, sep = " ")
stats <- txt_freq(stats$term)
library(wordcloud)
wordcloud(words = stats$key, freq = stats$freq, min.freq = 3, max.words = 100,
          random.order = FALSE, colors = brewer.pal(6, "Dark2"))

keywords plot4

Now up to you. Can you do the same on your own text?

Credits: This analysis would not have been possible without the Spanish annotated treebanks (https://github.com/UniversalDependencies/UD_Spanish-GSD in particular as made available through http://universaldependencies.org) and the UDPipe C++ library and models provided by Milan Straka (https://github.com/ufal/udpipe). All credits have to go there. 

Automate R processes

Last week we updated the cronR R package and released it to CRAN allowing you to schedule any R code on whichever timepoint you like. The package was updated in order to comply to more stricter CRAN policies regarding writing to folders. Along the lines, the RStudio add-in of the package was also updated. It now looks as shown below and is tailored to Data Scientists that want to automate basic R scripts.

cronR rstudioaddin

 

The cronR (https://github.com/bnosac/cronR) and taskscheduleR (https://github.com/bnosac/taskscheduleR) R packages are distributed on CRAN and provide the basic functionalities to schedule R code at your specific timepoints of interest. The taskscheduleR R package is designed to schedule processes on Windows, the cronR R package allows to schedule your jobs on Linux or Mac. Hope you enjoy the packages.

If you need support in automating and integrating more complex R flows in your current architecture, feel free to get into contact here.

A comparison between spaCy and UDPipe for Natural Language Processing for R users

In the last few years, Natural Language Processing (NLP) has become more and more an open multi-lingual task instead of being held back by language, country and legal boundaries. With the advent of commonly used open data regarding natural language processing tasks as available at http://universaldependencies.org one can now relatively easily compare different toolkits which perform natural language processing. In this post we compare the udpipe R package to the spacyr R package.

UDPipe - spaCy comparison

A traditional natural language processing flow consists of a number of building blocks which can be used to structure your Natural Language Application on top of it. Namely

1. tokenisation
2. parts of speech tagging
3. lemmatisation
4. morphological feature tagging
5. syntactic dependency parsing
6. entity recognition
7. extracting word & sentence meaning

Both of these R packages allow to do this where

Comparison

In the comparison, we will provide general feedback on the following elements

  • Languages which are covered by the tools
  • Ease of use
  • Annotation possibilities
  • Annotation accuracy of the models
  • Annotation speed   

udpipe spacyAnnotation languages

  • udpipe provides annotation models for more than 50 languages (afrikaans, ancient_greek-proiel, ancient_greek, arabic, basque, belarusian, bulgarian, catalan, chinese, coptic, croatian, czech-cac, czech-cltt, czech, danish, dutch-lassysmall, dutch, english-lines, english-partut, english, estonian, finnish-ftb, finnish, french-partut, french-sequoia, french, galician-treegal, galician, german, gothic, greek, hebrew, hindi, hungarian, indonesian, irish, italian, japanese, kazakh, korean, latin-ittb, latin-proiel, latin, latvian, lithuanian, norwegian-bokmaal, norwegian-nynorsk, old_church_slavonic, persian, polish, portuguese-br, portuguese, romanian, russian-syntagrus, russian, sanskrit, serbian, slovak, slovenian-sst, slovenian, spanish-ancora, spanish, swedish-lines, swedish, tamil, turkish, ukrainian, urdu, uyghur, vietnamese) of which 17 languages are released under a commercially more liberal license, the others are released under the CC-BY-SA-NC licence
  • spaCy provides currently models for 8 languages: English/German/Spanish/Portugues/French/Italian/Dutch.
    • For English and German these were trained on data which is not available on http://universaldependencies.org, for the other models they were trained on data from http://universaldependencies.org
    • In order to train your own models you need to do this directly in Python, the Python community is building these since the end of 2017.

Ease of use

  • Both packages are on CRAN
  • Models can be easily downloaded with both packages. For udpipe this is directly from R, for spacy this needs to be done in Python.
  • udpipe has no external dependencies and can easily be installed with install.packages('udpipe') and next you are ready to go
  • installation of spacyr will probably give you some trouble namely it
    • requires installation of the Python package spacy which is more difficult to install on a local computer in your corporate office. You can easily get stuck in problems of Python versioning, 32 vs 64 bit architecture issues, admin rights or basic shell commands that you should be aware of
    • does not allow to switch seamlessly between 2 languages (you need to initialize and finalize) which is a burden if you live in a multi-language country like e.g. Belgium
    • spaCy models are constructed on different treebanks each following different guidelines which make cross-language downstream analysis more difficult to harmonise

Annotation accuracy of the models

As the spaCy and UDPipe models for Spanish, Portuguese, French, Italian and Dutch have been built on data from the same Universal Dependencies treebank (version 2.0) one can compare the accuracies of the different NLP processing steps (tokenisation, POS tagging, morphological feature tagging, lemmatisation, dependency parsing).
Evaluation is traditionally done by leaving out some sentences from the training part and seeing how good the model did on these hold-out sentences which were tagged by humans that's why they are called 'gold'.
Below you can find accuracy statistics for the different NLP tasks by using the conllu 2017 shared task evaluation script on the holdout test sets. These graphs basically show that

  • UDPipe provides better results for French, Italian & Portuguese, equal results for Spanish and less good results for dependency parsing and treebank-specific tags for Dutch but better results for the universal parts of speech tags.
  • For English, only the Penn Treebank XPOS tag can be compared and spaCy shows less good results than what we expected when comparing to the UDPipe model

results alignedaccuracy2

results alignedaccuracy2

Exact reproducible details on the evaluation can be found at https://github.com/jwijffels/udpipe-spacy-comparison. Feel free to provide comments there.

Annotation possibilities

  • udpipe
    • allows to do tokenisation, parts of speech tagging, morphological feature tagging and dependency parsing
    • udpipe does not provide entity recognition and does not provide word vectors (you can use existing R packages for that (e.g text2vec)
  • spacyr
    • spaCy allows to do tokenisation, parts of speech tagging, morphological feature tagging and dependency parsing
    • On top of that it also does entity recognition
    • spacyr does not provide lemmatisation
    • spaCy also provides wordvectors (for English only) but they are not made available in spacyr

    So if you need entity recognition, udpipe is not an option. If you need lemmatisation, spacyr is not an option.

Annotation speed

  • spacyr is in our experiments 5 times faster than udpipe for a comparable full annotation pipeline (Tokenisation, POS tagging, Lemmatisation, Feature tagging, Dependency Parsing) and comparable output (see code below)
library(udpipe)
library(spacyr)
library(microbenchmark)
data(brussels_reviews, package = "udpipe")
f_udpipe <- function(x, model){

  x_anno <- udpipe_annotate(model, x = x)
  x_anno <- as.data.frame(x_anno)
  invisible()
}
f_spacy <- function(x){
  x_anno <- spacy_parse(x, pos = TRUE, tag = TRUE, lemma = TRUE, entity = FALSE, dependency = TRUE)
  invisible()
}
## Dutch

x <- subset(brussels_reviews, language == "nl")
x <- x$feedback
ud_model <- udpipe_download_model(language = "dutch")

ud_model <- udpipe_load_model(ud_model$file)
spacy_initialize(model = "nl", python_executable = "C:/Users/Jan/Anaconda3/python.exe")
microbenchmark(
  f_udpipe(x, model = ud_model),
  f_spacy(x),
  times = 2)
spacy_finalize()

Enjoy

Hope this provides you some guidance when you are thinking about extending your nlp workflow with more deeper natural language processing than merely sentiment analysis.

Last call for the course on Advanced R programming

Last call for the course on Advanced R programming scheduled in Leuven, Belgium on Febuary 20-21 2018. Register at: https://lstat.kuleuven.be/training/coursedescriptions/AdvancedprogramminginR.html

You'll learn during that course:

wawitsr

  • The apply family of functions, basic parallel programming for these functions and commonly needed data manipulation skills
  • Making a basic reproducible report using Sweave and knitr including tables, graphs and literate programming
  • How to create an R package
  • Understand how S3 programming works, generics, environments, namespaces.
  • Basic tips on how to organise and develop R code and test it.

Need other training: visit http://bnosac.be/index.php/training