Computer Vision Algorithms for R users

Just before the summer holidays, BNOSAC presented a talk called Computer Vision and Image Recognition algorithms for R users at the UseR conference. In the talk 6 packages on Computer Vision with R were introduced in front of an audience of about 250 persons. The R packages we covered and that were developed by BNOSAC are:

  • image.CornerDetectionF9:  FAST-9 corner detection
  • image.CannyEdges: Canny Edge Detector
  • image.LineSegmentDetector: Line Segment Detector (LSD)
  • image.ContourDetector:  Unsupervised Smooth Contour Line Detection
  • image.dlib: Speeded up robust features (SURF) and histogram of oriented gradients (FHOG) features
  • image.darknet: Image classification using darknet with deep learning models AlexNet, Darknet, VGG-16, GoogleNet and Darknet19. As well object detection using the state-of-the art YOLO detection system

For those of you who missed this, you can still see the video of the presentation & view the pdf of the presentation below. The packages are open-sourced and made available at https://github.com/bnosac/image

If you have a computer vision endaveour in mind, feel free to get in touch for a quick chat. For those of you interested in following training on how to do image analysis, you can always register for our training on Computer Vision with R and Python here. More details on the full training program and training dates provided by BNOSAC: visit http://bnosac.be/index.php/training

Loading...

Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger

Parts of Speech (POS) tagging is a crucial part in natural language processing. It consists of labelling each word in a text document with a certain category like noun, verb, adverb, pronoun, ... . At BNOSAC, we use it on a dayly basis in order to select only nouns before we do topic detection or in specific NLP flows. For R users working with different languages, the number of POS tagging options is small and all have up or downsides. The following taggers are commonly used.

  • The Stanford Part-Of-Speech Tagger which is terribly slow, the language set is limited to English/French/German/Spanish/Arabic/Chinese (no Dutch). R packages for this are available at http://datacube.wu.ac.at.
  • Treetagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger) contains more languages but is only usable for non-commercial purposes (can be used based on the koRpus R package)
  • OpenNLP is faster and allows to do POS tagging for Dutch, Spanish, Polish, Swedish, English, Danish, German but no French or Eastern-European languages. R packages for this are available at http://datacube.wu.ac.at.
  • Package pattern.nlp (https://github.com/bnosac/pattern.nlp) allows Parts of Speech tagging and lemmatisation for Dutch, French, English, German, Spanish, Italian but needs Python installed which is not always easy to request at IT departments
  • SyntaxNet and Parsey McParseface (https://github.com/tensorflow/models/tree/master/syntaxnet) have good accuracy for POS tagging but need tensorflow installed which might be too much installation hassle in a corporate setting not to mention the computational resources needed.

Comes in RDRPOSTagger which BNOSAC released at https://github.com/bnosac/RDRPOSTagger. It has the following features:

zinsontledingmogelijk

  1. Easily installable in a corporate environment as a simple R package based on rJava
  2. Covering more than 40 languages:
    UniversalPOS annotation for languages: Ancient_Greek, Ancient_Greek-PROIEL, Arabic, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Czech-CAC, Czech-CLTT, Danish, Dutch, Dutch-LassySmall, English, English-LinES, Estonian, Finnish, Finnish-FTB, French, Galician, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Kazakh, Latin, Latin-ITTB, Latin-PROIEL, Latvian, Norwegian, Old_Church_Slavonic, Persian, Polish, Portuguese, Portuguese-BR, Romanian, Russian-SynTagRus, Slovenian, Slovenian-SST, Spanish, Spanish-AnCora, Swedish, Swedish-LinES, Tamil, Turkish. Prepend the UD_ to the language if you want to used these models.
    MORPH annotation for languages: Bulgarian, Czech, Dutch, French, German, Portuguese, Spanish, Swedish
    POS annotation for languages: English, French, German, Hindi, Italian, Thai, Vietnamese
  3. Fast tagging as the Single Classification Ripple Down Rules are easy to execute and hence are quick on larger text volumes
  4. Competitive accuracy in comparison to state-of-the-art POS and morphological taggers
  5. Cross-platform running on Windows/Linux/Mac
  6. It allows to do the Morphological, POS tagging and universal POS tagging of sentences

The Ripple Down Rules a basic binary classification trees which are built on top of the Universal Dependencies datasets available at http://universaldependencies.org. The methodology of this is explained in detail at the paper ' A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-Of-Speech Tagging' available at http://content.iospress.com/articles/ai-communications/aic698. If you just want to apply POS tagging on your text, you can go ahead as follows:

library(RDRPOSTagger)
rdr_available_models()

## POS annotation
x <- c("Oleg Borisovich Kulik is a Ukrainian-born Russian performance artist")
tagger <- rdr_model(language = "English", annotation = "POS")
rdr_pos(tagger, x = x)

## MORPH/POS annotation
x <- c("Dus godvermehoeren met pus in alle puisten , zei die schele van Van Bukburg .",
       "Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont",
       "  ", "", NA)
tagger <- rdr_model(language = "Dutch", annotation = "MORPH")
rdr_pos(tagger, x = x)

## Universal POS tagging annotation
tagger <- rdr_model(language = "UD_Dutch", annotation = "UniversalPOS")
rdr_pos(tagger, x = x)

## This gives the following output
sentence.id word.id             word word.type
           1       1              Dus       ADV
           1       2   godvermehoeren      VERB
           1       3              met       ADP
           1       4              pus      NOUN
           1       5               in       ADP
           1       6             alle      PRON
           1       7          puisten      NOUN
           1       8                ,     PUNCT
           1       9              zei      VERB
           1      10              die      PRON
           1      11           schele       ADJ
           1      12              van       ADP
           1      13              Van     PROPN
           1      14          Bukburg     PROPN
           1      15                .     PUNCT
           2       1               Er       ADV
           2       2              was       AUX
           2       3             toen     SCONJ
           2       4              dat     SCONJ
           2       5           liedje      NOUN
           2       6              van       ADP
           2       7 tietenkonttieten      VERB
           2       8             kont     PROPN
           2       9           tieten      VERB
           2      10     kontkontkont     PROPN
           2      11                .     PUNCT
           3       0             <NA>      <NA>
           4       0             <NA>      <NA>
           5       0             <NA>      <NA>

The function rdr_pos requests as input a vector of sentences. If you need to transform you text data to sentences, just use tokenize_sentences from the tokenizers package.

Good luck with text mining.
If you need our help for a text mining project. Let us know, we'll be glad to get you started.

Scheduling R scripts and processes on Windows and Unix/Linux

cronR rstudioaddin2 new R packages were put on CRAN last week by BNOSAC (www.bnosac.be).

  • One package for scheduling R scripts and processes on Windows (taskscheduleR) and
  • Another package for scheduling R scripts and processes on Unix / Linux (cronR)

These 2 packages allow you to schedule R processes from R directly. This is done by passing commands directly to cron which is a basic Linux/Unix job scheduling utility or by using the Windows Task Scheduler. The packages were developed for beginning R users who are unaware of that fact that R scripts can also be run non-interactively and can be automated.

We blogged already about the taskscheduleR R package at this blog post and also here. This time we devote some more details to the cronR R package.

The cronR package allows to

  • Get the list of scheduled jobs
  • Remove scheduled jobs
  • Add a job
    • a job is basically a script with R code which is run through Rscript
    • You can schedule tasks 'ONCE', 'EVERY MINUTE', 'EVERY HOUR', 'EVERY DAY', 'EVERY WEEK', 'EVERY MONTH' or any complex schedule
    • The task log contains the stdout & stderr of the Rscript which was run on that timepoint. This log can be found at the same folder as the R script

The package is especially suited for persons working on an RStudio server in the cloud or within the premises of their corporate environment. It allows to easily schedule processes. To make that extremely easy for beginning R users, an RStudio addin was developed, which is shown in the example below. The RStudio addin basically allows you to select an R script and schedule it at specific timepoints. It does this by copying the script to your launch/log folder and setting up a cronjob for that script.

The example below shows how to set up a cron job using the RStudio addin so that the scripts are launched every minute or every day at a specific hour. The R code is launched through Rscript and the log will contain the errors and the warnings in case your script failed so that you can review where the code failed.

Mark that you can also pass on arguments to the R script so that you can launch the same script for productXYZ and productABC.

cronR rstudioaddin

Of course scheduling scripts can also be done from R directly. Some examples are shown below. More information at https://github.com/bnosac/cronR

library(cronR)
f <- system.file(package = "cronR", "extdata", "helloworld.R")
cmd <- cron_rscript(f, rscript_args = c("productx", "20160101"))
## Every minute
cron_add(cmd, frequency = 'minutely', id = 'job1', description = 'Customers')
## Every hour at 20 past the hour on Monday and Tuesday
cron_add(cmd, frequency = 'hourly', id = 'job2', at = '00:20', description = 'Weather', days_of_week = c(1, 2))
## Every day at 14h20 on Sunday, Wednesday and Friday
cron_add(cmd, frequency = 'daily', id = 'job3', at = '14:20', days_of_week = c(0, 3, 5))
## Every starting day of the month at 10h30
cron_add(cmd, frequency = 'monthly', id = 'job4', at = '10:30', days_of_month = 'first', days_of_week = '*')
## Get all the jobs
cron_ls()
## Remove all scheduled jobs
cron_clear(ask=FALSE)

We hope this will gain you some precious time and if you need more help on automating R processes, feel free to get into contact. We have a special training devoted to managing R processes which can be given in your organisation. More information at our training curriculum.

r training

 

Detect Lines in Digital Images

As part of our data science training initiative, bnosac is also providing a course on computer vision with R & Python which is held in March 9-10 in Leuven, Belgium (subscribe here or have a look at our full training offer here). Part of the course is covering finding blobs, corners, gradients, edges & lines in images.

For this reason, the R package image.LineSegmentDetector was made available at https://github.com/bnosac/image. It allows to detect segment lines in digital images. An example of this is shown below.

library(image.LineSegmentDetector)
library(pixmap)

## Read in the image + make sure input to the algorithm is matrix with grey-scale values in 0-255 range
imagelocation <- system.file("extdata", "chairs.pgm", package="image.LineSegmentDetector")
image <- read.pnm(file = imagelocation, cellres = 1)
x <- image@grey * 255
## Detect and plot the lines segments
linesegments <- image_line_segment_detector(x)
linesegments
plot(image)
plot(linesegments, add = TRUE, col = "red")

chairs result

The line segment detector finds lines in digital grey-scale images and is an implementation of the linear-time Line Segment Detector explained at https://doi.org/10.5201/ipol.2012.gjmr-lsd. It gives subpixel accurate results and is designed to work on any digital image without parameter tuning. It controls its own number of false detections where on average, one false alarm is allowed per image.
More information here https://github.com/bnosac/image

The algorithm requires as input a grey-scale image. So if you have another image, you can use the excellent magick package to transform it to grey scale.

library(magick)
f <- tempfile(fileext = ".pgm")
x <- image_read("http://www.momentumshipping.net/lounge/wp-content/uploads/2015/10/containership2.jpg")
x <- image_convert(x, format = "pgm", depth = 8)
image_write(x, path = f, format = "pgm")
image <- read.pnm(file = f, cellres = 1)
linesegments <- image_line_segment_detector(image@grey * 255)
plot(image)
plot(linesegments, add = TRUE, col = "red")

linesegmentdetector example

The algorithm is implemented with Rcpp and is released under the AGPL-3 license. Hope you enjoy it.

Need support in image recognition? Let us know.

An overview of text mining visualisations possibilities with R on the CETA trade agreement

Text Mining has become quite mainstream nowadays as the tools to make a reasonable text analysis are ready to be exploited and give astoundingly nice and reasonable results. At BNOSAC, we use it mainly for text mining on call center data, poetry, salesforce data, emails, HR reviews, IT logged tickets, customer reviews, survey feedback and many more. We also provide training on text mining with R in Belgium (next trainings are scheduled on November 14/15 2016 and March 23/24 2017 in Belgium, to subscribe: go here or here).

For this blog post, we will focus on the most relevant visualisations which exist in text mining by using the CETA trade agreement between the EU and Canada.

maple europa

This trade agreement is available as a pdf on www.trade.ec.europa.eu/doclib/docs/2014/september/tradoc_152806.pdf and we would like to see what is written in there without having to read all of the 1598 pages in which our politicians in Belgium seem to have found good reasons to at least for a while politically block the process of signing this treaty.

Let us start by reading in the pdf file in R which is pretty easy nowadays.

library(pdftools)
library(data.table)
txt <- pdf_text("tradoc_152806.pdf")
txt <- unlist(txt)
txt <- paste(txt, collapse = " ")

Research done in 2012 (Wikipedia) showed that the average reading speed is 184 words per minute or 863 characters per minute. The CETA treaty has 3091384 characters and 334723 terms (words excluding punctuations). A quick calculation tells me that reading this document would take me between 30 and 59 hours non-stop. Were are not going to do this but instead do some text mining on the CETA treaty.

If you are doing text mining, people refer to text as part of documents. In this case there is only 1 document, the CETA treaty. For further analysis, we will assume that each article of the CETA treaty is a document. As you can see in the following screenshot, the article starts always with the word article and next some numbers. Let's extract this so that we have a dataset with 453 articles, the header and the content. The R code for this is put at https://github.com/jwijffels/ceta-txtmining/blob/master/01_import_ceta_treaty_pos_tagging.R. This dataset is available at https://github.com/jwijffels/ceta-txtmining.

ceta reorganisation

A first step in text mining is always to do a Parts-of-Speech (POS) tagging. This assigns to each word a tag (noun, verb, adverb, adjective, pronoun, ...). Mostly you are only interested in nouns and some verbs and POS tagging allows you to easily do this. POS tagging for Dutch/French/English/German/Spanish/Italian can be done with the pattern.nlp package available at https://github.com/bnosac/pattern.nlp. We do this for all 453 articles in the treaty and show the distribution of the POS tags.

## Natural Language Processing: POS tagging
library(pattern.nlp)
ceta_tagged <- mapply(article.id = ceta$article.id, content = ceta$txt, FUN=function(article.id, content){
out <- pattern_pos(x = content, language = "english", core = TRUE)
out$article.id <- rep(article.id, times = nrow(out))
out
}, SIMPLIFY = FALSE)
ceta_tagged <- rbindlist(ceta_tagged)

## Take only nouns
ceta_nouns <- subset(ceta_tagged, word.type %in% c("NN") & nchar(word.lemma) > 2)

## All data in 1 list
ceta <- list(ceta_txt = txt, ceta = ceta, ceta_tagged = ceta_tagged, ceta_nouns = ceta_nouns)

## Look at POS tags
library(lattice)
barchart(sort(table(ceta$ceta_tagged$word.type)), col = "lightblue", xlab = "Term frequency",
main = "Parts of Speech Tag term frequency\n in CETA treaty")

ceta terms vis1

In what comes further, we only work on the nouns (POS tags NN). The data preparation for the text analysis is now done, we have a dataset with 1 row per word and the corresponding tag. We can now start to do some analysis and to make some graphs

In order to split up the 453 articles into relevant topics, we generate a topic model (code can be found here). Let's hope we find an indication that some part of the treaty will indeed cover the issues the Walloon minister president Paul Magnette was talking about. The core R code which fits this is shown below and fits a topic model with 5 topics on the top 250 terms occurring in the CETA treaty.

ceta_topics <- LDA(x = dtm[, topterms], k = 5, method = "VEM", control = list(alpha = 0.1, estimate.alpha = TRUE, seed = as.integer(10:1), verbose = FALSE, nstart = 10, save = 0, best = TRUE))

Can we find the topic which Mr Magnette was talking about? Yes it is topic 3 which emits the words claim, dispute, investment certainty, enterprise, decision and authority, these words will be visualised later on. If we were only interested in this, we can use the topic model to score the treaty and limit the further reading of that document to the 196 articles which cover that. Instead, we will show which typical text mining graphs can be generated with R.

1. Text Mining is mostly about frequency statistics of terms

Wordclouds are a commonly used way to display how many times each word is occurring in the text. You can easily do this with the wordcloud or wordcloud2 R package. The following code calculates how many times each noun is occurring and plots these in a wordcloud. The wordcloud2 package gives HTML as output and allows also to set a background image like an image of the Canada flag and the EU.

## Word frequencies
x <- ceta$ceta_nouns[, list(n = .N), by = list(word.lemma)]
x <- x[order(x$n, decreasing = TRUE), ]
x <- as.data.frame(x)

## Visualise them with wordclouds
library(wordcloud)
wordcloud(words = x$word.lemma, freq = x$n, max.words = 150, random.order = FALSE, colors = brewer.pal(8, "Dark2"))

library(wordcloud2)
wordcloud2(data = head(x, 700), figPath = "maple_europa_black.png")

ceta wordclouds

 

2. But text mining is also about co-occurrence statistics

Co-occurrence statistics indicate how many times 2 words are occuring in the same CETA article. That is great as it gives more indication of how word are used in its context. For this, you can rely on the ggraph & ggforce R packages. The following shows the top 70 word pairs which co-occur and visualises them.

library(ggraph)
library(ggforce)
library(igraph)
library(tidytext)
word_cooccurences <- pair_count(data=ceta$ceta_nouns, group="article.id", value="word.lemma", sort = TRUE)
set.seed(123456789)
head(word_cooccurences, 70) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1.8, col = "darkgreen") +
ggtitle(sprintf("\n%s", "CETA treaty\nCo-occurrence of nouns")) +
theme_void()

ceta txtmining vis1

 

3. And text mining is also about correlations between words

Text mining is also about visualising which words are used next to other words. One can use correlations for that which can be visualised. The following graph shows the top 25 correlations of the words emmitted by topic 3 (the Mr Magnette topic which generated some media fuzz). It shows the words emitted by topic 3, these are words like: claim, dispute, investment certainty, enterprise, decision and authorit ). This correlation visualisation clearly shows the correlations between the words which got into the news recently namely investor claims of enterprises, investment certainties, who is deciding on that and when does this new type of procedure get into place.

library(topicmodels.utils)
idx <- predict(ceta_topics, dtm, type = "topics")
idx <- idx$topic == 3
termcor <- termcorrplot(dtm[idx, ],
words = names(ceta_topic_terms$topic3), highlight = head(names(ceta_topic_terms$topic3), 3),
autocorMax = 25, lwdmultiplier = 3, drawlabel = TRUE, cex.label = 0.8)

ceta txtmining vis2


In text mining, you have a lot of terms, you can reduce the many word links to relevant links with the EBICglasso function from the qgraph package. This function allows to reduce the correlation matrix between the words to a similar matrix where correlations which are not relevant are set as much as possible to zero so that these do not need to be visualised. The following R code does this and visualises the positive (green) and negative (red) correlations which are left in the correlation matrix of words emmited by all topics. There is clearly and accreditation body which assess conformity according to regulations, an arbitration panel, and a lot is linked to each other regarding access to the market and the government which ensures the legality of the agreement.

library(Matrix)
library(qgraph)
terms <- predict(ceta_topics, type = "terms", min_posterior = 0.025)
terms <- unique(unlist(sapply(terms, names)))
out <- dtm[, terms]
out <- cor(as.matrix(out))
out <- nearPD(x=out, corr = TRUE)$mat
out <- as.matrix(out)

m <- EBICglasso(out, n = 1000)
qgraph(m, layout="spring", labels = colnames(out), label.scale=FALSE,
label.cex=1, node.width=.5)

ceta txtmining vis3



4. If you think of it, words are basically also a network which can be visualised

R packages semnet and visNetwork are your friends here. Words which occur next to each other or within a certain window can be extracted and visualised. Package semnet uses the igraph package for this, package visNetwork allows to generate HTML output to be published on websites. Example graphs are shown below.

library(semnet)
terms <- unique(unlist(sapply(ceta_topic_terms, names)))
cooc <- coOccurenceNetwork(dtm[, terms])
cooc <- simplify(cooc)
plot(cooc, vertex.size=V(cooc)$freq / 20, edge.arrow.size=0.5)

ceta networkvis


5. And of course you can find topics and visualise these topics

If you are interested in topic models, like the one fitted on the CETA treaty, you should have a look at the excellent LDAvis package which allows you to interactively explore the most salient words that each topic emits and as well visualises the distance between the topics.
The visualisation is interactive and can be put into a web application easily. The following call to the createJSON and serVis R functions from the LDAvis package give you an interactive topic visualisation in no time.

library(LDAvis)
json <- createJSON(phi = posterior(ceta_topics)$terms,
theta = posterior(ceta_topics)$topics,
doc.length = row_sums(dtm),
vocab = colnames(dtm),
term.frequency = col_sums(dtm))
serVis(json)

ceta txmining vis6



Hope this triggers you a bit to do text mining with R.

If you are interested in all of this, you might be interested also in attending our course on Text mining with R.

Our next course on Text Mining with R is scheduled on 2016 November 14/15 in Brussels (Belgium): more info + register at http://di-academy.com/event/text-mining-with-r/
Our next course on Text Mining with R is scheduled on 2017 March 23/24 in Leuven (Belgium): more info + register at https://lstat.kuleuven.be/training/coursedescriptions/text-mining-with-r

For all other enquiries:  Get in touch