An overview of open data from Belgium

BNOSAC is working on building an application on top of open data from questions and answers given at the parliament in Belgium. It will basically show what our civil servants in parliament are busy with. If you are interested in co-developing, feel free to get in touch for a quick chat. For those of you interested in an overview of open data available in Belgium, we've made a presentation showing what open data is available in Belgium for direct use (see below).

Interested in how open data can be used for your business, get in touch.

Loading...

CRAN search based on natural language processing

CRAN contains up to date (October 2017) more than 11500 R packages. If you want to scroll through all of these, you probably need to spend a few days, assuming you need 5 seconds per package and there are 8 hours in a day.

Since R version 3.4, we can also get a dataset will all packages, their dependencies, the package title, the description and even the installation errors which the packages have. Which makes the CRAN database with all packages an excellent dataset for doing text mining. If you want to get that dataset, just do as follows in R:

library(tools)
crandb <- CRAN_package_db()

Based on that data the following CRAN NLP searcher app was built as shown below. I'ts available for inspection at http://datatailor.be:9999/app/cran_search and is a tiny wrapper around the result of annotating the package title and package description using the udpipe R package: https://github.com/bnosac/udpipe

udpipe rlogo

If you want to easily extract what is written in text without reading it, a common way is to do Parts of Speech tagging, extract the nouns and/or the verbs and then plot all co-occurrences / correlations and frequencies of the lemmata. The updipe package allows you exactly to do that. Annotating using Parts of Speech tagging, is pretty easy with udpipe_annotate function from the the udpipe R package (https://github.com/bnosac/udpipe). Mark that this takes a while (as in +/- 30 minutes) and is probably something you want to run as a web-service or integrated stored procedure.

library(udpipe)
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)
crandb_annotated <- udpipe_annotate(ud_model,
                                    x = paste(crandb$Title, crandb$Description, sep = " \n "),
                                    doc_id = crandb$Package)
crandb_annotated <- as.data.frame(crandb_annotated)

Once we have that data annotated, making a web application which allows you to visualise, structure and display the CRAN packages content is pretty easy with tools like flexdashboard. That's exactly what this web application available at http://datatailor.be:9999/app/cran_search does. The application allows you

cran search cluster

  • List all packages which are part of a CRAN Task View
  • To search for CRAN packages based on what the author has written in the package title and description
  • Based on the found CRAN packages which were searched for: Visualise the nouns and verbs in the package title and descriptions by using
    • Word-coocurrence graphs indicating how many times each lemma occurs in the same package as another lemma
    • Word-correlation graphs showing the positive correlations between the top n most occurring lemma's in the packages
    • Word clouds indicating the frequency of nouns/verbs or consecutive nouns/verbs (bigrams) in the package descriptions
    • Build a topic model (latent dirichlet allocation) to cluster packages and visualise them

The web application (flexdashboard) was launched on a small shinyproxy server and is available here: http://datatailor.be:9999/app/cran_search. Can you find topics which are not yet covered by the CRAN task views yet? Can you find the content of the Rcpp-universe or the sp package universe?

If you are interested in these techniques, you can always subscribe for our text mining with R course at the following dates:

Text Mining with R - upcoming courses in Belgium

We use text mining a lot in day-to-day data mining operations. In order to share our knowledge on this, to show that R is an extremely mature platform to do business-oriented text analytics and to give you practical experience with text mining, our course on Text Mining with R is scheduled for the 3rd consecutive year at LStat, the Leuven Statistics Research Center (Belgium) as well as at the Data Science Academy in Brussels. Courses are scheduled 2 times in November 2017 and also in March 2018.

cran nlp cooccurrenceThis course is a hands-on course covering the use of text mining tools for the purpose of data analysis. It covers basic text handling, natural language engineering and statistical modelling on top of textual data. The following items are covered.

  • Text encodings
  • Cleaning of text data, regular expressions
  • String distances
  • Graphical displays of text data
  • Natural language processing: stemming, parts-of-speech tagging, tokenization, lemmatisation
  • Sentiment analysis
  • Statistical topic detection modelling and visualization (latent diriclet allocation)
  • Visualisation of correlations & topics
  • Word embeddings
  • Document similarities & Text alignment

Feel free to register at the following dates:

  • 18-19/10/2017: Statistical machine learning with R. Leuven (Belgium). Subscribe here
  • 08+10/11/2017: Text mining with R. Leuven (Belgium). Subscribe here
  • 27-28/11/2017: Text mining with R. Brussels (Belgium). http://di-academy.com/bootcamp + send mail to This email address is being protected from spambots. You need JavaScript enabled to view it.
  • 19-20/12/2017: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
  • 20-21/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
  • 08-09/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
  • 22-23/03/2018: Text Mining with R. Leuven (Belgium). Subscribe here

Is udpipe your new NLP processor for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing

If you work on natural language processing in a day-to-day setting which involves statistical engineering, at a certain timepoint you need to process your text with a number of text mining procedures of which the following ones are steps you must do before you can get usefull information about your text

  • Tokenisation (splitting your full text in words/terms)
  • Parts of Speech (POS) tagging (assigning each word a syntactical tag like is the word a verb/noun/adverb/number/...)
  • Lemmatisation (a lemma means that the term we "are" is replaced by the verb to "be", more information: https://en.wikipedia.org/wiki/Lemmatisation)
  • Dependency Parsing (finding relationships between, namely between "head" words and words which modify those heads, allowing you to look to words which are maybe far away from each other in the raw text but influence each other)

udpipe rlogo

If you do this in R, there aren't much available tools to do this. In fact there are none which

  1. do this for multiple language
  2. do not depend on external software dependencies (java/python)
  3. which also allow you to train your own parsing & tagging models.

Except R package udpipe (https://github.com/bnosac/udpipe, https://CRAN.R-project.org/package=udpipe) which satisfies these 3 criteria.

If you are interested in doing the annotation, pre-trained models are available for 50 languages (see ?udpipe_download_model) for details. Let's show how this works on some Dutch text and what you get of of this.


library(udpipe)
dl <- udpipe_download_model(language = "dutch")
dl

language                                                                      file_model
   dutch C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dutch-ud-2.0-170801.udpipe

udmodel_dutch <- udpipe_load_model(file = "dutch-ud-2.0-170801.udpipe")
x <- udpipe_annotate(udmodel_dutch,
                     x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.")
x <- as.data.frame(x)
x

The result of this is a dataset where text has been splitted in paragraphs, sentences, words, words are replaced by their lemma (ging > ga, nam > neem), you get the universal parts of speech tags, detailed parts of speech tags, you get features of the word and with the head_token_id we see which words are influencing other words in the text as well as the dependency relationship between these words.

udpipe example

To go from that dataset to meaningfull visualisations like this one is than just a matter of a few lines of code. The following visualisation shows the co-occurrence of nouns with customer feedback on Airbnb appartment stays in Brussels (open data available at http://insideairbnb.com/get-the-data.html).

udpipe example coocurrence

In a next post, we'll show how to train your own tagging models.

If you like this type of analysis or if you are interested in text mining with R, we have 3 upcoming courses planned on text mining. Feel free to register at the following links.

    • 18-19/10/2017: Statistical machine learning with R. Leuven (Belgium). Subscribe here
    • 08+10/11/2017: Text mining with R. Leuven (Belgium). Subscribe here
    • 27-28/11/2017: Text mining with R. Brussels (Belgium). http://di-academy.com/bootcamp + send mail to This email address is being protected from spambots. You need JavaScript enabled to view it.
    • 19-20/12/2017: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
    • 20-21/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
    • 08-09/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
    • 22-23/03/2018: Text Mining with R. Leuven (Belgium). Subscribe here

For business questions on text mining, feel free to contact BNOSAC by sending us a mail here.

Machine Learning with R - upcoming course in Belgium

For R users interested in Machine Learning, you can attend our upcoming course on Machine Learning with R which is scheduled on 18-19 October 2017 in Leuven, Belgium. This is now the 4th year this course is given at the university of Leuven so we made quite some updates since the first time this was given 4 years ago.

During the course you'll learn the following techniques from a methodological as well as a practical perspective: naive bayes, trees, feed-forward neural networks, penalised regression, bagging, random forests, boosting and if time permits graphical lasso, penalised generalised additive models, support vector machines.

Subscribe here: https://lstat.kuleuven.be/training/coursedescriptions/statistical-machine-learning-with-r

For a full list of training courses provided by BNOSAC - either in-house or in-public: go to http://www.bnosac.be/training

For R users interested in text mining with R, applied spatial modelling with R, advanced R programming or computer vision, you can also subscribe for the following courses, scheduled at the University of Leuven.

  • 18-19/10/2017: Statistical machine learning with R. Leuven (Belgium). Subscribe here
  • 08+10/11/2017: Text mining with R. Leuven (Belgium). Subscribe here
  • 19-20/12/2017: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
  • 20-21/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
  • 08-09/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
  • 22-23/03/2018: Text Mining with R. Leuven (Belgium). Subscribe here