Text Mining with R - upcoming courses in Belgium

We use text mining a lot in day-to-day data mining operations. In order to share our knowledge on this, to show that R is an extremely mature platform to do business-oriented text analytics and to give you practical experience with text mining, our course on Text Mining with R is scheduled for the 3rd consecutive year at LStat, the Leuven Statistics Research Center (Belgium) as well as at the Data Science Academy in Brussels. Courses are scheduled 2 times in November 2017 and also in March 2018.

cran nlp cooccurrenceThis course is a hands-on course covering the use of text mining tools for the purpose of data analysis. It covers basic text handling, natural language engineering and statistical modelling on top of textual data. The following items are covered.

  • Text encodings
  • Cleaning of text data, regular expressions
  • String distances
  • Graphical displays of text data
  • Natural language processing: stemming, parts-of-speech tagging, tokenization, lemmatisation
  • Sentiment analysis
  • Statistical topic detection modelling and visualization (latent diriclet allocation)
  • Visualisation of correlations & topics
  • Word embeddings
  • Document similarities & Text alignment

Feel free to register at the following dates:

  • 18-19/10/2017: Statistical machine learning with R. Leuven (Belgium). Subscribe here
  • 08+10/11/2017: Text mining with R. Leuven (Belgium). Subscribe here
  • 27-28/11/2017: Text mining with R. Brussels (Belgium). http://di-academy.com/bootcamp + send mail to This email address is being protected from spambots. You need JavaScript enabled to view it.
  • 19-20/12/2017: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
  • 20-21/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
  • 08-09/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
  • 22-23/03/2018: Text Mining with R. Leuven (Belgium). Subscribe here

Is udpipe your new NLP processor for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing

If you work on natural language processing in a day-to-day setting which involves statistical engineering, at a certain timepoint you need to process your text with a number of text mining procedures of which the following ones are steps you must do before you can get usefull information about your text

  • Tokenisation (splitting your full text in words/terms)
  • Parts of Speech (POS) tagging (assigning each word a syntactical tag like is the word a verb/noun/adverb/number/...)
  • Lemmatisation (a lemma means that the term we "are" is replaced by the verb to "be", more information: https://en.wikipedia.org/wiki/Lemmatisation)
  • Dependency Parsing (finding relationships between, namely between "head" words and words which modify those heads, allowing you to look to words which are maybe far away from each other in the raw text but influence each other)

udpipe rlogo

If you do this in R, there aren't much available tools to do this. In fact there are none which

  1. do this for multiple language
  2. do not depend on external software dependencies (java/python)
  3. which also allow you to train your own parsing & tagging models.

Except R package udpipe (https://github.com/bnosac/udpipe, https://CRAN.R-project.org/package=udpipe) which satisfies these 3 criteria.

If you are interested in doing the annotation, pre-trained models are available for 50 languages (see ?udpipe_download_model) for details. Let's show how this works on some Dutch text and what you get of of this.


library(udpipe)
dl <- udpipe_download_model(language = "dutch")
dl

language                                                                      file_model
   dutch C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dutch-ud-2.0-170801.udpipe

udmodel_dutch <- udpipe_load_model(file = "dutch-ud-2.0-170801.udpipe")
x <- udpipe_annotate(udmodel_dutch,
                     x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.")
x <- as.data.frame(x)
x

The result of this is a dataset where text has been splitted in paragraphs, sentences, words, words are replaced by their lemma (ging > ga, nam > neem), you get the universal parts of speech tags, detailed parts of speech tags, you get features of the word and with the head_token_id we see which words are influencing other words in the text as well as the dependency relationship between these words.

udpipe example

To go from that dataset to meaningfull visualisations like this one is than just a matter of a few lines of code. The following visualisation shows the co-occurrence of nouns with customer feedback on Airbnb appartment stays in Brussels (open data available at http://insideairbnb.com/get-the-data.html).

udpipe example coocurrence

In a next post, we'll show how to train your own tagging models.

If you like this type of analysis or if you are interested in text mining with R, we have 3 upcoming courses planned on text mining. Feel free to register at the following links.

    • 18-19/10/2017: Statistical machine learning with R. Leuven (Belgium). Subscribe here
    • 08+10/11/2017: Text mining with R. Leuven (Belgium). Subscribe here
    • 27-28/11/2017: Text mining with R. Brussels (Belgium). http://di-academy.com/bootcamp + send mail to This email address is being protected from spambots. You need JavaScript enabled to view it.
    • 19-20/12/2017: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
    • 20-21/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
    • 08-09/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
    • 22-23/03/2018: Text Mining with R. Leuven (Belgium). Subscribe here

For business questions on text mining, feel free to contact BNOSAC by sending us a mail here.

Machine Learning with R - upcoming course in Belgium

For R users interested in Machine Learning, you can attend our upcoming course on Machine Learning with R which is scheduled on 18-19 October 2017 in Leuven, Belgium. This is now the 4th year this course is given at the university of Leuven so we made quite some updates since the first time this was given 4 years ago.

During the course you'll learn the following techniques from a methodological as well as a practical perspective: naive bayes, trees, feed-forward neural networks, penalised regression, bagging, random forests, boosting and if time permits graphical lasso, penalised generalised additive models, support vector machines.

Subscribe here: https://lstat.kuleuven.be/training/coursedescriptions/statistical-machine-learning-with-r

For a full list of training courses provided by BNOSAC - either in-house or in-public: go to http://www.bnosac.be/training

For R users interested in text mining with R, applied spatial modelling with R, advanced R programming or computer vision, you can also subscribe for the following courses, scheduled at the University of Leuven.

  • 18-19/10/2017: Statistical machine learning with R. Leuven (Belgium). Subscribe here
  • 08+10/11/2017: Text mining with R. Leuven (Belgium). Subscribe here
  • 19-20/12/2017: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
  • 20-21/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
  • 08-09/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
  • 22-23/03/2018: Text Mining with R. Leuven (Belgium). Subscribe here

 

 

Computer Vision Algorithms for R users

Just before the summer holidays, BNOSAC presented a talk called Computer Vision and Image Recognition algorithms for R users at the UseR conference. In the talk 6 packages on Computer Vision with R were introduced in front of an audience of about 250 persons. The R packages we covered and that were developed by BNOSAC are:

  • image.CornerDetectionF9:  FAST-9 corner detection
  • image.CannyEdges: Canny Edge Detector
  • image.LineSegmentDetector: Line Segment Detector (LSD)
  • image.ContourDetector:  Unsupervised Smooth Contour Line Detection
  • image.dlib: Speeded up robust features (SURF) and histogram of oriented gradients (FHOG) features
  • image.darknet: Image classification using darknet with deep learning models AlexNet, Darknet, VGG-16, GoogleNet and Darknet19. As well object detection using the state-of-the art YOLO detection system

For those of you who missed this, you can still see the video of the presentation & view the pdf of the presentation below. The packages are open-sourced and made available at https://github.com/bnosac/image

If you have a computer vision endaveour in mind, feel free to get in touch for a quick chat. For those of you interested in following training on how to do image analysis, you can always register for our training on Computer Vision with R and Python here. More details on the full training program and training dates provided by BNOSAC: visit http://bnosac.be/index.php/training

Loading...

Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger

Parts of Speech (POS) tagging is a crucial part in natural language processing. It consists of labelling each word in a text document with a certain category like noun, verb, adverb, pronoun, ... . At BNOSAC, we use it on a dayly basis in order to select only nouns before we do topic detection or in specific NLP flows. For R users working with different languages, the number of POS tagging options is small and all have up or downsides. The following taggers are commonly used.

  • The Stanford Part-Of-Speech Tagger which is terribly slow, the language set is limited to English/French/German/Spanish/Arabic/Chinese (no Dutch). R packages for this are available at http://datacube.wu.ac.at.
  • Treetagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger) contains more languages but is only usable for non-commercial purposes (can be used based on the koRpus R package)
  • OpenNLP is faster and allows to do POS tagging for Dutch, Spanish, Polish, Swedish, English, Danish, German but no French or Eastern-European languages. R packages for this are available at http://datacube.wu.ac.at.
  • Package pattern.nlp (https://github.com/bnosac/pattern.nlp) allows Parts of Speech tagging and lemmatisation for Dutch, French, English, German, Spanish, Italian but needs Python installed which is not always easy to request at IT departments
  • SyntaxNet and Parsey McParseface (https://github.com/tensorflow/models/tree/master/syntaxnet) have good accuracy for POS tagging but need tensorflow installed which might be too much installation hassle in a corporate setting not to mention the computational resources needed.

Comes in RDRPOSTagger which BNOSAC released at https://github.com/bnosac/RDRPOSTagger. It has the following features:

zinsontledingmogelijk

  1. Easily installable in a corporate environment as a simple R package based on rJava
  2. Covering more than 40 languages:
    UniversalPOS annotation for languages: Ancient_Greek, Ancient_Greek-PROIEL, Arabic, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Czech-CAC, Czech-CLTT, Danish, Dutch, Dutch-LassySmall, English, English-LinES, Estonian, Finnish, Finnish-FTB, French, Galician, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Kazakh, Latin, Latin-ITTB, Latin-PROIEL, Latvian, Norwegian, Old_Church_Slavonic, Persian, Polish, Portuguese, Portuguese-BR, Romanian, Russian-SynTagRus, Slovenian, Slovenian-SST, Spanish, Spanish-AnCora, Swedish, Swedish-LinES, Tamil, Turkish. Prepend the UD_ to the language if you want to used these models.
    MORPH annotation for languages: Bulgarian, Czech, Dutch, French, German, Portuguese, Spanish, Swedish
    POS annotation for languages: English, French, German, Hindi, Italian, Thai, Vietnamese
  3. Fast tagging as the Single Classification Ripple Down Rules are easy to execute and hence are quick on larger text volumes
  4. Competitive accuracy in comparison to state-of-the-art POS and morphological taggers
  5. Cross-platform running on Windows/Linux/Mac
  6. It allows to do the Morphological, POS tagging and universal POS tagging of sentences

The Ripple Down Rules a basic binary classification trees which are built on top of the Universal Dependencies datasets available at http://universaldependencies.org. The methodology of this is explained in detail at the paper ' A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-Of-Speech Tagging' available at http://content.iospress.com/articles/ai-communications/aic698. If you just want to apply POS tagging on your text, you can go ahead as follows:

library(RDRPOSTagger)
rdr_available_models()

## POS annotation
x <- c("Oleg Borisovich Kulik is a Ukrainian-born Russian performance artist")
tagger <- rdr_model(language = "English", annotation = "POS")
rdr_pos(tagger, x = x)

## MORPH/POS annotation
x <- c("Dus godvermehoeren met pus in alle puisten , zei die schele van Van Bukburg .",
       "Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont",
       "  ", "", NA)
tagger <- rdr_model(language = "Dutch", annotation = "MORPH")
rdr_pos(tagger, x = x)

## Universal POS tagging annotation
tagger <- rdr_model(language = "UD_Dutch", annotation = "UniversalPOS")
rdr_pos(tagger, x = x)

## This gives the following output
sentence.id word.id             word word.type
           1       1              Dus       ADV
           1       2   godvermehoeren      VERB
           1       3              met       ADP
           1       4              pus      NOUN
           1       5               in       ADP
           1       6             alle      PRON
           1       7          puisten      NOUN
           1       8                ,     PUNCT
           1       9              zei      VERB
           1      10              die      PRON
           1      11           schele       ADJ
           1      12              van       ADP
           1      13              Van     PROPN
           1      14          Bukburg     PROPN
           1      15                .     PUNCT
           2       1               Er       ADV
           2       2              was       AUX
           2       3             toen     SCONJ
           2       4              dat     SCONJ
           2       5           liedje      NOUN
           2       6              van       ADP
           2       7 tietenkonttieten      VERB
           2       8             kont     PROPN
           2       9           tieten      VERB
           2      10     kontkontkont     PROPN
           2      11                .     PUNCT
           3       0             <NA>      <NA>
           4       0             <NA>      <NA>
           5       0             <NA>      <NA>

The function rdr_pos requests as input a vector of sentences. If you need to transform you text data to sentences, just use tokenize_sentences from the tokenizers package.

Good luck with text mining.
If you need our help for a text mining project. Let us know, we'll be glad to get you started.