Starspace for NLP #nlproc

Our recent addition to the NLP R universe is called R package ruimtehol which is open sourced at https://github.com/bnosac/ruimtehol This R package is a wrapper around Starspace which provides a neural embedding model for doing the following on text:

  • Text classification
  • Learning word, sentence or document level embeddings
  • Finding sentence or document similarity
  • Ranking web documents
  • Content-based recommendation (e.g. recommend text/music based on the content)
  • Collaborative filtering based recommendation (e.g. recommend text/music based on interest)
  • Identification of entity relationships

logo ruimtehol

If you are an R user and are interested in NLP techniques. Feel free to test out the framework and provide feedback at https://github.com/bnosac/ruimtehol/issues. The package is not on CRAN yet, but can be installed easily with the command devtools::install_github("bnosac/ruimtehol", build_vignettes = TRUE).

Below is an example how the package can be used for multi-label classification on questions asked in Belgian parliament. Each question in parliament was labelled with several of one of the 1785 categories.

library(ruimtehol)
data(dekamer, package = "ruimtehol")

## Each question in parliament was labelled with more than 1 category. There are 1785 categories in this dataset
dekamer$question_themes <- strsplit(dekamer$question_theme, " +\\| +")
## Plain text of the question in parliament
dekamer$text <- strsplit(dekamer$question, "\\W")
dekamer$text <- sapply(dekamer$text, FUN=function(x) paste(x, collapse = " "))
dekamer$text <- tolower(dekamer$text)
## Build starspace model
model <- embed_tagspace(x = dekamer$text,
                        y = dekamer$question_themes,
                        dim = 50,
                        ngram = 3, loss = "hinge", similarity = "cosine", adagrad = TRUE,
                        early_stopping = 0.8, minCount = 2,
                        thread = 4)
## Get embeddings of the dictionary of words as well as the categories
embedding_words  <- as.matrix(model, type = "words")
embedding_labels <- as.matrix(model, type = "label")
## Find closest labels / predict
embedding_combination <- starspace_embedding(model, "federale politie patrouille", type = "document")
embedding_similarity(embedding_combination,
                     embedding_labels,
                     top_n = 3)

term1                      term2 similarity rank
federale politie patrouille           __label__POLITIE  0.8480641    1
federale politie patrouille          __label__OPENBARE  0.6919607    2
federale politie patrouille __label__BEROEPSMOBILITEIT  0.6907637    3
predict(model, "de migranten komen naar europa, in asielcentra ...")
$input
"de migranten komen naar europa, in asielcentra ..."
$prediction
                label               label_starspace similarity
 VLUCHTELINGENCENTRUM __label__VLUCHTELINGENCENTRUM  0.7075160
          VLUCHTELING          __label__VLUCHTELING  0.6253517
             ILLEGALE             __label__ILLEGALE  0.5997692
       MIGRATIEBELEID       __label__MIGRATIEBELEID  0.5939595
           UITWIJZING           __label__UITWIJZING  0.5376520

The list of R packages regarding text mining with R provided by BNOSAC has been steadily growing. This is the list of R packages maintained by BNOSAC.

  • udpipe: tokenisation, lemmatisation, parts of speech tagging, dependency parsing, morphological feature extraction, sentiment scoring, keyword extraction, NLP flows
  • crfsuite: named entity recognition, text classification, chunking, sequence modelling
  • textrank: text summarisation
  • ruimtehol: text classification, word/sentence/document embeddings, document/label similarities, ranking documengs, content based recommendation, collaborative filtering-based recommendation

More details of ruimtehol at the development repository https://github.com/bnosac/ruimtehol where you can also provide feedback.

Training on Text Mining 

Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months.rtraining

  • 19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
  • 21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
  • 13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
  •      15/03/2019: Image Recognition with R and Python: Subscribe here
  • 01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here