CRAN Task View: Natural Language Processing
|Contact:||f.wild at open.ac.uk|
This CRAN task view contains a list of packages useful for natural language processing.
Side-note on text mining: In recent years, we have elaborated a framework to be used in
packages dealing with the processing of written material: the package
Extension packages in this area are highly recommended to interface with tm's basic routines
and developers are cordially invited to join in the discussion on further developments of this
provides an R interface
, a large
lexical database of English.
Keyword Extraction and General String Manipulation:
R's base package already provides a rich set of character manipulation
help.search(keyword = "character", package = "base")
for more information on these capabilities.
provides an R interface to
(Version 5.0). KEA (for
Keyphrase Extraction Algorithm) allows for extracting keyphrases from
text documents. It can be either used for free indexing or for indexing
with a controlled vocabulary.
can be used for certain parsing tasks such as
extracting words from strings by content rather than by delimiters.
shows an example of this in a natural language
contains basic string manipulation and analysis routines needed in text processing such as dealing with character encoding, language, pattern counting, and tokenization.
Natural Language Processing:
provides an R interface
collection of natural language processing tools including a
sentence detector, tokenizer, pos-tagger, shallow and full
syntactic parser, and named-entity detector, using the Maxent
Java package for training and using maximum entropy
ships trained models for English and
for Spanish to be used
is a interface
which is a collection of machine learning algorithms for data
mining tasks written in Java. Especially useful in the context
of natural language processing is its functionality for
tokenization and stemming.
provides the Snowball stemmers which contain the Porter
stemmer and several other stemmers for different
webpage for details.
(available from Omegahat) is an alternative
interface to a C version of Porter's word stemming algorithm.
provides a collection of conversion routines (e.g. Hangul to Jamos),
stemming, and part of speech tagging through interfacing with the Lucene's HanNanum analyzer.
In version 0.0-8.0, the documentation is sparse and still needs some help.
allows to create and compute with string kernels, like full string,
spectrum, or bounded range string kernels. It can directly use
the document format used
provides a comprehensive text mining framework for
Journal of Statistical Software
Infrastructure in R
gives a detailed overview and presents
techniques for count-based analysis methods, text clustering,
text classification and string kernels.
provides routines for performing a latent semantic analysis with R.
The basic idea of latent semantic analysis (LSA) is,
that text do have a higher order (=latent semantic) structure which,
however, is obscured by word usage (e.g. through the use of synonyms
or polysemy). By using conceptual indices that are derived statistically
via a truncated singular value decomposition (a two-mode factor analysis)
over a given document-term matrix, this variability problem can be overcome.
Unstructured Texts with Latent Semantic Analysis
gives a detailed overview and demonstrates the use of the package
with examples from the are of technology-enhanced learning.
provides an interface to the C code for Latent Dirichlet Allocation (LDA) models and Correlated Topics Models (CTM) by David M. Blei and co-authors and the C++ code for fitting LDA models using Gibbs sampling by Xuan-Hieu Phan and co-authors.
is a diverse collection of functions for automatic language detection,
hyphenation, several indices of lexical diversity (e.g., type token ratio, HD-D/vocd-D, MTLD) and
readability (e.g., Flesch, SMOG, LIX, Dale-Chall). See the
for more information.
is a machine learning package for automatic
text classification. It implements the nine different algorithms (svm, slda,
boosting, bagging, rf, glmnet, tree, nnet, and maxent) and routines supporting
the evaluation of accuracy.
is a suite of tools for text and sentiment mining.
provides support for n-gram based text categorization.
offers utility functions for the statistical analysis of corpus frequency data.
provides data sets and functions exemplifying statistical methods, and some
facilitatory utility functions used in the book by R. H. Baayen: "Analyzing Linguistic Data: a Practical
Introduction to Statistics Using R", Cambridge University Press, 2008.
offers some statistical models for word frequency distributions. The
utilities include functions for loading, manipulating and visualizing word frequency data and
vocabulary growth curves. The package also implements several statistical models for the
distribution of word frequencies in a population. (The name of this library derives from the
most famous word frequency distribution, Zipf's law.)
is an implementation of maxinum entropy minimising memory consumption of very large data-sets.
predict valued outputs based on an input matrix and assess predictive power ('the bag-of-words oracle').
provides a visualisation similar to the famous wordle ones: it horizontally and vertically distributes features in a pleasing visualisation with the font size scaled by frequency.
Import filters and Data Handling:
allows for distributing corpora across storage devices (local files or Hadoop Distributed File System).
helps with importing mail messages from archive files such as used in Thunderbird (mbox, eml).