As a way of learning new ideas and skills I resolved to taking little experiments.
An interesting lesson is generating a word-cloud in R with RStudio .
In order toTo get started generating a word cloud,
we need to install the following packages :
install.packages("tm") #for textmining install.packages("SnowballC") #for text stemming install.packages("wordcloud") #word cloud generator install.packages("RColorBrewer") # import color palettes
Load up the above installed packages with ;
library("tm") library("SnowballC") library("wordcloud") library("RColorBrewer")
I downloaded Theodore Roosevelt’s, Man in the Arena speech and saved it in a text file. To access it the file in R, locate the text file and read its contents.
#read file filePath <- "/home/wordcloud/man_in_arena.txt" text <- readLines(filePath) #load data docs <- Corpus(VectorSource(text)) #inspect contents of the document inspect(docs)
Modify the contents within the document that would improve semantics; reduce noise within the text and factors such as white spaces
#replace "/, @, |" with space toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x)) #remove unnecessary whitespaces docs <- tm_map(docs, toSpace, "@") docs <- tm_map(docs, toSpace, "\\|") docs <- tm_map(docs, toSpace, "/")
Stemming is the process of reducing inflected words to their word stem
As an example
fishing has its root word as
To exercise this, you can remove words that you’d like to;
#remove common words(stop words) & #convert to lowercase docs <- tm_map(docs, content_transformer(tolower)) #remove numbers docs <- tm_map(docs, removeNumbers) #remove english common stop words docs <- tm_map(docs, removeWords, stopwords("english")) #remove own / undesired stopword docs <- tm_map(docs, removeWords, c("Theodore", "We", "Shall")) #remove punctuations docs <- tm_map(docs, removePunctuation) #eliminate extra white spaces docs <- tm_map(docs, stripWhitespace) #text stemming docs <- tm_map(docs, stemDocument)
Build a term-document matrix
As a reference, this allows you to index each word in the text file and the frequency it appears in the document.
dtm <- TermDocumentMatrix(docs) m <- as.matrix(dtm) v <- sort(rowSums(m), decreasing = TRUE) d <- data.frame(word = names(v), freq=v) head(d, 10) #sample output word freq great 3 actual 2 deed 2 fail 2 strive 2 achieve 1 arena 1
Generate the word cloud with :
set.seed(1200) wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words = 200, random.orders=FALSE, rot.per=0.35, colors=brewer.pal(9, "RdPu"))