The best way I found to have all of these packages installed on your machine is to install a python distribution like Anaconda (https://www.continuum.io/downloads).
This tutorial gives an example of working with text data. The scenario is that you have a large dataset of documents. Each document can be classified or categorised based on what topic the document is about. One solution is to manually read each document and categorise them one by one, or, we can train a algorithm to classify these documents. One is labour intensive and costly, the other is not.
In this tutorial we will make use of the Twenty Newgroups dataset which is a collection of approximately 20,000 documents which has been partitioned/categorised into 20 different newsgroups or categories. You can read more about this dataset at http://qwone.com/~jason/20Newsgroups/. Conveniently, this dataset is already available in the sklearn package.
By the end of this tutorial we will train a classifier that can take free form text input and predict what newsgroup that input belongs to. For simplicity we will only be looking at a subset of the 20 possible newsgroups to classify and train our model.
categories = ['comp.sys.mac.hardware', 'rec.autos', 'sci.electronics', 'alt.atheism']
from sklearn.datasets import fetch_20newsgroups
data_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=20170212)
# ['description', 'DESCR', 'filenames', 'target_names', 'data', 'target']
for cat in data_train.target[:5]:
# sci.electronics # rec.autos # sci.electronics # sci.electronics # comp.sys.mac.hardware
In order to run our data through a machine learning algorithm we first need to turn the text data into numerical feature vectors that we can then create a feature matrix.
Naive approach: Bag of words technique.
Each column of the feature matrix X will be a unique word that was found in the corpus. There are 2243 documents which gives the potential of a very large number of unique words. However, not all words will occur as frequently, maybe not at all, in other documents from different categories, therefore this matrix will be quite sparse (a lot of 0s).
Assuming that we will have a 100000 by 2243 matrix and each cell is represented by a floating32 data type, that will be approximately 900 000 000 bytes in memory (RAM) or roughly 1 gb of ram. Which is manageable on most computers these days, but imagine what will happen if you wish to scale this up. I assume most of the readers will not have access to a distributed computing environment. Fortunately, scipy has a datastructure that only loads in memory the non-zero values into memory, hence saving up a lot of memory.
Lets Create the feature vectors.
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
W_counts = count_vect.fit_transform(data_train.data)
# (2243, 30500)
Creating a feature matrix of the word count is not a bad first attempt. However, there is the potential for the classifier to be biased. The logic behind the bias is that longer documents may have higher occurences of particular words just because it is a longer document, this will result in over weighting this word towards the document's particular category or classification. One way to overcome this bias is to divide the number of occurences of each word by the number of words in that document to get the word frequency. Another way to fine tune and improve the classification is to penalize words that occur in lots of documents and therefore do not provide as much information as other words.
The combination of converting into frequency and penalizing words that occur in many documents is called Term Frequency times Inverse Document Frequency. This can easily be computed in sklearn.
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
W_counts_tfidf = tfidf_transformer.fit_transform(W_counts)
# (2243, 30500)
Now you have extracted and normalized your feature vectors and feature matrix. Next we will learn about how to train a classifier using the data we have just extracted. Read more here: Machine Learning Text Classification Using Naive Bayes and Support Vector Machines Part 2