# Requirements

• Python (>=2.6 or >=3.3)
• Numpy (>= 1.6.1)
• SciPy (>=0.9)

The best way I found to have all of these packages installed on your machine is to install a python distribution like Anaconda (https://www.continuum.io/downloads).

# Objective

This tutorial gives an example of working with text data. The scenario is that you have a large dataset of documents. Each document can be classified or categorised based on what topic the document is about. One solution is to manually read each document and categorise them one by one, or, we can train a algorithm to classify these documents. One is labour intensive and costly, the other is not.

In this tutorial we will make use of the Twenty Newgroups dataset which is a collection of approximately 20,000 documents which has been partitioned/categorised into 20 different newsgroups or categories. You can read more about this dataset at http://qwone.com/~jason/20Newsgroups/. Conveniently, this dataset is already available in the sklearn package.

By the end of this tutorial we will train a classifier that can take free form text input and predict what newsgroup that input belongs to. For simplicity we will only be looking at a subset of the 20 possible newsgroups to classify and train our model.

# Steps

### Data Extraction

• Determine the categories we will classify documents.
categories = ['comp.sys.mac.hardware', 'rec.autos', 'sci.electronics', 'alt.atheism']
• Next we will load the documents from the dataset that fall within those categories. We will need a training set and the model that we will use requires the qualities of an iid (independent and identically distributed) variable.
from sklearn.datasets import fetch_20newsgroupsdata_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=20170212)
• The data_train object is a datastructure that stores the data in keys. These can be accessed similar to how you would access data in python dictionaries.
data_train.keys()# ['description', 'DESCR', 'filenames', 'target_names', 'data', 'target']
• target_names = list of categories that we specified earlier.
• data = is a list of raw text data of the documents that were received. There are 2243 documents.
• target = integer representation of the category label for each document in the training set. It is an index to the target_names list that has been classified for each of the 2243 documents. To illustrate this lets print out the first 5 documents' target and retrieve their corresponding target_names field.
for cat in data_train.target[:5]:    print(data_train.target_names[cat])# sci.electronics
# rec.autos
# sci.electronics
# sci.electronics
# comp.sys.mac.hardware

### Extracting features

In order to run our data through a machine learning algorithm we first need to turn the text data into numerical feature vectors that we can then create a feature matrix.

Naive approach: Bag of words technique.

1. Map each word to an integer (this can be done by using a dictionary or a map data structure).
2. For each document i count each word w and store w into a feature matrix X[i,j] where i = document number and j = index of word that was counted.

Each column of the feature matrix X will be a unique word that was found in the corpus. There are 2243 documents which gives the potential of a very large number of unique words. However, not all words will occur as frequently, maybe not at all, in other documents from different categories, therefore this matrix will be quite sparse (a lot of 0s).

Assuming that we will have a 100000 by 2243 matrix and each cell is represented by a floating32 data type, that will be approximately 900 000 000 bytes in memory (RAM) or roughly 1 gb of ram. Which is manageable on most computers these days, but imagine what will happen if you wish to scale this up. I assume most of the readers will not have access to a distributed computing environment. Fortunately, scipy has a datastructure that only loads in memory the non-zero values into memory, hence saving up a lot of memory.

### Preprocessing, Tokenizing and Filtering

Lets Create the feature vectors.

from sklearn.feature_extraction.text import CountVectorizercount_vect = CountVectorizer()W_counts = count_vect.fit_transform(data_train.data)W_counts.shape# (2243, 30500)
• The count_vect object has created a dictionary of features vertices.
• To see a dictionary of words and their counts over the corpus run the below script:
count_vect.vocabulary_

### Biasness and Normalization

Creating a feature matrix of the word count is not a bad first attempt. However, there is the potential for the classifier to be biased. The logic behind the bias is that longer documents may have higher occurences of particular words just because it is a longer document, this will result in over weighting this word towards the document's particular category or classification. One way to overcome this bias is to divide the number of occurences of each word by the number of words in that document to get the word frequency. Another way to fine tune and improve the classification is to penalize words that occur in lots of documents and therefore do not provide as much information as other words.

The combination of converting into frequency and penalizing words that occur in many documents is called Term Frequency times Inverse Document Frequency. This can easily be computed in sklearn.

from sklearn.feature_extraction.text import TfidfTransformertfidf_transformer = TfidfTransformer()W_counts_tfidf = tfidf_transformer.fit_transform(W_counts)W_counts_tfidf.shape# (2243, 30500)

# Conclusion

Now you have extracted and normalized your feature vectors and feature matrix. Next we will learn about how to train a classifier using the data we have just extracted. Read more here: Machine Learning Text Classification Using Naive Bayes and Support Vector Machines Part 2

## Subscribe to our mailing list

* indicates required