Machine Learning Text Classification Using Naive Bayes and Support Vector Machines Part 2

Requirements

This tutorial continues from Machine Learning Text Classification Using Naive Bayes and Support Vector Machines Part 1.

Training a classifier

Now we have the feature matrix from the training data we can train a classifier to try to predict new posts.

Naive Bayes Classifier

The Naive Bayes algorithm assumes that the value of a feature is independent of the value of any other feature. In this particular example some could argue that sentences or sequences of words may be correlated to another just by the way the English language is constructed, therefore, this particular assumption will not hold. However, for some probability models, Naive Bayes classiers can be trained quite efficiently in a supervised learning environment, which this example is. There are a number of Naive Bayes models to choose from, however, since we are using multiple variables (each word count in a document is a random variable) we will need to use the Multinomial Naive Bayes model.

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(W_counts_tfidf, data_train.target)
  • Now we can try to predice outcome of a new document.
  • Need to extract the features using the same techniques as above. Instead of fit_transform we will use transform as the transformers have already been fit to the training set.
new_docs = ['Cars are used for travelling', 'Computers contain wires and chips']
W_new_counts = count_vect.transform(new_docs)
W_new_tfidf = tfidf_transformer.transform(W_new_counts)

prediction = classifier.predict(W_new_tfidf)

for doc, category in zip(new_docs, prediction):
print('%r == %s'%(doc, data_train.target_names[category]))
# 'Cars are used for travelling' == rec.autos # 'Computers contain wires and chips' == sci.electronics

So far the classifier seems to be classifying the new documents correctly.

Data Pipelines

The process so far has been vectorizer > Transformer > Classifer. sklearn has a Pipeline class that streamlines this process.

from sklearn.pipeline import Pipeline
text_classifier = Pipeline([('vectorizer', CountVectorizer()), ('normalization', TfidfTransformer()), ('classifier', MultinomialNB())])

We can now train the model with one command

text_classifier = text_classifier.fit(data_train.data, data_train.target)

Evaluation

Evaluating the acuracy of the model is very easy using numpy

import numpy as np
data_test = fetch_20newsgroups(subset='test', categories=categories, shuffle = True, random_state=42)
docs_test = data_test.data
predicted = text_classifier.predict(docs_test)
np.mean(predicted == data_test.target)
# 0.93636972538513064

93.6% accuracy is quite good. Now we will investivate if using a support vector machine classifier will improve the performance of the model. A SVM is widely regarded as one of the best text classification algorithms, however, it runs a bit slower then Naive Bayes. Changing the algorithm is quite easy using the Pipeline class.

from sklearn.linear_model import SGDClassifier
text_classifier = Pipeline([('vectorizer', CountVectorizer()), ('normalization', TfidfTransformer()), ('classifier', SGDClassifier(loss='hinge', penalty = 'l2', alpha=1e-3, n_iter=5, random_state = 20170212))])
_ = text_classifier.fit(data_train.data, data_train.target)
predicted = text_classifier.predict(docs_test)
np.mean(predicted == data_test.target)
# 0.94373744139316806

Using the SVM yielded a higher accuracy over the test data.

The SVM classifier has a number of additional parameters, the l2 penalty term is the regularization term to be used in the loss function in order to avoid over-fiting the model to the training data. Alpha is a free parameter (to be optimized) that multiplies the regularization term. It is also used to compute the learning rate in the gradient descent algorithm. Too low an alpha will mean that the algorithm will not find a local optima solution in the number of iterations specified, too large an alpha and the model will actually move away from the local optima.

 You can get more detailed performance analysis of the results using the following commands:

from sklearn import metrics
print(metrics.classification_report(data_test.target,predicted, target_names = data_test.target_names))
                       precision    recall  f1-score   support

          alt.atheism       0.99      0.97      0.98       319
comp.sys.mac.hardware       0.92      0.94      0.93       385
            rec.autos       0.97      0.95      0.96       396
      sci.electronics       0.91      0.91      0.91       393

          avg / total       0.94      0.94      0.94      1493
metrics.confusion_matrix(data_test.target, predicted)
array([[310,   3,   2,   4],
       [  1, 363,   5,  16],
       [  0,   2, 377,  17],
       [  2,  26,   6, 359]])

You can see from the confussion matrix that a material amount of documents in the comp.sys.max.hardware category has been misclassified as sci.electronics category and vice versa. This is perhaps due to the similar nature of these two categories.

Parameter Tuning using Grid Search

Recall that the SVM classifier has a number of paramaters that was required to be provided. A different set of parameters will yield a different optimal solution in the gradient descentl optimizer. To get the optimal set of parameters it is possible to run an exhaustive search of the best parameters on a grid of possible values. We will try all classifiers, with or without idf and with a penalty parameter of either 0.01 or 0.001 for the linear SVM.

from sklearn.model_selection import GridSearchCV
parameters = {'vectorizer__ngram_range': [(1, 1), (1, 2)],
'normalization__use_idf': (True, False),
'classifier__alpha': (1e-2, 1e-3),
}

gs_classifier = GridSearchCV(text_classifier, parameters, n_jobs=-1)

setting n_jobs = -1 enables grid search to detect how many cores are installed and uses them all in parallel to speed up the training rate.

  • During the grid search process we will run it on a smaller subset of the training data to speed up the process.
gs_classifier = gs_classifier.fit(data_train.data[:300], data_train.target[:300])

# Can use GridSearchCV object to predict new documents
data_train.target_names[gs_classifier.predict(['I like fast cars'])[0]]
# 'rec.autos'

# object's best_score_ and best_params_ attributes stores the best mean score and parameters corresponding to that score.
gs_classifier.best_score_
# 0.89666666666666661
gs_classifier.best_params_
# {'classifier__alpha': 0.001, # 'normalization__use_idf': True, # 'vectorizer__ngram_range': (1, 2)}

NOTE: try using varying sample sizes of the grid search process. Let me know what relationship you find if any.

You can get more details of the results via: gs_classifier.cv_results_

You can easily import the results into a pandas dataframe to inspect the data more easily.

mean_fit_time mean_score_time mean_test_score mean_train_score param_classifier__alpha param_normalization__use_idf param_vectorizer__ngram_range params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.171132 0.047264 0.860000 1.000000 0.01 True (1, 1) {u'vectorizer__ngram_range': (1, 1), u'normali... 4 0.882353 1.000000 0.84 1.00 0.857143 1.00000 0.012056 0.001255 0.017489 0.000000
1 0.441058 0.122723 0.866667 1.000000 0.01 True (1, 2) {u'vectorizer__ngram_range': (1, 2), u'normali... 3 0.872549 1.000000 0.85 1.00 0.877551 1.00000 0.015408 0.013504 0.011961 0.000000
2 0.132625 0.046148 0.730000 0.974949 0.01 False (1, 1) {u'vectorizer__ngram_range': (1, 1), u'normali... 8 0.745098 0.979798 0.63 0.95 0.816327 0.99505 0.009616 0.002459 0.076454 0.018708
3 0.452035 0.122118 0.766667 0.994983 0.01 False (1, 2) {u'vectorizer__ngram_range': (1, 2), u'normali... 7 0.803922 0.994949 0.67 0.99 0.826531 1.00000 0.020628 0.007971 0.068974 0.004083
4 0.133968 0.048394 0.876667 1.000000 0.001 True (1, 1) {u'vectorizer__ngram_range': (1, 1), u'normali... 2 0.862745 1.000000 0.90 1.00 0.867347 1.00000 0.008025 0.006457 0.016606 0.000000
5 0.437073 0.120282 0.896667 1.000000 0.001 True (1, 2) {u'vectorizer__ngram_range': (1, 2), u'normali... 1 0.872549 1.000000 0.90 1.00 0.918367 1.00000 0.018241 0.011208 0.018849 0.000000
6 0.129757 0.044549 0.806667 1.000000 0.001 False (1, 1) {u'vectorizer__ngram_range': (1, 1), u'normali... 5 0.823529 1.000000 0.78 1.00 0.816327 1.00000 0.007547 0.003771 0.019084 0.000000
7 0.394275 0.112885 0.806667 1.000000 0.001 False (1, 2) {u'vectorizer__ngram_range': (1, 2), u'normali... 5 0.794118 1.000000 0.77 1.00 0.857143 1.00000 0.013236 0.008274 0.036524 0.000000

Conclusion

You have completed the tutorial on building a classification model to categorize documents to various categories using training data and test data. You also have learned about parameter tuning to optimise the model using a grid search technique.

Subscribe to our mailing list

* indicates required