Super Accurate Document Classification at the Speed of Light

Super Accurate Document Classification at the Speed of Light

“My life is very monotonous,” the fox said. “I hunt chickens; men hunt me. All the chickens are just alike, and all the men are just alike. And, in consequence, I am a little bored.”
– Antoine de Saint-Exupéry

My recent visit to a KPO reminded me of the famous French quote above. There seemed to be tremendous boredom and monotony amongst the educated and well-paid workforce. I could not but feel sorry for them as I could imagine the amount of human effort spent manually classifying documents daily. Being a software geek who had solved several such problems earlier, I knew there was a better way for those folks to get the job done, for example, by automating the process using Machine Learning.

Document Classification Automation aims to ease the life of a domain expert by avoiding painstaking, repetitive, and time-consuming processes.

What do the Classifiers Do

Classifiers make ‘predictions’ based on experience. When a classifier is fed a new document, it predicts that the document belongs to a particular class or category and assigns a “label.”

Source Data for Building a Classification Process

The Source dataset is a collection of documents that have been classified in the past. The Source dataset must be bifurcated into two parts – Training and Testing datasets

  1. Training dataset – Building a classification model requires a training dataset.
    It needs to be large enough to have adequate documents in each class. The Training dataset needs to be of good quality with a clear demarcation of differences in the documents belonging to the different categories.
  2. Testing dataset – Evaluating the effectiveness of the classification model requires a training dataset.

How the Classifier is built

  1. Pre-processing of dataset
    Pre-processing the data is necessary since source data may contain unnecessary information like noise and unreliable data. The objective is to structure the data to facilitate the Classification Process. Data pre-processing includes Data Cleansing, Normalization, Feature Extraction, and Feature Selection.We need to remember that Data Preparation is a complex subject involving many iterations, exploration, and analysis. Readying data in the Pre-processing steps is essential to get good results from the Classifier. Pre-processing steps play a vital role in improving the accuracy of a classifier.
  2. Classification Algorithm
    Documents are classified by comparing the number of matching terms in the document vectors to see which class it most closely resembles. Classifier makes a document into one of the category types and assigning a label to a document within a given category type.As per my experience, classification algorithms such as Support Vector Machines, Naive Bayes, and Rocchio are best suited for Document Classification.

The Accuracy Measure

Once the Classification model is built, it needs to be evaluated by feeding testing dataset. If the accuracy of the current Classification model is not as expected, then you must take a few steps to improve it. I took the following measures to improve accuracy :

  1. Revisit pre-processing of the dataset and filter out unwanted data
  2. Improve the quality of the Training corpus
  3. Try other Classification Algorithms or try the Ensemble approach

Summary

Document Classification is a supervised method that involves the creation of a model based on a pre-processed data set. To predict the category of any given document, the Classifier gets training on this training dataset.

The quality of the training dataset affects the quality of prediction. So, keeping the variation in each document category is essential to keep up the quality of the training dataset.

If you wish to look at our Document Classification Product, please feel free to contact us at [email protected].