Document classification is a process of signing documents to one or more predefined categories of documents, or classes of documents. Document classification is also refer to as text categorization. Document classification or text categorization can be divided into two. The first one is supervised learning-based classification and second one is unsupervised learning-based clustering. Supervised learning-based method requires training process along with training data sets. Whereas unsupervised learning-based method does not require such training phase. An example of unsupervised learning is came in came in space document clustering. The goal of this session is not to comprehensively discuss classification algorithms. But to apply TF-IDF that we just cover in previous session to feature weighting for document classification. There are three major components of document classification. First one is document collection. Document collection typically consist of training and testing dataset or in some cases training validation and test collections. A training dataset is a set of documents used to predict a potential class of a document. A test data set is set of documents used to assess the strength of predictive power of a classifier. Second is classifier which is an algorithm that implements classification specifically in a concrete implementation. It also refers to the mathematical function implement it by a classification of algorithm that maps input data to a category or clause. Third is feature, which is usually bag of words for document classification. Let me describe couple of classification techniques. Simplest algorithm is Naive Bayes classifier. It is a kind of simple probabilistic classifier based on applying Bayes theorem with strong, in many cases, naive, independence assumptions between the features. It is a popular baseline method for test categorization. It assumes that the value of a particular feature is independent of the value of any other features given the class variable. For example a document A may belong to the category of sports if the document contains terms like basketball, baseball, golf. In Naive Bayes classifier considers each of these features to contribute independently to the probability that this document belongs to category of sports. Second classifier is k-Nearest Neighbours, also called kNN. It is non-parametric method used for classification or regression. The input for k-Nearest Neighbor algorithm consist of the K-closest training examples in the feature space. A document is classified by majority of vote of its neighbors, with he document being assigned to the class most common among its k-Nearest Neighbours. Here, k is the positive integer, typically small number. If k equal to one then the document is simply assign to the class of the single nearest neighbor. The output is a class membership, the neighbors are taking from a set of documents for which the class is known. This can be taught of a training set for the algorithm. Let me move on to SVM. SVM stands for Support Vector Machine. It was first introduced in. COLT- 92 by. Since then it has become a very popular supervisor learning algorithm. The centralized website of SVM is www.kernel-machine.org. Support vectors are the data points that lie closest to the decision surface. SVMs maximize the margin around the separating hyperplane. The decision function is fully specified. A subset of training samples which is a support vectors. Grey squares on the diagram, prefers to the support vectors in terms of separation by hyperplane. Suppose that we are dealing with linear separate ability in two dimensions can separate by a line. And in higher dimensions in this hyperplane, in order to find separating hyperplane, linear programming such as perceptron is commonly used. There are two famous SVM implementations. First one is LibSVM. The full description of LibSVM is found at www.csie.ntu.edu.tw/~cjlin/libsvm/ /~cjlin/libsvm. In this session, the second one is SVM-Light, and it can be found at svmlight.joachims.org. In this session we will stick with LibSVM. Simply because LibSVM has the Java version of implementation. So, that we can easily integrate it into our package via Text Miner. LibSVM was developed by Chih-Jen Lin and his colleagues. It is known to be a tool for more multi-class software vector classifications and regression. It has many different implementations including C++, Java, Python, Matlab, and Perl. It also supports many OSs like Linux, UNIX, and Windows. Since we are going to use LibSVM for document classification exercise, let's take a closer look at data file format that LibSVM accepts. In fact this format Is also accepted by SVM-Light. The format for both training and testing data set is the same. The first column of training and testing data is the class label. The example here is binary class +1 and -1 are two classes. The second column to the less columns are feature sets. In the example, we have four features. In other words we have four unique words. You can think of this feature as a vocabulary list out of data collection. Separated by colon, the right side is the value of left side which is a feature or term. In our exercise, we will use tf-idf value for the right side value. When we test the classifier, the input documents must be transformed into this feature space. As far as evaluation method is concern there are several evaluation method such as leave one out and n-fold cross validation Since n-fold cross validation is the most well-accepted approach, I will briefly describe what n-fold cross validation is about. The data set you is have is to be partitioned into n equal-size disjoint subsets. And then use each subset as a test data set and combine the rest n-1 subsets as a training set to learn a classifier. You then execute the same procedure in times which gives inaccuracies. After all, you'll have average performance. 10-fold or 5-fold cross validations are commonly used.