Department of Computer Science at UH

University of Houston

Department of Computer Science

In Partial Fulfillment of the Requirements for the Degree of
Master of Science

Anu Goyal

Will defend her thesis

An Incremental Bayesian Classifier based on the EM Algorithm

Abstract

The EM algorithm is an outstanding statistical algorithm, which is slow on large data sets due to its iterative behavior. In general, it has been used to compute descriptive models, being clustering the most common. This thesis introduces a fast convergence EM algorithm that fits a separate mixture of Gaussians per class on large data sets to compute a Bayesian classification model. Our algorithm improves the Naive Bayes to achieve higher accuracy, while providing efficient computation. Our approach provides several advantages: the algorithm can incrementally learn the model, the model can adapt to local characteristics of the data set and classification accuracy can be tuned varying the number of clusters. Our proposed algorithm incorporates improvements to perform incremental learning to achieve faster convergence and to get higher classification accuracy. We present an extensive experimental evaluation with real and synthetic data sets studying accuracy and scalability. Our algorithm is shown to be more accurate than Naive Bayes and decision trees. From a performance point of view, our algorithm exhibits linear scalability in data set size and it is significantly faster than Standard EM.

Date: Wednesday, April 1st, 2009
Time: 10:30 AM
Place: 550-PGH

Faculty, students, and the general public are invited.
Advisor: Prof. Carlos Ordonez