In Partial Fulfillment of the
Requirements for the Degree of
Master of Science
Will defend her thesis
The EM algorithm is an outstanding statistical algorithm, which is slow on large data sets due to its iterative behavior. In general, it has been used to compute descriptive models, being clustering the most common. This thesis introduces a fast convergence EM algorithm that fits a separate mixture of Gaussians per class on large data sets to compute a Bayesian classification model. Our algorithm improves the Naive Bayes to achieve higher accuracy, while providing efficient computation. Our approach provides several advantages: the algorithm can incrementally learn the model, the model can adapt to local characteristics of the data set and classification accuracy can be tuned varying the number of clusters. Our proposed algorithm incorporates improvements to perform incremental learning to achieve faster convergence and to get higher classification accuracy. We present an extensive experimental evaluation with real and synthetic data sets studying accuracy and scalability. Our algorithm is shown to be more accurate than Naive Bayes and decision trees. From a performance point of view, our algorithm exhibits linear scalability in data set size and it is significantly faster than Standard EM.
Date:
Time:
Place: 550-PGH
Faculty, students,
and the general public are invited.
Advisor: Prof. Carlos Ordonez