In Partial Fulfillment of the Requirements for the Degree of
Master of Science
Will defend his thesis
In this work we solve the problem of variable selection for linear regression on large data sets stored in Database Management Systems (DBMS), under the Bayesian approach. This is a challenging problem due to data sets with a large number of variables present a combinatorial search space. Besides, data sets might be so large that they cannot fit in main memory. In this work, we introduce a three step algorithm to solve variable selection for large data sets: Pre-selection, Summarization and accelerated Gibbs sampling. Since MCMC methods, such as Gibbs sampling, require thousands of iterations for the Markov chain to stabilize, un-optimized algorithms could either run for many hours, or fail due to insufficient main memory. We overcome such issues with several non-trivial database-oriented optimizations, which are experimentally validated in both a parallel Array DBMS and a Row DBMS. We highlight the superiority of an Array DBMS, a novel kind of DBMS, to analyze large matrices. We analyze performance from two perspectives: data set size and dimensionality. We present interesting results for high-dimensional microarray data. We show that our algorithm presents promising performance, specially for high-dimensional datasets: our prototype is generally two orders of magnitude faster than a variable selection package for R.
Date: Tuesday, November 11, 2014
Time: 9:30 AM
Place: PGH 550
Faculty, students, and the general public are invited.
Advisor: Carlos Ordóñez