In Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
Will defend his dissertation proposal
Along with the proliferation of cheap storage and more efficient CPUs, autonomous heterogeneous semistructured sources have been created. These large heterogeneous sources are difficult to query and explore, even those having a common origin from a structured source. Multiple solutions have been proposed attempting to manage the data resulting from this pervasive integration problem via ad-hoc systems. Our research defends the idea that integrating and managing a collection of semistructured data and a central database repository (structured data), within a database management system (DBMS), is efficient in medium-size collections and allows complex querying and knowledge discovery. In order to perform this combined querying, we present several data layouts and algorithms for extracting hidden links at different granularity levels between the metadata (table names and columns names) and content (records) in a DBMS and the keywords within a corpus of heterogeneous sources (e.g. documents, source code, and spreadsheets, among others). These algorithms focus on efficiently creating, managing summarization tables and keyword matching routines using standard SQL queries and extensibility features of the DBMS (e.g. User-Defined Functions). Ultimately, these links can be ranked, explored and queried by several search algorithms that we introduce. As a result, we extend relational queries to handle documents as well as provide a complexity analysis of the proposed algorithms. Furthermore, we present additional knowledge discovery techniques (stream clustering and ontology extraction) within the DBMS to explore and manage these common unstructured sources.