The Instrument will enable research on Data Mining thanks to its numerous built-in, pre-defined data mining algorithms for mining from vector and raster geospatial data ranging from data reduction algorithms (feature selection, feature extraction, and instance selection (e.g., sampling)) to spatial data mining algorithms including spatial co-location pattern discovery, spatial outlier detection, spatial clustering, and spatial classification.

 

In many applications, the data are naturally multi-modal, in the sense that they are represented by multiple sets of features. With the availability of multiple information sources, it is a challenging problem to conduct integrated exploratory analysis with the aim of extracting more information than what is possible from only a single source. In literature, many multi-view learning algorithms are designed to learning from multiple information sources.

 

From the information fusion perspective, multi-view learning methods can be categorized into three different types based on the way that information from different sources is used:

 

(1) Feature Integration where the feature representation is enlarged to incorporate all attributes from different sources and a unified feature space is generated [WOC99].

(2) Semantic Integration where computational methods are first applied to each dataset separately and results on different datasets are then combined [Bis06].

(3) Intermediate Integration (or kernel integration) where the datasets are kept in their original form and are integrated at the similarity computation or the Kernel level [BS02, LCB+06].

 

From the learning perspective, so far, most of the multi-view learning algorithms are designed under the semi-supervised and clustering frameworks, where the semi-supervised learning is to deal with inadequate labeled examples and the abundant number of unlabeled examples and clustering aims focuses more on learning the hidden patterns of the data set in an unsupervised way.

 

The team of Dr. Tao Li [http://users.cis.fiu.edu/~taoli/research-project.html] at FIU will utilize the Instrument to establish a comprehensive framework for large-scale data mining from multiple information sources. The framework focuses on unsupervised learning and semi-supervised learning and is able to perform fusion at all different levels including feature integration, semantic integration, and intermediate integration.

 

With the explosive growth of the volume and complexity of document data, it has become a necessity to semantically understand documents and deliver meaningful information to users. Areas dealing with these problems are crossing data mining, information retrieval, and machine learning.

 

Dr. Tao Li's group has been focusing on developing advanced data mining and machine learning algorithms to

 

1) improving document clustering and summarization performance;

2) integrating document clustering and summarization to obtain meaningful document clusters with summarized interpretation;

3) summarizing the difference and evolution of different document sources; and

4) building document understanding systems to solve real-world applications.

 

The Instrument with is vast repository of geo-located OCR-ed documents will allow them to scale-up their methods.

 

 

References Cited

 

[WOC99] L. Wu, S. L. Oviatt, P.R. Cohen. Multimodal Integration - A Statistical View. IEEE Transactions on Multimedia, 1(4): 334-341, 1999.

[BIS06] C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

[BS02] B. Schlkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002.

[LCB+06] G.R. Lanckriet, N. Cristianini, P.L. Bartlett, L.E. Ghaoui, and M.I. Jordan. Learning the kernel matrix with semi-definite programming. In Proceedings of International Conference on Machine Learning (ICML), pages 323.330, 2006.