A Holistic Approach to Automatic Deep Understanding of Technical Documents
by Dr. Nikolaos Bourbakis, IEEE Fellow
Director, CART-WSU, BAIF
Most of the technical documents are composed by several modalities, like diagrams, tables, formulas, functions, algorithms, graphics, pictures and natural language text. Each of these modalities and their associations significantly contribute to the overall deep understanding of the technical document and the knowledge represented in it. Here, for us all these modalities, except NL text, are considered as “images”. Thus, each technical document mainly is composed by NL text sentences and “images”. Thus, in this talk we present a holistic approach, where all these modalities can be expressed into the same two modalities (natural languages text sentences and SPN graphs) for better associations and deeper understanding of a technical document. This deeper understanding will come from two different novel contributions.
The first unique contribution will be an enrichment of the NL text part with additional NL text sentences extracted from the “images” of the technical document. The second unique contribution will come from the SPN models of these “images” that enrich the main block diagram’s functionality by generating a simulator for the system described in that technical document.
Graph degeneracy and applications to social networks and text mining
by Dr. Michalis Vazirgiannis, Professor at LIX,
Ecole Polytechnique in France
Graph degeneracy is a popular method to approximate the densest subgraph in almost linear complexity time. In our research work we extended this method to weighted and directed graphs and capitalizing on them to investigate its potential in different graph and text mining cases. One of the cases is k-core based community evaluation – specifically metrics that integrates authority and collaboration – a properties not captured by the single node metrics or by the established community evaluation metrics. Based on the k-core, which essentially measure the robustness of a community under degeneracy, we extend it to weighted graphs. We further extend introduce novel metrics for evaluating the collaborative nature of directed graphs and define a novel D-core metric, extending the classic graph-theoretic notion of k-cores for undirected graphs to directed ones.. We applied the D-core approach on large real-world graphs such as Wikipedia and Aminer.org citation data and report interesting results. The D-core metric has been adopted by Aminer as part of its reported metrics – see an example here. We also investigate to issue of influence maximization in graphs using degeneracy as means to select the optimal spreaders. The results are promising and show that starting an epidemic from the densest k-truss. We also investigate thoroughly the issue of graph similarity via novel graph kernels and embedding schemes with applications to graph classification in chemo-informatics, social networks and text mining.
At the level of Text mining, we capitalize on the Graph-of Words (GoW) model, that capitalizes on a graph representation of documents and captures inherently the words’ order and distances in the document, apart from the frequency, to capture document similarity. We applied graph-of-word in various tasks such as ad-hoc Information Retrieval, Single-Document Keyword Extraction, Text Categorization and Sub-event Detection in Textual Streams (i.e. twitter) and document summarization. In all cases the graph of word approach, assisted by degeneracy at times, outperforms the state of the art base lines in all cases. We are currently investigating the potential of the GoW as input to deep learning architectures for text mining tasks.