Data Collection, Verification and Querying from Heterogeneous Data Sources on the Internet - Big Data and Artificial Intelligence Laboratory

In this project, a framework will be developed for collecting data from heterogeneous data sources on the Internet, measuring its consistency, automatically calculating and indexing the accuracy of the collected data, and making quick and effective queries on the indexed data. Wikipedia, Freebase, YAGO, Satori, and Knowledge Graph knowledge bases are mainly focused on text-based inference methods. In this project, correlations will be made between data obtained from very different data sources, such as social media contents, data sources in tabular format, and Web contents. Supervised learning methods will be used to determine the types of data sources.
With the framework to be developed, the methods used in the literature will be used to identify and extract the parts that do not contribute to the semantic content of the text at the preprocess stage for the texts in English and Turkish, but new methods will be developed, especially if the existing methods in Turkish are insufficient. In Turkish, there will be improvements in stages such as synonyms and reduction to the root. The data to be collected on the Internet will be created horizontally, including all fields, instead of being structured vertically. Morphological analysis of the data will be made, and each information-containing term group will be reduced to triples. In addition, semantic matching will be made between the data to be collected in large numbers, and the level of relationship will be determined. As a result, the texts obtained as triples will be used for the verification of other texts. In the project, new algorithms will be developed in order to reduce the data to triples, verify their accuracy through Internet resources, perform morphological analysis of texts, and create an index.