Guest Talk: Probabilistic Data Integration – Dealing with data quality issues in data integration

Wednesday, 20.12.2017, 11:00am

Location: RWTH Aachen University, Department of Computer Science - Ahornstr. 55, building E3, room 9u10


Speaker : Prof. Maurice van Keulen, University of Twente, NL


Probabilistic data integration is a specific kind of data integration where integration problems such as inconsistency and uncertainty are handled by means of a probabilistic data representation. The approach is based on the view that data quality problems (as they occur in an integration process) can be modeled as uncertainty and this uncertainty is considered an important result of the integration process. In a sense, data quality problems arising during the data integration process are not solved immediately, but explicitly represented in the resulting integrated data. This data can be stored in a probabilistic database to be queried directly resulting in possible or approximate answers. A probabilistic database is a specific kind of DBMS that allows storage, querying and manipulation of uncertain data. It keeps track of alternatives and dependencies among them.

While traditional data integration methods more or less explicitly consider uncertainty as a problem, as something to be avoided, probabilistic data integration treats uncertainty as an additional source of information, which is precious and should be preserved. It effectively allows for postponement of solving data integration problems. When combined with an effective method for data quality measurement, it also has the potential to allow for a pay-as-you-go and good-is-good-enough approach where small iterations reduce overall effort in improving the data quality of the integrated result.

In this presentation, we give an overview of various data integration problems and how a probabilistic approach can improve them, for example, entity resolution, merging of grouping data, and information extraction from natural language.



University Website