# Probabilistic Databases Under Open World Assumptions

Probabilistic databases (PDBs) [1] are used to store, process and to query large amounts of uncertain (probabilistic) data as may arise, for example, in information extraction, sensory data or by using machine learning methods. Probabilistic databases are modeled using the so-called possible worlds semantics. Instead of a single relational database, a PDB describes a probability distribution over a collection of traditional instances. These individual instances are called the “possible worlds” of the probabilistic database. More formally, a PDB is a probability space over sets of relational structures.

Preliminary work on probabilistic databases dates back to the 80s and 90s with (among others) the seminal work of Cavallo and Pittarelli [2], of Barbará et al. [3] and of Fuhr and Röllecke [4]. The topic regained new interest in the mid-2000s with the work of Dalvi and Suciu (see [5]) who showed a notable dichotomy result for the computational complexity of a very natural class of queries.

To that point however, the concrete notions that were considered for modeling probabilistic databases mostly comprised the following two properties:

- the domain (resp. the union of all attribute domains) is of fixed, finite size; and
- the stochastic model is quite simple (using various kinds of independence assumptions).

Parallel and subsequent work proposed systems that may handle even continuous distributions (for example [6], [7]) and even introduced a continuous, formal semantics for tree-databases [8]. Still, the probabilistic data herein initially follows a closed world semantics. That is, every possible database record that has no explicitly specified positive probability, is assumed to be almost surely (i. e. with probability 1) false. However, as Ceylan et al. [9] pointed out, moving towards an open world assumption is more reasonable in a lot of application scenarios. They suggest specifying a probability interval for all “unspecified” facts that can be built using the underlying domain. However, their approach is limited to finite domains and the assumption of stochastic independence of individual tuples. While this is no real restriction in a finite setting, it is unclear, whether and how their method can be extended to support infinite domains.

The goal of this research project is to build and sharpen the theoretical foundations for probabilistic databases with infinite universes and rich correlations as well as explore the possibilities of performing (approximate) open-world query evaluation in such a model.

So far, there are no immediate intersections with other dissertation topics of the RTG.