Christoph Standke: Probabilites in Database Queries: Power and Complexity



Christoph Standke

Associated doctoral researcher


+49 241 80-21716



Uncertainty plays a central role in modern database systems. It arises from many use cases, such as data integration, noisy data, data from unreliable sources or randomized processes. To obtain a quantitative framework to handle uncertainty, it is modelled via probability distributions.

Best studied in this area is the notion of finite probabilistic databases (PDBs), which are finite probability spaces over database instances. For finite PDBs, the questions of representation and query evaluation under set semantics are well understood. We extend this knowledge in different directions:

Tuple-Independent Representations of Infinite Probabilistic Databases We systematically study the representability problem for infinite PDBs by means of tuple-independence and first-order views. Although first-order views over tuple-independent PDBs are not a complete representation system for infinite PDBs, they form a fairly robust class: Adding first-order constraints does not give them additional expressive power, and they cover many relevant special cases such as block-independent disjoint PDBs, and PDBs of bounded instance size. We identify criteria for representability (or non-representability) in this class and explore their limits.

Probabilistic Query Evaluation with Bag Semantics We study probabilistic query evaluation under bag semantics where tuples are allowed to be present with duplicates. Due to potentially unbounded multiplicities, the bag probabilistic databases we consider are no longer finite objects, which requires a treatment of representation mechanisms. Moreover, the answer to a Boolean query is a probability distribution over non-negative integers, rather than a probability distribution over {true, false}. Therefore, we explore two flavors of probabilistic query evaluation: computing expectations of answer tuple multiplicities, and computing the probability that a tuple is contained in the answer at most k times for some parameter k.

Furthermore, we also consider other sources of randomness than distributions over database instances:

The Importance of Parameter Values in Database Queries We propose and study a framework for quantifying the importance of the choices of parameter values to the result of a query over a database. These parameters occur as constants in logical queries such as conjunctive queries. This quantity is the SHAP score of the individual parameter values, as previously applied to feature values in machine-learning models. The application of the SHAP score requires two components in addition to the query and the database: (a) a probability distribution over the combination of parameter values, and (b) a utility function that measures the distance between the result for the original parameters and the result for hypothetical parameters. The main question we investigate is the complexity of calculating the SHAP score for different distributions and distances.