Loading…

In-database batch and query-time inference over probabilistic graphical models using UDA–GIST

To meet customers’ pressing demands, enterprise database vendors have been pushing advanced analytical techniques into databases. Most major DBMSes use user-defined aggregates (UDAs), a data-driven operator, to implement analytical techniques in parallel. However, UDAs alone are not sufficient to im...

Full description

Saved in:

Bibliographic Details
Published in:	The VLDB journal 2017-04, Vol.26 (2), p.177-201
Main Authors:	Li, Kun, Zhou, Xiaofeng, Wang, Daisy Zhe, Grant, Christan, Dobra, Alin, Dudley, Christopher
Format:	Article
Language:	English
Subjects:	Computer Science Database Management Probabilistic models Regular Paper
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	To meet customers’ pressing demands, enterprise database vendors have been pushing advanced analytical techniques into databases. Most major DBMSes use user-defined aggregates (UDAs), a data-driven operator, to implement analytical techniques in parallel. However, UDAs alone are not sufficient to implement statistical algorithms where most of the work is performed by iterative transitions over a large state that cannot be naively partitioned due to data dependency. Typically, this type of statistical algorithm requires pre-processing to set up the large state in the first place and demands post-processing after the statistical inference. This paper presents general iterative state transition (GIST), a new database operator for parallel iterative state transitions over large states. GIST receives a state constructed by a UDA and then performs rounds of transitions on the state until it converges. A final UDA performs post-processing and result extraction. We argue that the combination of UDA and GIST (UDA–GIST) unifies data-parallel and state-parallel processing in a single system, thus significantly extending the analytical capabilities of DBMSes. We exemplify the framework through two high-profile batch applications: cross-document coreference, image denoising and one query-time inference application: marginal inference queries over probabilistic knowledge graphs. The 3 applications use probabilistic graphical models, which encode complex relationships of different variables and are powerful for a wide range of problems. We show that the in-database framework allows us to tackle a 27 times larger problem than a scalable distributed solution for the first application and achieves 43 times speedup over the state-of-the-art for the second application. For the third application, we implement query-time inference using the UDA–GIST framework and apply over a probabilistic knowledge graph, achieving 10 times speedup over sequential inference. To the best of our knowledge, this is the first in-database query-time inference engine over large probabilistic knowledge base. We show that the UDA–GIST framework for data- and graph-parallel computations can support both batch and query-time inference efficiently in databases.
ISSN:	1066-8888 0949-877X
DOI:	10.1007/s00778-016-0446-1