dblinkR is an R interface for
dblink—an Apache Spark package
for performing unsupervised entity resolution. It implements a
generative Bayesian model for entity resolution called blink (Steorts
2015), with extensions proposed in (Marchant et al. 2021). Unlike many
entity resolution methods, dblink approximates the full posterior
distribution over the linkage structure. This facilitates propagation
of uncertainty to post-entity resolution analysis, and provides a
framework for answering probabilistic queries about entity membership.
dblinkR is not currently available on CRAN. The latest development
version can be installed from source using devtools as follows:
library(devtools)
install_github("ngmarchant/dblinkR")dblinkR depends heavily on the sparklyr R interface for Apache
Spark. Please refer to the sparklyr
website for information about connecting
to a Spark deployment.
dblinkR currently supports Spark releases in the 2.3.x series and
2.4.x series. Spark releases prior to 2.3.x are not supported.
The RLdata500 vignette demonstrates how to
use dblinkR to perform entity resolution for a small synthetic data
set. This example is small enough to run on a laptop (Spark cluster not
required).
GPL-3
Marchant, Neil G., Andee Kaplan, Daniel N. Elazar, Benjamin I. P. Rubinstein, and Rebecca C. Steorts. 2021. “d-blink: Distributed End-to-End Bayesian Entity Resolution.” Journal of Computational and Graphical Statistics 30 (2): 406–21. https://doi.org/10.1080/10618600.2020.1825451.
Steorts, Rebecca C. 2015. “Entity Resolution with Empirically Motivated Priors.” Bayesian Analysis 10 (4): 849–75. https://doi.org/10.1214/15-BA965SI.