Distributed Methods

Summary

"Distributed Methods" refers to data operations that can be implemented using map-reduce architecture with each node in the network performing data operations independently, orchestrated by a central system that manages intermediate and final results ("aggregator", or "mapper" or "oracle"). Methods include Analysis Methods (e.g. regression, SVM), Record Linkage Methods (a.k.a EMPI), Data Transformation, and Data Profiling Methods.

Development often requires adaptation of existing APIs or services to distributed architecture (e.g. OHDSI WebAPI, R and/or SAS code for fitting generalized linear models).

In order for method to be implemented successfully in a distributed framework, methods need to separate components of computation that will occur in different parts of the architecture.

Current approach is that methods that can be included in a pSCANNER Protocol and can be proposed to a subnetwork in a PMN Scanner Study Request Type should must be broken down into 4 component resources - all of the Distribted Methods should assume this model.

Server side query parameterization & UI (right now integrated with PMN portal - NewAnalysis.cs)
Client-side - computatons that need to be executed by each site. In PMN framework, this is managed by The DMC. pSCANNER Adapters are here
"Aggregator" code that specifies what, if anything, needs to be computed on combined resulst for all sites participating in a distributed query. This may be simple pass-through for display, concatenation, or one-time compuation, as well as more complex operations - the Aggregating.cs PMN code handles the queueing and DMC result collection currently.
Server-side result display (right now not tightly integrated with PMN portal, so could route to a Shiny server, for example)

Technically, any distributed method must specify

What data input formats are expected for components 1-4
What data output formats are expected for components 1-4
Requirements and platforms/semantics/models for each step.

Note that the PFA model calls #3 a "Folding Engine" and splits it into "Tally" actions and "Merging". Currently PMN-pSCANNER approach does not initiate Merging until all parallelized outputs have been obtained by the PMN-pSCANNER Aggregating.cs function and subsequently released to the DMC responsible for "Merging" operations.

Conventions and Decomponsition of Methods

These convnetions need to be specified clearly as we work through the different use-cases. More detail below...

Methods follow basic input->{action}->output model. Inputs and outputs are typically datasets (or arrays), actions are typically computations that may be user defined functions or functions built into existing computing platforms. In distributed methods, actions are distributed over the network. In many cases a high level action is parallelized and decomponsed into computations. Most distributed methods have two roles corresponding to different categories of actions. One role is related to handling consolidation of parallelized outputs ("aggregator") and the other implements operations on locally managed data ("client")

In order to facilitate platform independence for analysis methods, we are trying to observe the syntax and specification formats developed for PMML: http://dmg.org. A network of heterogeneous PMML or PFA consumers (e.g. SPSS, R, SAS, KNIME) can collaborate to estimate any generalized linear model (or any model with a convex error function) using gradient decent or Newton-Raphson methods orchestrated by an "aggregator". So only the aggregator node needs to have the user-defined functions for iteratively updating the coefficients until the error converges. We might consider some of the conventions and requirements in PFA. In particular, "aggregator" functions have been modeled in a way that PFA calls a Folding Engine that allows us to clearly articulate that PFA tally activities are handled by PMN whereas merge activities operate on collections. http://dmg.org/pfa/docs/document_structure/

While PMML model includes transformations, Data Transformations and processing steps are not as well represented in PMML expressions and we may opt to extend them with expressions/UDFs from another standard (e.g. HQMF/CQF).

Creating Subnetworks devoted to specific operations and projects.

The pSCANNER-PMN system enables users to create sub-groups of users and resources that can collaborate on a range of data operations (methods) in an ongoing way. For example, a subnetwork containing only UC System might approve cohort discovery methods. All of the PRIME participants in pSCANNER might approve mutual sharing of quality measurement.

Specifying Workflows for Networked Data Operations (Protocols)

There are typically several steps in creating an end product that can be published, including steps required for implementing a common data model for cohort discovery purposes. A workflow specificaiton will be a resource in its own right, the SCANNER and pSCANNER-PMN systems include user interfaces for specifying these workflows, including approval steps.

pSCANNER Protocol Specification and Approval Process

The project approval workflow involves obtaining approval from each entity participating in a networked protocol. This includes speificaiton of what data resources are required (e.g. OMOP CDM; bariatric surgery dataset) and what sequence of operations (e.g. dataset extraction; logistic regression) are required. This step also allows specification of what the privacy level is of input and output data for each step, including outputs that are transferred centrally. (SOMEDAY: Ideally these would be directly linked with local and/or central IRBs and ultimately patient-level approvals).

Operations that change privacy levels
Operations that create new resources for future use (e.g. creating a DB view or data set)

Differences from SCANNER and SHRINE in Unattended vs. Attended Operations

The original --SCANNER-- protocol specification includes specificaiton of asynchronous and synchronous operations for each step in the protocol (PMN "attended" and "unattended" operations). The PopMedNet framework, as currently implemented, requires that each site must create a separate data mart resource for data operated upon in "unattended" and "attended" mechanisms. The SHRINE approach for networking i2b2 data marts instantiates all i2b2 databases as "unattended".

The UX on this for the pSCANNER-PMN Portal needs to be worked out.

More on Protocol Specificaitons Piped Data Operations and Workflows (i.e. Prospective "Provenance")

Data Input(s)
Data Operations and Parmeters
Data Output(s)

TODO: Create a list of required metadata elements for each resource indexed in pSCANNER Distributed Methods.

Data -Definitions- Could be single datasets using metadata standards or reference models with known specifications.
Data -Operations- (input data dictionary or data type, output data dictionary or type, environment requirements)

TODO: Consider use of the FHIR Task model to pipe and chain data operations together to complete an analysis

Platform Independent Specifications for Data and Operations

A desired feature of the pSCANNER Architecture is supporting a requirement for supporting heterogeneous data models and software enviornments. There are different ways of supporting this requirement.

Code Generation as a Service (CGaaS?) vs. Stored Procedures/Packages

We could use PFA and/or PMML as inputs to a central service for code generation, or we could contribute/extend packages or stored procedures to existing systems. (E.g. RPMML package on CRAN)

Specifying Operations

Natural language specificaitons as well as platform specific technical/executable programs are required.

Maintain independent versions of programs with different stewards (e.g. PostGres version and Oracle version). This approach is the most likely to get out of sync across implementations, but also will be the one that is easiest to understand. This is what we do with all of our ETL from source systems to CDMs and between different CDMs.
Code Generation from Platform Specific Impemenation - Maintain a reference file in one format, and use software that converts from one platform/dialect to another (e.g. LINQ, many different R-based analysis frameworks).
Code Generation from Platform Independent Specificaiton - Maintain a platform-independent specifciation for operations and employ translation software to specific environemnts (e.g. converting XML or JSON specifications to SQL, converting PMML into R or SAS).

This approach is most attractive in cases where user interfaces exist for authoring programs (e.g. i2b2, Measure Authoring Tools, etc.
It is also attractive for creating multi-step workflows where operations might involve transformations in one environment and analysis in anotehr, for example.

Both 2 and 3 lend themselves to implemenations as web services and the possiblity of "on the fly" code generation. In most cases, logical operations that are more complex than booleen logic or common database expressions will require creating or invoking stored procedures (e.g. using "logistic regression") In the case of healthcare-specific components, concepts like "length of stay" might be easier to implement programmatically as stored procedures in different RDBMS or statistical platforms). We need some kind of validation framework and system for maintainig stored procedures for each platform and keeping refactoring in sync. Each local resources will need to be validated before queries are run.

Data Specifications

There are a number of ways in which informaiton models describing underlying data can be specified. We use these informaiton models to create the inputs to data operations. (TODO: Explain all of this better).

A FHIR Data Resource Profile, a Data Dictionary, Metadata, etc. are all ways that data can be exposed for analysis.

Variables (or table columns, fields, or in ML "features")
Value Sets (or "ontologies" - this is the simplest type of data operation).
Derivation and processing rules in natural languge as well as implementation programs
Privacy level

Authorizations and Access Control

User experinece for groups of users needs to be ironed out in PMN. The permissioning system is very flexible, but therfore also difficult to manage. It would be attractive to use InCommon for some of this.

Summary

Conventions we use and structure of repositories should adhere to a shared understanding of these approaches. Ideally, we would have validation procedures to ensure conformance to metadata standards.

Use Cases

These are some use cases and how each node in the computation operates. Note that the Portal is simply used for rendering HTML, queueing, and routing. Note that the OHDSI API may be very slow and require that sites run in "unattended" mode because the API makes multiple calls to the OMOP database, but some visualization methods will be able to run on a single de-identified extract.

Use Case	Locus	Role/Component	Computation
Horizontally Partitioned Regression	Portal	Protocol Specification	Render html for analysis parameters (sites, data variables & model specificaitons/parameters)
Horizontally Partitioned Regression	Data Site DMC	Model Scoring	Fits each iteration of model to local data, returns error and var-cov matrix
Horizontally Partitioned Regression	Aggregator/HB DMC	Model Parameter Estimation/Iteration	Retrieves error from each locally scored model, computes new coefficients/parameters on each iteration using IRLS algorithm, on convergence sends results for display.
Horizontally Partitioned Regression	Portal	Invoke Result Display Service	Obtains Final Message with model, displays results
Privacy Preserving Record Linkage	Data Site	Preparation	Offline/In advance create encrypted dataset wtih shared key, register to DMC, and (maybe compression too?)
Privacy Preserving Record Linkage	Portal	Protocol Specification	Render html for linkage parameters (identification of pre-encrypted data set, threshold?, subnetwork)
Privacy Preserving Record Linkage	Data Site DMC	Retrieval	Retrieve specified data set and send records to portal for queueing (maybe compression too?)
Privacy Preserving Record Linkage	Portal	Queueing	Wait for all sites to respond, route to aggregator node
Privacy Preserving Record Linkage	Aggregator DMC	Matching	(decompress), Run PPRL, assign identifier, probability, return result data set to portal with local and global IDs and metadata for routing
Privacy Preserving Record Linkage	Portal	Routing	Route identifiers to correct sites (this may require updates to current PMN portal code here)
Privacy Preserving Record Linkage	Data Site DMC	Local Update	Update local data set with network wide identifiers, send "completed" response to portal
De-ID Data Visualization*	Portal	Protocol Specification	Select data visualization method and parameters, subnetwork, parameters
De-ID Data Visualization	Data Site DMC	Extraction	Extract De-ID dataset and return to portal (compression?)
De-ID Data Visualization	Portal	Queuing	Wait for all sites to respond, send to aggregator node
De-ID Data Visualization	Aggregator/HB DMC	Merge and Prep	Merge Data Sets, prepare for display, return to portal for visualization
De-ID Data Visualization	Portal	Invokde Display Service	Render results (currently a link out)
OHDSI/Multi-Query*	Portal	Protocol Specification	Render OHDSI GUI (or container), route each API call to Data Sites
OHDSI/Multi-Query	Data Site DMC	Local Computation	Invoke OHDSI API, retrieve results, return results
OHDSI/Multi-Query	Portal	Queueing	Wait for all sites to respond (or timeout?), send to aggregator
OHDSI/Multi-Query	Aggregator	Merge and Prep	Take list of results, combine into a single result in OHDSI format, route to portal
OHDSI/Multi-Query	Portal	Invoke Display Service	Route results to OHDSI GUI and display

[Two use-cases for visualization/data exploration - one based on a de-identified data set where interactivity does not require multiple queries to source data, the other where intereactivity requires multple queries of source data (OHDSI)]

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
cohorts		cohorts
_config.yml		_config.yml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed Methods

Summary

Conventions and Decomponsition of Methods

Creating Subnetworks devoted to specific operations and projects.

Specifying Workflows for Networked Data Operations (Protocols)

pSCANNER Protocol Specification and Approval Process

Differences from SCANNER and SHRINE in Unattended vs. Attended Operations

More on Protocol Specificaitons Piped Data Operations and Workflows (i.e. Prospective "Provenance")

Platform Independent Specifications for Data and Operations

Code Generation as a Service (CGaaS?) vs. Stored Procedures/Packages

Specifying Operations

Data Specifications

Authorizations and Access Control

Summary

Use Cases

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

pSCANNER/Distributed-Methods-All

Folders and files

Latest commit

History

Repository files navigation

Distributed Methods

Summary

Conventions and Decomponsition of Methods

Creating Subnetworks devoted to specific operations and projects.

Specifying Workflows for Networked Data Operations (Protocols)

pSCANNER Protocol Specification and Approval Process

Differences from SCANNER and SHRINE in Unattended vs. Attended Operations

More on Protocol Specificaitons Piped Data Operations and Workflows (i.e. Prospective "Provenance")

Platform Independent Specifications for Data and Operations

Code Generation as a Service (CGaaS?) vs. Stored Procedures/Packages

Specifying Operations

Data Specifications

Authorizations and Access Control

Summary

Use Cases

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages