Chapter 2 Introduction

2.1 Assumptions

It is assumed that the reader is familiar with the basic concepts and motivations around using DataSHIELD for federated analysis. More information about this can be found here.

Knowledge of R. For harmonisation, knowledge of JavaScript and or MagmaScript is needed. For analysis, an understanding of DSLite.

(TO DO add some more links to helpful stuff here)

2.2 Motivation

The benefits of the federated approach are that data can be harmonised and analysed without giving complete access to or transferring the data. A less desirable consequence of this approach is that it is more challenging for data experts to harmonise data to common standards, and for analysts to run their scripts, when the data are not tangibly in front of them. One could say that it is like trying to build a Lego model while wearing a blindfold.

With harmonisation, some groups have been through the process of transferring and centralising the data, but this negates one of the benefits of the federated approach. While it is only necessary for the harmonisation team to receive the data, a lot of bureaucracy is required. Others have mandated that each group harmonise their own data. The challenge with this approach is that there can be inconsistencies in the approach of different teams, and each team needs training and expertise in the harmonisation process.

Analysis via DataSHIELD has no such compromises and requires the analyst to make their own checks to validate that their analysis is progressing as planned. This has to be done via non-disclosive information about the data that the analysis has generated. For example, to confirm that a subset into male and female groups has been successful, the analyst could ask for a summary of the original gender column and check the counts of male and female participants match the length of the subset dataframes. These extra steps are fine, but it can be more tricky to confirm the behaviour of more complex functions such as ds.lexis and ds.reshape.

2.3 Hypothesis for using synthetic data

R packages like synthpop (Nowok, Raab, and Dibben 2016) have been developed to generate realistic synthetic data that is not disclosive. A dsSynthpop package could be used to generate a synthetic data set on the client side by running the generation on the server side. Users can then perform harmonisation while working with full access to synthetic data on the client to confirm algorithms are working as expected. When the user is happy that the algorithms are working correctly, they can then be applied to the real data on the server side. The user therefore has the benefit of being able to see the data they are working with, but without the need to go through labourious data transfer processes. The same benefits are realised for an analysis user.

Other packages that provide synthetic data generation are simstudy and gcipdr. Simstudy requires the user to define the characteristics of variables and their relationships. However, non-disclosive access via DataSHIELD can help provide these summary statistics. There is also the benefit that the user then has precise control over the nature of the synthetic data generated. Likewise, gcipdr makes it easy for users to extract features such as mean, standard deviation and correlations via DataSHIELD, and use these to provide a more automated generation of the synthetic data. In dsSynthetic we provide functionality built on simstudy as it is more mature, has less complex dependencies and is faster. The compromise is that gcipdr should provide more accurate results, as it was designed to provide synthetic data that would allow actual inferences to be drawn as from the real data. However for our purposes we only want synthetic data that is realistic enough to write harmonisation code and plan analysis code: this work is then applied to the real data to get the inferences.

2.4 Overview of steps

The hypothesis can be described by the following steps for harmonisation:

  1. The data custodian uploads the raw data to the server side and installs the server side pack dsSynthetic
  2. The user install the package dsSyntheticClient on the client side
  3. The user calls functions in the dsSyntheticClient package to generate a synthetic but non-disclosive data set which is returned to the client side.
  4. With the synthetic data on the client side, the user can view the data and build harmonisation algorithms. They will be able to see the results of the algorithms for each row of data.
  5. When the algorithms are complete, they can be implemented on Opal using the real data.
Central harmonisation via synthetic data without full access

Figure 2.1: Central harmonisation via synthetic data without full access

And for analysis:

  1. Assuming steps #1 and #2 above are complete, the user calls functions in the dsSyntheticClient package to generate a synthetic but non-disclosive data set which is returned to the client side.
  2. With the synthetic data on the client side, the user then starts a DSLite instance and places the synthetic data into it.
  3. The user can then write their analysis using DataSHIELD commands against the DSLite instance. DSLite then allows the user to return any object on the server side. Therefore users can see the results of each step of their script for each row of data.
  4. When the analysis script is complete, the user can run it against the real data on the server side.
Prototyping DataSHIELD analysis using synthetic data on DSLite

Figure 2.2: Prototyping DataSHIELD analysis using synthetic data on DSLite

2.5 Prerequisites

Using DataSHIELD also requires some R packages to be installed on the client site. So far, the following R packages must be installed (in their development version):

install.packages("DSOpal", dependencies = TRUE)
install.packages("dsBaseClient", repos = c("https://cloud.r-project.org", "https://cran.obiba.org"), dependencies = TRUE)
devtools::install_github("tombisho/dsSyntheticClient", dependencies = TRUE)
install.packages("simstudy")

The package dependencies are then loaded as follows:

library(DSOpal)
## Loading required package: opalr
## Loading required package: httr
## Loading required package: DSI
## Loading required package: progress
## Loading required package: R6
library(dsSyntheticClient)

References

Nowok, Beata, Gillian M. Raab, and Chris Dibben. 2016. Synthpop: Bespoke Creation of Synthetic Data in R. Journal of Statistical Software. Vol. 74. 11. https://doi.org/10.18637/jss.v074.i11.