Chapter 6 Generating data using simstudy methods

In this chapter we will look at how to generate synthetic data on the client side using DataSHIELD functions to extract summary characteristics of the data set from the server side. These parameters are then used with the simstudy package to generate the synthetic data. This method requires more work to extract the parameters correctly, and the dsSyntheticClient package provides some helper functions. The advantage of working in this way is that it may be preferred by data custodians as only summary statistics leave the remote server. While using synthpop should not “leak” personal data when the synthetic data is generated on the server side, this still may be of concern to some data custodians.

6.1 Overview

The steps for using the simstudy to generate a synthetic data set:

  1. Connect to the server
  2. Split factor variables in dummy (binary) variables
  3. Prepare lists of continuous and factor variables
  4. Generate synthetic data set
  5. Recombine dummy variables.

6.2 Getting set up

Again we need to build a login object for the server that holds the data. Note that the dsSynthetic functions have been written to work with a single connection:

builder <- DSI::newDSLoginBuilder()
# hide credentials
builder$append(server="server1", url="https://opal-sandbox.mrc-epid.cam.ac.uk",
               user="dsuser", password="P@ssw0rd", 
               table = "DASIM.DASIM1")
logindata <- builder$build()

And then we establish a connection to the server:

library(DSOpal)
if(exists("connections")){
  datashield.logout(conns = connections)
}
connections <- datashield.login(logins=logindata, assign = TRUE)

6.3 Preparing to generate synthetic data

This method builds on the fundamentals of the simstudy package. There is a good introduction to this package here.

First we load the library

library("simstudy")
library("dsBaseClient")
library("dsSyntheticClient")

The simstudy method uses a function called genCorFlex (see simstudy help for more details). This function is currently not able to support categorical distributions, but does support binary distributions. Therefore we need to convert factor variables with more than 2 levels into a series of binary dummy variables. For example, the PM_BMI_CATEGORICAL variable has levels 1, 2 and 3. This will need to be converted into 2 binary variables with 2 levels. A helper function is provided to make this process easier.

ds.binary.helper(dataframe = "D", factor_variables = "PM_BMI_CATEGORICAL", newobj = "my.bin", datasources = connections)
## $server1
## $server1$return.message
## [1] "Object(s) 'dummy,X1_PM_BMI_CATEGORICAL,X2_PM_BMI_CATEGORICAL,temp_mat' was deleted."
## 
## $server1$deleted.objects
## [1] "dummy,X1_PM_BMI_CATEGORICAL,X2_PM_BMI_CATEGORICAL,temp_mat"
## 
## $server1$missing.objects
## [1] ""
## 
## $server1$problem.objects
## [1] ""

Looking at the new dataframe, which has the 2 binary variables in it:

ds.summary("my.bin", datasources = connections)
## $server1
## $server1$class
## [1] "data.frame"
## 
## $server1$`number of rows`
## [1] 10000
## 
## $server1$`number of columns`
## [1] 2
## 
## $server1$`variables held`
## [1] "X1_PM_BMI_CATEGORICAL" "X2_PM_BMI_CATEGORICAL"

We will make a new dataframe without the original PM_BMI_CATEGORICAL variable but with the new binaries:

vars = c("LAB_TSC", "LAB_TRIG", "LAB_HDL", "LAB_GLUC_FASTING", "PM_BMI_CONTINUOUS", "DIS_CVA", "DIS_DIAB", "DIS_AMI", "GENDER")
ds.subset(x="D", subset="D2", cols=vars, datasources = connections)
ds.dataFrame(x = c("D2","my.bin"), newobj = "new_frame", datasources = connections)
## $is.object.created
## [1] "A data object <new_frame> has been created in all specified data sources"
## 
## $validity.check
## [1] "<new_frame> appears valid in all sources"

6.4 Generating synthetic data

Now we are ready to generate the synthetic data using the ds.genCorFlex.helper function. We need to provide a vector of continuous variables and factor variables:

cont_vars = c("LAB_TSC", "LAB_TRIG", "LAB_HDL", "LAB_GLUC_FASTING", "PM_BMI_CONTINUOUS")
factor_vars = c( "DIS_CVA", "DIS_DIAB", "DIS_AMI", "GENDER", "X1_PM_BMI_CATEGORICAL", "X2_PM_BMI_CATEGORICAL")
dd = ds.genCorFlex.helper(dataframe = "new_frame", cont_variables = cont_vars, factor_variables = factor_vars, num_rows = 10000, datasources = connections)

If all has gone well the synthetic data should have a similar structure and properties to the real data:

head(dd)
##    id DIS_CVA_num DIS_DIAB_num DIS_AMI_num GENDER_num X1_PM_BMI_CATEGORICAL_num
## 1:  1           0            0           0          1                         1
## 2:  2           0            0           0          1                         0
## 3:  3           0            0           0          0                         0
## 4:  4           0            0           0          1                         0
## 5:  5           0            0           0          0                         0
## 6:  6           0            0           0          1                         0
##    X2_PM_BMI_CATEGORICAL_num  LAB_TSC LAB_TRIG  LAB_HDL LAB_GLUC_FASTING
## 1:                         0 5.737100 1.144921 1.877549         5.011586
## 2:                         0 6.169881 2.148150 1.893974         3.228336
## 3:                         0 3.773902 1.671582 1.877346         4.535217
## 4:                         0 4.708924 2.160444 1.398481         3.746336
## 5:                         0 5.359086 4.137004 1.271885         3.174415
## 6:                         0 5.415491 1.681056 2.018016         4.472380
##    PM_BMI_CONTINUOUS
## 1:          30.18235
## 2:          21.24348
## 3:          26.15846
## 4:          30.28508
## 5:          23.84756
## 6:          31.23694

We can reconstruct the PM_BMI_CATEGORICAL variable as follows:

dd$PM_BMI_CATEGORICAL = 1 + dd$X1_PM_BMI_CATEGORICAL_num + dd$X2_PM_BMI_CATEGORICAL_num

Such that it now has levels 1, 2 and 3.

6.5 Brief comments on the validity of the data

In the example above we chose to synthetically generate the variable PM_BMI_CATEGORICAL. This variable is actually derived from the continuous variable PM_BMI_CONTINUOUS, with (say) BMI <25 being category 1, 25<= BMI < 30 being category 2 etc. Because of the probabilistic way in which the data are generated, this categorisation is not enforced in the data synthesis. It might be better to generate the continuous variable only and add the categorical variable afterwards.