Chapter 3 Generating data using synthpop methods

In this chapter we will look at how to generate synthetic data on the server side using DataSHIELD functions, based on synthpop functionality.

First we need to build a login object for the server that holds the data:

builder <- DSI::newDSLoginBuilder()
# hide credentials
builder$append(server="server1", url="https://opal-sandbox.mrc-epid.cam.ac.uk",
               user="dsuser", password="password", 
               table = "DASIM.DASIM1")
logindata <- builder$build()

And then we establish a connection to the server:

library(DSOpal)
if(exists("connections")){
  datashield.logout(conns = connections)
}
connections <- datashield.login(logins=logindata, assign = TRUE)
## 
## Logging into the collaborating servers
## 
##   No variables have been specified. 
##   All the variables in the table 
##   (the whole dataset) will be assigned to R!
## 
## Assigning table data...

The first option is to generate a synthetic dataset using an implementation of the synthpop package on the server side. synthpop requires some thought on the part of the user: if you have a data set with a large number of columns it may take a large amount of time to generate the synthetic data. Assuming we have a data set with a small number of columns (i.e. around 10) we can simply execute the following command:

library(dsSyntheticClient)
library(dsBaseClient)
synth_data = ds.syn(data = "D", method = "cart", m = 1, seed = 123)$server1$syn

We then have the synthetic data on the client side and can view and manipulate it as required:

head(synth_data)
##       LAB_TSC  LAB_TRIG   LAB_HDL LAB_GLUC_FASTING PM_BMI_CONTINUOUS DIS_CVA
## 1013 5.939322 0.4123825 2.5331337         4.780652          16.56022       0
## 1001 4.488742 1.4545074 1.1666774         4.354192          29.37608       0
## 1002 5.530095 1.4927338 1.2354929         4.144381          27.20650       0
## 1003 6.223834 2.7842252 1.4955468         2.751385          22.57965       0
## 1004 6.249578 3.4324239 0.9035903         4.379078          32.29457       0
## 1005 2.615003 0.3251287 1.7154251         3.953342          24.67133       0
##      DIS_DIAB DIS_AMI GENDER PM_BMI_CATEGORICAL
## 1013        0       0      1                  1
## 1001        0       0      0                  2
## 1002        0       0      1                  2
## 1003        0       0      0                  1
## 1004        0       0      1                  3
## 1005        0       0      1                  1

If you have a dataset with a larger number of columns, you could generate a synthetic dataset for a subset of the variables that you need to generate a particular harmonised variable. For example if we needed to generate a diabetes variable based on blood triglycerides, HDL and glucose we could just generate a dataset for those variables:

ds.subset(x = "D", subset = "D2", cols = c("LAB_HDL", "LAB_TRIG", "LAB_GLUC_FASTING"))
synth_data_sub = ds.syn(data = "D2", method = "cart", m = 1, seed = 123)$server1$syn
head(synth_data_sub)
##       LAB_HDL  LAB_TRIG LAB_GLUC_FASTING
## 1013 1.830697 1.0700770         3.696904
## 1001 1.017168 2.2473066         3.626215
## 1002 1.560354 2.5917701         4.229809
## 1003 1.257052 3.2711980         3.941001
## 1004 1.636933 0.3509866         4.419090
## 1005 1.454375 2.3146460         4.580741

Lastly we save our data for later chapters:

write.csv(x = synth_data, file = "data/synth_data.csv")