Chapter 3 Generating data using synthpop methods

In this chapter we will look at how to generate synthetic data on the server side using DataSHIELD functions, based on synthpop functionality. In this scenario the synthetic data is generated on the server side and returned to the client.

The syn function in the synthpop package has many detailed options for optimising the generation of synthetic data. A limited number of these options are available in the dsSynthetic package. Details of how to use these options can be found in the vignettes for synthpop.

3.1 Getting set up

First we need to build a login object for the server that holds the data. Note that the dsSynthetic functions have been written to work with a connection to a single server:

builder <- DSI::newDSLoginBuilder()
# hide credentials
builder$append(server="server1", url="https://opal-sandbox.mrc-epid.cam.ac.uk",
               user="dsuser", password="P@ssw0rd", 
               table = "DASIM.DASIM1")
logindata <- builder$build()

And then we establish a connection to the server:

library(DSOpal)
if(exists("connections")){
  datashield.logout(conns = connections)
}
connections <- datashield.login(logins=logindata, assign = TRUE)

3.2 Generate synthetic data with synthpop

The recommended way to generate a synthetic dataset is by using an implementation of the synthpop package on the server side. synthpop requires some thought on the part of the user: if you have a data set with a large number of columns it may take a large amount of time to generate the synthetic data. Assuming we have a data set with a small number of columns (i.e. around 10) we can simply execute the following command:

library(dsSyntheticClient)
library(dsBaseClient)
# N.B. you may need to replace `server1` if you have named your connection differently
synth_data = ds.syn(data = "D", method = "cart", m = 1, seed = 123)$server1$Data$syn

We then have the synthetic data on the client side and can view and manipulate it as required:

head(synth_data)
##        LAB_TSC  LAB_TRIG   LAB_HDL LAB_GLUC_FASTING PM_BMI_CONTINUOUS DIS_CVA
## 1     5.716675 1.2131437 1.2263569         4.746905          28.30447       0
## 10    5.635550 1.0352904 0.9980688         4.724115          26.40500       0
## 100   4.304612 0.2782473 1.7977266         3.922677          28.83399       0
## 1000  6.250217 0.8718639 1.8747723         3.457615          23.08530       0
## 10000 8.621496 4.3548597 1.1010694         4.637112          27.85634       0
## 1001  5.026787 2.2295412 1.1088745         3.802152          21.22769       0
##       DIS_DIAB DIS_AMI GENDER PM_BMI_CATEGORICAL
## 1            0       0      0                  2
## 10           0       0      1                  2
## 100          0       0      1                  2
## 1000         0       0      1                  1
## 10000        0       0      0                  2
## 1001         0       0      0                  1

If you have a dataset with a larger number of columns, you could generate a synthetic dataset for a subset of the variables that you need to generate a particular part of your code development. For example if we needed to generate a diabetes variable based on blood triglycerides, HDL and glucose we could just generate a dataset for those variables:

ds.subset(x = "D", subset = "D2", cols = c("LAB_HDL", "LAB_TRIG", "LAB_GLUC_FASTING"))
# N.B. you may need to replace `server1` if you have named your connection differently
synth_data_sub = ds.syn(data = "D2", method = "cart", m = 1, seed = 123)$server1$Data$syn
head(synth_data_sub)
##         LAB_HDL LAB_TRIG LAB_GLUC_FASTING
## 1     0.9853245 3.076382         4.456135
## 10    1.3826338 2.876982         4.873135
## 100   1.2971431 2.419555         3.803016
## 1000  1.4092677 1.197656         4.057115
## 10000 1.3146526 2.818617         5.289527
## 1001  1.5117405 1.244554         5.430503

Lastly we save our data for later chapters:

write.csv(x = synth_data, file = "data/synth_data.csv")

3.3 Brief comments on the validity of the data

In the example above we chose to synthetically generate the variable PM_BMI_CATEGORICAL. This variable is actually derived from the continuous variable PM_BMI_CONTINUOUS, with (say) BMI <25 being category 1, 25<= BMI < 30 being category 2 etc. Because of the probabilistic way in which the data are generated, this categorisation is not enforced in the data synthesis. It might be better to generate the continuous variable only and add the categorical variable afterwards.