Chapter 6 Analysis with synthetic data

In this section we describe how we can use a copy of the synthetic data to help users write DataSHIELD analysis code. Again we will make use of an existing dataset (** TO DO would be better to run this on some harmonised data)

Recall that the objective here is to use a synthetic copy of already harmonised real data that is then made available to the analyst on the client side. This synthetic data set can be loaded into DSLite, a special client side implementation of server side DataSHIELD used for development purposes: the user can have full access to the data. This is acceptable in this situation as the data are synthetic. Therefore the user has the chance to write DataSHIELD code but see the complete results of each step. This makes it easier to develop the code.

6.1 Getting set up

We require DSLite and dsBase to be installed:

install.packages("DSLite", dependencies = TRUE)
## Installing package into '/home/vagrant/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("dsBase", repos = c("https://cloud.r-project.org", "https://cran.obiba.org"), dependencies = TRUE)
## Installing package into '/home/vagrant/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)

Let us assume that the DASIM1 dataset has previously been harmonised and is held on a server. We do not have access to the full dataset but want to do some analysis on it. Using the method described above, the first thing to do is generate a synthetic version that we can fully access. To do this we will use the same steps that were shown in 3

6.2 Create synthetic dataset

We build our log in object

builder <- DSI::newDSLoginBuilder()
builder$append(server="server1", url="https://opal-sandbox.mrc-epid.cam.ac.uk",
               user="dsuser", password="password", 
               table = "DASIM.DASIM1")
logindata <- builder$build()

Then perform the log in to ther server:

library(DSOpal)
if(exists("connections")){
  datashield.logout(conns = connections)
}
connections <- datashield.login(logins=logindata, assign = TRUE)
## 
## Logging into the collaborating servers
## 
##   No variables have been specified. 
##   All the variables in the table 
##   (the whole dataset) will be assigned to R!
## 
## Assigning table data...

The DASIM1 dataset is relatively small (i.e. around 10 columns). This is probably also true of many harmonised data sets. Therefore we can just create a synthetic version of it:

library(dsSyntheticClient)
synth_data = ds.syn(data = "D", method = "cart", m = 1, seed = 123)
DASIM = synth_data$server1$syn

6.3 Start DSLite local instance

library(DSLite)
library(dsBaseClient)
dslite.server <- newDSLiteServer(tables=list(DASIM=DASIM))
dslite.server$config(defaultDSConfiguration(include=c("dsBase", "dsSynthetic")))

builder <- DSI::newDSLoginBuilder()
builder$append(server="server1", url="dslite.server", table = "DASIM", driver = "DSLiteDriver")
logindata <- builder$build()

if(exists("connections")){
  datashield.logout(conns = connections)
}
connections <- datashield.login(logins=logindata, assign = TRUE)
## 
## Logging into the collaborating servers
## 
##   No variables have been specified. 
##   All the variables in the table 
##   (the whole dataset) will be assigned to R!
## 
## Assigning table data...

We can now check for ourselves that our DASIM data is in the DSLite server:

ds.summary('D')
## $server1
## $server1$class
## [1] "data.frame"
## 
## $server1$`number of rows`
## [1] 10000
## 
## $server1$`number of columns`
## [1] 10
## 
## $server1$`variables held`
##  [1] "LAB_TSC"            "LAB_TRIG"           "LAB_HDL"           
##  [4] "LAB_GLUC_FASTING"   "PM_BMI_CONTINUOUS"  "DIS_CVA"           
##  [7] "DIS_DIAB"           "DIS_AMI"            "GENDER"            
## [10] "PM_BMI_CATEGORICAL"

Now suppose we want to subset the DASIM data into men and women. We can use the ds.subset function:

ds.subset(x="D", subset = "women", logicalOperator = "==", threshold = 1)

With DSLite we have the chance to look at actually what happened in detail:

women = getDSLiteData(conns = connections, symbol = "women")$server1
head(women)
##       LAB_TSC  LAB_TRIG   LAB_HDL LAB_GLUC_FASTING PM_BMI_CONTINUOUS DIS_CVA
## 1013 5.939322 0.4123825 2.5331337         4.780652          16.56022       0
## 1001 4.488742 1.4545074 1.1666774         4.354192          29.37608       0
## 1002 5.530095 1.4927338 1.2354929         4.144381          27.20650       0
## 1003 6.223834 2.7842252 1.4955468         2.751385          22.57965       0
## 1004 6.249578 3.4324239 0.9035903         4.379078          32.29457       0
## 1005 2.615003 0.3251287 1.7154251         3.953342          24.67133       0
##      DIS_DIAB DIS_AMI GENDER PM_BMI_CATEGORICAL
## 1013        0       0      1                  1
## 1001        0       0      0                  2
## 1002        0       0      1                  2
## 1003        0       0      0                  1
## 1004        0       0      1                  3
## 1005        0       0      1                  1

This doesn’t look quite right. There are still rows with GENDER == 0. There is an error in our code (we didn’t specify GENDER as part of the logicalOperator parameter) but didn’t get a warning. Let’s make a correction and try again:

ds.subset(x="D", subset = "women", logicalOperator = "GENDER==", threshold = 1)
# get the data again
women = getDSLiteData(conns = connections, symbol = "women")$server1
head(women)
##       LAB_TSC  LAB_TRIG   LAB_HDL LAB_GLUC_FASTING PM_BMI_CONTINUOUS DIS_CVA
## 1013 5.939322 0.4123825 2.5331337         4.780652          16.56022       0
## 1002 5.530095 1.4927338 1.2354929         4.144381          27.20650       0
## 1004 6.249578 3.4324239 0.9035903         4.379078          32.29457       0
## 1005 2.615003 0.3251287 1.7154251         3.953342          24.67133       0
## 1010 5.189675 1.5872435 1.7576039         3.502300          30.10351       0
## 2301 6.401451 2.8097121 1.4876340         4.964355          28.72204       0
##      DIS_DIAB DIS_AMI GENDER PM_BMI_CATEGORICAL
## 1013        0       0      1                  1
## 1002        0       0      1                  2
## 1004        0       0      1                  3
## 1005        0       0      1                  1
## 1010        0       0      1                  3
## 2301        0       0      1                  2

This now looks much better.

We can also compare results obtained via DataSHIELD with results on the actual data:

from_server = ds.glm(formula = "DIS_DIAB~PM_BMI_CONTINUOUS+LAB_TSC+LAB_HDL", data = "women", family = "binomial")
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge
from_local = glm(formula = "DIS_DIAB~PM_BMI_CONTINUOUS+LAB_TSC+LAB_HDL", data = women, family = "binomial")

from_server$coefficients
##                     Estimate Std. Error   z-value      p-value low0.95CI.LP
## (Intercept)       -6.0430125 1.07150306 -5.639753 1.702944e-08  -8.14311987
## PM_BMI_CONTINUOUS  0.1439233 0.02563643  5.614015 1.976845e-08   0.09367683
## LAB_TSC           -0.2203753 0.10708899 -2.057871 3.960252e-02  -0.43026590
## LAB_HDL           -0.5597964 0.31679551 -1.767059 7.721835e-02  -1.18070413
##                   high0.95CI.LP        P_OR low0.95CI.P_OR high0.95CI.P_OR
## (Intercept)         -3.94290507 0.002368771   0.0002906442      0.01902291
## PM_BMI_CONTINUOUS    0.19416978 1.154795539   1.0982047814      1.21430243
## LAB_TSC             -0.01048477 0.802217643   0.6503361470      0.98957001
## LAB_HDL              0.06111143 0.571325401   0.3070624495      1.06301736
from_local$coefficients
##       (Intercept) PM_BMI_CONTINUOUS           LAB_TSC           LAB_HDL 
##        -6.0430125         0.1439233        -0.2203753        -0.5597964

TO DO make a more realistic example for example with ds.lexis or ds.reshape where it is much harder to know what is going on