Chapter 5 Harmonisation with synthetic data

In this section we describe how to harmonise synthetic data on the client side. This assumes that you have used one of the previous methods to generate your synthetic data set.

Recall that we are aiming to use synthetic data on the client side to design harmonisation algorithms, and then implement these on Opal on the server side using the real data. This removes the need for the user to have full access to the data. Harmonisation algorithms can be implemented in Opal using MagmaScript (JavaScript with some additional functions) without having full access to the data. The idea is that writing JavaScript on the client side, having full access to the synthetic data, is easier than writing the code on the server side with only access to summaries.

The steps for harmonisation following generation of synthetic data are:

  1. User requests synthetic copy of real data
  2. Synthetic data generated & available on client side
  3. Synthetic data loaded into JavaScript (JS). User writes harmonisation code (MagmaScript) on client side.
  4. When complete, MagmaScript code implemented on server side to run on real data to generate new, harmonised data set
Prototyping DataSHIELD harmonisation using synthetic data on Javascript

Figure 5.1: Prototyping DataSHIELD harmonisation using synthetic data on Javascript

5.1 Getting set up

First we start a JavaScript session and load the additional MagmaScript functionality that is found in Opal. We also load our synthetic data into the JavaScript session.

library(V8)
## Using V8 engine 9.6.180.12
ct2 = v8()
ct2$source(system.file("MagmaScript.min.js", package = "dsSyntheticClient"))
## [1] "true"
synth_data = read.csv(file = "data/synth_data.csv")
ct2$assign("synth_data", synth_data)

We then go into the JavaScript v8 console.

ct2$console()

5.2 Experiment with a single row

A MagmaScript function grabs the first row of data. We can then write some JavaScript to operate on that single row and show the result:

var $ = MagmaScript.MagmaScript.$.bind(synth_data[0]);

if ($('y3age').value() > 25 ){
  out = 1
} else {
  out = 0
}

5.3 Test on whole dataset

Now we test our code against the whole dataset. This is done by:

  1. Defining the script as a string assigned to a variable
  2. Execute this script in a loop through each row of data
  3. Each time capture the output

myScript = `
if ($('y3age').value() > 25 ){
  out = 1
} else {
  out = 0
}
`

var my_out = [];
var out = NULL;

for (j = 0; j < synth_data.length; j++){
  my_out.push(MagmaScript.evaluator(myScript, synth_data[j]))
}
exit

And pull the results into R for inspection:

my_out = ct2$get("my_out")

synth_data_harm = synth_data
synth_data_harm$my_var = my_out

5.4 Run the code on the real data

If we are happy with the code, we can paste it directly into the Opal script interface so that it can be executed on the real data:

Script editor in Opal

Figure 5.2: Script editor in Opal

This will generate a harmonised variable in the view on Opal which can be used in analyses. The summary statistics of the harmonised data can be checked to make sure the harmonisation is working correctly.

A similar process could be conducted in a platform like MOLGENIS.