Chapter 5 Harmonisation with synthetic data

In this section we describe how to harmonise synthetic data on the client side. This assumes that you have used one of the previous methods to generate your synthetic data set.

Recall that we are aiming to use synthetic data on the client side to design harmonisation algorithms, and then implement these on Opal on the server side using the real data. This removes the need for the user to have full access to the data. Harmonisation algorithms can be implemented in Opal using MagmaScript (JavaScript with some additional functions) without having full access to the data. The idea is that writing JavaScript on the client side, having full access to the synthetic data, is easier than writing the code on the server side with only access to summaries.

Aditional steps for harmonisation are:

  1. With the synthetic data on the client side, the user can view the data and develop their code. They will be able to see the how the data changes as the code is run.
  2. When the code is complete, it can be run on the serve side using the real data.

In detail, the steps proposed are:

  1. Start a JavaScript session on the client side
  2. Load the synthetic data into the session
  3. Write and test JavaScript code in the session against the synthetic data
  4. When happy, copy the code into Opal to generate the harmonised data
Prototyping DataSHIELD harmonisation using synthetic data on Javascript

Figure 5.1: Prototyping DataSHIELD harmonisation using synthetic data on Javascript

5.1 Getting set up

First some system level packages may need to be installed: On Debian / Ubuntu install either libv8-dev or libnode-dev, on Fedora use v8-devel. This allows the installation of the R V8 package.

install.packages("V8")
## Installing package into '/home/vagrant/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
library("V8")
## Using V8 engine 9.6.180.12

5.2 Do the work

Now we can start a JavaScript session and load the additional MagmaScript functionality that is found in Opal. We also load our synthetic data into the JavaScript session.

ct2 = v8()
ct2$source("https://raw.githubusercontent.com/tombisho/dsSyntheticClient/main/MagmaScript.min.js")
## [1] "true"
synth_data = read.csv(file = "data/synth_data.csv")
ct2$assign("als_syn", synth_data)

We then go into the console. A MagmaScript function grabs the first row of data. We can then write some JavaScript to operate on that single row and show the result:

ct2$console()
var $ = MagmaScript.MagmaScript.$.bind(als_syn[0]);

if ($('y3age').value() > 25 ){
  out = 1
} else {
  out = 0
}

Now we test our code against the whole dataset. This is done by:

  1. Defining the script as a string assigned to a variable
  2. Execute this script in a loop through each row of data
  3. Each time capture the output

myScript = `
if ($('y3age').value() > 25 ){
  out = 1
} else {
  out = 0
}
`

var my_out = [];

for (j = 0; j < als_syn.length; j++){
  my_out.push(MagmaScript.evaluator(myScript, als_syn[j]))
}
exit

And pull the results into R for inspection:

my_out = ct2$get("my_out")

synth_data_harm = synth_data
synth_data_harm$my_var = my_out

If we are happy with the code, we can paste it directly into the Opal script interface so that it can be executed on the real data:

Script editor in Opal

Figure 5.2: Script editor in Opal