Realism in synthetic data generation : a thesis presented in fulfilment of the requirements for the degree of Master of Philosophy in Science, School of Engineering and Advanced Technology, Massey University, Palmerston North, New Zealand
Loading...
Date
2017
DOI
Open Access Location
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Massey University
Rights
The Author
Abstract
There are many situations where researchers cannot make use of real data because either the data
does not exist in the required format or privacy and confidentiality concerns prevent release of the data.
The work presented in this thesis has been undertaken in the context of security and privacy for the
Electronic Healthcare Record (EHR). In these situations, synthetic data generation (SDG) methods are
sought to create a replacement for real data. In order to be a proper replacement, that synthetic data
must be realistic yet no method currently exists to develop and validate realism in a unified way. This
thesis investigates the problem of characterising, achieving and validating realism in synthetic data
generation. A comprehensive domain analysis provides the basis for new characterisation and
classification methods for synthetic data, as well as a previously undescribed but consistently applied
generic SDG approach. In order to achieve realism, an existing knowledge discovery in databases
approach is extended to discover realistic elements inherent to real data. This approach is validated
through a case study. The case study demonstrates the realism characterisation and validation
approaches as well as establishes whether or not the synthetic data is a realistic replacement. This
thesis presents the ATEN framework which incorporates three primary contributions: (1) the THOTH
approach to SDG; (2) the RA approach to characterise the elements and qualities of realism for use in
SDG, and finally; (3) the HORUS approach for validating realism in synthetic data. The ATEN framework
presented is significant in that it allows researchers to substantiate claims of success and realism in
their synthetic data generation projects. The THOTH approach is significant in providing a new
structured way for engaging in SDG. The RA approach is significant in enabling a researcher to discover
and specify realism characteristics that must be achieved synthetically. The HORUS approach is
significant in providing a new practical and systematic validation method for substantiating and justifying
claims of success and realism in SDG works. Future efforts will focus on further validation of the ATEN
framework through a controlled multi-stream synthetic data generation process.
Description
Keywords
Computer simulation, Data protection, Data mining