Realism in synthetic data generation : a thesis presented in fulfilment of the requirements for the degree of Master of Philosophy in Science, School of Engineering and Advanced Technology, Massey University, Palmerston North, New Zealand

dc.contributor.authorMcLachlan, Scott
dc.date.accessioned2017-08-07T02:05:08Z
dc.date.available2017-08-07T02:05:08Z
dc.date.issued2017
dc.description.abstractThere are many situations where researchers cannot make use of real data because either the data does not exist in the required format or privacy and confidentiality concerns prevent release of the data. The work presented in this thesis has been undertaken in the context of security and privacy for the Electronic Healthcare Record (EHR). In these situations, synthetic data generation (SDG) methods are sought to create a replacement for real data. In order to be a proper replacement, that synthetic data must be realistic yet no method currently exists to develop and validate realism in a unified way. This thesis investigates the problem of characterising, achieving and validating realism in synthetic data generation. A comprehensive domain analysis provides the basis for new characterisation and classification methods for synthetic data, as well as a previously undescribed but consistently applied generic SDG approach. In order to achieve realism, an existing knowledge discovery in databases approach is extended to discover realistic elements inherent to real data. This approach is validated through a case study. The case study demonstrates the realism characterisation and validation approaches as well as establishes whether or not the synthetic data is a realistic replacement. This thesis presents the ATEN framework which incorporates three primary contributions: (1) the THOTH approach to SDG; (2) the RA approach to characterise the elements and qualities of realism for use in SDG, and finally; (3) the HORUS approach for validating realism in synthetic data. The ATEN framework presented is significant in that it allows researchers to substantiate claims of success and realism in their synthetic data generation projects. The THOTH approach is significant in providing a new structured way for engaging in SDG. The RA approach is significant in enabling a researcher to discover and specify realism characteristics that must be achieved synthetically. The HORUS approach is significant in providing a new practical and systematic validation method for substantiating and justifying claims of success and realism in SDG works. Future efforts will focus on further validation of the ATEN framework through a controlled multi-stream synthetic data generation process.en_US
dc.identifier.urihttp://hdl.handle.net/10179/11569
dc.language.isoenen_US
dc.publisherMassey Universityen_US
dc.rightsThe Authoren_US
dc.subjectComputer simulationen_US
dc.subjectData protectionen_US
dc.subjectData miningen_US
dc.titleRealism in synthetic data generation : a thesis presented in fulfilment of the requirements for the degree of Master of Philosophy in Science, School of Engineering and Advanced Technology, Massey University, Palmerston North, New Zealanden_US
dc.typeThesisen_US
massey.contributor.authorMcLachlan, Scott
thesis.degree.disciplineScienceen_US
thesis.degree.grantorMassey Universityen_US
thesis.degree.levelMastersen_US
thesis.degree.nameMaster of Philosophy (MPhil)en_US
Files
Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
01_front.pdf
Size:
116.41 KB
Format:
Adobe Portable Document Format
Description:
Loading...
Thumbnail Image
Name:
02_whole.pdf
Size:
3.62 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
3.32 KB
Format:
Item-specific license agreed upon to submission
Description: