Scalable, high-performance, and generalized subtree data anonymization approach for Apache Spark

Data anonymization strategies such as subtree generalization have been hailed as techniques that provide a more efficient generalization strategy compared to full-tree generalization counterparts. Many subtree-based generalizations strategies (e.g., top-down, bottom-up, and hybrid) have been implemented on the MapReduce platform to take advantage of scalability and parallelism. However, MapReduce inherent lack support for iteration intensive algorithm implementation such as subtree generalization. This paper proposes Distributed Dataset (RDD)-based implementation for a subtree-based data anonymization technique for Apache Spark to address the issues associated with MapReduce-based counterparts. We describe our RDDs-based approach that offers effective partition management, improved memory usage that uses cache for frequently referenced intermediate values, and enhanced iteration support. Our experimental results provide high performance compared to the existing state-of-the-art privacy preserving approaches and ensure data utility and privacy levels required for any competitive data anonymization techniques.

Keywords

Spark, subtree generalization, privacy, data anonymization, Resilient Distributed Dataset (RDD)

Citation

Bazai SU, Jang-Jaccard J, Alavizadeh H. (2021). Scalable, high-performance, and generalized subtree data anonymization approach for apache spark. Electronics (Switzerland). 10. 5. (pp. 1-28).

URI

https://mro.massey.ac.nz/handle/10179/69139

Collections

Journal Articles

Creative Commons license

Full item page

Scalable, high-performance, and generalized subtree data anonymization approach for Apache Spark

Files

Date

DOI

Open Access Location

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Rights

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license