Novel approaches for multimedia data processing : a thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, Auckland, New Zealand

Thumbnail Image
Open Access Location
Journal Title
Journal ISSN
Volume Title
Massey University
The Author
Multimedia data processing is an active research field contributing to many frontiers of science and technology. It involves the processing of audio, image, video, text, and other forms of data. In this thesis, four novel approaches are proposed to address two key issues in multimedia data processing: (i) how to reduce the annotation costs of sound event classification/tagging, and (ii) how to improve the quality of video captions. To address the issue of how to reduce the annotation costs of sound event classification/tagging, we propose a Gabor dictionary-based active learning (DBAL) approach for semi-automatic sound event classification. In DBAL, sound features are extracted from audio recordings through a Gabor dictionary. Based on the extracted features, sound events in the recordings will be manual or automatic tagged through active learning. Then a classifier is trained by these recordings with their true or predicted labels. Thus, DBAL can be evaluated by the accuracy of the classifier. Further, a learnt dictionary-based active learning (LDAL) approach is proposed to tackle the same issue. In LDAL, a K-SVD learnt dictionary replaces the Gabor dictionary for feature extraction. The same active learning mechanism and classifier are used for tagging and evaluation. Compared with other existing approaches, our approaches (i.e., DBAL and LDAL) achieve higher classification accuracies but require much fewer annotation costs. To tackle the issue of how to improve the quality of video captions, we propose an attention-based dual learning (ADL) approach for video captioning. Two modules (i.e., a caption generation module and a video reconstruction module) are contained in ADL, which are fine-tuned via dual learning. Thus, ADL can enhance the quality of the generated captions by minimizing the differences between raw and reconstructed/reproduced videos. Further, we propose a bidirectional relational recurrent neural network (Bidirectional RRNN) to tackle the same issue. By fully utilizing the local and global context information as well as visual information in videos, Bidirectional RRNN can capture all events in a video, reason the relationships between events, and generate a set of informative sentences to describe video contents. Experimental results on benchmark datasets demonstrate that our approaches (i.e., ADL and Bidirectional RRNN) are superior to the state-of-the-art approaches. In conclusion, this thesis proposes four effective approaches for processing multimedia data. Experimental results show that our approaches outperform the state-of-the-art approaches.
Multimedia systems, Research