Novel approaches for multimedia data processing : a thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, Auckland, New Zealand

dc.confidentialEmbargo :Yesen_US
dc.contributor.advisorRuili, Wang
dc.contributor.authorJi, Wanting
dc.date.accessioned2020-10-06T02:34:34Z
dc.date.accessioned2021-05-10T02:39:47Z
dc.date.available2020-10-06T02:34:34Z
dc.date.available2021-05-10T02:39:47Z
dc.date.issued2020
dc.description.abstractMultimedia data processing is an active research field contributing to many frontiers of science and technology. It involves the processing of audio, image, video, text, and other forms of data. In this thesis, four novel approaches are proposed to address two key issues in multimedia data processing: (i) how to reduce the annotation costs of sound event classification/tagging, and (ii) how to improve the quality of video captions. To address the issue of how to reduce the annotation costs of sound event classification/tagging, we propose a Gabor dictionary-based active learning (DBAL) approach for semi-automatic sound event classification. In DBAL, sound features are extracted from audio recordings through a Gabor dictionary. Based on the extracted features, sound events in the recordings will be manual or automatic tagged through active learning. Then a classifier is trained by these recordings with their true or predicted labels. Thus, DBAL can be evaluated by the accuracy of the classifier. Further, a learnt dictionary-based active learning (LDAL) approach is proposed to tackle the same issue. In LDAL, a K-SVD learnt dictionary replaces the Gabor dictionary for feature extraction. The same active learning mechanism and classifier are used for tagging and evaluation. Compared with other existing approaches, our approaches (i.e., DBAL and LDAL) achieve higher classification accuracies but require much fewer annotation costs. To tackle the issue of how to improve the quality of video captions, we propose an attention-based dual learning (ADL) approach for video captioning. Two modules (i.e., a caption generation module and a video reconstruction module) are contained in ADL, which are fine-tuned via dual learning. Thus, ADL can enhance the quality of the generated captions by minimizing the differences between raw and reconstructed/reproduced videos. Further, we propose a bidirectional relational recurrent neural network (Bidirectional RRNN) to tackle the same issue. By fully utilizing the local and global context information as well as visual information in videos, Bidirectional RRNN can capture all events in a video, reason the relationships between events, and generate a set of informative sentences to describe video contents. Experimental results on benchmark datasets demonstrate that our approaches (i.e., ADL and Bidirectional RRNN) are superior to the state-of-the-art approaches. In conclusion, this thesis proposes four effective approaches for processing multimedia data. Experimental results show that our approaches outperform the state-of-the-art approaches.en_US
dc.identifier.urihttp://hdl.handle.net/10179/16333
dc.publisherMassey Universityen_US
dc.rightsThe Authoren_US
dc.subjectMultimedia systemsen
dc.subjectResearchen
dc.subject.anzsrc460399 Computer vision and multimedia computation not elsewhere classifieden
dc.titleNovel approaches for multimedia data processing : a thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, Auckland, New Zealanden_US
dc.typeThesisen_US
massey.contributor.authorJi, Wantingen_US
thesis.degree.disciplineComputer Scienceen_US
thesis.degree.grantorMassey Universityen_US
thesis.degree.levelDoctoralen_US
thesis.degree.nameDoctor of Philosophy (PhD)en_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
JiPhDThesis.pdf
Size:
2.44 MB
Format:
Adobe Portable Document Format
Description: