Design Of An Improved Model For Video Summarization Using Multimodal Fusion And Reinforcement Learning

  • Mr. Sushant Savita Madhukar Gandhi
  • Dr. Mukesh Shrimali
  • Dr. Pradip Mane
Keywords: Video Summarization, Multimodal Fusion, Reinforcement Learning, Sentiment Analysis, Emotion Detection, Process

Abstract

Science and Technology together have been working on Video Summarization techniques efficiently and effectively, as the growth of video content across platforms insists on the need. These techniques do have limitations, mostly originating from the focus of the work on single modalities, e.g. video or text alone, which often produces shallow summaries in context or emotional touch. In addition, the majority of these mechanisms have no styling to be respectful of user requirements or involvement in the study of relevant measures. This limitation restricts the boundaries of these techniques in use for concrete applications in real life. To tackle these difficulties, we present a novel set of framework lines that blend multimodal fusion, reinforcement learning, and sentiment-emotion analysis for advanced video summarization. Our model, which we will refer to as the Multimodal Fusion Transformer (MMFT), utilizes Transformer networks for the fusion of multiple streams of data representation originating from video frames, audio spectrograms, and textual transcripts by making use of the very recent cross-modal attention mechanism. This approach will further enable one to capture in detail the inter-correlations among the different modalities, resulting in contextually enriched summaries. Given such multimodal representation, we take the further step to introduce Reinforcement Summarization Agent (RSA) that dynamically refines generated summaries through optimization towards user satisfaction and engagement metrics. RSAs treat summarization as a sequential decisionmaking problem in a reinforcement learning manner to iteratively enhance the summary quality based on real-time feedback. Finally, in order to make the summaries much more emotionally intense and of sentimental relevance, we adapted Irritability-Aware BERT with Emotion-Enriched CNN-LSTM (IA-BE-CNNLSTM). This is a hybrid model that draws information about sentiment from textual data and emotional cues from visual and audio data to ensure important emotional moments are part of the outcome. Such a fusion has enabled substantial improvements in the accuracy and emotional impact of the summaries. Experimental results on the YouTube and TrecVID databases show that this approach increases the precision of 0.82-0.92 and the recall of 0.75-0.88, and the metrics for user engagement/emotional resonance are notably improved. This is a giant leap forward in the area of video summarization, and a robust yet adaptable system to various application domains.

Author Biographies

Mr. Sushant Savita Madhukar Gandhi

PhD Scholar, Pacific University, Udaipur

Dr. Mukesh Shrimali

Director, Pacific University, Udaipur

Dr. Pradip Mane

Associate Professor, VPPCOE&VA, Mumbai

References

[1] J. Lin et al., "VideoXum: Cross-Modal Visual and Textural Summarization of Videos," in IEEE Transactions on Multimedia, vol. 26, pp. 5548-5560, 2024, doi: 10.1109/TMM.2023.3335875.
[2] P. Kadam et al., "Recent Challenges and Opportunities in Video Summarization With Machine Learning Algorithms," in IEEE Access, vol. 10, pp. 122762-122785, 2022, doi: 10.1109/ACCESS.2022.3223379.
[3] F. Wang, J. Chen and F. Liu, "Keyframe Generation Method via Improved Clustering and Silhouette Coefficient for Video Summarization," in Journal of Web Engineering, vol. 20, no. 1, pp. 147-170, January 2021, doi: 10.13052/jwe15409589.2018.
[4] E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris and I. Patras, "Video Summarization Using Deep Neural Networks: A Survey," in Proceedings of the IEEE, vol. 109, no. 11, pp. 1838-1863, Nov. 2021, doi: 10.1109/JPROC.2021.3117472.
[5] Y. Zhang, Y. Liu, W. Kang and Y. Zheng, "MAR-Net: Motion-Assisted Reconstruction Network for Unsupervised Video Summarization," in IEEE Signal Processing Letters, vol. 30, pp. 1282-1286, 2023, doi: 10.1109/LSP.2023.3313091.
[6] K. Davila, F. Xu, S. Setlur and V. Govindaraju, "FCN-LectureNet: Extractive Summarization of Whiteboard and Chalkboard Lecture Videos," in IEEE Access, vol. 9, pp. 104469-104484, 2021, doi: 10.1109/ACCESS.2021.3099427.
[7] P. Nagar, A. Rathore, C. V. Jawahar and C. Arora, "Generating Personalized Summaries of Day Long Egocentric Videos," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 6832-6845, 1 June 2023, doi: 10.1109/TPAMI.2021.3118077.
[8] T. Liu, Q. Meng, J. -J. Huang, A. Vlontzos, D. Rueckert and B. Kainz, "Video Summarization Through Reinforcement Learning With a 3D Spatio-Temporal U-Net," in IEEE Transactions on Image Processing, vol. 31, pp. 1573-1586, 2022, doi: 10.1109/TIP.2022.3143699.
[9] Y. Yuan and J. Zhang, "Unsupervised Video Summarization via Deep Reinforcement Learning With Shot-Level Semantics," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 1, pp. 445-456, Jan. 2023, doi: 10.1109/TCSVT.2022.3197819.
[10] G. Mujtaba, A. Malik and E. -S. Ryu, "LTC-SUM: Lightweight Client-Driven Personalized Video Summarization Framework Using 2D CNN," in IEEE Access, vol. 10, pp. 103041-103055, 2022, doi: 10.1109/ACCESS.2022.3209275. [11] O. Issa and T. Shanableh, "CNN and HEVC Video Coding Features for Static Video Summarization," in IEEE Access, vol. 10, pp. 72080-72091, 2022, doi: 10.1109/ACCESS.2022.3188638.
[12] Z. Ji, Y. Zhao, Y. Pang, X. Li and J. Han, "Deep Attentive Video Summarization With Distribution Consistency Learning," in IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 4, pp. 1765-1775, April 2021, doi: 10.1109/TNNLS.2020.2991083.
[13] M. Ma et al., "Keyframe Extraction From Laparoscopic Videos via Diverse and Weighted Dictionary Selection," in IEEE Journal of Biomedical and Health Informatics, vol. 25, no. 5, pp. 1686-1698, May 2021, doi: 10.1109/JBHI.2020.3019198.
[14] J. Gao, X. Yang, Y. Zhang and C. Xu, "Unsupervised Video Summarization via Relation-Aware Assignment Learning," in IEEE Transactions on Multimedia, vol. 23, pp. 3203-3214, 2021, doi: 10.1109/TMM.2020.3021980. [15] B. Zhao, M. Gong and X. Li, "AudioVisual Video Summarization," in IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 8, pp. 5181-5188, Aug. 2023, doi: 10.1109/TNNLS.2021.3119969.
[16] Y. Zhang, Y. Liu, W. Kang and R. Tao, "VSS-Net: Visual Semantic Self-Mining Network for Video Summarization," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 4, pp. 2775-2788, April 2024, doi:
10.1109/TCSVT.2023.3312325.
[17] P. Tang, K. Hu, L. Zhang, J. Luo and Z. Wang, "TLDW: Extreme Multimodal Summarization of News Videos," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 3, pp. 1469-1480, March 2024, doi:
10.1109/TCSVT.2023.3296196.
[18] E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris and I. Patras, "AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video Summarization," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 8, pp. 3278-3292, Aug. 2021, doi: 10.1109/TCSVT.2020.3037883.
[19] B. Köprü and E. Erzin, "Use of Affective Visual Information for Summarization of Human-Centric Videos," in IEEE Transactions on Affective Computing, vol. 14, no. 4, pp. 3135-3148, 1 Oct.-Dec. 2023, doi: 10.1109/TAFFC.2022.3222882.
[20] J. Xie et al., "Multimodal-Based and Aesthetic-Guided Narrative Video Summarization," in IEEE Transactions on Multimedia, vol. 25, pp. 4894-4908, 2023, doi: 10.1109/TMM.2022.3183394.
[21] B. Zhao, H. Li, X. Lu and X. Li, "Reconstructive Sequence-Graph Network for Video Summarization," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2793-2801, 1 May 2022, doi: 10.1109/TPAMI.2021.3072117.
[22] Y. Zhang, Y. Liu, P. Zhu and W. Kang, "Joint Reinforcement and Contrastive Learning for Unsupervised Video Summarization," in IEEE Signal Processing Letters, vol. 29, pp. 2587-2591, 2022, doi: 10.1109/LSP.2022.3227525.
[23] B. Zhao, X. Li and X. Lu, "TTH-RNN: Tensor-Train Hierarchical Recurrent Neural Network for Video Summarization," in IEEE Transactions on Industrial Electronics, vol. 68, no. 4, pp. 3629-3637, April 2021, doi: 10.1109/TIE.2020.2979573.
[24] M. Ma, S. Mei, S. Wan, Z. Wang, D. D. Feng and M. Bennamoun, "Similarity Based Block Sparse Subset Selection for Video Summarization," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 10, pp. 3967-3980, Oct. 2021, doi: 10.1109/TCSVT.2020.3044600.
[25] S. Priyadharshini and A. Mahapatra, "MOHASA: A Dynamic Video Synopsis Approach for Consumer-Based Spherical Surveillance Video," in IEEE Transactions on Consumer Electronics, vol. 70, no. 1, pp. 290-298, Feb. 2024, doi: 10.1109/TCE.2023.3324712.
Published
2024-10-10
How to Cite
Mr. Sushant Savita Madhukar Gandhi, Dr. Mukesh Shrimali, & Dr. Pradip Mane. (2024). Design Of An Improved Model For Video Summarization Using Multimodal Fusion And Reinforcement Learning. Revista Electronica De Veterinaria, 25(1), 2260-2267. https://doi.org/10.69980/redvet.v25i1.1180
Section
Articles