LI Zhenpeng, CHI Simin, YANG Jian
Accepted: 2025-11-02
Multimodal data (image, speech, video) is a common form of data in the field of information science. Based on multimodal sentiment analysis techniques, the recognition algorithm can more accurately refine sentimental tendencies and attitudes, thereby greatly improving the user experience of artificial intelligence (AI) products. Therefore, it has important potential application value in the fields of economics, management, and sociology. Multimodal sentiment analysis techniques in its ascendant, how to effectively integrate multimodal information for sentiment analysis is being an important topic in the field of artificial intelligence. This paper implements facial emotional recognition based on Efficientnet-V2 and BiLSTM, speech emotional recognition based on CNN architecture, text emotional recognition based on BiLSTM architecture, and word bag model integration. By optimizing the model parameters and using four multimodal fusion strategies, such as early fusion, late fusion, semantic fusion, and stage fusion, we construct a multimodal sentimental recognition algorithm. The experimental results show that the accuracy of the fused audio-visual bimodal algorithm reaches 73.86%, compared with Reference(Zhang, et al., 2019), the accuracy of the classification has been improved by 3.1%. Multimodal recognition algorithm under different fusion strategies can significantly improve recognition accuracy compared to single modal or bimodal recognition methods. Meanwhile, testing on simulated datasets shows that the semantic fusion method has the best effect in the fusion strategy, and the late fusion strategy is better than the early fusion. We also find that when the influence factor of stage fusion is 0.84, the multimodal classification effect is the best.