青岛科技大学主页平台管理系统杨树国--中文主页-- ASiT-CRNN: A method for sound event detection with fine-tuning of self-supervised pre-trained ASiT-based model

杨树国

基于数学建模的“三轴联动、五层递进”研究生创新能力培养模式的研究与实践 -----第九届山东省省级教学成果奖佐证材料一、成果曾获奖励二、团队主要成员指导研究生数学建模竞赛获奖统计三、我校连续12年获“中国研究生数学建模竞赛优秀组织奖”荣誉称号四、团队主要成员获批教研项目五、团队主要成员获批课程立项六、团队主要成员的教学论文和教材七、成果推广应用证明

论文成果中文主页 > 科学研究 > 论文成果

ASiT-CRNN: A method for sound event detection with fine-tuning of self-supervised pre-trained ASiT-based model

发布时间：2025-07-14 点击次数：

关键字：NEURAL-NETWORKS; TRANSFORMER
摘要：Recently, the utilization of pre-trained models to transfer knowledge to downstream tasks has become an increasing trend. In this paper, we present an effective sound event detection (SED) method, which improves the performance of a baseline system based on convolutional recurrent neural network (CRNN) for the DCASE 2022 Task 4 by embedding a local-global audio spectrogram vision transformer (ASiT) with a two-phase finetuning strategy, thus referred to as ASiT-CRNN. ASiT is based on the audio classification task and is a pre-trained model on the large-scale audio dataset AudioSet using several self-supervised learning (SSL) methods. However, due to the differences between the clip-level and frame-level tasks, inputting the output of ASiT into the SED task without any processing does not give the desired results. Therefore, the ASiT-CRNN model adopts frequency- averaged pooling (FAP) and nearest neighbour interpolation (NNI) operations on the ASiT output based on the original network architecture, aiming at obtaining a sequence of frame-level features and improving the temporal resolution of the embedding. Two complementary feature sequences, ASiT and CNN, are also fused to obtain a higher quality and more discriminative representation of audio features. We train the ASiT-CRNN model on the development set of DCASE 2022 Task 4 and fine-tune it in two phases using semi-supervised mean-teacher methods to address the challenges of limited labelled data. Finally, we also fairly compare ASiT with several other self-supervised pre-trained models on the SED task. The ASiT-CRNN model achieves PSDS1 and PSDS2 scores of 0.488 and 0.767, respectively, performing significantly better than those of the baseline system, CRNN, of 0.351 and 0.552. In addition, ASiT-CRNN outperforms several other SSL pre-trained models based on SED tasks in the comparison experiments. Open source code at https://github.com/qingkezyy/ASiT-CRNN.
卷号：160
期号：
是否译文：否

上一条： Attention-enhanced deep learning model for reconstruction and downscaling of thermocline depth in the tropical Indian Ocean

下一条：基于深度学习的南海海表面温度的智能化预测研究