AN ENHANCED AUDIO EVENT DETECTION WITH ATTENTION NEURAL NETWORKS
Abstract
Multimedia recordings are very vital in the aspect of audio event detection with attention neural networks and it’s a task of recognizing an audio event in an audio recording. The aim of the proposed work is to improve the development of existing audio/sound event detection in continuous streams and audio recording. Also, to compute the classes of audio events such as gunshots, screaming, door slamming, bell ringing, coffee, bird singing etc from an audio recording and also to estimate the onset and offset of these acoustic events. In this work, the propose system is going to use the modern machine learning methods called attention neural networks. The enhancement in the quality of audio event detection is achieved using an attention neural network based approach. Different activation functions that include RELU, LeakyReLU, ReLU6, ELU, and Swish known as SILU were investigated, the performance of the models using the above mentioned activation functions and Evaluate the performance of the baseline system using the different activation functions and compare the performance with the results of the existing studied papers were presented and discussed. As discussed and shown in this research, Swish network achieved mAP of 0.361432 and dprime of 2.642 outperformed the ReLU network and D-prime, from the baseline paper even though they both achieved the same AUC using the same architecture with 1024 hidden units. Using the feature level attention model, Swish activation function with mAP of 0.370 outperformed ReLU with mAP 0.369 in the baseline paper. Swish performance is...
References
Adavanne, S. and Virtanen, T., “A report on sound event detection with different binaural features,” DCASE2017 Challenge, Tech. Rep., September 2017.
Alex Krizhevsky, “Convolutional Deep Belief Networks on CIFAR-10”
Ankit Shah: DCASE 2017 CHALLENGE SETUP: TASKS, DATASETS AND BASELINE SYSTEM
Bello, J. P., Mydlarz, C., and Salamon, J., “Sound analysis in smart cities,” in binaural features,” DCASE2017 Challenge, Tech. Rep., September 2017.
Briggs F., B. Lakshminarayanan, L. Neal, X. Z. Fern, R. Raich, S. J. Hadley, A. S. Hadley,and M. G. Betts. Acoustic classification of multiple simultaneous bird species: A multi-instance multi-label approach. The Journal of the Acoustical Society of America, 131(6):4640–4650, 2012. for mobile robots using audio features,” in IEEE International Conference on Multimedia
Chu, S., Narayanan, S., Kuo, C., and Mataric, M., “Where am I? scene recognition Computational Analysis of Sound Scenes and Events, Virtanen, T., Plumbley, M., and Ellis, D. P., Eds. Springer, 2018, pp. 373 – 397.
Dennis, J., Tran, H. D., and Li, H., “Spectrogram image feature for sound event classification in mismatched conditions,” IEEE Signal Processing Letters, vol. 18, no. 2, pp. 130 – 133, 2011.
Elizalde .B et al. Experimentation on The DCASE Challenge 2016: Task 1 - Acoustic Scene Classification and Task 3 - Sound Event Detection in Real Life Audio. Tech. rep. Detection, Classification of Acoustic Scenes, and Events 2016, 2016.
Eronen A. J., V. T. Peltonen, J. T. Tuomi, A. P. Klapuri, S. Fagerlund, T. Sorsa, G. Lorho, and J. Huopaniemi. Audio-based context recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 14(1):321–329, 2006.
Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M., “Audio set: An ontology and human-labeled dataset for audio events,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 776 – 780.
Goetze, S., Schroder, J., Gerlach, S., Hollosi, D., Appell, J.-E., and Wallhoff, F., “Acoustic monitoring and localization for social care,” Journal of Computing Science and Engineering, vol. 6, no. 1, pp. 40 – 50, 2012.
Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A. C., and Bengio, Y.,“Maxout networks.” ICML (3), vol. 28, pp. 1319 – 1327, 2013.
Han, K. J., Chandrashekaran, A., Kim, J., and Lane, I., “The capio 2017 conversational speech recognition system,” arXiv preprint arXiv:1801.00059, 2017.
Hawkins, D. M., “The problem of overfitting,” Journal of chemical information and computer sciences, vol. 44, no. 1, pp. 1 – 12, 2004.
Howard G., M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Ada“ MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Kao, C.-C., Wang, W., Sun, M., and Wang, C., “R-crnn: Region-based convolutional recurrent neural network for audio event detection,” arXiv preprint arXiv:1808.06627, 2018.
Kingma D. P. and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
Kong, Qiuqiang, Changsong Yu, Yong Xu, Turab Iqbal, Wenwu Wang, and Mark D. Plumbley. "Weakly Labelled AudioSet Tagging With Attention Neural Networks." IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, no. 11 (2019): 1791-1802.
Kristof T. Schütt, Pieter-Jan Kindermans, Huziel E. Sauceda, Stefan Chmiela, Alexandre Tkatchenko, Klaus-Robert Müller, “ SchNet: A continuous-filter convolutional neural network for modeling quantum interactions”
Lee Y, D. K. Han, and H. Ko. “Acoustic Signal Based Abnormal Event Detection in Indoor Environment Using Multiclass Adaboost”. In: Proc. 2013 IEEE International Conference on Consumer Electronics (ICCE). 2013, pp. 322–323.
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org, 1, 2015
Maron O and T. Lozano-P´erez. A framework for multiple-instance learning. Advances in neural information processing systems, pages 570–576, 1998.
OkombaN. (2024). Intelligent Lighting Control Systems for Energy Savings in Hospital Buildings Using Artificial Neural Networks. FUDMA JOURNAL OF SCIENCES, 8(2), 390 - 398. https://doi.org/10.33003/fjs-2024-0802-2320
Paulya L, H. Peela, S. Luoa, D. Hoggb and R. Fuentes, “ Deeper Networks for Pavement Crack Detection”. 34th International Symposium on Automation and Robotics in Construction (ISARC 2017)
Prajit Ramachandran, Barret Zoph, Quoc V. Le, “ Searching for Activation Functions”. [Submitted on 16 Oct 2017 (v1), last revised 27 Oct 2017 (this version, v2)]
Prajit Ramachandran, Barret Zoph, Quoc V. Le, “ Swish: a Self-Gated Activation Function”
Radhakrishnan, R., Divakaran, A., and Smaragdis, A., “Audio analysis for surveillance applications,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005. IEEE, 2005, pp. 158 – 161.
Schilit, B., Adams, N., and Want, R., “Context-aware computing applications,” in First Workshop on Mobile Computing Systems and Applications, 1994, pp. 85 – 90.
Srivastava N, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
Valenzise G., L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti. Scream and gunshot detection and localization for audio-surveillance systems. In Advanced Video and Signal Based Surveillance, 2007. AVSS 2007. IEEE Conference on, pages 21–26. IEEE, 2007.
Copyright (c) 2024 FUDMA JOURNAL OF SCIENCES
This work is licensed under a Creative Commons Attribution 4.0 International License.
FUDMA Journal of Sciences