The Detection of Crowd Noise by fusing two CNNs

6 min readMay 5, 2021

A Deep Learning Fusion Model for Audio

Our main focus for this paper is to compare the performances of different audio features in a CNN driven binary classification system.

For this project, crowd noises were chosen as our audio of interest and our aim was to train a neural network model that can detect whether or not an audio sample contains this sound. Crowd noise is a sound everyone knows, made up mostly of people talking to each other. They are often considered a component of the background noise one experiences when trying to focus on another sound. If one was to invent a crowd noise removal system to focus on another sound, a detector could perhaps be useful as a way to automatically activate the removal process. The main reason for choosing crowd detection for this project focus is because even though it is not as demanding as training a speech recognition system, its stochastic characteristics make it varied and complex enough for our task at hand. Crowd noises are also abundant in the audio databases available to us and can form a good basis to test different models for audio classification.

There will be two sets of models that will be applied to the data, both of the convolutional neural network form. The most significant difference in the models is that the first model will take raw waveform data and the second will take an MFCC representation of the same data. There have been more successful audio classification papers based on the latter (Hershey 2017), however there has been notable success using raw audio as discovered in (Dai 2016). After training and analysing the performance of each model, a final late fusion process is implemented to observe whether results improve by combining the two models.

Given the Covid-19 outbreak, the project could not take advantage of my School’s GPU’s.

Therefore the training done in this project was accomplished with a smaller than usual database that could process through an ordinary laptop. The Isolated Urban Sound Database, which holds ~3 hours of urban landscape sound, with ~20 minutes of crowd labelled audio, was an appropriate set of labelled urban landscape sounds for this case.

(Sidenote: I recently found out that GoogleCollab offers free GPU usage, so might update my training and research soon!)

R a w M o d e l

As a preprocessing step, the stereo channel information was downsampled to mono in order to reduce the number of data channels. This is a reasonable step since stereo information is not usually an important factor in an audio classification model. The data was also downsampled from 44.1KHz to 8KHz in order to improve computational speed and efficiency, which becomes an important focus when designing a deep neural network, which in itself is naturally CPU intensive.

All audio clips are sliced to create a series of non-overlapping 4 second segments (32KHz) and in case audio clips were shorter than this, they were padded with 0’s to make 4 second inputs.

The model is based on Dai 2016 optimises the performance of a CNN audio classifier based on raw audio data as a 1 dimensional tensor input. The CNNs will take in as input a 1 dimensional vector of size 32,000. Normally, deep CNN’s would be inefficient to produce high level inference with this vector size, however through the use of batch normalisation and some downsampling in the initial layers, the paper shows a 34 layer CNN with residual learning to be impressive and faster than expected. Several variations of the architectures were taken from the paper, with different and implemented in our project. It was found that the M5 variation (5 layers) of the architecture produced very positive results for our test data, yielding a 97% accuracy rate (40/41 predictions correct). It also had less layers than every model apart from the M3 variation, which resulted in considerably faster training, where 30 epochs would take ~20 mins. In comparison the 11 layer variation took ~2 hours to reach the same level.

M F C C M o d e l

The Mel Spectral Coefficients is a derived feature of the mel-spectrogram representation of audio data; it has a considerably lower resolution than its counterpart, which may be inefficient for maximising information per datum, but it is this project’s preference because it is less CPU intensive and more efficient in training. It has been argued that for more robust results within a high-level classifier, a Mel-Spectrogram is preferable but an MFCC should suffice since the task is to create a relatively simple binary classification system (Huzaifah 2017).

Prior to the popularisation of neural networks, in order to create audio event detection systems, time-frequency representations were often combined with traditional machine learning classifiers based on GMM, HMM and SVM models. More recently, deep neural networks have become the new state of the art because of its capacity to take in large amounts of data, which is a significant advantage over the former classifiers. The use of time-frequency representations in DNNs perhaps follows the success of image processing within this field.

Like images, MFCC data is two dimensional, which in turn requires 2D convolutional layers. Five convolutional layers were also used for this model, as well as batch normalisation. After initially observing overfitting as indicated by the extremely low output of the Loss function, a dropout layer was included after the last fully connected layer.

After experimenting further with several forms of the architecture, it was found that a network with 5 convolutional layers produced the best result, achieving an accuracy rate of 62% (25/41), which is slightly better than chance. Training occurred considerably faster than the raw audio waveform, with 30 epochs taking ~15mins.

F u s i o n

A late fusion approach looks at combining the two models, by creating an ensemble network that optimises over the outputs of the two pre-trained models. The parameters of the trained models are stored in separate files, and are subsequently loaded into the ensemble module that is defined in a separate fusion file. When training this new fusion network, the last linear layer of each pretrained model is removed and concatenated with one another, and the resulting tensor goes through a final linear layer and non-linear activation function with all pre-trained parameters remaining unaffected through out training.

Even though our individual results show that the raw model outperforms the MFCC model by some margin, it might be the case that the latter model is better at classifying some particular sounds than the former, as is the case in sever papers e.g. Salomon 2017. Our fusion model was relatively fast to train (30 epochs ~ 10 minutes) and yielded a 97% accuracy rate in the testing set.

Considering that the room for improving on the first model is small, there should be little surprise on the outcome of this result. If results for both models were low, it would be of more interest to evaluate the performance then, as there is more room for the fusion model to improve on both models. A multi-label classification task could perhaps benefit more fusion, since each individual network performance can be measured across a number of classes.

C o n c l u d i n g N o t e s

Although a test set was used to compare the performance of each model, it was made up of a relatively small dataset, and so we should hastily compare our results and be cautious of creating strong inferences based on these findings. This paper should however give further encouragement to the research and rethink of how raw audio works in deep neural networks. The use of a simple raw waveform in neural networks may allow feature extractions to naturally occur, jointly with the classification. The network could extract some feature representation, like its own version of a melspectrogram, within its layers and adjust its parameters (e.g. number of frequency bands) in a supervised manner in order to optimise classification. By allowing the “feature engineering” process be decided by the network itself, there is potential for the network to optimise the use of all the information contained within raw audio.

R e f e r e n c e s

S. Ioffe and C. Szegedy,

“Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv

preprint arXiv:1502.03167, 2015

N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,

“Dropout: a simple way to prevent neural networks from overfitting”

Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R.

Channing Moore

“Cnn Architectures for Large Scale Classification”

https://arxiv.org/pdf/1609.09430.pdf, 2017

Wei Dai*, Chia Dai*, Shuhui Qu, Juncheng Li, Samarjit Das