View on GitHub

genreClassification

Group 27: Alexander Schott, James Johnson, Shan Yu Suen, Stuti Rao, Elizabeth Dworkin

Purpose

The purpose of this project was to create classifiers to categorize songs by their respective genres. Manually categorizing songs by genre can be a time-intensive and subjective process. Machine Learning appears to be applicable for genre assignment as it is a classification problem, where each category is already known. Automated genre assignment is a popular and broadly applied technology in most popular music collections such as Spotify and Apple Music.

Prior research in the realm of musical genre classification focused on content-based approaches, with an early comparative study published in 2003 indicating that the modern Daubechies Wavelet Coefficient Histograms (DWCH) method of feature extraction yielded significantly more accurate results than other existing methods [1]. Later studies used a neural network to categorize songs into predefined genre categories as we have done. These often rely on tempo, volume, and other features [2]. This project embodies many of the conditions of the latter study to build on.

Million Song Dataset

The Million Song Dataset is a collection of songs tagged with various labels, including genres, along with datasets of features extracted from the MP3 files of those songs.
The specific dataset we used from the Million Song Dataset was the Top-MAGD Million Song Dataset Benchmarks. From this dataset, we used the partition mapping files to split the feature set into testing and training data, the label assignment file, and various feature files that will be detailed in the next section. The original dataset consisted of roughly 350,000 songs but more than 200,000 of them were all from the ‘Pop_Rock’ category. This had a strong chance of leading any of our models to classify all songs as pop or rock due to the data imbalance, so we chose a stratified split, with equal entries from all genres, that would ensure the training data had an equal number of every song.

The 13 genre labels are Blues, Country, Electronic, Folk, International, Jazz, Latin, New Age, Pop Rock, Rap, Reggae, RnB, and Vocal.

Methods

All of the models were created in Python 3.7 using the library Keras with the TensorFlow backend. The output layer and loss function were common across models. The output layer consisted of a 13 node softmax activated perceptron layer. There were 13 nodes for the 13 possible genre labels and softmax was used as we were training against a one hot vector that represented the ground truth genre label. The loss function was Keras’s categorical cross entropy function, since this is categorizational problem and only a single label is applicable to any data point. This is not necessarily true for a real world application, but our data assigns a single label to each song and there is no way to apply a loss function for our model saying a song is a mix of Jazz and New Age music.

For each model below, hyperparameters were tuned manually until the model was seen to be overfitting the data. Specific parameters included the number of epochs, layer and kernel sizes, and number of layers. In general, hyperparameters were tuned such that the validation accuracy would be maximized.

Results

Perceptrons and Neural Networks

Neural network models were used on the Marsyas, JMIR, and Rhythm Histogram feature sets. Neither of these feature sets had a time component nor had an extremely large number of features. Therefore, it was decided that a simple neural network would suffice to classify songs using these features.

Marsyas

Architecture

Marsyas Architecture

Confusion Matrix

The true label is on the y axis and the predicted label is on the x axis. Marsyas Confusion

Convolutional Neural Networks

The Rhythm Pattern feature set had a total of 1,440 features for each song. A convolutional neural network was used for the rhythm patterns feature set, as a means of of extracting relevant patterns from the massive two dimensional array of features. Using a number of convolutional layers, the feature space was shrunk enough to allow for a reasonably sized neural network to analyze the resulting features.

In addition, many of these datasets included a temporal dimension that could be simplified using a kernel. Convolutional layers are known to be useful when analyzing data across a temporal dimension and we took advantage of that when presented with features spread across a temporal dimension.

Rhythm Pattern

Architecture

Rhythm Pattern Architecture

Confusion Matrix

The true label is on the y axis and the predicted label is on the x axis. Rhythm Pattern Confusion

The models and architectures used for the extra features seen here can be found in Appendix B.

Network Type Feature Set Accuracy(%) Loss
Neural Network JMIR Low Level Spectral Data 63.5 1.25
  JMIR MFCC voice data 67 1.05
  Rhythm Histograms 32.1 2.24
  Marsyas Timbral Data 70.1 0.96
Convolutional Neural Network Rhythm Patterns 31 3.8
  Statistical Spectrum Data 46 1.8
  Temporal SSD 46 1.8

Conclusion

Given the above results, it is clear that the Marsyas Timbral data are the best features to use when classifying music genres. Many of the models would fit the training data well but struggle to achieve high accuracy across all genres with the validation data. Only the Marsyas features were able to sufficiently split the feature space into genre categories such that the boundaries worked well in both the training and validation phases. It appears that the other feature sets were not able to sufficiently partition the feature space or lacked the necessary information to discriminate between various genres.

One reason that the Marsyas features were able to classify music genre so well is that the features are composed of heavily processed information extracted from the songs. Marsyas (Music Analysis, Retrieval and SYnthesis for Audio Signals) is an audio processing library that focuses on extracting information from various signal sources and focusing on music. It is unclear how exactly the Vienna University of Technology used the library to extract information as the link is dead, but, given the above results, it seems that the Marsyas features are heavily correlated with the genre of the song and there is only minimal extra analysis needed to extract that correlation.

Future Work

The Marsyas featureset worked quite well for classifying music but lacks in its ability to classify folk music. A future project could involve feature engineering a comination dataset to take advantage of the Marsyas feature sets overall ability to classify music along with the Rhythm Pattern model’s ability to classify folk music to create a better overall classifier. Also, ensemble learning methods could be applied to reduce variance in the testing accuracy.

Appendix - Additional Feature Sets

JMIR Low Level Spectral Data

Low Level Features Result

JMIR Confusion low level

MFCC Voice Data Result

JMIR Confusion MFCC

Rhythm Histogram

Histograms

RH Confusion

Architecture

Rhythm Histogram Architecture

Statistical Spectrum Descriptor

SSD Confusion

Architecture

Statistical Spectrum Descriptor Architecture

Temporal Statistical Spectrum Descriptor

Temporal SSD Confusion

Architecture

Temporal Statistical Spectrum Descriptor Architecture

References

  1. https://dl.acm.org/citation.cfm?id=860487
  2. https://pdfs.semanticscholar.org/1314/aee7880cb2bc9fc20fba12264545caa45018.pdf.
  3. http://www.ifs.tuwien.ac.at/mir/msd/