Pre-print of new work released; sound scene city classification | audio geotagging
We are happy to release a pre-print of our latest work, conducted in partnership with our friends at Technical University of Tampere, Finland online today.
Abstract: The majority of sound scene analysis work focuses on one of two clearly defined tasks: acoustic scene classification or sound event detection. Whilst this separation of tasks is useful for problem definition, they inherently ignore some subtleties of the real-world, in particular how humans vary in how they describe a scene. Some will describe the weather and features within it, others will use a holistic descriptor like `park’, and others still will use unique identifiers such as cities or names.
In this paper, we undertake the task of automatic city classification to ask whether we can recognize a city from a set of sound scenes? In this problem each city has recordings from multiple scenes.
We test a series of methods for this novel task and show that a simple convolutional neural network (CNN) can achieve accuracy of 50\%. This is less than the acoustic scene classification task baseline in the DCASE 2018 ASC challenge on the same data. A simple adaptation to the class labels of pairing city labels with grouped scenes, accuracy increases to 52\%, closer to the simpler scene classification task. Finally we also formulate the problem in a multi-task learning framework and achieve an accuracy of 56\%, outperforming the aforementioned approaches.