Human perceptions of music and image are closely related to each other. Both music and image can inspire similar human sensations such as emotion, motion, and power etc. The main objective of this paper is to investigate whether and how music and image can be bridged by machine. The contributions are three folds. Firstly, we construct a dataset composed by more than 25,000 music-image pairs obtained from music videos, and conduct human annotation of comparing the matching degree of these pairs. The results show that the human labelers largely agree with each other on the matching degree of music-image pairs. Secondly, we propose semantic representations of music and image which are suitable for cross modal matching task. Specially, we adopt lyrics as a middle-media to connect music and image and extract a set of attributes from lyrics for image representation. Thirdly, we propose a new method, cross-modal kernel analysis (CMKA) to learn the semantic similarity between music and image with side information. CMKA aims to find the optimal embedding spaces for both music and image in sense of maximizing the ordinal margin between music-image pairs annotated by the labelers and the random ones. The proposed method is able to learn the non-linear relationship between music and images, and more importantly, it can efficiently integrate heterogeneous data from different modalities into a unified space. Experimental results demonstrate that the proposed method performs best in the music-image matching task.
The supplementary material (details) of this project.
The video for framework illustration.