Show Me the Instrument: Musical Instrument Retrieval from Mixture Audio

Kyungsu Kim¹, Minju Park¹, Haesun Joung¹, Yunki Chae², Yeongbeom Hong¹, Seonghyeon Go¹, and Kyogu Lee¹²³

Music and Audio Research Group (MARG), Department of Intelligence and Information, Seoul National University, Seoul, Republic of Korea¹ Interdisciplinary Program in Artificial Intelligence, Seoul National University² Artificial Intelligence Institute, Seoul National University³

Fig 1. Training Single-Instrument Encoder

Fig 2. Training Multi-Instrument Encoder

Fig 3. Retrieving similar instruments from the instrument library using the proposed method.

The proposed method consists of the Single-Instrument Encoder and the Multi-Instrument Encoder. The Single-Instrument Encoder extracts an instrument embedding from a single-track audio of the instrument. Using the instrument embeddings computed by the Single-Instrument Encoder as a set of target embeddings, the Multi-Instrument Encoder is trained to estimate the multiple instrument embeddings. Here are samples of our proposed dataset, and results from our Instrument Encoder models.

1. Multi-Instrument Encoder Results

The goal of our method is to retrieve the instruments used in the reference music from a library of musical instrument samples. For the inference stage as depicted in Fig 3, we used the samples from Nlakh-single for the library of musical instruments.

Following are some Multi-Instrument Encoder’s retrieval results.

1-1. Experiment with Nlakh Multi-track

The following examples show three cases: (a) perfect match, (b) predicts only some part of the ground truth, and (c) predicts different instruments from the ground truth.

As is seen in the examples, the correct answer is well predicted for which musical instrument exists inside the multi-track music. Furthermore, we can see that even if the wrong answer is presented, the characteristics of the musical instrument are similar because it has a short distance in embedding space.