Kyungsu Kim¹, Minju Park¹, Haesun Joung¹, Yunki Chae², Yeongbeom Hong¹, Seonghyeon Go¹, and Kyogu Lee¹²³

Music and Audio Research Group (MARG), Department of Intelligence and Information, Seoul National University, Seoul, Republic of Korea¹ Interdisciplinary Program in Artificial Intelligence, Seoul National University² Artificial Intelligence Institute, Seoul National University³

Fig 1. Training Single-Instrument Encoder

Fig 1. Training Single-Instrument Encoder

Fig 2. Training Multi-Instrument Encoder

Fig 2. Training Multi-Instrument Encoder

Fig 3. Retrieving similar instruments from the instrument library using the proposed method.

Fig 3. Retrieving similar instruments from the instrument library using the proposed method.

The proposed method consists of the Single-Instrument Encoder and the Multi-Instrument Encoder. The Single-Instrument Encoder extracts an instrument embedding from a single-track audio of the instrument. Using the instrument embeddings computed by the Single-Instrument Encoder as a set of target embeddings, the Multi-Instrument Encoder is trained to estimate the multiple instrument embeddings. Here are samples of our proposed dataset, and results from our Instrument Encoder models.

1. Multi-Instrument Encoder Results

The goal of our method is to retrieve the instruments used in the reference music from a library of musical instrument samples. For the inference stage as depicted in Fig 3, we used the samples from Nlakh-single for the library of musical instruments.

Following are some Multi-Instrument Encoder’s retrieval results.

1-1. Experiment with Nlakh Multi-track

The following examples show three cases: (a) perfect match, (b) predicts only some part of the ground truth, and (c) predicts different instruments from the ground truth.

As is seen in the examples, the correct answer is well predicted for which musical instrument exists inside the multi-track music. Furthermore, we can see that even if the wrong answer is presented, the characteristics of the musical instrument are similar because it has a short distance in embedding space.

Multi-track Music


Case (a) Perfect Match

mix.png

mix.wav

Ground Truth (Single Tracks)


Instrument : guitar_acoustic_15

19_0.png

19_0.wav

Instrument : organ_electronic_104

36_0.png

36_0.wav

Instrument : reed_acoustic_11

41_0.png

41_0.wav

Instrument : string_acoustic_5

47_0.png

47_0.wav

Retrieval Results (Samples in the Library)


Instrument : guitar_acoustic_15

20-0001.png

20-0001.wav

Instrument : organ_electronic_104

37-0001.png

37-0001.wav

Instrument : reed_acoustic_11

42-0001.png

42-0001.wav

Instrument : string_acoustic_5

48-0001.png

48-0001.wav


Case (b). Predicts only some part of the ground truth

mix.png

mix.wav

Instrument : guitar_acoustic_10

17_0.png

17_0.wav

Instrument : reed_acoustic_37

44_0.png

44_0.wav

Instrument : keyboard_acoustic_4

24_0.png

24_0.wav

Instrument : guitar_acoustic_10

18-0001.png

18-0001.wav

Instrument : reed_acoustic_37

45-0001.png

45-0001.wav


Case (c). Predicts other instruments from the ground truth.

mix.png

mix.wav

Instrument : bass_electronic_27

2_0.png

2_0.wav

Instrument : keyboard_synthetic_0

31_0.png

31_0.wav

Instrument : guitar_acoustic_10

17_0.png

17_0.wav

Instrument : bass_electronic_25

02-0001.png

02-0001.wav

Instrument : keyboard_synthetic_0

32-0001.png

32-0001.wav