An important thing to note is that the Dadabots focus heavily on curating[1] the resultant audio clips. From a blackbox neural network that produces any kind of audio, from silence, to a cacophony of sounds, and everything in between, one needs to curate the results to combine them into a cohesive piece of music.
The dadabots published a paper and tool[2],[3] for curation:
With the creation of neural synthesis systems which output raw audio, it has become possible to generate dozens of hours of music. While not a perfect imitation of the original training data, the quality of neural synthesis can provide an artist with many variations of musical ideas. However, it is tedious for an artist to explore the full musical range and select interesting material when searching through the output.
They built a visual tool, but I wanted to write a simpler automated tool. The goal of my automated curation script is to:
Load every clip generated by the trained SampleRNN model
Trim out the silence
Potentially apply some MIR techniques* to group similar clips together, to create the sense of a "cohesive" musical piece
*Ideally, we would use popular MIR Python libraries (librosa, madmom, Essentia) to assist us in the curation step, e.g. by grouping similar clips by their musical content. However, as this is more in the domain of MUMT 621 (music information retrieval), and not an easy task, I won't spend too much time on it.
Chromaprint-based curation script
The first tool I use in the curation script is librosa's trim silence[4] to remove leading and trailing silence from the generated clips. This helps as the majority of clips have some musical content but lots of silence in between.
The next thing I need is to be able to compare the pieces of musical content/non-silence with each other, to concatenate clips by similarity. This should have the effect of creating a maximally cohesive piece of music. I settled for the chromaprint[5]. Using chromaprints, we can get an array of 32-bit integers representing the "acoustic fingerprint" of a waveform.
Combining these two elements is the first job of the curation script:
for wav_file in os.listdir(p):
full_path = os.path.join(p, wav_file)
print('trimming silence and computing chromaprint for {0}'.format(full_path))
x, _ = soundfile.read(full_path, dtype='float32')
# trim trailing and leading silence
x_trimmed, _ = librosa.effects.trim(x, top_db=50, frame_length=256, hop_length=64)
fingerprint = es.Chromaprinter(sampleRate=args.sample_rate)(x_trimmed)
int_fingerprint = ai.chromaprint.decode_fingerprint(bytes(fingerprint, encoding='utf-8'))[0]
all_audio[full_path] = {
'raw_audio': x_trimmed,
'chromaprint': int_fingerprint,
}
This code extracts sequences of non-silent music from the generated clips, and stores their chromaprints. We also need a way to compare the arrays of 32-bit integer chromaprints, to be able to rank the generated music clips by their similarity with other clips. For this I borrowed some code that computes a correlation between lists of integer chromaprints[6]. I rank clips by maximum correlation in their chromaprints:
# naive O(n^2) comparison
for filename1, data1 in all_audio.items():
for filename2, data2 in all_audio.items():
print('comparing chromaprint correlation for {0}, {1}'.format(filename1, filename2))
if filename1 == filename2:
# don't compare a file to itself
continue
try:
chromaprint_correlation = correlation(data1['chromaprint'], data2['chromaprint'])
except:
continue
correlation_scores[chromaprint_correlation] = (filename1, filename2)
The last part of the script sorts clips by maximum correlation in their chromaprints, and accumulates them in a "total_curated.wav" file:
# sort by the most highly correlated pairs of audio
sorted_correlation_scores = dict(sorted(correlation_scores.items(), reverse=True))
total_audio = None
for v in sorted_correlation_scores.values():
print('concatenating audio by similarity')
# if we've already taken a clip before, ignore
if v[0] not in all_audio.keys() or v[1] not in all_audio.keys():
continue
# keep a running accumulation of similar clips
if total_audio is None:
total_audio = all_audio[v[0]]['raw_audio']
else:
total_audio = numpy.concatenate((total_audio, all_audio[v[0]]['raw_audio']))
total_audio = numpy.concatenate((total_audio, all_audio[v[1]]['raw_audio']))
# delete the data we don't need anymore
del all_audio[v[0]]
del all_audio[v[1]]
print('writing output file')
soundfile.write("total_curated.wav", total_audio, args.sample_rate)
Curating good 2-tier results
We saw on the 2-tier experiment page how we generated some very believable clips of music which sounded like the training album, Cannibal Corpse - A Skeletal Domain. There were 200 20-second clips generated, on which we'll apply the curation script.
We'll first copy all the generated clips from epoch 10/iteration 220,001 into a directory to-curate/:
Then, we'll run the chromaprint-based curation script. This logs a lot of info (especially when comparing 200 clips against each other, i.e. 200*200 comparisons):
(1000sharks-curator) sevagh:dadabots_SampleRNN $ mpv total_curated.wav
Resuming playback. This behavior can be disabled with --no-resume-playback.
(+) Audio --aid=1 (pcm_s16le 1ch 16000Hz)
AO: [pulse] 16000Hz mono 1ch s16
A: 00:00:13 / 00:32:00 (0%)
It sounds great - 32 minutes of fake Cannibal Corpse. I'll name this song "DOMAINAL SKELETON", and it's included as one of the final songs in the demo. I also applied the same curation script to the results of the SampleRNN bridging the gap experiments, to create the songs "Animatronics as Leaders" (from the dadabots result), and "Ambiance as Leaders" from the PRiSM result.
Salvaging bad 3-tier results
First, I use the trained 3-tier SampleRNN model described earlier to generate audio clips at various epochs. I chose the epochs randomly - the ones before checkpoint 100 are trained only on Periphery, while the ones after checkpoint 100 are trained on a mix of Periphery and Mestis (aka "Mestiphery").
According to the dadabots' README[1], they don't necessarily accept that the latest training epoch is the best one. That's why I picked a variety of epochs:
However, we found the latest checkpoint does not always create the best music. Instead we listen to the test audio generated at each checkpoint, choose our favorite checkpoint, and delete the newer checkpoints, before generating a huge batch with this script.
I generated sequences of different durations. The generate commands are as follows:
After listening to several of the files manually, I noticed that the lower epochs produce better, more cohesive, and longer bits of music. This to me indicates the failure of experiment 2, of mixing Periphery + Mestis and training for 250 epochs. This could be due to the nature of the training data - the neural network is trying to generate samples that minimize error for two distinct musical artists and styles, so it can resort to outputting nothing (to minimize numerical error between the two sets of training data). Also, like the Dadabots say, perhaps training for too long isn't beneficial.
This gives a 24-minute song (when run on all of the generated clips mentioned previously). Although cacophonous and random, it actually sounds like (very weird) music. Also, it emphatically does not sound like Periphery or Mestis, the two bands which supplied the training data, so I didn't include it in the real results: