Curating neural generated results

An important thing to note is that the Dadabots focus heavily on curating^[1] the resultant audio clips. From a blackbox neural network that produces any kind of audio, from silence, to a cacophony of sounds, and everything in between, one needs to curate the results to combine them into a cohesive piece of music.

The dadabots published a paper and tool^[2],[3] for curation:

With the creation of neural synthesis systems which output raw audio, it has become possible to generate dozens of hours of music. While not a perfect imitation of the original training data, the quality of neural synthesis can provide an artist with many variations of musical ideas. However, it is tedious for an artist to explore the full musical range and select interesting material when searching through the output.

They built a visual tool, but I wanted to write a simpler automated tool. The goal of my automated curation script is to:

Load every clip generated by the trained SampleRNN model
Trim out the silence
Potentially apply some MIR techniques* to group similar clips together, to create the sense of a "cohesive" musical piece

*Ideally, we would use popular MIR Python libraries (librosa, madmom, Essentia) to assist us in the curation step, e.g. by grouping similar clips by their musical content. However, as this is more in the domain of MUMT 621 (music information retrieval), and not an easy task, I won't spend too much time on it.

Chromaprint-based curation script

The first tool I use in the curation script is librosa's trim silence^[4] to remove leading and trailing silence from the generated clips. This helps as the majority of clips have some musical content but lots of silence in between.

The next thing I need is to be able to compare the pieces of musical content/non-silence with each other, to concatenate clips by similarity. This should have the effect of creating a maximally cohesive piece of music. I settled for the chromaprint^[5]. Using chromaprints, we can get an array of 32-bit integers representing the "acoustic fingerprint" of a waveform.

Combining these two elements is the first job of the curation script:

    for wav_file in os.listdir(p):
        full_path = os.path.join(p, wav_file)
        print('trimming silence and computing chromaprint for {0}'.format(full_path))
    
        x, _ = soundfile.read(full_path, dtype='float32')
    
        # trim trailing and leading silence
        x_trimmed, _ = librosa.effects.trim(x, top_db=50, frame_length=256, hop_length=64)
    
        fingerprint = es.Chromaprinter(sampleRate=args.sample_rate)(x_trimmed)
        int_fingerprint = ai.chromaprint.decode_fingerprint(bytes(fingerprint, encoding='utf-8'))[0]
    
        all_audio[full_path] = {
                'raw_audio': x_trimmed,
                'chromaprint': int_fingerprint,
        }

This code extracts sequences of non-silent music from the generated clips, and stores their chromaprints. We also need a way to compare the arrays of 32-bit integer chromaprints, to be able to rank the generated music clips by their similarity with other clips. For this I borrowed some code that computes a correlation between lists of integer chromaprints^[6]. I rank clips by maximum correlation in their chromaprints:

    # naive O(n^2) comparison
    for filename1, data1 in all_audio.items():
        for filename2, data2 in all_audio.items():
            print('comparing chromaprint correlation for {0}, {1}'.format(filename1, filename2))
            if filename1 == filename2:
                # don't compare a file to itself
                continue
            try:
                chromaprint_correlation = correlation(data1['chromaprint'], data2['chromaprint'])
            except:
                continue
            correlation_scores[chromaprint_correlation] = (filename1, filename2)

The last part of the script sorts clips by maximum correlation in their chromaprints, and accumulates them in a "total_curated.wav" file:

    # sort by the most highly correlated pairs of audio
    sorted_correlation_scores = dict(sorted(correlation_scores.items(), reverse=True))
    total_audio = None
    for v in sorted_correlation_scores.values():
        print('concatenating audio by similarity')
        # if we've already taken a clip before, ignore
        if v[0] not in all_audio.keys() or v[1] not in all_audio.keys():
            continue
    
        # keep a running accumulation of similar clips
        if total_audio is None:
            total_audio = all_audio[v[0]]['raw_audio']
        else:
            total_audio = numpy.concatenate((total_audio, all_audio[v[0]]['raw_audio']))
        total_audio = numpy.concatenate((total_audio, all_audio[v[1]]['raw_audio']))
    
        # delete the data we don't need anymore
        del all_audio[v[0]]
        del all_audio[v[1]]
    
    print('writing output file')
    soundfile.write("total_curated.wav", total_audio, args.sample_rate)

Curating good 2-tier results

We saw on the 2-tier experiment page how we generated some very believable clips of music which sounded like the training album, Cannibal Corpse - A Skeletal Domain. There were 200 20-second clips generated, on which we'll apply the curation script.

We'll first copy all the generated clips from epoch 10/iteration 220,001 into a directory to-curate/:

    (1000sharks-curator) sevagh:dadabots_SampleRNN $ cp results_2t/models-two_tier-two_tier16k.py-expskeletal-domain_experiment-n_frames64-frame_size16-emb_size256-skip_connT-dim1024-n_rnn5-rnn_typeLSTM-q_levels256-q_typemu-law-batch_size32-weight_normT-learn_h0F-which_setskeletal-domain-lr0.001/samples/sample_e10_i220001_20_17\:19\:5* to-curate/
    (1000sharks-curator) sevagh:dadabots_SampleRNN $ ls to-curate/ | head -n10
    sample_e10_i220001_20_17:19:52_0.wav
    sample_e10_i220001_20_17:19:52_10.wav
    sample_e10_i220001_20_17:19:52_11.wav
    sample_e10_i220001_20_17:19:52_12.wav
    sample_e10_i220001_20_17:19:52_13.wav
    sample_e10_i220001_20_17:19:52_14.wav
    sample_e10_i220001_20_17:19:52_15.wav
    sample_e10_i220001_20_17:19:52_16.wav
    sample_e10_i220001_20_17:19:52_17.wav
    sample_e10_i220001_20_17:19:52_18.wav

Then, we'll run the chromaprint-based curation script. This logs a lot of info (especially when comparing 200 clips against each other, i.e. 200*200 comparisons):

    (1000sharks-curator) sevagh:dadabots_SampleRNN $ python ~/repos/1000sharks.xyz/samplernn-scripts/curation.py ./to-curate/
    ...truncated output
    concatenating audio by similarity
    concatenating audio by similarity
    writing output file
    (1000sharks-curator) sevagh:dadabots_SampleRNN $

This produces a file, total_curated.wav:

    (1000sharks-curator) sevagh:dadabots_SampleRNN $ mpv total_curated.wav
    Resuming playback. This behavior can be disabled with --no-resume-playback.
     (+) Audio --aid=1 (pcm_s16le 1ch 16000Hz)
     AO: [pulse] 16000Hz mono 1ch s16
     A: 00:00:13 / 00:32:00 (0%)

It sounds great - 32 minutes of fake Cannibal Corpse. I'll name this song "DOMAINAL SKELETON", and it's included as one of the final songs in the demo. I also applied the same curation script to the results of the SampleRNN bridging the gap experiments, to create the songs "Animatronics as Leaders" (from the dadabots result), and "Ambiance as Leaders" from the PRiSM result.

Salvaging bad 3-tier results

First, I use the trained 3-tier SampleRNN model described earlier to generate audio clips at various epochs. I chose the epochs randomly - the ones before checkpoint 100 are trained only on Periphery, while the ones after checkpoint 100 are trained on a mix of Periphery and Mestis (aka "Mestiphery").

According to the dadabots' README^[1], they don't necessarily accept that the latest training epoch is the best one. That's why I picked a variety of epochs:

However, we found the latest checkpoint does not always create the best music. Instead we listen to the test audio generated at each checkpoint, choose our favorite checkpoint, and delete the newer checkpoints, before generating a huge batch with this script.

I generated sequences of different durations. The generate commands are as follows:

    # epoch 83, 92 (periphery only)
    
    $ python generate.py --output_path ./to-curate-periphery-epoch-83/gen.wav --checkpoint_path ./logdir/periphery_only/15.09.2020_17.36.46/model.ckpt-83 --config_file ./default.config.json --num_seqs 10 --dur 30 --sample_rate 16000
    $ python generate.py --output_path ./to-curate-periphery-epoch-83-2/gen.wav --checkpoint_path ./logdir/periphery_only/15.09.2020_17.36.46/model.ckpt-83 --config_file ./default.config.json --num_seqs 100 --dur 8 --sample_rate 16000
    $ python generate.py --output_path ./to-curate-periphery-epoch-92/gen.wav --checkpoint_path ./logdir/periphery_only/17.09.2020_21.55.43/model.ckpt-92 --config_file ./default.config.json --num_seqs 10 --dur 30 --sample_rate 16000
    
    # epoch 139, 165, 188 (mestiphery - mixed periphery + mestis)
    
    $ python generate.py --output_path ./to-curate-mestiphery-epoch-139/gen.wav --checkpoint_path ./logdir/periphery_only/18.09.2020_04.32.49/model.ckpt-139 --config_file ./default.config.json --num_seqs 10 --dur 30 --sample_rate 16000
    $ python generate.py --output_path ./to-curate-mestiphery-epoch-165/gen.wav --checkpoint_path ./logdir/periphery_only/20.09.2020_10.04.13/model.ckpt-165 --config_file ./default.config.json --num_seqs 100 --dur 8 --sample_rate 16000
    $ python generate.py --output_path ./mestiphery-to-curate-epoch-188/gen.wav --checkpoint_path ./logdir/periphery_only/20.09.2020_17.12.32/model.ckpt-188 --config_file ./default.config.json --num_seqs 30 --dur 10 --sample_rate 16000

The results of these generation commands are directories named "to-curate-*" full of wav files:

    $ ls to-curate-*
    to-curate-mestiphery-epoch-139:
    gen_0.wav  gen_1.wav  gen_2.wav  gen_3.wav  gen_4.wav  gen_5.wav  gen_6.wav  gen_7.wav  gen_8.wav  gen_9.wav
    
    to-curate-mestiphery-epoch-165:
    gen_0.wav   gen_28.wav  gen_46.wav  gen_64.wav  gen_82.wav
    gen_10.wav  gen_29.wav  gen_47.wav  gen_65.wav  gen_83.wav
    ...truncated...
    
    to-curate-mestiphery-epoch-188:
    gen_0.wav   gen_12.wav  gen_15.wav  gen_18.wav  gen_20.wav  gen_23.wav  gen_26.wav  gen_29.wav  gen_4.wav  gen_7.wav
    gen_10.wav  gen_13.wav  gen_16.wav  gen_19.wav  gen_21.wav  gen_24.wav  gen_27.wav  gen_2.wav   gen_5.wav  gen_8.wav
    gen_11.wav  gen_14.wav  gen_17.wav  gen_1.wav   gen_22.wav  gen_25.wav  gen_28.wav  gen_3.wav   gen_6.wav  gen_9.wav
    
    to-curate-mestiphery-epoch-92:
    gen_0.wav  gen_1.wav  gen_2.wav  gen_3.wav  gen_4.wav  gen_5.wav  gen_6.wav  gen_7.wav  gen_8.wav  gen_9.wav
    
    to-curate-periphery-epoch-83:
    gen_0.wav  gen_2.wav  gen_4.wav  gen_6.wav  gen_8.wav
    gen_1.wav  gen_3.wav  gen_5.wav  gen_7.wav  gen_9.wav
    
    to-curate-periphery-epoch-83-2:
    gen_0.wav   gen_28.wav  gen_46.wav  gen_64.wav  gen_82.wav
    gen_10.wav  gen_29.wav  gen_47.wav  gen_65.wav  gen_83.wav
    gen_11.wav  gen_2.wav   gen_48.wav  gen_66.wav  gen_84.wav

After listening to several of the files manually, I noticed that the lower epochs produce better, more cohesive, and longer bits of music. This to me indicates the failure of experiment 2, of mixing Periphery + Mestis and training for 250 epochs. This could be due to the nature of the training data - the neural network is trying to generate samples that minimize error for two distinct musical artists and styles, so it can resort to outputting nothing (to minimize numerical error between the two sets of training data). Also, like the Dadabots say, perhaps training for too long isn't beneficial.

This gives a 24-minute song (when run on all of the generated clips mentioned previously). Although cacophonous and random, it actually sounds like (very weird) music. Also, it emphatically does not sound like Periphery or Mestis, the two bands which supplied the training data, so I didn't include it in the real results:

Curating neural generated results

Chromaprint-based curation script

Curating good 2-tier results

Salvaging bad 3-tier results

References