prism-samplernn experiments (3-tier)

The RNCM (Royal Northern College of Music) PRiSM (Practice and Research in Science and Music) lab released a modern implementation of 3-tier SampleRNN^[1]:

PRiSM is shortly going to publish its own implementation, using TensorFlow 2, and we’ll be explaining the features of the PRiSM SampleRNN in our next instalment – when we will also make the code available on PRiSM’s GitHub pages, along with a number of pretrained and optimised models.

Since the dadabots discourage the use of their own 3-tier model, we'll use the PRiSM repo (where the 3-tier model is the primary focus).

Python setup + minor code tweaks

I vendored the original prism-samplernn^[2] codebase in the 1000sharks project to make a minor adjustment in the scripts^[3].

    physical_devices = tf.config.list_physical_devices('GPU')
    tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)

Without this parameter, the training would crash on my GPU (RTX 2070 SUPER) with a mysterious error message, "Fail to find the dnn implementation". It's an esoteric fix that one can find scattered across GitHub.

The Python setup is straightforward using conda^[4] and following the project's README.md:

    $ conda create -n prism-samplernn python=3.8 anaconda
    $ conda activate prism-samplernn
    (prism-samplernn) $ pip install -r requirements.txt

SampleRNN configuration

I'll summarize the available SampleRNN hyperparameters and other customizeable steps compared across the original 2017 ICLR implementation, the Dadabots fork, the PRiSM fork which I use throughout the rest of this report, and finally my own modifications to the PRiSM parameters after experiment 0:

	Original	Dadabots	PRiSM	Mine	Descr
RNN layers	4	5	4	5	Quality of results (dadabots note that 5 learns music better than 4)
Tiers	2 or 3	2 or 3 (2 recommended for good music)	3	2, 3	Tiers of RNN (more = wider temporal timescale, but...*)
Frame sizes (corresponds to tiers)	16	16	16,64	3: 16,64, 2: 16	Samples apart between low and high timescales
Sample rate	16000 (fixed)	16000	16000	16000	Sample rate of training/generating waveform (lower = faster learning, better able to learn long-timescale patterns)
Training input	No details	Chop albums into 8s + 1s overlap	Chop albums into 8s + 1s overlap	Chop albums into 8s + 1s overlap	Suggestions on how to prepare training data
Epochs	Not customizeable	Not customizeable	100	100, 250	Entire cycles of training on the same data (more = possibly better learning, but not necessarily)

*: recall that 3-tier is in practise worse for music

The frame sizes 16, 64 correspond to the additional tiers of SampleRNN (the first tier is always n = 1, or consecutive samples). A 2-tier SampleRNN architecture has a wider temporal scale of learning at 16 frames, while a 3-tier SampleRNN architecture learns at 16 and 64 frames.

Experiment 0: training on a single album

We'll train 3-tier SampleRNN first on Animals as Leaders' self-titled album:

I downloaded the audio using youtube-dl, converted it to 16kHz mono with ffmpeg (recommended for SampleRNN to perform better), split it up into chunks (ignoring silence) using the prism-samplernn script chunk_audio.py, and ran the training with default parameters:

    $ python train.py --id aamgen --data_dir ./chunks/ --num_epochs 100 --batch_size 64 --checkpoint_every 5 --output_file_dur 3 --sample_rate 16000

This emitted some generated clips during the training. Let's listen to 2 of the more musically interesting clips (a lot of them are just silence), generated at epoch 20 and 85 of the 100-epoch training. An epoch is one cycle of the entire training dataset - this means that the neural network observed the same album 100 times iteratively to learn how to model it:
Epoch 20:

Epoch 85:

After the training was done (it took ~3 days on my machine), I generated 2 10-second clips of what I thought would be "Animas-as-Leaders-esque" music. The generate command is:

    $ python generate.py --output_path ./aamgen-out/aamgen.wav --checkpoint_path ./logdir/aamgen/14.09.2020_19.02.06/model.ckpt-90 --config_file ./default.config.json --num_seqs 2 --dur 10 --sample_rate 16000

This says to use the model checkpoint 90. Even though we specified 100 epochs in the training, the model has an intelligent stop when it sees that additional training is not improving the model. In this case, it seems like 90 epochs exhausted the model's learning ability. Here's one of them (both sound equally bad):

Experiment 0 lessons

I applied tweaks that Karl Hiner^[5] did in his experiments - for my next experiment, I would try some or all of the following:

Increase the number of RNN layers from 4 to 5 to try to create more cohesive music
Use more training data than just one album
Increase epochs from 100 to 250 (longer training may lead to better results)

We're also probably running into the same discovery of others, that 3-tier architecture of the PRiSM-SampleRNN implementation may be producing worse music than 2-tier.

Experiment 1: longer training on multiple albums

For my next experiment, I downloaded instrumental versions of the albums of Periphery (instrumental - I didn't want vocals mixing into the results, as I want to focus on musical instruments acoustics) and Mestis (an instrumental band). The data fetch and preprocessing scripts are available in my prism-samplernn fork:

    #!/usr/bin/env bash
    
    echo "Fetching training data - youtube-dl wav files for Mestis and Periphery albums"
    
    # youtube playlists for Mestis - Eikasia, Polysemy, Basal Ganglia
    mestis_album_1="PLNOrZEIoYAMgLJeZeCUEhABLPz7yqkyfI"
    mestis_album_2="PLfoVvOUi1CqV0O-yMdOvTff_vp8hOQnWi"
    mestis_album_3="PLRK89uMjq03BMsxBKFGBcDAh2G7ACwJMK"
    
    youtube-dl -ci -f "bestaudio" -x --audio-format wav -i ${mestis_album_1}
    youtube-dl -ci -f "bestaudio" -x --audio-format wav -i ${mestis_album_2}
    youtube-dl -ci -f "bestaudio" -x --audio-format wav -i ${mestis_album_3}
    
    # youtube playlists for instrumental Periphery albums - Periphery III, I, II, IV, Omega, Juggernaut
    periphery_album_1="PLSTnbYVfZR03JGmoJri6Sgvl4f0VAi9st"
    periphery_album_2="PL7DVODcLLjFplM5Rw-bNUyrwAECIPRK26"
    periphery_album_3="PLuEYu7jyZXdde7ePWV1RUvrpDKB8Gr6ex"
    periphery_album_45="PLEFyfJZV-vtKeBedXTv82yxS7gRZkzfWr"
    periphery_album_6="PL6FJ2Ri6gSpOWcbdq--P5J0IRcgH-4RVm"
    
    youtube-dl -ci -f "bestaudio" -x --audio-format wav -i ${periphery_album_1}
    youtube-dl -ci -f "bestaudio" -x --audio-format wav -i ${periphery_album_2}
    youtube-dl -ci -f "bestaudio" -x --audio-format wav -i ${periphery_album_3}
    youtube-dl -ci -f "bestaudio" -x --audio-format wav -i ${periphery_album_45}
    youtube-dl -ci -f "bestaudio" -x --audio-format wav -i ${periphery_album_6}
    
    mkdir -p periphery-raw
    mkdir -p mestis-raw
    
    find . -maxdepth 1 -mindepth 1 -type f -iname '*PERIPHERY*.wav' -exec mv {} periphery-raw/ \;
    find . -maxdepth 1 -mindepth 1 -type f -iname '*MESTIS*.wav' -exec mv {} mestis-raw/ \;
    find . -maxdepth 1 -mindepth 1 -type f -iname '*Javier*.wav' -exec mv {} mestis-raw/ \;
    find . -maxdepth 1 -mindepth 1 -type f -iname '*Suspiro*.wav' -exec mv {} mestis-raw/ \;
    find . -maxdepth 1 -mindepth 1 -type f -name '*.wav' -exec rm {} \;
    
    mkdir -p mestis-processed
    mkdir -p periphery-processed
    
    echo "Processing each wav file to 16kHz mono"
    
    for f in mestis-raw/*.wav; do
            ffmpeg -i "${f}" -ac 1 -ar 16000 "mestis-processed/$(basename "$f")";
    done
    
    for f in periphery-raw/*.wav; do
            ffmpeg -i "${f}" -ac 1 -ar 16000 "periphery-processed/$(basename "$f")";
    done
    
    mkdir -p periphery-chunks
    mkdir -p mestis-chunks
    mkdir -p mixed-chunks
    
    for f in mestis-processed/*.wav; do
            python ../chunk_audio.py --input_file "${f}" --output_dir mestis-chunks --chunk_length 8000 --overlap 1000
            python ../chunk_audio.py --input_file "${f}" --output_dir mixed-chunks --chunk_length 8000 --overlap 1000
    done
    
    for f in periphery-processed/*.wav; do
            python ../chunk_audio.py --input_file "${f}" --output_dir periphery-chunks --chunk_length 8000 --overlap 1000
            python ../chunk_audio.py --input_file "${f}" --output_dir mixed-chunks --chunk_length 8000 --overlap 1000
    done

What the script does is:

Fetch files for every Mestis song (from YouTube playlists + youtube-dl)
Fetch files for every instrumental Periphery song (from YouTube playlists + youtube-dl)
Pre-process them into 16kHz mono with ffmpeg (for optimal training)
Apply the chunk_audio.py script to split into non-silent 8-second chunks with 1 second overlap
Create 3 sets of training data - periphery-chunks, mestis-chunks, mixed-chunks

My intention was to train the model on each of the sets of training chunks, to create generated music that:

Sounds like Periphery only
Sounds like Mestis only
Sounds like "Mestiphery", an organic mashup

Experiment 1 part 1 - Periphery only

The training command for periphery-only is:

$ python train.py --id periphery_only --data_dir ./experiment-1/periphery-chunks/ --num_epochs 100 --batch_size 64 --sample_rate 16000

Periphery only results

Here's a 30-second clip output from the training on Periphery only:

Another trait is that most generated audio consists of silence. I was very lucky to get almost 30 seconds of musical content in a single clip. Subjectively, this sounds nothing like Periphery:

Some more clips show the high-pitched output (which is melodic, but again, seemingly bizarre when trained on downtuned palm-muted rhythm guitar riffs).
Epoch 67:

Epoch 70:

Overfitting to a single song, Make Total Destroy

Since SampleRNN was trained on 6 different albums, let's narrow down why it wasn't able to create Periphery's characteristic sound by overfitting specifically on the song shown above.

After creating a test dataset with only the song Make Total Destroy and training on it for 100 epochs, the model reaches the following loss and accuracy:

    Epoch: 91/100, Step: 118/125, Loss: 0.867, Accuracy: 74.538, (0.246 sec/step)
    Epoch: 91/100, Step: 119/125, Loss: 0.864, Accuracy: 74.616, (0.255 sec/step)
    Epoch: 91/100, Step: 120/125, Loss: 0.861, Accuracy: 74.705, (0.247 sec/step)
    Epoch: 91/100, Step: 121/125, Loss: 0.857, Accuracy: 74.797, (0.246 sec/step)
    Epoch: 91/100, Step: 122/125, Loss: 0.854, Accuracy: 74.893, (0.246 sec/step)
    Epoch: 91/100, Step: 123/125, Loss: 0.850, Accuracy: 74.985, (0.251 sec/step)
    Epoch: 91/100, Step: 124/125, Loss: 0.847, Accuracy: 75.076, (0.246 sec/step)

Generated audio from epoch 91:
Clip 1:

Clip 2:

These don't sound anything like music.

Generated audio visualization

The above clips show collections of realistic note onsets. One of my original statements about WaveNet and SampleRNN was that they could produce music with convincing dynamics and timbre, to make us believe real humans played it. Observing various aspects of the waveform in the time and frequency domain should be useful.

Let's view each clip (epochs 67, 70) in the time domain and frequency domain (with a spectrogram):

Although this is subjective, one can see the dynamic nature of the produced audio in the plots above. It really does look like there are real musical variations in the complex waveform (aside from the totally blank silences which are odd in real music).

Mu-law vs linear quantization

Karl Hiner's blog post touches on WaveNet's mu-law quantization, and claims it sounds better than SampleRNN's linear quantization. In fact, I found that every SampleRNN implementation I found had options for linear and mu-law quantization (perhaps it was added later). In fact the original ICLR 2017 paper even has an "a-law quantization" (similar to mu-law). Let's hear what each sounds like:

Periphery epoch 83, mu-law quantization:

Periphery epoch 83, linear quantization WARNING! LOUD!:

In my subjective listening test, the linear quantization output is very loud, almost to the point of distortion and clipping. The mu-law quantization is outputting music with more subtle volume, possibly since its more suited for the logarithmic human experience of loudness.

Abandoning the original hypotheses

The results were very different from my expectations:

The resulting generated audio is mostly silence and junk
There are some interesting potentially musical sounds, but it's all high-pitched whistling and doesn't contain any characteristics of the band Periphery (palm-muted distorted guitar chords, etc.)
I have to generate 100s or 1000s of clips and curate the results to create a final result - the chances of getting 1 cohesive "song" (let's say, 2 minutes of contiguous music) are pretty slim

With the demonstrated poor quality of results (e.g. generated audio that sounds nothing like the band), my original hypotheses were debunked. I mixed the Mestis data into the Periphery data, and continued training the model (that was initially trained only on Periphery for 100 epochs) for 250 epochs. At this point my goal was to "embrace the weird" and see what sort of strange music I can create. The sum total of all my training and experiments (including failed starts) for 3-tier SampleRNN was 3 weeks.

Training results

The results of training SampleRNN are stored in the logdir directory, in timestamped directories storing checkpoints for intermediate epochs in training. Here's a tree view of of the training. The size of the training directory is 51GB, after having been trained on all of the albums listed above (709MB of music):

    $ tree logdir/periphery_only/ -L 2
    logdir/periphery_only/
    ├── 15.09.2020_17.36.46
    │   ├── checkpoint
    │   ├── model.ckpt-79.data-00000-of-00001
    │   ├── model.ckpt-79.index
    │   ├── model.ckpt-80.data-00000-of-00001
    │   ├── model.ckpt-80.index
    │   ├── model.ckpt-81.data-00000-of-00001
    │   ├── model.ckpt-81.index
    │   ├── model.ckpt-82.data-00000-of-00001
    │   ├── model.ckpt-82.index
    │   ├── model.ckpt-83.data-00000-of-00001
    │   ├── model.ckpt-83.index
    │   └── train
    ├── 17.09.2020_21.55.43
    │   ├── checkpoint
    │   ├── model.ckpt-89.data-00000-of-00001
    │   ├── model.ckpt-89.index
    │   ├── model.ckpt-90.data-00000-of-00001
    │   ├── model.ckpt-90.index
    │   ├── model.ckpt-91.data-00000-of-00001
    │   ├── model.ckpt-91.index
    │   ├── model.ckpt-92.data-00000-of-00001
    │   ├── model.ckpt-92.index
    │   ├── model.ckpt-93.data-00000-of-00001
    │   ├── model.ckpt-93.index
    │   └── train
    ...truncated...
    └── 22.09.2020_22.25.13
        ├── checkpoint
        ├── model.ckpt-244.data-00000-of-00001
        ├── model.ckpt-244.index
        ├── model.ckpt-245.data-00000-of-00001
        ├── model.ckpt-245.index
        ├── model.ckpt-246.data-00000-of-00001
        ├── model.ckpt-246.index
        ├── model.ckpt-247.data-00000-of-00001
        ├── model.ckpt-247.index
        ├── model.ckpt-248.data-00000-of-00001
        ├── model.ckpt-248.index
        └── train