2-tier SampleRNN experiments

Let's run some experiments with the SampleRNN implementation most successful for producing music - that of the dadabots. My earlier hestitation for using the dadabots code was how old the dependencies were. As such, it was slightly trickier to get up and running than the modern Tensorflow 2 PRiSM fork.

Python minor code tweaks

I vendored the dadabots_sampleRNN codebase^[1] into the 1000sharks project to make a minor adjustment in the scripts^[2].

    diff --git a/models/two_tier/two_tier16k.py b/models/two_tier/two_tier16k.py
    index 5a579c8..9f9d53f 100644
    --- a/models/two_tier/two_tier16k.py
    +++ b/models/two_tier/two_tier16k.py
    @@ -505,10 +505,13 @@ def generate_and_save_samples(tag):
                     numpy.int32(t == FRAME_SIZE)
                 )
    
    -        samples[:, t] = sample_level_generate_fn(
    -            frame_level_outputs[:, t % FRAME_SIZE],
    -            samples[:, t-FRAME_SIZE:t],
    -        )
    +        try:
    +            samples[:, t] = sample_level_generate_fn(
    +                frame_level_outputs[:, t % FRAME_SIZE],
    +                samples[:, t-FRAME_SIZE:t],
    +            )
    +        except:
    +            pass
    
         total_time = time() - total_time
         log = "{} samples of {} seconds length generated in {} seconds."
    @@ -705,8 +708,9 @@ while True:
             # 5. Generate and save samples (time consuming)
             # If not successful, we still have the params to sample afterward
             print "Sampling!",
    +        print "skipping because it crashes!"
             # Generate samples
    -        generate_and_save_samples(tag)
    +        #generate_and_save_samples(tag)
             print "Done!"
    
             if total_iters-last_print_iters == PRINT_ITERS \

After several cycles of training, the program would crash with an index out of range error in the samples slicing code. After adding a try/catch, I noticed a second problem which is that the sample generation code (in between training steps) was using up all of my 32GB of RAM (not GPU vmem, but real system RAM). I ended up commenting out the generate_and_save_samples code. This means that while training, the model won't emit sample clips to allow one to listen to the quality of the training - but at least it prevented crashing.

Python setup

The dadabots SampleRNN installation instructions^[3] are tailored to a Google Cloud Platform Ubuntu 16.04 setup with an NVIDIA V100 GPU. I adjusted the steps to work on my computer.

The full install steps for a functional dadabots SampleRNN (with notes) were as follows:

    # we need python 2.7 for dadabots sampleRNN
    # create a conda environment

    $ conda create -n dadabots_SampleRNN python=2.7 anaconda
    $ conda activate dadabots_SampleRNN

    # we need Theano==1.0.0 - newer breaks

    (dadabots_SampleRNN) $ conda install -c mila-udem -c mila-udem/label/pre theano==1.0.0 pygpu

    # we need Lasagne (based on Theano) 0.2.dev1
    # this command installs 0.2.dev1 correctly - suggested by https://github.com/imatge-upc/salgan/issues/29

    (dadabots_SampleRNN) $ pip install --upgrade https://github.com/Lasagne/Lasagne/archive/master.zip

    # we need to create a ~/.theanorc file with some cudnn details
    # cudnn.h is installed on my system through Fedora repos, as mentioned previously
    # 
    # nvcc requires an older GCC, 8.4, which I already had set up

    $ cat ~/.theanorc
    [dnn]
    include_path = /usr/include/cuda

    [global]
    mode = FAST_RUN
    device = cuda0
    floatX = float32

    [nvcc]
    compiler_bindir = /home/sevagh/GCC-8.4.0/bin

    # at this point, we have a working dadabots 2-tier SampleRNN model
    # install some extra pip packages

    (dadabots_SampleRNN) $ pip install graphviz pydot pydot-ng

Training data preparation

I ran the 2-tier dadabots SampleRNN on a Cannibal Corpse album, A Skeletal Domain:

Similarly to the 3-tier procedure, I downloaded the audio using youtube-dl, converted it to 16kHz mono with ffmpeg, placed the wav file in the required directory and ran the data prep script:

    (dadabots_SampleRNN) $ cp skeletal_domain.wav datasets/music/downloads/
    (dadabots_SampleRNN) $ cd datasets/music && python new_experiment16k.py skeletal-domain downloads/

This creates 6400 overlapping flac part files from the original wav file:

    $ ls datasets/music/skeletal-domain/parts/p*.flac | head -n20
    datasets/music/skeletal-domain/parts/p0.flac
    datasets/music/skeletal-domain/parts/p1000.flac
    datasets/music/skeletal-domain/parts/p1001.flac
    datasets/music/skeletal-domain/parts/p1002.flac
    datasets/music/skeletal-domain/parts/p1003.flac
    datasets/music/skeletal-domain/parts/p1004.flac
    datasets/music/skeletal-domain/parts/p1005.flac
    datasets/music/skeletal-domain/parts/p1006.flac
    datasets/music/skeletal-domain/parts/p1007.flac
    datasets/music/skeletal-domain/parts/p1008.flac
    datasets/music/skeletal-domain/parts/p1009.flac
    datasets/music/skeletal-domain/parts/p100.flac
    datasets/music/skeletal-domain/parts/p1010.flac
    datasets/music/skeletal-domain/parts/p1011.flac
    datasets/music/skeletal-domain/parts/p1012.flac
    datasets/music/skeletal-domain/parts/p1013.flac
    datasets/music/skeletal-domain/parts/p1014.flac
    datasets/music/skeletal-domain/parts/p1015.flac
    datasets/music/skeletal-domain/parts/p1016.flac
    datasets/music/skeletal-domain/parts/p1017.flac

Example of 2 overlapping chunks:

The chunks emitted are as many overlapping 8-second clips as it takes to create 6400 total files.

Training command

The dadabots training command they recommend with the best hyperparameters for producing music is:

    # start training

    (dadabots_SampleRNN) $ python -u models/two_tier/two_tier16k.py --exp skeletal-domain_experiment --n_frames 64 --frame_size 16 --emb_size 256 --skip_conn True --dim 1024 --n_rnn 5 --rnn_type LSTM --q_levels 256 --q_type mu-law --batch_size 32 --weight_norm True --learn_h0 False --which_set skeletal-domain

    # resume training

    (dadabots_SampleRNN) $ python -u models/two_tier/two_tier16k.py --exp skeletal-domain_experiment --n_frames 64 --frame_size 16 --emb_size 256 --skip_conn True --dim 1024 --n_rnn 5 --rnn_type LSTM --q_levels 256 --q_type mu-law --batch_size 32 --weight_norm True --learn_h0 False --which_set skeletal-domain --resume

I had to modify their batch size from 128 (which would crash on GPU vmem OOM) to 32 to fit on the 8GB of GPU memory on my video card.

This ran for 10 epochs, or 90 hours of training (220,000 total iterations) when I decided to stop and verify the outputs. The saved model on-disk, 23GB in size, looks like this:

    $ tree results_2t/models-two_tier-two_tier16k.py-expskeletal-domain_experiment-n_frames64-frame_size16-emb_size256-skip_connT-dim1024-n_rnn5-rnn_typeLSTM-q_levels256-q_typemu-law-batch_size32-weight_normT-learn_h0F-which_setskeletal-domain-lr0.001/ -L 1
    results_2t/models-two_tier-two_tier16k.py-expskeletal-domain_experiment-n_frames64-frame_size16-emb_size256-skip_connT-dim1024-n_rnn5-rnn_typeLSTM-q_levels256-q_typemu-law-batch_size32-weight_normT-learn_h0F-which_setskeletal-domain-lr0.001/
    ├── args.pkl
    ├── best
    ├── model_settings.txt
    ├── params
    ├── samples
    ├── th_conf.txt
    ├── train_log.pkl
    └── train_log.png

The file train_log.png is actually an image showing how the SampleRNN model is improving during training:

Generation command and results

To generate music from the saved model, we run the following command:

    (dadabots_SampleRNN) $ python -u models/two_tier/two_tier_generate16k.py --exp skeletal-domain_experiment --n_frames 64 --frame_size 16 --emb_size 256 --skip_conn True --dim 1024 --n_rnn 5 --rnn_type LSTM --q_levels 256 --q_type mu-law --batch_size 32 --weight_norm True --learn_h0 False --which_set skeletal-domain --n_secs 20 --n_seqs 200 --temp 0.95

The results sound great - almost exactly like what the dadabots created, it sounds like unique (and random) compositions by the band Cannibal Corpse in the style of the album, A Skeletal Domain, that it was trained to overfit.

Checkpoint/training iteration 220,001:

In their raw form, the 200 disjoint 20-second output clips have varying musical content and need to be curated. This is a personal choice - the dadabots discuss the role of curation after generating 10 hours of music with their model^[4], and I wanted to work on a curation script. I chose to work with 200x 20s wav files as my curation base.

Mini-conclusion

This sounds great and is the best performing music model, as promised by the dadabots in their paper^[4]. Their paper describes the modifications and hyperparameters they made to SampleRNN, but the important hyperparameters are:

We use a 2-tier SampleRNN with 256 embedding size, 1024 dimensions, 5 layers, LSTM, 256 linear quantization levels, 16kHz or 32kHz sample rate, skip connections, and a 128 batch size, using weight normalization. The initial state h0 is randomized to generate more variety