Bridging the gap between SampleRNN implementations
The "music" resulting from the default parameters of the prism-samplernn implementation led to (subjectively) bad results when trained on the Animals as Leaders self-titled album.
My main judgement criteria is how similar the generated audio sounds to the original music:
Result (from the prism samplernn experiment page):
I'll use the same album as a training input to dadabots SampleRNN to ensure that it's not the fault of the training input (that is somehow unlearnable). Then, I'll apply various tweaks to the prism SampleRNN implementation to see how much we can improve the music generation.
This is valuable because, as previously mentioned, the prism-samplernn codebase[1],[2] is more modern, faster, performant, uses Python 3 (as opposed to 2) and also uses the most up-to-date TensorFlow 2 library as compared to the dadabots[3],[4] and reference[5] implementations. It also incorporates fast audio generation (while the reference SampleRNN implementation doesn't include an audio generation script at all, and the dadabots generation is very slow).
Repeating experiment 0 with dadabots SampleRNN
I followed the same procedure as the dadabots SampleRNN experiment on Cannibal Corpse music. The training ran for ~200,000 steps (a little over), which amounted to 44 hours.
The preprocessing, training, and generation commands were all the same as the Cannibal Corpse experiment - 6400 overlapping flac files created from the 16kHz mono entire album wav file:
Just like with the Cannibal Corpse experiment, the resulting generated clips sound similar to the training music (although unstructured and cacophonous):
Repeating experiment 0 with PRiSM + 2-tier architecture
I repeated Experiment 0, training on a single album (Animals as Leaders' self-titled album), but after modifying the prism-samplernn code to support the 2-tier architecture (which as discussed in the overview is purported to produce better music).
The parameter `frame_sizes = [16,64]` determines the additional 2 tiers, frame and big frame (in addition to the base sample tier) - in 2-tier SampleRNN, there is no big frame. I modified my fork of prism-samplernn to accept `frame_size = [16]` as a configuration for 2-tier. The code changes can be viewed here. The exact steps of experiment 0 were repeated.
The resultant clips exhibit the similar strange high-pitched whistling and erratic drum beats as some other bad PRiSM-SampleRNN results - no significant improvement. Although the code "works" (in that it doesn't crash and actually successfully trains and generates anything at all), I can't really say whether it "works" in the neural perspective, i.e. converges to a correct solution.
The next thing to compare are the other model hyperparameters. In the dadabots case, these are all specified via command-line, while the PRiSM implementation takes a json config file:
Another difference is I copied the dadabots preprocessing (6400 overlapping clips, as opposed to the default which splits albums into 8-second chunks with 1-second overlap, resulting in 443 clips):
Let this run for 25 epochs (due to the highly increased number of training clips, this resulted in much longer training per epoch), or 44 hours (exactly same time as the dadabots training, to do a fair comparison).
Results:
Mini-conclusion
Implementing 2-tier SampleRNN + the dadabots preprocessing and hyperparameters definitely improved the results of the prism-samplernn codebase in generating progressive metal music.