Album art generation

SampleRNN, which I've shown so far, is a model for unconditional music generation. NVIDIA's StyleGAN2^[1] is a model for unconditional image generation. I'll use StyleGAN2 to create the fake album art for 1000sharks. Given that both are unconditional waveform generators (audio = 1D waveform with an implicit time axis, image = 2D matrix, non-temporal), the training and generation procedures are broadly similar to what has been described.

Image pre-preprocessing script

StyleGAN2 expects the training data to be square images with the same power-of-two dimension. I wrote a Python script that automatically extracts the square middle dim pixels of an image and saves them as png files using Pillow^[2]:

    from PIL import Image, ImageOps

    seq = 0
    # dim is a user-supplied argument

    for p in args.inpaths:
        for image in os.listdir(p):
            img = Image.open(os.path.join(p, image))

            thumbnail = ImageOps.fit(
                img,
                (args.dim, args.dim),
                Image.ANTIALIAS
            )
            thumbnail.save(os.path.join(args.outpath, '{0}.png'.format(seq)))
            seq += 1

I've committed this script (crop_images.py) to my fork of StyleGAN2^[3]. One thing to note is that I had to run conda install libwebp in my Conda environment before installing Pillow to support the webp image format.

Early on this page I mentioned vague "difficulties" when relying on older machine learning libraries. I encountered many of these with StyleGAN2:

I needed to install Python 3.7 which is the last version of Python that still supports Tensorflow 1.15 (an older version required for StyleGAN2)

I had to symlink several CUDA libraries to the 10.0 versions expected by Tensorflow 1.15:

        ln -snf /usr/lib64/libcudart.so.10.2.89 /usr/lib64/libcudart.so.10.0
        ln -snf /usr/lib64/libcublas.so.10.2.89 /usr/lib64/libcublas.so.10.0
        ln -snf /usr/lib64/libcufft.so.10.1.2.89 /usr/lib64/libcufft.so.10.0
        ln -snf /usr/lib64/libcublas.so.10.2.2.89 /usr/lib64/libcublas.so.10.0
        ln -snf /usr/lib64/libcusparse.so.10.3.1.89 /usr/lib64/libcusparse.so.10.0
        ln -snf /usr/lib64/libcurand.so.10.1.2.89 /usr/lib64/libcurand.so.10.0
        ln -snf /usr/lib64/libcusolver.so.10.3.0.89 /usr/lib64/libcusolver.so.10.0

After encountering several resource exhaustion crashes, I had to allow GPU memory growth, similar to the modification I had to make to SampleRNN: export TF_FORCE_GPU_ALLOW_GROWTH="true"

Preprocessing, training, and generation commands

The training data consists of shark images (saved from a Google image search^[4]), and heavy metal album covers (saved from the following article^[5]).

After downloading these to a directory, I ran the following commands:

    # create 256x256 middle cropped images from sharks and album covers
    $ python crop_images.py --dim=256 ./output-images/ ./shark-images/ ./metal-album-covers/

    # preprocess cleaned cropped images using stylegan2's own tool
    $ python dataset_tool.py create_from_images datasets/1000sharks/ ./output-images/

    # train for config-e and kimg=1000
    $ python run_training.py --data-dir=./datasets/ --dataset=1000sharks --config=config-e --total-kimg=1000

    # generate 1000 images, randomly seeded, for curation
    $ python run_generator.py generate-images --seeds=0-999 --truncation-psi=1.0   --network=results/00008-stylegan2-1000sharks-1gpu-config-e/network-final.pkl

I chose the training parameters config-e (the second-best configuration - config-f is the best configuration in StyleGAN2 but is a larger neural network that's much slower to train - these are explained in the source code here^[6]), with kimg=1000 (this is similar to the epochs of SampleRNN, in that more is probably better and increases the training time).

After 31 hours of training, the model was done:

    tick 118   kimg 951.7    lod 0.00  minibatch 32   time 1d 06h 10m   sec/tick 845.5   sec/kimg 104.85  maintenance 0.0    gpumem 5.0
    tick 119   kimg 959.7    lod 0.00  minibatch 32   time 1d 06h 24m   sec/tick 845.2   sec/kimg 104.81  maintenance 0.0    gpumem 5.0
    tick 120   kimg 967.8    lod 0.00  minibatch 32   time 1d 06h 38m   sec/tick 844.9   sec/kimg 104.78  maintenance 0.0    gpumem 5.0
    network-snapshot-000967        time 11m 39s      fid50k 192.4708
    tick 121   kimg 975.9    lod 0.00  minibatch 32   time 1d 07h 04m   sec/tick 844.7   sec/kimg 104.76  maintenance 712.7  gpumem 5.0
    tick 122   kimg 983.9    lod 0.00  minibatch 32   time 1d 07h 18m   sec/tick 845.8   sec/kimg 104.89  maintenance 0.0    gpumem 5.0
    tick 123   kimg 992.0    lod 0.00  minibatch 32   time 1d 07h 32m   sec/tick 845.7   sec/kimg 104.87  maintenance 0.0    gpumem 5.0
    tick 124   kimg 1000.1   lod 0.00  minibatch 32   time 1d 07h 47m   sec/tick 845.6   sec/kimg 104.87  maintenance 0.0    gpumem 5.0
    network-snapshot-001000        time 11m 41s      fid50k 195.7975
    dnnlib: Finished training.training_loop.training_loop() in 1d 07h 59m.

The logs to stdout are similar to SampleRNN (kimg = epoch, fid50k = accuracy measure where lower is better^[7]).

Curated album art

There are 9 results I liked the most and included in the project:

Album art generation

Image pre-preprocessing script

Preprocessing, training, and generation commands

Curated album art

References