The next iteration of CANNe, and a paper accepted in IJCNN 2020.
One of the main frustrations I had with CANNe was its unruly nature. In its first implementation, CANNe was trained on a dataset of 5-octave scales played on almost every patch of a MicroKORG synthesizer. While I tuned the network's hyperparameters to improve the network's reconstructive performance, I didn't pay enough attention to the usability of the synthesizer.
When sampling CANNe's latent space, there was very little control over what sounds came out. Some times they were soothing and gentle, but often they were screechy and unpleasant. Moreover, there was no control over what note came out of the synthesizer - you got what you got.
To improve these issues, I went back to the drawing board with the training corpus and network design. After returning to the literature and some brainstorming, I came up with a way to condition the network to output a specific note class. Thus I could ask the network to output a C note, though I did not ask it to distinguish between C3 and C5.
To do this, a one-hot encoded vector of the input spectrogram's chroma representation was fed into both the encoder's first layer and to the bottleneck layer of the autoencoder.
I also took a subset of the 5-octave dataset to create a 1-octave dataset, thus eliminating training examples that were very high pitched and screechy.
The conditioning is not perfect, as shown below. Even though I may ask for an F note, sampling from the bottom-right portion of the latent space would yield D notes.