Master Thesis: Accelerating Neural Audio Synthesis

My ETH Zürich Master Thesis. Making AI-generated instrument sounds fast enough to comfortably run in real-time.

The text is available here and the code is here.

Thumbnail

Image adapted from the DDSP paper

In short, I started with DDSP (see also blog), RAVE, and NEWT as three baseline models for synthesising sounds of musical instruments. The main application of these models is timbre transfer: you put in a melody performed in one instrument (such as your voice) and you get back the same melody played on a violin, a trumpet or so, depending on what the model was trained on.

The dream was to be able to run one of these models in a DAW so that electronic music artists can use the model seamlessly. The issue is that the models are very CPU intensive compared to regular synthesizers – when I started working on the thesis, they were barely real-time. I used techniques such as neural network quantization and pruning to speed the models up. A summary of the findings is:

  • The original DDSP paper uses a model with about 6M parameters, but I found that a tiny model with 7k parameters works just as well. This means that the DDSP architecture itself has a really strong inductive bias that determines what the model can do.
  • It matters a lot what framework you run the model in. ONNX Runtime and PyTorch’s Torchscript were generally the best.
  • Quantizing to 8 bits helps, although as I learned, this is not because of saved CPU cycles but rather reduced memory bandwidth.
  • Pruning does nothing for speed in most libraries. DeepSparse is the exception, but the library was less mature than the alternatives (at the time that I wrote the thesis). It didn’t support acceleration for the dilated convolutional layers that made up the bulk of the networks I was working with.

Updated: