Adaptive representation learning for the gestural control of deep audio generative models

Sarah Nabi, PhD candidate within the EDITE doctoral school (ED130) of Sorbonne Université, conducted her research entitled « Adaptive representation learning for the gestural control of deep audio generative models » as part of the Movement Music Sound Interaction (ISMM) team and the Analysis-Synthesis team at the STMS Laboratory (IRCAM, CNRS, Sorbonne Université, Ministry of Culture), co-supervised by Frédéric Bevilacqua, Philippe Esling and Geoffroy Peeters.

Jury composition :

Rebecca Fiebrink - Professor, University of the Arts London (UAL) - Reviewer
Anna Huang - Associate Professor, Massachusetts Institute of Technology (MIT) - Reviewer
Magdalena Fuentes - Assistant Professor, New York University (NYU) - Examiner
Andrew McPherson - Professor, Imperial College London (ICL) - Examiner
Olivier Sigaud - Professor, Sorbonne Université (SU) - Examiner

Abstract:
In recent years, neural audio synthesis has advanced significantly, offering promising tools for musical creation. These models learn the underlying data distribution from a set of observations. In particular, their potential lies in their ability to learn a parametric representation, called latent space, used to condition the synthesis process. This parallels Digital Musical Instruments (DMI), where gestural controllers drive synthesis parameters in real-time, and raises the following question: To what extent could these generative models meet the technical, ergonomic, and aesthetic criteria required to qualify as a musical instrument? Since the learning process is implicit, controlling such models is challenging. These latent representations are very abstract and generally too high-dimensional to be directly interpretable. Existing methods mainly rely on conditioning, which requires to retrain the synthesis model with massive sets of labeled examples and fail to consider the individual ways users may engage with such systems. This highlights the need for user-centered methods to investigate their role and integration in creative workflows, while enabling greater personalization and adaptability to diverse endeavors.

This thesis focuses on one of the core aspects of DMI design, namely sound synthesis control, and more specifically on how to provide gestural control on deep audio synthesis models for live musical performance. We acknownledge a fundamental mismatch between the performer’s gesture space and the latent space. While performers interact through gestures in a continuous, 3-dimensional Euclidean space, latent representations typically lie in non-linear, high-dimensional manifolds in which features are deeply entangled and, therefore, not directly interpretable. This disparity challenges interaction design, as straightforward control mapping strategies might fail to provide predictable or musically meaningful gestures. Although performers can leverage these topological constraints in creative ways, another strategy may be to adapt the latent representations to match specific properties of the performer’s gesture space. This reframes our problem as mapping these relevant latent parameters to a new user-adapted control space that preserves the local linearity and smoothness properties necessary for movement-based interaction. However, these relevant parameters can vary across users and models. Hence, we aim to enable users to define personalized controls from limited examples. First, we establish an art-research collaboration with a dancer to creatively explore these latent spaces through embodied interaction while creating an interactive dance/music performance. We propose a new motion-sound interactive system integrating deep audio generative models with three interaction strategies using IMU sensors, and analyze interviews of the dancer. Second, we propose the model PLaTune, a new supervised disentanglement method based on flow matching, that efficiently reshapes the latent space into a disentangled control-style space defined by the end-user to add temporal controls on pretrained synthesis models. Finally, we propose to implicitly define controls as the underlying variations found within data groups defined by a limited set of user-specified labels. We instantiate this idea as a general framework and formalize a new contrastive learning objective in a separate variation space that defines relative differences between the grouped embeddings to create multiple views of the target control. Applying this contrastive strategy to variational autoencoders, we propose the model CoALa, which directly reshapes the pretrained latent space into a new customized control space that disentangles the targeted features under limited supervision.

From the same archive

Adaptive representation learning for the gestural control of deep audio generative models - Questions du Jury - Sarah Nabi

Adaptive representation learning for the gestural control of deep audio generative models

From the same archive

Adaptive representation learning for the gestural control of deep audio generative models - Questions du Jury - Sarah Nabi

speakers

information

Sarah Nabi's thesis defense

IRCAM

opening times

subway access