Our recent findings
Zehranaz Canfes*, Furkan Atasoy*, Alara Dirik*, and Pinar Yanardag
The manipulation of latent space has recently become an interesting topic in the field of generative models. Recent research shows that latent directions can be used to manipulate images towards certain attributes. However, controlling the generation process of 3D generative models remains a challenge. In this work, we propose a novel 3D manipulation method that can manipulate both the shape and texture of the model using text or image-based prompts such as ’a young face’ or ’a surprised face’. We leverage the power of Contrastive Language-Image Pre-training (CLIP) model and a pre-trained 3D GAN model designed to generate face avatars, and create a fully differentiable rendering pipeline to manipulate meshes. More specifically, our method takes an input latent code and modifies it such that the target attribute specified by a text or image prompt is present or enhanced, while leaving other attributes largely unaffected. Our method requires only 5 minutes per manipulation, and we demonstrate the effectiveness of our approach with extensive results and comparisons.
Cemre Efe Karakas*, Alara Dirik*, Eylul Yalcinkaya, and Pinar Yanardag
Recent advances in generative adversarial networks have shown that it is possible to generate high-resolution and hyperrealistic images. However, the images produced by GANs are only as fair and representative as the datasets on which they are trained. In this paper, we propose a method for directly modifying a pre-trained StyleGAN2 model that can be used to generate a balanced set of images with respect to one (e.g., eyeglasses) or more attributes (e.g., gender and eyeglasses). Our method takes advantage of the style space of the StyleGAN2 model to perform disentangled control of the target attributes to be debiased. Our method does not require training additional models and directly debiases the GAN model, paving the way for its use in various downstream applications. Our experiments show that our method successfully debiases the GAN model within a few minutes without compromising the quality of the generated images.
Meliksah Turker, Alara Dirik, and Pinar Yanardag
While recent works have shown that it is possible to find disentangled directions in the latent space of image generation networks, finding directions in the latent space of sequential models for music generation remains a largely unexplored topic. In this work, we propose a method for discovering linear directions in the latent space of a music generating Variational Auto-Encoder (VAE). We use PCA, a statistical method to transform the input data such that the variation along the new axes is maximized. We apply PCA on the latent space activations of our model and find largely disentangled directions that change the style and characteristics of the input music. Our experiments show that the found directions are often monotonic, global and encode fundamental musical characteristics such as colorfulness, speed and repetitiveness. Moreover, we propose a set of quantitative metrics to describe different musical styles and characteristics to evaluate our results. We show that the found directions decouple content and can be utilized for style transfer and conditional music generation tasks.
Umut Kocasari, Alara Dirik, Mert Tiftikci, and Pinar Yanardag
Pre-trained GANs have shown great potential for interpretable directions in the latent space. The discovery of such directions is often done in a supervised or self-supervised manner and requires manual annotations which limits their application in practice. On the other hand, unsupervised approaches provide a way to discover interpretable directions without any supervision, but no fine-grained attribute can be discovered. Recent work such as StyleCLIP aims to overcome this limitation by leveraging the power of CLIP, a joint representational model for text and images, for text-driven image manipulation. While promising, these methods take several hours of pre-processing or training time, and require multiple text prompts. In this work, we propose a fast and efficient method for text-guided image generation and manipulation by leveraging the power of StyleGAN2 and CLIP. Our method uses a CLIP-based loss and an identity loss to manipulate images via user-supplied text prompts without changing any of the irrelevant attributes. Unlike previous work, our method requires only 12 seconds of optimization per text prompt and can be used with any pre-trained StyleGAN2 model. We demonstrate the effectiveness of our method with extensive results and comparisons to state-of-the-art approaches.
Alara Dirik*, Hilal Donmez*, and Pinar Yanardag
We propose the novel task of theatrical cue generation from dialogues. Using our play scripts dataset, which consists of 775K lines of dialogue and 277K cues, we approach the problem of cue generation as a controlled text generation task and show how cues can be used to amplify the impact of dialogue using a language model conditioned on a dialogue/cue discriminator. In addition, we explore the use of topic keywords and emotions to drive cue generation. Extensive quantitative and qualitative experiments show that language models can be successfully used to generate plausible and attribute-controlled text in highly specialized domains such as theater play scripts.