neuralwork - Research

Text and Image Guided 3D Avatar Generation and Manipulation

Zehranaz Canfes, Furkan Atasoy, Alara Dirik, and Pinar Yanardag

Winter Conference on Applications of Computer Vision (WACV), 2023

The manipulation of latent space has recently become an interesting topic in the field of generative models. Recent research shows that latent directions can be used to manipulate images towards certain attributes. However, controlling the generation process of 3D generative models remains a challenge. In this work, we propose a novel 3D manipulation method that can manipulate both the shape and texture of the model using text or image-based prompts such as ’a young face’ or ’a surprised face’. We leverage the power of Contrastive Language-Image Pre-training (CLIP) model and a pre-trained 3D GAN model designed to generate face avatars, and create a fully differentiable rendering pipeline to manipulate meshes. More specifically, our method takes an input latent code and modifies it such that the target attribute specified by a text or image prompt is present or enhanced, while leaving other attributes largely unaffected. Our method requires only 5 minutes per manipulation, and we demonstrate the effectiveness of our approach with extensive results and comparisons.

FairStyle: Debiasing StyleGAN2 with Style Channel Manipulations

Cemre Efe Karakas, Alara Dirik, Eylul Yalcinkaya, and Pinar Yanardag

European Conference on Computer Vision (ECCV), 2022

Recent advances in generative adversarial networks have shown that it is possible to generate high-resolution and hyperrealistic images. However, the images produced by GANs are only as fair and representative as the datasets on which they are trained. In this paper, we propose a method for directly modifying a pre-trained StyleGAN2 model that can be used to generate a balanced set of images with respect to one (e.g., eyeglasses) or more attributes (e.g., gender and eyeglasses). Our method takes advantage of the style space of the StyleGAN2 model to perform disentangled control of the target attributes to be debiased. Our method does not require training additional models and directly debiases the GAN model, paving the way for its use in various downstream applications. Our experiments show that our method successfully debiases the GAN model within a few minutes without compromising the quality of the generated images.

MIDISpace: Finding Linear Directions in Latent Space for Music Generation (Honorable Mention Award)

Meliksah Turker, Alara Dirik, and Pinar Yanardag

ACM Creativity & Cognition (ACM CC), 2022

While recent works have shown that it is possible to find disentangled directions in the latent space of image generation networks, finding directions in the latent space of sequential models for music generation remains a largely unexplored topic. In this work, we propose a method for discovering linear directions in the latent space of a music generating Variational Auto-Encoder (VAE). We use PCA, a statistical method to transform the input data such that the variation along the new axes is maximized. We apply PCA on the latent space activations of our model and find largely disentangled directions that change the style and characteristics of the input music. Our experiments show that the found directions are often monotonic, global and encode fundamental musical characteristics such as colorfulness, speed and repetitiveness. Moreover, we propose a set of quantitative metrics to describe different musical styles and characteristics to evaluate our results. We show that the found directions decouple content and can be utilized for style transfer and conditional music generation tasks.

Based Fast Text-Guided Image Generation and Manipulation

Umut Kocasari, Alara Dirik, Mert Tiftikci, and Pinar Yanardag

Winter Conference on Applications of Computer Vision (WACV), 2022

Pre-trained GANs have shown great potential for interpretable directions in the latent space. The discovery of such directions is often done in a supervised or self-supervised manner and requires manual annotations which limits their application in practice. On the other hand, unsupervised approaches provide a way to discover interpretable directions without any supervision, but no fine-grained attribute can be discovered. Recent work such as StyleCLIP aims to overcome this limitation by leveraging the power of CLIP, a joint representational model for text and images, for text-driven image manipulation. While promising, these methods take several hours of pre-processing or training time, and require multiple text prompts. In this work, we propose a fast and efficient method for text-guided image generation and manipulation by leveraging the power of StyleGAN2 and CLIP. Our method uses a CLIP-based loss and an identity loss to manipulate images via user-supplied text prompts without changing any of the irrelevant attributes. Unlike previous work, our method requires only 12 seconds of optimization per text prompt and can be used with any pre-trained StyleGAN2 model. We demonstrate the effectiveness of our method with extensive results and comparisons to state-of-the-art approaches.

Controlled Cue Generation for Play Scripts (Best Paper Award)

Alara Dirik, Hilal Donmez, and Pinar Yanardag

Controllable Generative Modeling in Language and Vision (NeurIPS Workshop), 2021

We propose the novel task of theatrical cue generation from dialogues. Using our play scripts dataset, which consists of 775K lines of dialogue and 277K cues, we approach the problem of cue generation as a controlled text generation task and show how cues can be used to amplify the impact of dialogue using a language model conditioned on a dialogue/cue discriminator. In addition, we explore the use of topic keywords and emotions to drive cue generation. Extensive quantitative and qualitative experiments show that language models can be successfully used to generate plausible and attribute-controlled text in highly specialized domains such as theater play scripts.