Image and Video Synthesis


Dynamic textures are image sequences with visual pattern repetition in time and space, such as smoke, flames, moving objects and so on. It is one kind of textures, which can be defined as patterns with temporal regularity exhibiting in time and space, or as temporal changes in spectral parameters of image sequences. Dynamic texture synthesis is in general defined as texture synthesis applied to video data. The synthesis of video to infinite time and space has become a major topic in computer vision. It has found broad applications in digital special effects, video editing, non-photorealistic rendering, video synthesis and so on. Our work on video synthesis aims to understand image content in virtue of image textural properties and retrieve visual information so that we could provide a continuous and infinitely varying stream of images by doing operations on dynamic textures. Synthesis of dynamic textures refers to rendering of natural phenomena, such as waves, waterfall or fire. Recently, video synthesis has extended the content to human facial motion, such as speaking and expression, which has potential applications in games, movies, animation and virtual reality.


One application is the video-realistic speech animation. This topic plays an important role in the area of affective human-robot interactions. The goal is to synthesize a visually realistic face that can talk just like the way as we do. In this way, it can provide a natural platform for a human user and a robot to communicate with each other. Besides that, the techniques also have potential applications such as generating synchronized visual cues for audios in order to help hearing-impaired people to better capture information or making human characters in movies. As illustrated in the figure, we first record a video corpus within which a human character is asked to speak different utterances. His mouth is then cropped from the original speech videos and is used to learn generative models for synthesizing novel mouth images. A generative model considers the whole utterance contained in a video as a continuous process and represents it using a set of trigonometric functions embedded within a path graph. The transformation that projects the values of the functions to the image space is found through graph embedding. Such a model allows us to synthesize mouth images at arbitrary positions in the utterance. To synthesize a video for a novel utterance, the utterance is first compared with the existing ones from which we find the phoneme combinations that best approximate the utterance. New mouth images are then synthesized from the learned models based on the combinations. Finally, we project the synthesized mouth back to some background video seamlessly to gain realism.

In summary, our research on image and video synthesis is to understand, model and process texture, and ultimately simulate natural scenes and human visual processes using computer vision technologies. The topics being investigated are:

Related publications:

Guo Y, Zhao G, Chen J, Pietikäinen M & Xu Z (2009) Dynamic texture synthesis using a spatial temporal descriptor. Proc. IEEE International Conference on Image Processing (ICIP 2009), Cairo, Egypt, 2277-2280, ISBN 978-1-4244-5655-0.)

Zhou Z, Zhao G & Pietikäinen M (2010) Synthesizing a talking mouth. Proc. Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP 2010), Chennai, India, accepted.

CMV/Research/ImageAndVideoSynthesis (last edited 2011-11-19 15:09:32 by WebMaster)