Recognition of Facial Expressions and Emotions


A goal of facial expression recognition is to determine the emotional state of the face regardless of its identity. The face can express emotion sooner than people verbalize or even realize their feelings, and research in social psychology has shown that facial expressions form the major modality in human communication. So facial expression is one of the most powerful, natural and immediate means for human beings to communicate their emotions and intentions. Even though much work has been done, recognizing spontaneous facial expressions with high accuracy under uncontrolled environments e.g. with view variations, illumination changes and occlusions, remains to be difficult due to the complexity and variety of facial expressions.

Moreover, it is widely agreed that emotion is a multimodal procedure, which can be expressed mainly by facial expression, but also by head and body movements, gestures, speech and some physical characteristics like heart rate and blood pressure. Our interest is to combine signals from different sensors and develop robust systems for emotion recognition.

Our work:

1) Dynamic facial expression recognition with spatiotemporal LBP features.

Considering the motion of the facial region, we propose region-concatenated descriptors on the basis of the LBP-TOP for facial expression recognition. An LBP description computed over the whole facial expression sequence encodes only the occurrences of the micropatterns without any indication about their locations. To overcome this effect, a representation in which the face image is divided into several nonoverlapping or overlapping blocks is introduced. The LBP-TOP histograms in each block are computed and concatenated into a single histogram, as Fig. 1 shows. All features extracted from each block volume are connected to represent the appearance and motion of the facial expression sequence.

In this way, we effectively have a description of the facial expression on three different levels of locality. The labels (bins) in the histogram contain information from three orthogonal planes, describing appearance and temporal information at the pixel level. The labels are summed over a small block to produce information on a regional level expressing the characteristics for the appearance and motion in specific locations and all information from the regional level is concatenated to build a global description of the face and expression motion.

This approach does not require error prone segmentation of lips and other facial features and it is robust against monotonic gray scale changes caused, for example, by illumination and skin color variations, and errors in face alignment.


Fig. 1: (a) Three planes in dynamic texture (b) LBP histograms from each plane (c) Concatenated histogram.


Facial expression with VIS camera (avi 2.85 MB);

2) Facial expression recognition under NIR imaging system.


Fig. 2: VIS (top row) and NIR images (bottom row). Columns from left to right: normal, weak and dark illumination.


Fig. 3: (a) 38 facial points tracked by STASM (b) Six facial components cropped from face image using 38 facial points


Facial expression with NIR camera (avi 1.8 MB)

3) Visual speech recognition (lip-reading)

It is well known that human speech perception is a multimodal process. Visual observation of the lips, teeth and tongue offers important information about the place of pronunciation articulation. McGurk effect demonstrates that inconsistency between audio and visual information can result in perceptual confusion. Visual information plays an important role especially in noisy environments or for the listeners with hearing impairment. A human listener can use visual cues, such as lip and tongue movements, as shown in Fig. 4, to improve speech understanding. The process of using visual modality is often referred to as lip-reading which is to make sense of what someone is saying by watching the movement of his lips.


Fig. 4. Sequence for saying “You are welcome”.

Our work:

We studied local spatiotemporal descriptors (Fig. 5) to represent and recognize spoken isolated phrases based solely on visual input. Spatiotemporal local binary patterns extracted from mouth regions are used for describing isolated phrase sequences. Fig. 6 shows the system diagram. The advantages of our approach include local processing and robustness to monotonic gray-scale changes. Moreover, no error prone segmentation of moving lips is needed.


Fig. 5: Features in each block volume.(a) Block volumes (b) LBP features from three orthogonal planes (c) Concatenated features for one block volume with the appearance and motion


Fig. 6: System diagram.

We also proposed to use graph embedding to capture video dynamics (Fig. 7). Instead of ignoring audio information, we use it to better align video sequences of the same utterance. A new distance metric on a pair of frames is then defined based on the alignment results. Graphs are constructed based the distance metric to characterize the temporal connections among video frames. For each utterance, the high-dimensional visual features are mapped into a low-dimensional subspace of which video dynamics are well preserved by the curves of the mapped features along each dimension. Discriminatory cues are decoded from the curves for classification.


Fig. 7. Overview of the proposed method. The training process is included in the dashed box. The training video sequences (a) of different utterances are firstly represented by graphs (b) which are then used to learn some subspaces of the visual feature space through the graph embedding framework described in [Yan, TPAMI2007]. Given a test sequence (c), the extracted visual features are mapped into the subspaces. The spectra (e) of the mapped feature curves (d) along some particular dimensions are calculated and the magnitudes within certain frequency bands are used as cues to determine the associated utterance.


Lipreading (avi 2 MB)

Related projects:


Zhao G & Pietikäinen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):915-928.

Taini M, Zhao G, Li SZ & Pietikäinen M (2008) Facial expression recognition from near-infrared video sequences. Proc. 19th International Conference on Pattern Recognition (ICPR 2008), Tampa, FL, 4 p.

Zhao G & Pietikäinen M (2009) Boosted multi-resolution spatiotemporal descriptors for facial expression recognition. Pattern Recognition Letters 30(12):1117-1127.

Taini M, Zhao G & Pietikäinen M (2009) Weight-based facial expression recognition from near-infrared video sequences. In: Image Analysis, SCIA 2009 Proceedings, Lecture Notes in Computer Science 5575, 239-248, ISBN 978-3-642-02229-6.

Huang X, Zhao G, Pietikäinen M & Zheng W (2010) Dynamic facial expression recognition using boosted component-based spatiotemporal features and multi-classifier fusion. Proc. Advanced Concepts for Intelligent Vision Systems (ACIVS 2010), Sydney, Australia, in press.

Zhao G, Barnard M & Pietikäinen M (2009) Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia 11(7):1254-1265.

Zhou Z, Zhao G & Pietikäinen M (2010) Lipreading: A graph embedding approach. Proc. 20th International Conference on Pattern Recognition (ICPR 2010), Istanbul, Turkey, 523-526.


Oulu-CASIA NIR&VIS facial expression database: It contains videos with the six typical expressions (happiness, sadness, surprise, anger, fear, disgust) from 80 subjects captured with two imaging systems, NIR (Near Infrared) and VIS (Visible light), under three different illumination conditions: normal indoor illumination, weak illumination (only computer display is on) and dark illumination (all lights are off). Here is the document for database description. The database can be used, for example, in studying the effects of illumination variations to facial expressions, cross-imaging-system facial expression recognition or face recognition.

This database now is released. If you are interested, please contact Dr. Guoying Zhao

OuluVS database: It includes the video and audio data for 20 subjects uttering ten phrases: Hello, Excuse me, I am sorry, Thank you, Good bye, See you, Nice to meet you, You are welcome, How are you, Have a good time. Each person spoke each phrase five times. There are also videos with head motion from front to left, from front to right, without utterance, five times for each person. Here is the document for the collection information. The details and the baseline results for visual speech recognition can be found in:

Zhao G, Barnard M & Pietikäinen M (2009). Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia 11(7):1254-1265.

The database can be used, for example, in studying the visual speech recognition (lipreading). If you want to get this database, please contact Guoying Zhao.


CMV/Research/RecognitionOfFacialExpressionsAndEmotions (last edited 2011-11-19 15:09:40 by WebMaster)