components in the face image, and DCT features are calculated as the DCT energies for the horizontal, diagonal and vertical directions.
Body posture at each time instant is estimated from the trinocular images acquired by the three CCD cameras observing the person from three directions . First, some significant points such as the head top and finger tips are located by analyzing the contour of the silhouette extracted from the background by thresholding each image. Main joints such as elbows and knees are located using a learning procedure, because these points are difficult to estimate by a simple contour analysis. By evaluating appropriateness of the three views, two views are selected for each significant point so that 3D coordinates of the significant points can be calculated by the principle of triangulation.
To reproduce facial expressions in the 3D model, the DCT features obtained by the method described in Section 2.2 are utilized to deform the face model, where the authors' reproduction method based on anatomy of artists or plastic anatomy is used for the deformation . To reproduce body postures, the 3D coordinates of the significant points are given to the 3D model, and the other vertices of the model are displaced by an interpolation method.
In general, humans display different facial expressions sequentially. Therefore, computers need to spot each facial expression from video sequences and recognize the spotted facial epxression. The authors have developed an HMM (Hidden Markov Models) based method .
First, the motion around the right eye and mouth is estimated by using the gradient-based optical flow algorithm. Then 2-D Fourier transform is applied, and lower-frequency coefficients are extracted as a 15-dimensional feature vector. Then, the temporal sequence of the feature vector is matched to the HMM models representing the facial expressions to be recognized so that each facial expression is spotted and recognized. Our method basically uses left-to- right HMM, and each state of HMM is assigned to a condition of facial muscles: i.e., neutral, contracting, apex and relaxing. The right most (final) state has transitions back to the left most (initial) state of not only that category but also the other categories. By thresholding the forward probability of the state apex, it is possible to spot the duration corresponding to a facial expression from the video sequence. At the same time, the expression category for the spotted duration can be recognized. As shown in Fig. 2, the duration that corresponds to each facial expression in the sequence can be spotted and recognized accurately.