iccv2019論文全集8377-multi-label-co-regularization-for-semi-supervised-facial-action-unit-recognition_第1頁
iccv2019論文全集8377-multi-label-co-regularization-for-semi-supervised-facial-action-unit-recognition_第2頁
iccv2019論文全集8377-multi-label-co-regularization-for-semi-supervised-facial-action-unit-recognition_第3頁
iccv2019論文全集8377-multi-label-co-regularization-for-semi-supervised-facial-action-unit-recognition_第4頁
iccv2019論文全集8377-multi-label-co-regularization-for-semi-supervised-facial-action-unit-recognition_第5頁
已閱讀5頁,還剩6頁未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

Multi-label Co-regularization for Semi-supervised Facial Action Unit Recognition Xuesong Niu1,3, Hu Han1,2, Shiguang Shan1,2,3,4, Xilin Chen1,3 1 Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China 2 Peng Cheng Laboratory, Shenzhen, China 3 University of Chinese Academy of Sciences, Beijing 100049, China 4 CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, China xuesong.niu, hanhu, sgshan, xlchen Abstract Facial action units (AUs) recognition is essential for emotion analysis and has been widely applied in mental state analysis. Existing work on AU recognition usually requires big face dataset with accurate AU labels. However, manual AU annotation requires expertise and can be time-consuming. In this work, we propose a semi-supervised approach for AU recognition utilizing a large number of web face images without AU labels and a small face dataset with AU labels inspired by the co-training methods. Unlike traditional co-training methods that require provided multi-view features and model re-training, we propose a novel co- training method, namely multi-label co-regularization, for semi-supervised facial AU recognition. Two deep neural networks are used to generate multi-view features for both labeled and unlabeled face images, and a multi-view loss is designed to enforce the generated features from the two views to be conditionally independent representations. In order to obtain consistent predictions from the two views, we further design a multi-label co-regularization loss aiming to minimize the distance between the predicted AU probability distributions of the two views. In addition, prior knowledge of the relationship between individual AUs is embedded through a graph convolutional network (GCN) for exploiting useful information from the big unlabeled dataset. Experiments on several benchmarks show that the proposed approachcaneffectivelyleveragelargedatasetsofunlabeledfaceimagestoimprove the AU recognition robustness and outperform the state-of-the-art semi-supervised AU recognition methods. Code is available1. 1Introduction Facial action units coded by the Facial Action Coding System (FACS) 5 refer to a set of facial muscle movements defi ned by their appearance on the face. These facial movements can be used for coding nearly any anatomically possible facial expression and have wide potential applications in mental state analysis, i.e., deception detection 10, diagnosing mental health 28, and improving e-learning experiences 22. Most of the existing AU recognition methods are in a supervised fashion 3,16,21,29,40, for which a large number of facial images with AU labels are required. However, since AUs are subtle, local and have signifi cant subject-dependent variations, qualifi ed FACS experts are required to annotate facial AUs. In addition, labeling AUs is time-consuming and labor-intensive, making it impractical to 1 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. manually annotate a large set of face images. While at the same time, there are massive facial images, i.e., available on the Internet, video surveillance, social media, etc. There are still very limited studies about how to make such massive unlabeled face images to assist in AU recognition from a relatively small label face dataset. As illustrated in Fig. 1(a), there could be different perspectives of the face that can be used for the classifi cation of AUs. The diversity of different models trained for different views exists in both labeled and unlabeled face images and could further be used to enhance the generalization ability of each model. This idea is inspired by the traditional co-training methods 2, which has been proven to be effective for multi-view semi-supervised learning. However, traditional co-training methods usually require multiple views from different sources 2 or representations 20, which can be diffi cult to obtain in practice. In addition, traditional co-training methods usually need pseudo annotations for the unlabeled data and retraining the classifi ers, which is not suitable for end-to-end training. Besides, traditional co-training methods seldom consider multi-label classifi cation, which has its particular characteristics such as the correlations between different classifi ers. Because of these reasons, co-training has seldom been studied for semi-supervised AU recognition. In recent years, deep neural networks (DNNs) have been proved to be effective for representation learning in various computer vision tasks, including AU recognition 3,16,21,29. The strong representation learning ability of DNNs makes it possible to generate multi-view representations that can be used for co-training based semi-supervised learning. In addition, there exist strong correlation among different AUs. For example, as shown in Fig. 1(b), AU6 (cheek raiser) and AU12 (lip corner puller) are usually activated simultaneously in a common facial expression called Duchenne smile. There exist several methods that utilizing this prior knowledge to improve the AU recognition accuracy 1,3,6. However, these methods are all in a supervised fashion, and their generalization abilities are limited by the sizes of existing labeled AU databases. At the same time, such correlations of AUs exist in face images not matter labeled or unlabeled, and could bring more robustness to the AU classifi ers in a semi-supervised setting because more face images are considered. In this paper, we propose a semi-supervised co-training approach named as multi-label co- regularization for AU recognition, aiming at improving AU recognition with massive unlabeled face images and domain knowledge of AUs. Unlike the traditional co-training method that requires provided multi-view features and model re-training, we propose a co-regularization method to perform semi-supervised co-training. For each facial image with or without AU annotation, we fi rstly generate features of different views via two DNNs. A multi-view lossLmvis used to enforce the two feature generators to get conditional independent facial representations, which are used as the multi-view features of the input image. A multi-label co-regularization lossLcris also utilized to constrain the prediction consistency of two views for both labeled and unlabeled images. In addition, in order to leverage the AU relationships in both labeled and unlabeled images for our co-regularization framework, we use graph convolutional network (GCN) to embed the domain knowledge. The contributions of this paper are as follows: i) we propose a novel multi-label co-regularization method for semi-supervised AU recognition leveraging massive unlabeled face images and a relatively small set of labeled face images; ii) we utilize the domain knowledge of AU relationships embedded via graph convolutional network to further mine the useful information of the unlabeled images; and iii) we achieve superior performance than the results without using massive unlabeled face images and the state-of-the-art semi-supervised AU recognition approaches. 2Related Work In this section, we review existing methods that are related to our work, including semi-supervised and weakly supervised AU recognition, AU recognition with relationship modeling, and co-training. Semi-supervised and Weakly supervised AU Recognition.Previous works for semi-supervised AU recognition and weakly supervised AU recognition mainly focused on utilizing face images with incomplete labels, noisy labels or related emotion labels to improve the AU recognition accuracy. Wu et al. 33 proposed to use Restricted Boltzmann Machine to model the AU distribution, which is further used to train the AU classifi ers with partially labeled data. Peng et al. 26 proposed an adversary network to improve the AU recognition accuracy with emotion labeled images. In 39, Zhao et al. proposed a weakly supervised clustering method for pruning noise labels and trained 2 The lips of the subject is apart. The texture of his mouth is like a smile. AU12 (Lip Corner Puller) Activated (a) EmotioNetBP4D (b) Figure 1: (a) An illustration of the idea of co-training. For an input image, representations generated by different models can highlight different cues for AU recognition. Sharing such kind of multi-view representations for unlabeled images can improve the generalization ability of each model. (b) The correlation maps of different AUs calculated based on Equ. 8 in Section 3.4 for EmotioNet and BP4D databases suggest that there exist strong correlations between different AUs. the AU classifi ers with re-annotated data. Although these methods do not need massive complete AU labeled images, they still need other annotations such as noise AU labels or emotion labels as well as labeled face images. Recently, several methods tried to utilize the face images with only emotion labels to recognize AUs. Peng et al. 25 utilized the prior knowledge of AUs and emotions to generate pseudo AU labels for training from facial images with only emotion labels. Zhang et al. 37 proposed a knowledge-driven strategy for jointly training multiple AU classifi ers without any AU annotation by leveraging prior probabilities on AUs. Although these methods do not need any AU labels, they still need the related emotion labels. Besides, most of these methods are evaluated on lab-collected data, and their generation abilities to the web facial images are limited. AU Recognition based on Relationship Modeling.There exist strong probabilistic dependencies between different AUs that can be treated as domain knowledge and further used for AU recognition. Almaev et al. 1 exploited to learn the classifi er for one single AU and transfer the classifi er to other AUs using the latent relations between different AUs. Eleftheriadis et al. 6 proposed a latent space embedding method for AU classifi cation by considering the AU label dependencies. In 3, Corneanu et al. proposed a structure inference network to model AU relationships based on a fully-connected graph. However, all these methods need fully annotated face images, and the generalization abilities of these models are limited because of the limited sizes of existing AU databases. Co-training for Semi-supervised Learning.Semi-supervised learning is a widely studied problem, and many milestone works have been proposed, i.e., Mean-teacher method 31, Transductive SVM 14, etc. Among these works, co-training 2 is designed for multi-view semi-supervised learning and has been proved to have good theoretical results. Traditional co-training methods are mainly based on provided multi-view data, which is usually not available in practice. Meanwhile, they usually need to get the pseudo labels for retraining, making it impractical for end-to-end training. Recently, Qiao et al. 27 utilized the adversarial examples to learning the multi-view features for multi-class image classifi cation. However, for the problem of facial AU recognition, it is hard to get the adversarial examples for multiple classifi ers. In 34, Xing et al. proposed a multi-label co-training (MLCT) method considering the co-occurrence of pairwise labels. However, they still required provided multi-view features for training, which may be hard to obtain. 3Proposed Method In this section, we fi rst introduce the traditional co-training. Then, we detail the proposed multi-label co-regularization approach for AU recognition with AU relationship learning. 3.1Traditional Co-training The traditional co-training approach 2 is an award-winning method for semi-supervised learning. It assumes that each sample in the training set has two different viewsv1andv2, and each view can provide suffi cient information to learn an effective model. The two different views in the co-training assumption are from different sources or data representations. Two models, i.e.,M1andM2, are trained based onv1andv2, respectively. Then the predictions of each model for the unlabeled data are 3 View2 Feature Generator1 1 View1 labeled images unlabeled images C C t C AU Relationship Learning t C Feature Generator2 C C AU6 AU12 AU2 AU25 AU6 AU12 Figure 2: An overview of the proposed multi-label co-regularization method for semi-supervised AU recognition. The losses defi ned for labeled facial images, i.e.,Lv1andLv2, are illustrated with blue dash lines. The losses defi ned for both the labeled and unlabeled images, i.e.,LmvandLcr, are illustrated using red solid lines. used to augment the training set of the other model. This procedure is conducted for several iterations untilM1andM2 become stable. This simple but effective approach can signifi cantly improve the models performance when it is used to exploit useful information from massive unlabeled data, and has been proven to have PAC-style guarantees on semi-supervised learning under the assumption that two views are conditionally independent 2. There are two characteristics that guarantee the success of co-training: i) the two-view features are conditionally independent; ii) the models trained on different views tend to have similar predictions because of the re-training mechanism. Our multi-label co-regularization method is designed based on these two characteristics. 3.2Multi-view Feature Generation Deep neural networks have been proved to be effective in feature generation 11,13,30. We utilize two deep neural networks to generate the two-view features 4. Here, we choose two ResNet-34 networks 13 as the feature generators. Given a facial image datasetD = L S U, whereLdenotes the facial images with AU labels andUdenotes the unlabeled face images. For each image inD, the two-view features f1and f2 can be generated using the two different generators. Then C classifi ers can be learned to predict the probabilities ofCAUs using the feature of each view. Letpijdenotes the probability predicted for thej-th AU using thei -th view, and the fi nal probabilities can be formulated as pij= (wT ijfi+ bij); (1) wheredenotes the sigmoid function, andwijandbij are the respective classifi er parameters. We fi rst calculate the losses for all the labeled images inL. A binary cross-entropy loss is utilized for both view. In order to better handle the data imbalance 32,23 of AUs, a selective learning strategy 12 is adopted. The loss function for AU recognition for the i-th view Lviis formulated as Lvi= 1 C C X j=1 acpjlog pij+ (1 pj)log(1 pij)(2) wherepjis the ground-truth probability of the occurrence for thej-th AU, with 1 denoting occurrence of an AU and 0 denoting no occurrence.acis a balancing parameter calculated in each batch using the selective learning strategy 12. One key characteristic of co-training is the input multi-view features are supposed to be conditional independent. Although different networks may achieve similar performance in a complementary way when they are initialized differently, they can gradually resemble each other when they are supervised by the same target. In order to encourage the two feature generators to get conditional independent features instead of collapsing into each other, we proposed a multi-view loss by orthogonalizing the weights of the AU classifi ers of different views. The multi-view loss Lmv is defi ned as Lmv= 1 C C X j=1 WT 1jW2j kW1jkkW2jk (3) 4 whereWij= wijbijdenote the parameters of thej -th AUs classifi er of thei-th view. With this multi-view loss, the features generated for different views are expected to be different while complementary with each other. 3.3Multi-label Co-regularization Besides of the conditional independent assumption, another key characteristic of co-training is to force the classifi ers of different views to get consistent predictions. Instead of using the labeling and re-training mechanism in the traditional co-training methods, we propose a co-regularization loss to encourage the classifi ers from different views to generate similar predictions. For the face images inD , we fi rst get the predicted probabilitiespijfor thej -th AU from the classifi er of thei-th view. Then, we try to minimize the distance between the two predicted probability distributions w.r.t. all AU classes from the two views. The Jensen-Shannon divergence 7 is utilized to measure the distance of two distributions, and the co-regularization loss is defi ned as Lcr= 1 C C X j=1 (H( p1j+ p2j 2 ) H( p1j) + H( p2j) 2 )(4) whereH(p) = (plogp + (1 p)log(1 p)is the entropy w.r.tp . The fi nal loss function of our multi-label co-regularization can be formulated as L = 1 2 2 X i=1 Lvi+ mvLmv+ crLcr(5) where mvand cr are hyper-parameters that balance the infl uences of different losses. 3.4AU Relationship Learning Based on the nature of facial anatomy, there exist strong relationships among different AUs. In order to make full use of such correlation as prior knowledge existing in massive unlabeled images, we further embed such prior knowledge into our multi-label co-regularization AU recognition model via graph convol

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論