1 Introduction
Face recognition is a primary technique in computer vision to model and understand the real world. Many methods and enormous datasets
[vggface2, msceleb1m, kemelmacher2016megaface, vggface, imdbface, casia_webface]have been introduced, and recently, methods that use deep learning
[ArcFace, uniformface, AFRN, SphereFace, CosFace, centerloss, regularface] have greatly improved the face recognition accuracy, but it still falls short of expectations.To reduce the shortfall, most of the recent research in face recognition focused on improving the loss function. The streams from CenterLoss
[centerloss], CosFace [CosFace], ArcFace [ArcFace] and RegularFace [regularface] all tried to minimize the intraclass variation and maximize the interclass variation. These methods are effective and have gradually improved the accuracy by elaborating the objective of learning.Despite the development of loss functions, generalpurpose networks, not a network devised for a face recognition, can have difficulty in enabling effective training of the network to recognize a huge number of person identities. Unlike common problems such as classification, in the evaluation stage, a facerecognition model encounters new identities, which are not included in the training set. Thus, the model has to embed nearly 100k identities [msceleb1m] in the training set and also consider a huge number of unknown identities. However, most of the existing methods just attach several fullyconnected layers after widelyused backbone networks such as VGG [vggface] and ResNet [resnet] without any designs for the characteristics of face recognition.
Grouping is a key idea to efficientlyandflexibly embed a significant number of people and briefly describe an unknown person. Each person has own characteristics in his or her face. At the same time, they have common ones shared in a group of people. In the real world, groupbased description (man with deep, black eyes and red beard) that involves common characteristics in the group, can be useful to narrow down the set of candidates, even though it cannot identify the exact person. Unfortunately, explicit grouping requires manual categorizing on huge data and may be limited by the finite range of descriptions by human knowledge, However, by adopting the concept of grouping, the recognition network can reduce the search space and flexibly embed a significant number of identities into an embedding feature.
We propose a novel facerecognition architecture called GroupFace that learns multiple latent groups and constructs groupaware representations to effectively adopt the concept of grouping (Figure 1). We define Latent Groups, which are internally determined as latent variables by comprehensively considering facial factors (e.g., hair, pose, beard) and nonfacial factors (e.g., noise, background, illumination). To learn the latent groups, we introduce a selfdistributed grouping method that determines group labels by considering the overall distribution of latent groups. The proposed GroupFace structurally ensembles multiple groupaware representations into the original instancebased representation for face recognition.
We summarize the contributions as follows:

[leftmargin=+.2in,label=]

GroupFace is a novel facerecognitionspecialized architecture that integrates the groupaware representations into the embedding feature and provides welldistributed grouplabels to improve the quality of feature representation. GroupFace also suggests a new similarity metric to consider the group information additionally.

We prove the effectiveness of GroupFace in extensive experiments and ablation studies on the behaviors of GroupFace.

GroupFace can be applied many existing facerecognition methods to obtain a significant improvement with a marginal increase in the resources. Especially, a hardensemble version of GroupFace can achieve high recognitionaccuracy by adaptively using only a few additional convolutions.
2 Related Works
Face Recognition
has been studied for decades. Many researchers proposed machine learning techniques with feature engineering
[10.1007/9783540246701_36, 6619233, joint_bayesian, fisherface, 5459250, CSML, eigenface, WHT:ECCVW08:DBMW, QiYin:2011:AMF:2191740.2192084]. Recently, deep learning methods have overcome the limitations of traditional facerecognition approaches with public facerecognition datasets [vggface2, msceleb1m, kemelmacher2016megaface, vggface, imdbface, casia_webface]. DeepFace [deepface] used 3D face frontalization to achieve a breakthrough in face recognition methods that use deep learning. FaceNet [FaceNet] proposed triplet loss to maximize the distance between an anchor and its negative sample, and to minimize the distance between the same anchor and its positive sample. CenterLoss [centerloss] proposed center loss to minimize the distance between samples and their class centers. MarginalLoss [marginal_loss] adopted the concept of margin to minimize intraclass variations and to keep interclass distances with margin. RangeLoss [range_loss] used longtailed data during the training stage. RingLoss [ring_loss] constrained a feature’s magnitude to be a certain number. NormFace [NormFace] proposed to normalize features and fully connected layer weights; verification accuracy increased after normalization. SphereFace [SphereFace] proposed angular softmax (ASoftmax) loss with multiplicative angular margin. Based on ASoftmax, CosFace [CosFace] proposed an additive cosine margin and ArcFace [ArcFace] applies an additive angular margin. The authors of RegularFace [regularface] and UniformFace [uniformface] argued that approaches that use angular margin [ArcFace, SphereFace, CosFace] concentrated on intraclass compactness only, then suggested new losses to increase the interclass variation. These previous methods, in general, focused on how to improve loss functions to improve face recognition accuracy with conventional feature representation. A slight change such as adding a few layers or increasing the number of channels, commonly did not bring a noticeable improvement. However, GroupFace improves the quality of feature representation and achieves a significant improvement by adding a few more layers in parallel.Grouping
or clustering methods such as kmeans internally categorize samples by considering relative metrics such as a cosine similarity or Euclidean distance without explicit class labels. In general, these clustering methods attempt to to construct welldistinguished categories by preventing the assignment of most images to one or a few clusters. Recently, several methods that used deep learning
[caron2018deep, noroozi2016unsupervised, yang2016joint]have been introduced. These methods are effective, however, they use full batches as in previous methods, not minibatches as in deep learning. Thus, these methods are not readily incorporate deeply and endtoend in an application framework. To efficiently learn the latent groups, our method introduces a selfdistributed grouping method that considers an expectationnormalized probability in a deep manner.
3 Proposed Method
Our GroupFace learns the latent groups by using a selfdistributed grouping method, constructs multiple groupaware representations and ensembles them into the standard instancebased representation to enrich the feature representation for face recognition.
3.1 GroupFace
We discuss that how the scheme of latent groups are effectively integrated into the embedding feature in GroupFace.
Instancebased Representation.
We will call a feature vector in conventional face recognition scheme
[ArcFace, CosFace, centerloss, regularface] as an Instancebased Representation in this paper (Figure 2). The instancebased representation is commonly trained as an embedding feature by using softmaxbased loss (e.g., CosFace [CosFace] and ArcFace [ArcFace]) and is used to predict an identity as:(1) 
where is an identity label, is the instancebased representation of a given sample x, and is a function which projects an embedding feature of 512 dimension into dimensional space. is the number of person identities.
Groupaware Representation. GroupFace uses a novel Groupaware Representation as well as the instancebased representation to enrich the embedding features. Each groupaware representation vector is extracted by deploying fullyconnected layers for each corresponding group (Figure 2). The embedding feature (, Final Representation in Figure 2) of GroupFace is obtained by aggregating the instancebased representation and the weightedsummed groupaware representation . GroupFace predicts an identity by using the enriched final representation as:
(2)  
where is an ensemble of multiple groupaware representations with group probabilities.
Structure. GroupFace calculates and uses instancebased representation and groupaware representations, concurrently. The instancebased representation is obtained by the same procedures that are used in conventional face recognition methods [ArcFace, CosFace, centerloss, regularface], and the
groupaware representations are obtained similarly by deploying a fullyconnected layer. Then, the group probabilities are calculated from the instancebased representation vector by deploying a Group Decision Network (GDN) that consists of three fullyconnected layers and a softmax layer. Using the group probabilities, the multiple groupaware representations are subensembled in a soft manner (SGroupFace) or a hard manner (HGroupFace).

SGroupFace aggregates multiple groupaware representations with corresponding probabilities of groups as weights, and is defined as:
(3) 
HGroupFace selects one of the groupaware representations for which the corresponding group probability has the highest value, and is defined as:
(4)
SGroupFace provides a significant improvement of recognition accuracy with a marginal requirement for additional resources, and HGroupFace is more suitable for practical applications than SGroupFace, at the cost of a few additional convolutions. The final representation is enriched by aggregating both the instancebased representation and the subensembled groupaware representation.
Groupaware Similarity. We introduce a groupaware similarity that is a new similarity considering both the standard embedding feature and the intermediate feature of GDN in the inference stage. The groupaware similarity is penalized by a distance between intermediate features of two given instances because the intermediate feature is not trained on the cosine space and just describes the group identity of a given sample, not the explicit identity of a given sample. The groupaware similarity between the image and the image is defined as:
(5) 
where is a cosine similarity metric, is a distance metric, denotes the intermediate feature of GDN and, and are a constant parameter. The parameters are determined empirically to be and .
3.2 Selfdistributed Grouping
In this work, we define a group as a set of samples that share any common visualornonvisual features that are used for face recognition. Such a group is determined by a deployed GDN. Our GDN is gradually trained in a selfgrouping manner that provides a group label by considering the distribution of latent groups without any explicit groundtruth information.
Naïve Labeling. A naïve way to determine a group label is to take an index that has the maximum activation of softmax outputs.We build a GDN to determine a belonging group for a given sample by deploying MLP and attaching a softmax function:
(6) 
(7) 
where is the group. The lack of the consideration for the group distribution can cause the naïve solution to assign most of samples to one or few groups.
Selfdistributed Labeling.
We introduce an efficient labeling method that utilizes a modified probability regulated by a prior probability to generate uniformlydistributed group labels in a deep manner. We define an expectationnormalized probability
to balance the number of samples among groups:(8) 
where the first bounds the normalized probability between 0 and 1. Then, the expectation of the expectationnormalized probability is computed as:
(9)  
The optimal selfdistributed label is obtained as:
(10) 
The trained GDN estimates a set of group probabilities that represent the degree to which the sample belongs to the latent groups. As the number of samples approaches infinity, the proposed method stably outputs the uniformdistributed labels (Figure
3).3.3 Learning
The network of GroupFace is trained by both the standard classification loss, which is a softmaxbased loss to distinguish identities, and the selfgrouping loss, which is a softmax loss to train latent groups, simultaneously.
Loss Function. A softmaxbased loss (ArcFace [ArcFace] is mainly used in this work) is used to train a feature representation for identities and is defined as:
(11) 
where is a number of samples in a minibatch, is the angle between a feature and the corresponding weight, is a scale factor, is a marginal factor. To construct the optimal groupspace, a selfgrouping loss, which reduces the difference between the prediction and the selfgenerated label, is defined as:
(12) 
Training. The whole network is trained by using the aggregation of two losses:
(13) 
where the parameter balances the weights of different losses and is empirically set to . Thus, GDN can learn the group, which is an attribute beneficial to face recognition.
4 Experiments
We describe implementation details and extensively perform experiments and ablation studies to show the effectiveness of GroupFace.
4.1 Implementation Details
Datasets. For the train, we use MSCeleb1M [msceleb1m] which has contain about 10M images for 100K identities. Due to the noisy labels of MSCeleb1M original dataset, we use the refined version [ArcFace] which contains 3.8M images for 85k identities. For the test, we conduct our experiments with nine commonly used datasets as follows:

LFW [LFW] which contains 13,233 images from 5,749 identities and provides 6000 pairs from them. CALFW [CALFW] and CPLFW [CPLFWTech] are the reorganized datasets from LFW to include higher pose and age variations.

YTF [YTF] which consists of 3,425 videos of 1,595 identities.

MegaFace [kemelmacher2016megaface] which composed of more than 1 million images from 690K identities for challenge 1(MF1).

CFPFP [CFP_FP] which contains 500 subjects, each with 10 frontal and 4 profile images.

AgeDB30 [age_db] which contains 12,240 images of 440 identities.

IJBB [IJB_B] which contains 67,000 face images, 7,000 face videos and 10,000 nonface images.

IJBC [IJB_C] which contains 138,000 face images, 11,000 face videos and 10,000 nonface images.
Metrics. We compare the verificationaccuracy for identitypairs on LFW [LFW], YTF [YTF], CALFW [CALFW], CPLFW [CPLFWTech], CFPFP [CFP_FP], AgeDB30 [age_db] and MegaFace [kemelmacher2016megaface] verification task. MegaFace [kemelmacher2016megaface] identification task is evaluated by rank1 identification accuracy with 1 million distractors. We compare a True Accept Rate at a certain False Accept Rate (TAR@FAR) from 1e4 to 1e6 on IJBB [IJB_B] and IJBC [IJB_C].
Experimental Setting. We construct a normalized face image [ArcFace, SphereFace, CosFace] () by warping a faceregion using five facial points from two eyes, nose and two corners of mouth. We employ the ResNet100 [resnet] as the backbone network similar to the recent works [ArcFace, AFRN]. We vectorize the activation and reduced # activations to 4096 (shared feature in Figure 2) by a block of BNFC. Our GroupFace is attached after res5c in ResNet100, where its activation dimension is 51277. The MLP in GDN consists of two blocks of BNFC and a FC for group classification. We follow [ArcFace, CosFace] to set the hyperparameters of the loss function.






Learning. We train the model with synchronized GPUs and a minibatch involving images per GPU. To stable the groupprobability, the network of GroupFace is trained from the pretrained network that is trained by only the softmaxbased loss [ArcFace, CosFace]. We used a learning rate of for the first 50k, for the 20k, and for 10k with a weight decay of and a momentum of
with stochastic gradient descent (SGD). We compute the expectation of group probabilities by computing the group probabilities of
samples on all GPUs and averaging the expectations over the recent batches to accurately estimate the expectation of the group probabilities on the population; between and empirically shows a similar performance.4.2 Ablation Studies
To show the effectiveness of the proposed method, we perform the ablation studies on the it’s behaviors. For all experiments, we also use the same network structure with the hyperparameters mentioned earlier. To clearly show the effect of each ablation study, TAR@FAR of the models are compared on IJBB dataset [IJB_B]; all models in the ablation studies shows around on LFW.
Number of Groups. We compare the recognition performance according to the number of groups (Table (a)a). As the number of groups grows, the performance increases steadily. In particular, a few initial groups can benefit greatly, and by deploying more groups, significant improvement of performance can be obtained.
Learning for GDN. We compare the learning method for GDN (Table (b)b): (1) without loss (adopt the groupaware network structure only), (2) naive labeling, and (3) selfdistributed labeling. Just by applying our novel network structure, the recognition performance is greatly improved. In particular, the performance is further increased by adjusting the proposed selfdistributed labeling method.
Hard vs. Soft. SGroupFace shows a high improvement in the performance because it uses all groupaware representations comprehensively with a reasonable additional resource (Table (c)c). Since HGroupFace uses only one strongest groupaware representation even if many groups are deployed, the burden of increasing the number of groups is fixed to a slight amount of additional resource. Thus, HGroupFace can be applied immediately for high performance gains in practical applications.
Aggregation vs. Concatenation. We compare how to combine the instancebased representation and the groupaware representations into an one embedding feature (Table (d)d): (1) aggregation and (2) concatenation. Concatenationbased GroupFace shows a better TAR@FAR=1e6 by 0.67 percentage points than Aggregationbased GroupFace, however, Aggregationbased GroupFace shows a much better TAR@FAR=1e5 by 1.16 percentage points. We chose the Aggregationbased GroupFace that is generally better performing with fewer feature dimensions.
Groupaware Similarity. The recognitionperformance is once again improved significantly by evaluating the groupaware similarity (Table (e)e). Even though the groupaware similarity increases the feature dimension for calculating a similarity, it is easy to extract the required feature because the feature is the intermediate output of the recognition network. Especially, this experiment shows that the groupbased information is distinct from the conventional identitybased information enough to improve performance in practical usages. We show more detailed experiments in Table 5.
Lightweight Model. GroupFace is also effective for a lightweight model such as ResNet34 [resnet] that requires only 8.9 GFLOPS less than ResNet100 [resnet], which requires 24.2 GFLOPS. ResNet34 based GroupFace shows a similar performance of ResNet100 based ArcFace [ArcFace] and greatly outperforms ResNet100 in a most difficult criterion (FAR=1e6). In addition, the groupaware similarity significantly exceed the basic performance of ResNet34 model (Table (f)f).
4.3 Evaluations
Method  #Image  LFW  YTF 

DeepID [contrastive_loss]  0.2M  99.47  93.2 
DeepFace [deepface]  4.4M  97.35  91.4 
VGGFace [vggface]  2.6M  98.95  97.3 
FaceNet [FaceNet]  200M  99.64  95.1 
CenterLoss [centerloss]  0.7M  99.28  94.9 
RangeLoss [range_loss]  5M  99.52  93.7 
MarginalLoss [marginal_loss]  3.8M  99.48  95.9 
SphereFace [SphereFace]  0.5M  99.42  95.0 
RegularFace [regularface]  3.1M  99.61  96.7 
CosFace [CosFace]  5M  99.81  97.6 
UniformFace [uniformface]  6.1M  99.80  97.7 
AFRN [AFRN]  3.1M  99.85  97.1 
ArcFace [ArcFace]  5.8M  99.83  97.7 
GroupFace  5.8M  99.85  97.8 
Method 
CALFW 
CPLFW 
CFPFP 
AgeDB30 

CenterLoss [centerloss]  85.48  77.48     
SphereFace [SphereFace]  90.30  81.40     
VGGFace2 [vggface2]  90.57  84.00     
CosFace [CosFace]  95.76  92.28  98.12  98.11 
ArcFace [ArcFace]  95.45  92.08  98.27  98.28 
GroupFace  96.20  93.17  98.63  98.28 
Method 
Protocol 
Ident 
Verif 

RegularFace [regularface]  Large  75.61  91.13 
UniformFace [uniformface]  Large  79.98  95.36 
CosFace [CosFace]  Large  80.56  96.56 
ArcFace [ArcFace]  Large  81.03  96.98 
GroupFace  Large  81.31  97.35 
SphereFace [SphereFace]  Large / R  97.91  97.91 
AdaptiveFace [adaptiveface]  Large / R  95.02  95.61 
CosFace [CosFace]  Large / R  97.91  97.91 
ArcFace [ArcFace]  Large / R  98.35  98.49 
GroupFace  Large / R  98.74  98.79 
Method  TAR on IJBB  TAR on IJBC  

FAR=1e6  FAR=1e5  FAR=1e4  FAR=1e6  FAR=1e5  FAR=1e4  
VGGFace2 [vggface2]    0.671  0.800    0.747  0.840 
CenterFace [centerloss]          0.781  0.853 
ComparatorNet [ComparatorNet]      0.849      0.885 
PRN [PRN]    0.721  0.845       
AFRN [AFRN]    0.771  0.885    0.884  0.931 
CosFace [CosFace]  0.3649  0.8811  0.9480  0.8801  0.9370  0.9615 
ArcFace [ArcFace]  0.3828  0.8933  0.9425  0.8625  0.9315  0.9565 
GroupFace  0.4166  0.8983  0.9453  0.8858  0.9399  0.9606 
GroupFace  0.4678  0.9115  0.9445  0.9053  0.9437  0.9602 
GroupFace  0.5212  0.9124  0.9493  0.8928  0.9453  0.9626 
LFW, YTF, CALFW, CPLFW, CFPFP and AgeDB30. We compare the verificationaccuracy on LFW [LFW] and YTF [YTF] with the unrestricted with labelled outside data protocol (Table 2). On YTF, we evaluate all the images without the exclusion of noisy images from image sequences. Even though both datasets are highlysaturated, Our GroupFace surpasses the other recent methods. We also report the verification accuracy on the variant of LFW (CALFW [CALFW], CPLFW [CPLFWTech]), CFPFP [CFP_FP] and AgeDB30 [age_db] (Table 3). Our GroupFace shows the better accuracy on all of the above datasets.
MegaFace. We evaluate our GroupFace under the largetrainingset protocol, in which models are trained by using the training set containing more than 0.5M images, on MegaFace [kemelmacher2016megaface] (Table 4). GroupFace is the topranked face recognition model among the recent published stateoftheart methods. On the refined MegaFace [ArcFace], our GroupFace also outperforms the other models.
IJBB and IJBC. We compare the proposed method with other methods on IJBB [IJB_B] and IJBC [IJB_C] datasets (Table 5). Recent angularmarginsoftmax based methods [ArcFace, CosFace] show great performance in the datasets. We reports the improvement of GroupFace in the verification accuracy based on both CosFace [CosFace] and ArcFace [ArcFace] without any testtime augmentations such as horizontal flipping. Our GroupFace shows significant improvements on all FAR criteria by 8.5 percentage points on FAR=1e6, 1.8 percentage points on FAR=1e5 and 0.2 percentage points on FAR=1e4 than the ArcFace [ArcFace] on IJBB and by 4.3 percentage points on FAR=1e6, 1.2 percentage points on FAR=1e5 and 0.4 percentage points on FAR=1e4 than the ArcFace [ArcFace] on IJBC. The recognitionperformance is once again improved significantly by applying the groupaware similarity (Eq. 5), especially on the most difficult criterion (TAR@FAR=1e6) on IJBB by 5.3 percentage points.
4.4 Visualization
To show the effectiveness of the proposed method, we visualize the feature representation, the average activation of groups and the visual interpretation of groups.
2D Projection of Representation. Figure 4 shows a quantitative comparison among (a) the final representation of the baseline network (ArcFace [ArcFace]), (b) the instancebased representation of GroupFace and the final representation of GroupFace on a 2D space. We select the first eight identities in the refined MSCeleb1M dataset [msceleb1m] and map the extracted features onto the angular space by using tSNE [t_SNE]. The quantitative comparison shows that the proposed model generates more distinctive feature representations rather than the baseline model and also the proposed model enhances the instancebased representation.
Activation Distribution of Groups.
The proposed SelfGrouping tries to make the samples evenly spread throughout the all groups, and at the same time, the softmaxbased loss also simultaneously propagates gradients into GDN so that the identification works best. Thus, the probability distribution is not exactly uniform (Figure
5). Some probabilities of the groups are low and the others are high (e.g., 1, 2, 5, 6, 14, 15, 17, 18, 28, 29, 30, 31 groups). The overall distribution is not uniform as we expected, but we see that there is no dominant one among the high activated group.Interpretation of Groups. The trained latent groups are not always visually distinguishable because they are categorized by a nonlinear function of GDN using a latent feature, not a facial attribute (e.g., hair, glasses, and mustache). However, there are two cases of groups (Group 5 and 20 in Figure 6) that we can clearly see their visual properties; 95 of randomlyselected 100 images are men in Group 5 and 94 of randomlyselected 100 images are bald men in Group 20. Others are not described as an one visual property, however, they seems to be described as multiple visual properties such as smile women, rightprofile people and scared people in Group 1.
5 Conclusion
We introduce a new facerecognitionspecialized architecture that consists of a groupaware network structure and a selfdistributed grouping method to effectively manipulate multiple latent groupaware representations. By extensively conducting the ablation studies and experiments, we prove the effectiveness of our GroupFace. The visualization also shows that GroupFace fundamentally enhances the feature representations rather than the existing methods and the latent groups have some meaningful visual descriptions. Our GroupFace provides a significant improvement in the recognitionperformance ans is practically applicable to existing recognition systems. The rationale behind the effectiveness of GroupFace is summarized in two main ways: (1) It is well known that additional supervisions from different objectives can bring an improvement of the given task by sharing a network for feature extraction,
e.g., a segmentation head can improve accuracy in object detection [bell2016inside, he2017mask]. Likewise, learning the groups can be a helpful cue to train a more generalized feature extractor for face recognition. (2) GroupFace proposes a novel structure that fuses instancebased representation and groupbased representation, which is empirically proved its effectiveness.Acknowledgement
We thank AI team of Kakao Enterprise, especially Wonjae Kim and Yoonho Lee for their helpful feedback.
Comments
There are no comments yet.