본문 바로가기
Study (Data Science)/Paper Research

Contrastive Loss (similarity metric / face verification / 2005.06)

by 콜라찡 2023. 3. 13.

Contrastive Loss

 

원문 PDF 비공개 논문

Paper

  • Learning a similarity metric discriminatively, with application to face verification
  • Sumit Chopra (2005.06 / NYU)
  • PDF / Github 없음

Points

  • Recognition, Verification 문제에 method for training a similarity metric from data를 제시함.
  • similar and dissimilar한 포인트 두 개로 계산한 loss라고 하여 contrastive loss라는 말이 처음 유래함.
  • L1 norm을 사용하여 semantic distance를 측정하고, input patterns into a target space 함.

Abstract

  • L1 사용하여서 semantic distance 도출했음. 보통은 L2를 쓰기 때문에 특이점이었음.
  • L1을 사용하면 특징추출이 더 잘됨. 그래서 사람얼굴을 classification하는 과제에 있어 더 적은 데이터로도 98.5프로의 정확성을 보여줄 수 있었음. 

The authors used L1 norm in the target space because it approximates semantic distance better than L2. The idea is to learn a function that maps input patterns into a target space such that the L1 norm in the target space approximates "semantic" distance in the input space, which leads to improved performance on face verification tasks with large numbers of categories and small training samples per category.

Dataset

1. AT&T Database of Faces: This dataset consists of 400 images from 40 subjects, with each subject having ten frontal face images taken under varying lighting conditions.

2. AR and Feret Databases: The second set of training and testing was performed by combining the AR database created at Purdue University (which contains over 3,500 facial images) with a subset of grayscale Feret database containing more than one thousand faces.

Model

  • 비공개라 자세한 확인은 어려웠음
  • (CNN - Dense - Batch norm  - max pooling) * 5 + (Dense w/ L1 regularlization)  * 2 이라고 추정되는데,, 과연 BN을 Dense 뒤에 썼을까 싶다.
  •  CNN filter num : 16 >> 512 // kernel size : 7 > 2  

The paper does not provide the model.summary() function output. However, it describes the architecture of their proposed method which includes a convolutional neural network (CNN) followed by a fully connected layer and an L1 normalization layer to produce feature vectors for face verification task. Their CNN consists of 5 convolutional layers with max-pooling in between each pair, followed by two fully-connected layers before producing final embeddings using L1-normalization. The number of filters used in these convolutions increases from 16 to 512 while kernel size decreases from [7x7] to [2x2]. Additionally, they use batch normalization after every activation except last one as well as dropout regularization during training process

Activation Function

  • ReLU & Leaky ReLU
  • Normalization : : y = γ * ((x - μ)/σ ) + β 

The paper uses the Rectified Linear Unit (ReLU) activation function, which is defined as: f(x)=max(0,x) where x represents input to a neuron and f(.) denotes its output. Additionally, they use Leaky ReLU in some of their experiments with α=0.1 where it's defined as: f(x)=x if x> 0 α*x otherwise Here, alpha (α) is a small constant that determines how much negative values are allowed through. Finally, for normalization purposes during training process batch-normalization technique was used whose formula can be written as follows: y = γ * ((x - μ)/σ ) + β Where y represents normalized activations; gamma(γ), beta(β), mean value(mu or µ ), standard deviation(sigma or σ ) were learned parameters per channel over each mini-batch samples respectively.

Loss Function (Contrastive Loss)

  • L = Contrastive Loss function
  • Y = 한 쌍의 얼굴사진이 같은 class면 0, 아니면 1
  • D = 두 점 feature vector간의 embedded distance
  • m = 유사하지 않은 쌍이 최소한으로 떨어져있어야 하는 마진
  • max(0,x)가 ReLU인데, 이렇게 들어가면, 유사하지 않은 쌍일때, 즉 Y=1이면, 결국 Y * max(0, m - D)^2 만 남게 되므로, 그 거리를 max로 멀게 하겠다는 뜻.

L = (1-Y) * D^2 + Y * max(0, m - D)^2

where L represents the value of the contrastive loss function, Y denotes whether a pair belongs to same class or not (Y=0 if they belong to same class and 1 otherwise), D is distance between two feature vectors in embedded space and 'm' refers to margin which controls how far apart dissimilar pairs should be. As for code implementation, it depends on what deep learning framework you are using. However, most frameworks such as Keras provide built-in functions that can calculate this type of custom loss during training process. Here's an example code snippet written in Python/Keras: ```

(관련수식)
import keras.backend as K 
def contrastive_loss(y_true,y_pred): 
	margin = 1 # set your own desired margin here 
    square_pred = K.square(y_pred) 
    squared_margin_diff = K.square(K.maximum(margin-y_pred)) 
    return y_true*square_pred+(1-y_true)*squared_margin_diff
    
------------------------------------------------------------------------
(쉽게쓰는법)
from tensorflow.keras import backend as K
from tensorflow.keras.losses import Loss

class ContrastiveLoss(Loss):
    def __init__(self, margin=1.0, **kwargs):
        super(ContrastiveLoss, self).__init__(**kwargs)
        self.margin = margin

    def call(self, y_true, y_pred):
        square_pred = K.square(y_pred)
        squared_margin_diff = K.square(K.maximum(self.margin - y_pred, 0))
        loss = K.mean(y_true * square_pred + (1 - y_true) * squared_margin_diff)
        return loss
        
model.compile(optimizer='adam', loss=ContrastiveLoss())

This will define a new custom `contrastive_loss` function that takes true labels (`y_true`) and predicted values (`y_preds`) from model output layer as inputs while returning calculated losses based on above mentioned equation.

Results

  • test로 Purdue/AR Face Data를 썼을 때, 98.5% 였음.
  • 하지만 한계가 있음. 밑에서 보충.

The paper reports the accuracy of their proposed method on a face verification task using Purdue/AR Face Database. They achieved an overall accuracy of 98.5% which is quite high considering the large degree of variability in pose, lighting, expression and occlusions present in this dataset. It's important to note that performance can vary depending on factors such as choice of hyperparameters (e.g., margin value), size/composition of training data etc. However, these results demonstrate that their approach has potential for real-world applications where robustness to variations is critical.

Why this paper is famous for its use of contrastive loss?

  • 같은얼굴 조금의 데이터를 다른얼굴 여러 데이터에서 잘 분석해 내었음.
  • similar한 것과 dissimiliar 한 것을 통해 feature를 학습하였다 해서 contrastive 라는 말이 유래됨.

It used the contrastive loss which drives the similarity metric to be small for pairs of faces from the same person and large for pairs from different persons. The name "contrastive" comes from this idea that it contrasts between similar and dissimilar samples in order to learn a better feature representation. The authors chose this particular type of loss because they were working with face verification tasks where there are only two classes (same or different) instead of multiple categories as seen in traditional classification problems. Additionally, their method was designed specifically to handle cases where training data has very few examples per class making other methods such as softmax-based cross-entropy less effective. Overall, using contrastive loss allowed them to train an embedding model that could produce compact yet discriminative representations suitable for recognition/verification applications even when dealing with limited amounts of labeled data.

Limitation

  • 얼굴인식에만 대한 결과라서 범용성을 체크해봐야 함. The proposed method is evaluated only on face verification task, and its performance may vary for other recognition or classification tasks.
  • 데이터 자체가 연구실에서 컨트롤링 한 것이기 때문에 현실성이 떨어질 수 있음. Although the AR Dataset used in testing has a high degree of variability, it still consists mostly of controlled images captured under laboratory conditions. Therefore, the generalization ability to real-world scenarios with uncontrolled variations such as occlusions or disguises remains unclear. 
  • 데이터의 일부라고 해도 라벨링 하긴 해야해서 비용이 많이 듬. While Siamese networks have been shown effective for learning similarity metrics from small datasets, their training can be computationally expensive due to pairwise comparisons between all samples during each iteration. This limits scalability when dealing with large-scale datasets containing millions/billions/trillions examples. 

 

 

 

728x90

댓글