This project is based on Computer Vision which detects your facial expression and converts it to an emoji!
train.csv contains two columns, "emotion" and "pixels". The "emotion" column contains a numeric code ranging from 0 to 6, inclusive, for the emotion that is present in the image. The "pixels" column contains a string surrounded in quotes for each image. The contents of this string a space-separated pixel values in row major order. test.csv contains only the "pixels" column and your task is to predict the emotion column.
The training set consists of 28,709 examples. The public test set used for the leaderboard consists of 3,589 examples. The final test set, which was used to determine the winner of the competition, consists of another 3,589 examples.
- Dataset used is from a kaggle contest.
Model: "sequential"
Layer (type) Output Shape Param #
conv2d (Conv2D) (None, 48, 48, 64) 640
batch_normalization (BatchNo (None, 48, 48, 64) 256
activation (Activation) (None, 48, 48, 64) 0
dropout (Dropout) (None, 48, 48, 64) 0
conv2d_1 (Conv2D) (None, 48, 48, 128) 73856
batch_normalization_1 (Batch (None, 48, 48, 128) 512
activation_1 (Activation) (None, 48, 48, 128) 0
max_pooling2d (MaxPooling2D) (None, 24, 24, 128) 0
dropout_1 (Dropout) (None, 24, 24, 128) 0
conv2d_2 (Conv2D) (None, 24, 24, 512) 590336
batch_normalization_2 (Batch (None, 24, 24, 512) 2048
activation_2 (Activation) (None, 24, 24, 512) 0
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 512) 0
dropout_2 (Dropout) (None, 12, 12, 512) 0
conv2d_3 (Conv2D) (None, 12, 12, 512) 2359808
batch_normalization_3 (Batch (None, 12, 12, 512) 2048
activation_3 (Activation) (None, 12, 12, 512) 0
max_pooling2d_2 (MaxPooling2 (None, 6, 6, 512) 0
dropout_3 (Dropout) (None, 6, 6, 512) 0
flatten (Flatten) (None, 18432) 0
dense (Dense) (None, 256) 4718848
batch_normalization_4 (Batch (None, 256) 1024
activation_4 (Activation) (None, 256) 0
dropout_4 (Dropout) (None, 256) 0
dense_1 (Dense) (None, 512) 131584
batch_normalization_5 (Batch (None, 512) 2048
activation_5 (Activation) (None, 512) 0
dropout_5 (Dropout) (None, 512) 0
dense_2 (Dense) (None, 7) 3591
Total params: 7,886,599
Trainable params: 7,882,631
Non-trainable params: 3,968
Layer (type) Output Shape Param #
lambda_1 (Lambda) (None, 48, 48, 3) 0
resnet50 (Functional) (None, 2, 2, 2048) 23587712
flatten_1 (Flatten) (None, 8192) 0
batch_normalization_4 (Batch (None, 8192) 32768
dense_4 (Dense) (None, 256) 2097408
dropout_3 (Dropout) (None, 256) 0
batch_normalization_5 (Batch (None, 256) 1024
dense_5 (Dense) (None, 128) 32896
dropout_4 (Dropout) (None, 128) 0
batch_normalization_6 (Batch (None, 128) 512
dense_6 (Dense) (None, 64) 8256
dropout_5 (Dropout) (None, 64) 0
batch_normalization_7 (Batch (None, 64) 256
dense_7 (Dense) (None, 7) 455
Total params: 25,761,287
Trainable params: 17,132,295
Non-trainable params: 8,628,992
Now, here 'lambda' denotes the Resnets architecture as the base model. In this we feed in images after preprocessing and resizing it to size of (48, 48). And output is a FCN layer of 7 nodes.
Obtained the facial landmaarks using the dlib-opencv library on the images. Facial landmarks are used to localize and represent salient regions of the face, such as:
- Eyes
- Eyebrows
- Nose
- Mouth
- Jawline
These points are quite sensitive to emotions. So, we trained our model on the relative positions of the jaw, eyes and other salient features.
Model: "sequential"
Layer (type) Output Shape Param #
conv2d (Conv2D) (None, 68, 2, 128) 256
batch_normalization (BatchNo (None, 68, 2, 128) 512
max_pooling2d (MaxPooling2D) (None, 34, 1, 128) 0
conv2d_1 (Conv2D) (None, 34, 1, 128) 16512
batch_normalization_1 (Batch (None, 34, 1, 128) 512
conv2d_2 (Conv2D) (None, 34, 1, 256) 33024
batch_normalization_2 (Batch (None, 34, 1, 256) 1024
max_pooling2d_1 (MaxPooling2 (None, 17, 1, 256) 0
conv2d_3 (Conv2D) (None, 17, 1, 256) 65792
batch_normalization_3 (Batch (None, 17, 1, 256) 1024
conv2d_4 (Conv2D) (None, 17, 1, 256) 65792
batch_normalization_4 (Batch (None, 17, 1, 256) 1024
max_pooling2d_2 (MaxPooling2 (None, 9, 1, 256) 0
conv2d_5 (Conv2D) (None, 9, 1, 256) 65792
batch_normalization_5 (Batch (None, 9, 1, 256) 1024
conv2d_6 (Conv2D) (None, 9, 1, 128) 32896
batch_normalization_6 (Batch (None, 9, 1, 128) 512
max_pooling2d_3 (MaxPooling2 (None, 5, 1, 128) 0
conv2d_7 (Conv2D) (None, 5, 1, 64) 8256
batch_normalization_7 (Batch (None, 5, 1, 64) 256
flatten (Flatten) (None, 320) 0
dense (Dense) (None, 7) 2247
Total params: 296,455
Trainable params: 293,511
Non-trainable params: 2,944
Model: "sequential"
Layer (type) Output Shape Param #
conv2d (Conv2D) (None, 68, 2, 128) 256
batch_normalization (BatchNo (None, 68, 2, 128) 512
max_pooling2d (MaxPooling2D) (None, 34, 1, 128) 0
conv2d_1 (Conv2D) (None, 34, 1, 128) 16512
batch_normalization_1 (Batch (None, 34, 1, 128) 512
conv2d_2 (Conv2D) (None, 34, 1, 256) 33024
batch_normalization_2 (Batch (None, 34, 1, 256) 1024
max_pooling2d_1 (MaxPooling2 (None, 17, 1, 256) 0
conv2d_3 (Conv2D) (None, 17, 1, 256) 65792
batch_normalization_3 (Batch (None, 17, 1, 256) 1024
conv2d_4 (Conv2D) (None, 17, 1, 256) 65792
batch_normalization_4 (Batch (None, 17, 1, 256) 1024
max_pooling2d_2 (MaxPooling2 (None, 9, 1, 256) 0
conv2d_5 (Conv2D) (None, 9, 1, 256) 65792
batch_normalization_5 (Batch (None, 9, 1, 256) 1024
conv2d_6 (Conv2D) (None, 9, 1, 128) 32896
batch_normalization_6 (Batch (None, 9, 1, 128) 512
max_pooling2d_3 (MaxPooling2 (None, 5, 1, 128) 0
conv2d_7 (Conv2D) (None, 5, 1, 64) 8256
batch_normalization_7 (Batch (None, 5, 1, 64) 256
flatten (Flatten) (None, 320) 0
dense (Dense) (None, 7) 2247
Total params: 296,455
Trainable params: 293,511
Non-trainable params: 2,944
So, in this model we feed a (68, 2) in batches. And finally after training, we take out the output from FCN layer of 7 nodes.
Model: "sequential_1"
Layer (type) Output Shape Param #
conv2d_4 (Conv2D) (None, 224, 224, 64) 1792
batch_normalization_6 (Batch (None, 224, 224, 64) 256
activation_6 (Activation) (None, 224, 224, 64) 0
dropout_6 (Dropout) (None, 224, 224, 64) 0
conv2d_5 (Conv2D) (None, 224, 224, 128) 73856
batch_normalization_7 (Batch (None, 224, 224, 128) 512
activation_7 (Activation) (None, 224, 224, 128) 0
max_pooling2d_3 (MaxPooling2 (None, 112, 112, 128) 0
dropout_7 (Dropout) (None, 112, 112, 128) 0
conv2d_6 (Conv2D) (None, 112, 112, 512) 590336
batch_normalization_8 (Batch (None, 112, 112, 512) 2048
activation_8 (Activation) (None, 112, 112, 512) 0
max_pooling2d_4 (MaxPooling2 (None, 56, 56, 512) 0
dropout_8 (Dropout) (None, 56, 56, 512) 0
conv2d_7 (Conv2D) (None, 56, 56, 512) 2359808
batch_normalization_9 (Batch (None, 56, 56, 512) 2048
activation_9 (Activation) (None, 56, 56, 512) 0
max_pooling2d_5 (MaxPooling2 (None, 28, 28, 512) 0
dropout_9 (Dropout) (None, 28, 28, 512) 0
flatten_1 (Flatten) (None, 401408) 0
dense_3 (Dense) (None, 256) 102760704
batch_normalization_10 (Batc (None, 256) 1024
activation_10 (Activation) (None, 256) 0
dropout_10 (Dropout) (None, 256) 0
dense_4 (Dense) (None, 128) 32896
batch_normalization_11 (Batc (None, 128) 512
activation_11 (Activation) (None, 128) 0
dropout_11 (Dropout) (None, 128) 0
dense_5 (Dense) (None, 6) 774
Total params: 105,826,566
Trainable params: 105,823,366
Non-trainable params: 3,200
Here, the images are feed in by preprocessing and resizing it to size of (224, 224) and the output is again a FCN of 6 nodes (we neglected the emotions which are suffering from data insufficency.