Implementation of CNN-RNN for Video Classification - Fine-Tuning with Action Recognition & Emotion Detection


Video classification tasks, like recognizing human actions or detecting emotions in short video clips, are critical in industries such as sports analytics, security surveillance, and entertainment. In this tutorial, we will fine-tune pre-trained CNN models, combine them with RNNs, and train them on real-world datasets like UCF-101 and FER 2013.


Table of Contents

  1. Data Preprocessing for Video Classification
  2. Fine-Tuning Pre-trained CNN Models
  3. CNN-RNN Architecture for Video Classification
  4. Real-World Examples
  5. Evaluation Metrics for Video Classification
  6. Best Practices for Fine-Tuning CNN-RNN Models
  7. Limitations of CNN-RNN Architecture & Future Directions
  8. Conclusion

1. Data Preprocessing for Video Classification

1.1 Frame Extraction

Videos consist of multiple frames, which are essentially still images. Before passing these to a CNN-RNN architecture, you need to extract individual frames from the video. Depending on the task, you may not need to process every frame but instead sample frames at a specified interval (e.g., one frame every second). This reduces the computational load and focuses on key frames.

Code: Frame Extraction with Padding

An important consideration in video frame extraction is handling cases where fewer frames are available than expected. Padding can be applied to ensure the correct number of frames is used for consistent model input.

import cv2

def extract_frames(video_path, frame_rate, num_frames):
    video = cv2.VideoCapture(video_path)
    fps = video.get(cv2.CAP_PROP_FPS)
    frame_interval = max(int(fps / frame_rate), 1)
    frames = []
    count = 0

    while True:
        success, frame = video.read()
        if not success:
            break
        if count % frame_interval == 0:
            frames.append(frame)
        count += 1

        if len(frames) >= num_frames:
            break

    video.release()

    # If fewer frames were captured, pad the frames by repeating the last frame
    if len(frames) < num_frames and frames:
        frames += [frames[-1]] * (num_frames - len(frames))
    elif len(frames) == 0:
        return None  # Return None if no frames were extracted

    return frames

Note: The frame_rate parameter controls how many frames per second are extracted, and padding ensures consistent input sizes.

1.2 Resizing

After extracting frames, resize them to a consistent dimension that matches the input size expected by the CNN model. Common sizes include (224×224)(224 \times 224) for models like ResNet or VGG.

Code: Frame Resizing

def resize_frames(frames, target_size=(224, 224)):
    resized_frames = [cv2.resize(frame, target_size) for frame in frames]
    return resized_frames

1.3 Normalization

Normalize pixel values to a range optimal for neural networks. For most pre-trained CNN models, pixel values should be scaled to the range ([0, 1]).

Code: Normalizing Frames

import numpy as np

def normalize_frames(frames):
    normalized_frames = [frame / 255.0 for frame in frames]
    return np.array(normalized_frames)

1.4 Combining Preprocessing Steps

Combine all preprocessing steps to ensure the video frames are extracted, resized, normalized, and padded if necessary.

Code: Video Preprocessing Pipeline

def preprocess_video(video_path, frame_rate=5, target_size=(224, 224), num_frames=10):
    frames = extract_frames(video_path, frame_rate, num_frames)
    if frames is None:
        return None

    resized_frames = resize_frames(frames, target_size)
    normalized_frames = normalize_frames(resized_frames)

    return normalized_frames

2. Fine-Tuning Pre-trained CNN Models

Fine-tuning adapts a pre-trained CNN (such as ResNet or Inception), trained on a large dataset like ImageNet, for a new task involving video data. The CNN extracts spatial features from individual frames, while the RNN (or LSTM) captures temporal dependencies between those frames.

2.1 Why Fine-Tune?

  • Pre-trained Knowledge: The CNN has already learned useful features like edges, textures, and shapes.
  • Efficiency: Fine-tuning speeds up convergence, crucial when working with large video datasets.
  • Data Requirements: Reduces the need for massive amounts of video data, suitable for domains with limited labeled data.

2.2 Setting Up Fine-Tuning in TensorFlow

Import a pre-trained CNN model (e.g., ResNet50) and remove its fully connected layers. Add RNN layers to process the sequence of features extracted from each frame.

Code: Pre-trained CNN + RNN in TensorFlow

from tensorflow.keras.applications import ResNet50
from tensorflow.keras import layers, models

# Load the pre-trained ResNet50 model without the top layers
cnn_base = ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

# Freeze the pre-trained layers
for layer in cnn_base.layers:
    layer.trainable = False

# Create the CNN-RNN model
model = models.Sequential([
    layers.TimeDistributed(cnn_base, input_shape=(None, 224, 224, 3)),
    layers.TimeDistributed(layers.GlobalAveragePooling2D()),
    layers.LSTM(128, return_sequences=False),
    layers.Dense(101, activation='softmax')  # For UCF-101 dataset
])

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Summary of the model architecture
model.summary()

2.3 Explanation of Key Components

  • TimeDistributed Layer: Applies the CNN to each frame independently.
  • Global Average Pooling: Reduces dimensionality while retaining spatial features.
  • LSTM Layer: Captures temporal patterns across the video sequence.
  • Freezing Layers: Prevents overfitting and reduces training time by keeping pre-trained weights fixed.

2.4 Fine-Tuning Considerations

  1. Unfreezing Layers: Unfreeze deeper layers for additional fine-tuning as training progresses.
  2. Learning Rate Adjustment: Use a smaller learning rate to prevent drastic changes to pre-trained weights.

Code: Unfreezing Layers and Learning Rate Scheduling

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import LearningRateScheduler

# Unfreeze the last few layers of the CNN
for layer in cnn_base.layers[-10:]:
    layer.trainable = True

# Learning rate scheduler
def lr_scheduler(epoch, lr):
    if epoch > 10:
        return lr * 0.5
    return lr

# Compile the model with a lower initial learning rate
model.compile(optimizer=Adam(learning_rate=1e-4), loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model with the learning rate scheduler
model.fit(train_data, epochs=20, validation_data=val_data, callbacks=[LearningRateScheduler(lr_scheduler)])

3. CNN-RNN Architecture for Video Classification

Combining CNNs and RNNs involves using the CNN to process each video frame individually and passing the extracted features to an RNN to model temporal dependencies.

3.1 CNN for Spatial Feature Extraction

The CNN processes each video frame to extract spatial features.

CNN output for frame t=fCNN(xt)\text{CNN output for frame } t = f_{\text{CNN}}(x_t)

Where:

  • xtx_t is the frame at time step tt ,
  • fCNNf_{\text{CNN}} is the function learned by the CNN.

3.2 RNN/LSTM for Temporal Sequence Processing

The RNN processes the sequence of features extracted by the CNN.

ht=fRNN(ht1,CNN output for frame t)h_t = f_{\text{RNN}}(h_{t-1}, \text{CNN output for frame } t)

Where:

  • hth_t is the hidden state at time step tt ,
  • fRNNf_{\text{RNN}} is the function learned by the RNN.

4. Real-World Examples

4.1 Example 1: UCF-101 Action Recognition

Dataset Overview

UCF-101 is a popular action recognition dataset with 13,320 video clips across 101 action categories.

Data Preprocessing

Extract frames from videos, resize to (224×224)(224 \times 224) , and normalize pixel values.

Code: Data Preprocessing for UCF-101

import os
import cv2
import numpy as np
from tensorflow.keras.utils import to_categorical

def load_ucf101_data(video_dir, frame_rate=5, target_size=(224, 224), num_frames=10):
    X_data = []
    y_data = []
    class_names = sorted(os.listdir(video_dir))

    for class_idx, class_name in enumerate(class_names):
        class_path = os.path.join(video_dir, class_name)
        for video_file in os.listdir(class_path):
            video_path = os.path.join(class_path, video_file)
            frames = preprocess_video(video_path, frame_rate, target_size, num_frames)
            if frames is not None:
                X_data.append(frames)
                y_data.append(class_idx)

    X_data = np.array(X_data)
    y_data = to_categorical(y_data, num_classes=len(class_names))

    return X_data, y_data, class_names

Model Setup and Training

Code: Creating and Training the Model

# Load the data
video_dir = '/path/to/ucf101/videos'
X_data, y_data, class_names = load_ucf101_data(video_dir)

# Split data
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.2, random_state=42)

# Create the CNN-RNN model
model = models.Sequential([
    layers.TimeDistributed(cnn_base, input_shape=(None, 224, 224, 3)),
    layers.TimeDistributed(layers.GlobalAveragePooling2D()),
    layers.LSTM(128, dropout=0.5),
    layers.Dense(len(class_names), activation='softmax')
])

# Compile and train the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=20, batch_size=8, validation_data=(X_val, y_val))

Performance Evaluation

Code: Evaluating the Model

from sklearn.metrics import classification_report

# Predict on validation set
y_pred = model.predict(X_val)
y_pred_labels = np.argmax(y_pred, axis=1)
y_true_labels = np.argmax(y_val, axis=1)

# Classification report
report = classification_report(y_true_labels, y_pred_labels, target_names=class_names)
print(report)

4.2 Example 2: FER 2013 Emotion Detection

Dataset Overview

FER 2013 consists of 35,887 grayscale images of facial expressions across 7 emotion categories.

Adapting Images to Video Clips

Since FER 2013 consists of images, create synthetic video clips by grouping images.

Code: Creating Video Clips from Images

def create_video_clips(image_dir, clip_length=10, target_size=(48, 48)):
    video_clips = []
    labels = []
    class_names = sorted(os.listdir(image_dir))

    for class_idx, class_name in enumerate(class_names):
        class_path = os.path.join(image_dir, class_name)
        image_files = sorted(os.listdir(class_path))

        for i in range(0, len(image_files) - clip_length + 1, clip_length):
            clip = []
            for j in range(clip_length):
                image_path = os.path.join(class_path, image_files[i + j])
                image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
                if image is None:
                    continue
                image = cv2.resize(image, target_size)
                image = image / 255.0
                image = np.expand_dims(image, axis=-1)
                clip.append(image)
            if len(clip) == clip_length:
                video_clips.append(np.array(clip))
                labels.append(class_idx)

    X_data = np.array(video_clips)
    y_data = to_categorical(labels, num_classes=len(class_names))
    return X_data, y_data, class_names

Adjusting the CNN Base Model for Grayscale Images

Use a CNN model that accepts grayscale images or replicate channels.

Option 1: Adjust Input Layer

from tensorflow.keras.applications import VGG16
from tensorflow.keras.layers import Input

input_tensor = Input(shape=(48, 48, 1))
cnn_base = VGG16(weights=None, include_top=False, input_tensor=input_tensor)

Option 2: Replicate Grayscale Channels

def preprocess_fer_image(image_path, target_size=(224, 224)):
    image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    image = cv2.resize(image, target_size)
    image = np.stack((image,)*3, axis=-1)
    image = image / 255.0
    return image

Model Setup and Training

Code: Creating and Training the Model

# Load the data
image_dir = '/path/to/fer2013/images'
X_data, y_data, class_names = create_video_clips(image_dir)

# Split data
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.2, random_state=42)

# Create the CNN-RNN model
model = models.Sequential([
    layers.TimeDistributed(cnn_base, input_shape=(10, 48, 48, 1)),
    layers.TimeDistributed(layers.GlobalAveragePooling2D()),
    layers.LSTM(64, dropout=0.5),
    layers.Dense(len(class_names), activation='softmax')
])

# Compile and train the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=15, batch_size=32, validation_data=(X_val, y_val))

Performance Evaluation

Code: Evaluating the Model

from sklearn.metrics import classification_report

# Predict on validation set
y_pred = model.predict(X_val)
y_pred_labels = np.argmax(y_pred, axis=1)
y_true_labels = np.argmax(y_val, axis=1)

# Classification report
report = classification_report(y_true_labels, y_pred_labels, target_names=class_names)
print(report)

5. Evaluation Metrics for Video Classification

Assess the model’s performance using appropriate metrics.

  • Accuracy: Percentage of correct predictions.

    Accuracy=Number of correct predictionsTotal predictions \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total predictions}}
  • Precision: Proportion of positive predictions that are correct.

    Precision=True PositivesTrue Positives + False Positives \text{Precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}}
  • Recall: Proportion of actual positives correctly identified.

    Recall=True PositivesTrue Positives + False Negatives \text{Recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}
  • F1-Score: Harmonic mean of precision and recall.

    F1-Score=2×Precision×RecallPrecision + Recall \text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision + Recall}}

6. Best Practices for Fine-Tuning CNN-RNN Models

  1. Preprocessing: Consistently resize and normalize video frames.

  2. Frame Sampling: Use keyframes to reduce computational load.

  3. Data Augmentation: Apply temporal augmentations like time-shifting.

    Example: Temporal Shift

    import random
    
    def temporal_shift(frames, max_shift=2):
        shift = random.randint(-max_shift, max_shift)
        if shift > 0:
            frames = frames[shift:] + frames[:shift]
        elif shift < 0:
            frames = frames[shift:] + frames[:shift]
        return frames
    
  4. Batch Size & Sequence Length: Adjust based on memory constraints.

  5. Regularization: Use dropout or batch normalization to prevent overfitting.


7. Limitations of CNN-RNN Architecture & Future Directions

7.1 Limitations

  • Resource Intensive: Large video datasets require significant computational resources.
  • Sequential Processing: RNNs can be slow with long sequences.
  • Complexity: Requires careful hyperparameter tuning and architecture adjustments.

7.2 Future Directions

  • Transformers: Models like TimeSformer and Video Swin Transformer offer efficient processing of temporal data.
  • Efficient Architectures: Using models like MobileNet or EfficientNet can reduce computation.
  • 3D CNNs: Capture spatiotemporal features simultaneously.

8. Conclusion

Combining CNNs and RNNs for video classification provides a powerful method for learning spatial and temporal patterns in video data. By fine-tuning pre-trained CNNs and using RNNs to process temporal sequences, tasks like human action recognition and emotion detection become more accessible. Future advancements, including transformer models and efficient architectures, promise to enhance performance and efficiency in video classification.

© 2024 Dominic Kneup