Combining CNNs with RNNs for Video Classification


Introduction

Video classification is a complex task that requires understanding both spatial and temporal features. While Convolutional Neural Networks (CNNs) excel at extracting spatial information from individual frames, Recurrent Neural Networks (RNNs)—especially Long Short-Term Memory (LSTM) networks—are better suited for modeling the temporal dependencies between frames. By combining these two architectures, we can effectively handle the intricacies of video data.

CNNs are designed to capture local patterns and hierarchies in images, while RNNs capture sequential patterns in data. This architecture is commonly used in tasks like action recognition, video captioning, and emotion detection in videos.


Table of Contents

  1. CNN-RNN Architecture for Video Classification
  2. TensorFlow Implementation
  3. PyTorch Implementation
  4. Best Practices for CNN-RNN Models
  5. Limitations and Future Directions
  6. Conclusion

1. CNN-RNN Architecture for Video Classification

The architecture typically involves using a CNN to process each frame and extract spatial features, followed by passing the sequence of features to an RNN to learn temporal dependencies across frames.

1.1 CNN for Spatial Feature Extraction

The CNN processes each video frame to extract spatial features, often using a pre-trained model such as ResNet or VGG. The output of the CNN is a feature vector for each frame, which is then passed to the RNN.

CNN output for frame t=fCNN(xt)\text{CNN output for frame } t = f_{\text{CNN}}(x_t)

Where:

  • xtx_t is the frame at time step tt ,
  • fCNNf_{\text{CNN}} is the function learned by the CNN.

Clarification: CNNs capture spatial hierarchies and local patterns such as edges and textures, which are crucial for understanding video frames.

1.2 RNN/LSTM for Temporal Sequence Processing

Once the CNN extracts the spatial features from each frame, they are fed sequentially to an RNN (or LSTM). RNNs have a feedback loop that allows them to maintain a hidden state capturing information from previous time steps, enabling the network to model the temporal dependencies between the frames.

ht=fRNN(ht1,CNN output for frame t)h_t = f_{\text{RNN}}(h_{t-1}, \text{CNN output for frame } t)

Where:

  • hth_t is the hidden state at time step tt ,
  • fRNNf_{\text{RNN}} is the function learned by the RNN.

By the end of the sequence, the RNN generates a final prediction for the entire video.


2. TensorFlow Implementation

2.1 Data Preparation

Before building the model, we need to prepare the data by extracting frames from videos, preprocessing them, and arranging them into sequences.

Frame Extraction and Preprocessing

import cv2
import numpy as np
from tensorflow.keras.utils import to_categorical

def extract_frames(video_path, num_frames):
    video = cv2.VideoCapture(video_path)
    frames = []
    total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    frame_indices = np.linspace(0, total_frames - 1, num_frames).astype(int)

    for idx in frame_indices:
        video.set(cv2.CAP_PROP_POS_FRAMES, idx)
        success, frame = video.read()
        if success:
            frame = cv2.resize(frame, (64, 64))
            frame = frame / 255.0
            frames.append(frame)
        else:
            break

    video.release()
    frames = np.array(frames)
    if frames.shape[0] < num_frames:
        # Pad with zeros if not enough frames
        padding = np.zeros((num_frames - frames.shape[0], 64, 64, 3))
        frames = np.vstack((frames, padding))
    return frames

Creating the Dataset

import os

def load_video_dataset(video_dir, num_frames):
    X_data = []
    y_data = []
    class_names = sorted(os.listdir(video_dir))

    for label, class_name in enumerate(class_names):
        class_path = os.path.join(video_dir, class_name)
        for video_file in os.listdir(class_path):
            video_path = os.path.join(class_path, video_file)
            frames = extract_frames(video_path, num_frames)
            X_data.append(frames)
            y_data.append(label)

    X_data = np.array(X_data)
    y_data = to_categorical(y_data, num_classes=len(class_names))
    return X_data, y_data, class_names

Note: This code handles cases where videos have fewer frames than num_frames by padding with zeros.

2.2 CNN-RNN Architecture in TensorFlow

Building the Model

import tensorflow as tf
from tensorflow.keras import layers, models

def build_cnn():
    cnn_model = models.Sequential([
        layers.Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 3)),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(128, (3, 3), activation='relu'),
        layers.GlobalAveragePooling2D(),
    ])
    return cnn_model

def build_cnn_rnn(timesteps, num_classes):
    cnn_model = build_cnn()

    model = models.Sequential([
        layers.TimeDistributed(cnn_model, input_shape=(timesteps, 64, 64, 3)),
        layers.LSTM(128),
        layers.Dense(64, activation='relu'),
        layers.Dense(num_classes, activation='softmax')  # Number of classes for video classification
    ])
    return model

# Assuming you have loaded X_train, y_train, X_val, y_val
num_classes = y_train.shape[1]
timesteps = X_train.shape[1]

model = build_cnn_rnn(timesteps, num_classes)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

Key Details:

  • TimeDistributed Layer: Applies the CNN to each frame independently within the sequence.
  • Global Average Pooling: Reduces each feature map to a single value, resulting in a feature vector.
  • LSTM Layer: Captures temporal dependencies across the sequence of frames.

2.3 Training the Model

history = model.fit(
    X_train, y_train,
    epochs=10,
    batch_size=8,
    validation_data=(X_val, y_val)
)

2.4 Evaluating the Model

from sklearn.metrics import classification_report
import numpy as np

# Predict on validation set
y_pred = model.predict(X_val)
y_pred_labels = np.argmax(y_pred, axis=1)
y_true_labels = np.argmax(y_val, axis=1)

# Classification report
report = classification_report(y_true_labels, y_pred_labels, target_names=class_names)
print(report)

3. PyTorch Implementation

3.1 Data Preparation

Frame Extraction and Preprocessing

import cv2
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

def extract_frames(video_path, num_frames):
    video = cv2.VideoCapture(video_path)
    frames = []
    total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    frame_indices = np.linspace(0, total_frames - 1, num_frames).astype(int)

    for idx in frame_indices:
        video.set(cv2.CAP_PROP_POS_FRAMES, idx)
        success, frame = video.read()
        if success:
            frame = cv2.resize(frame, (64, 64))
            frame = frame / 255.0
            frames.append(frame)
        else:
            break

    video.release()
    frames = np.array(frames)
    if frames.shape[0] < num_frames:
        padding = np.zeros((num_frames - frames.shape[0], 64, 64, 3))
        frames = np.vstack((frames, padding))
    return frames

class VideoDataset(Dataset):
    def __init__(self, video_dir, num_frames):
        self.video_paths = []
        self.labels = []
        self.class_names = sorted(os.listdir(video_dir))
        for label, class_name in enumerate(self.class_names):
            class_path = os.path.join(video_dir, class_name)
            for video_file in os.listdir(class_path):
                video_path = os.path.join(class_path, video_file)
                self.video_paths.append(video_path)
                self.labels.append(label)
        self.num_frames = num_frames

    def __len__(self):
        return len(self.video_paths)

    def __getitem__(self, idx):
        video_path = self.video_paths[idx]
        frames = extract_frames(video_path, self.num_frames)
        frames = np.transpose(frames, (0, 3, 1, 2))  # Convert to (T, C, H, W)
        frames = torch.from_numpy(frames).float()
        label = self.labels[idx]
        return frames, label

3.2 CNN-RNN Architecture in PyTorch

Building the Model

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

class CNNFeatureExtractor(nn.Module):
    def __init__(self):
        super(CNNFeatureExtractor, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1),  # Output: (32, 64, 64)
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # Output: (32, 32, 32)
            nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),  # Output: (64, 32, 32)
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # Output: (64, 16, 16)
            nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),  # Output: (128, 16, 16)
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((1, 1)),  # Output: (128, 1, 1)
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)  # Flatten to (batch_size, 128)
        return x

class CNNRNN(nn.Module):
    def __init__(self, num_classes):
        super(CNNRNN, self).__init__()
        self.cnn = CNNFeatureExtractor()
        self.lstm = nn.LSTM(input_size=128, hidden_size=128, num_layers=1, batch_first=True)
        self.fc = nn.Linear(128, num_classes)

    def forward(self, x):
        batch_size, timesteps, C, H, W = x.size()
        x = x.view(batch_size * timesteps, C, H, W)
        cnn_features = self.cnn(x)
        cnn_features = cnn_features.view(batch_size, timesteps, -1)
        lstm_out, _ = self.lstm(cnn_features)
        output = self.fc(lstm_out[:, -1, :])
        return output

# Instantiate the model
num_classes = len(train_dataset.class_names)
model = CNNRNN(num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

Key Details:

  • Adaptive Average Pooling: Ensures a fixed output size from the CNN regardless of input dimensions.
  • Batch First: Set batch_first=True in the LSTM since inputs are shaped as (batch_size, timesteps, features).

3.3 Training the Model

num_epochs = 10
model.train()
for epoch in range(num_epochs):
    running_loss = 0.0
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * inputs.size(0)
    epoch_loss = running_loss / len(train_loader.dataset)
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss:.4f}')

3.4 Evaluating the Model

from sklearn.metrics import classification_report

model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for inputs, labels in val_loader:
        outputs = model(inputs)
        _, preds = torch.max(outputs, 1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

class_names = train_dataset.class_names
report = classification_report(all_labels, all_preds, target_names=class_names)
print(report)

4. Best Practices for CNN-RNN Models

  1. Pretrained Models:

    • Utilize pretrained CNN models (e.g., VGG16, ResNet50) to leverage learned features.
    • Adjust input layers to match your data’s frame size and channels.
  2. Frame Sampling Techniques:

    • Uniform Sampling: Select frames at consistent intervals.
    • Random Sampling: Randomly choose frames to increase data variability.
    • Key Frame Extraction: Use algorithms to select the most informative frames.
  3. Data Augmentation:

    • Spatial Augmentation: Apply techniques like rotation, flipping, and cropping.

    • Temporal Augmentation:

      import random
      
      def temporal_shift(frames, max_shift=2):
          shift = random.randint(-max_shift, max_shift)
          if shift != 0:
              frames = np.roll(frames, shift, axis=0)
          return frames
      
  4. Hyperparameter Tuning:

    • Experiment with different sequence lengths, batch sizes, and learning rates.
    • Use validation sets to monitor performance and avoid overfitting.
  5. Regularization:

    • Incorporate dropout layers to reduce overfitting.
    • Apply weight decay (L2 regularization) to the optimizer.

5. Limitations and Future Directions

5.1 Limitations

  • Computational Complexity:

    • CNN-RNN models can be resource-intensive, especially with high-resolution videos and long sequences.
  • Sequential Processing:

    • RNNs process data sequentially, which can be slow and less efficient for long sequences.

5.2 Future Directions

  1. 3D CNNs:

    • Capture spatial and temporal features simultaneously.
    • Models like C3D and I3D are effective for video classification tasks.
  2. Transformer Models:

    • Leverage self-attention mechanisms to handle temporal dependencies.
    • Models like TimeSformer and Video Swin Transformer offer improved scalability.
  3. Efficient Architectures:

    • Utilize lightweight models such as MobileNet or EfficientNet to reduce computational demands.

6. Conclusion

Combining CNNs and RNNs offers a robust framework for tackling video classification tasks by leveraging the strengths of both architectures. While CNNs excel at spatial feature extraction, RNNs (or LSTMs) model the temporal dependencies between frames. This approach is well-suited for tasks like action recognition, emotion detection, and video scene understanding.

However, challenges like computational efficiency and handling large datasets exist. These can be mitigated with strategies like pretraining, frame sampling, and using efficient architectures. As research evolves, hybrid architectures like CNN-transformer models may become the new standard for video classification.

Experiment with both CNN-RNN models and newer architectures to find the best approach for your specific video classification tasks!

© 2024 Dominic Kneup