Implementation of CNN-RNN for Video Classification - Fine-Tuning with Action Recognition & Emotion Detection
Video classification tasks, like recognizing human actions or detecting emotions in short video clips, are critical in industries such as sports analytics, security surveillance, and entertainment. In this tutorial, we will fine-tune pre-trained CNN models, combine them with RNNs, and train them on real-world datasets like UCF-101 and FER 2013.
Table of Contents
- Data Preprocessing for Video Classification
- Fine-Tuning Pre-trained CNN Models
- CNN-RNN Architecture for Video Classification
- Real-World Examples
- Evaluation Metrics for Video Classification
- Best Practices for Fine-Tuning CNN-RNN Models
- Limitations of CNN-RNN Architecture & Future Directions
- Conclusion
1. Data Preprocessing for Video Classification
1.1 Frame Extraction
Videos consist of multiple frames, which are essentially still images. Before passing these to a CNN-RNN architecture, you need to extract individual frames from the video. Depending on the task, you may not need to process every frame but instead sample frames at a specified interval (e.g., one frame every second). This reduces the computational load and focuses on key frames.
Code: Frame Extraction with Padding
An important consideration in video frame extraction is handling cases where fewer frames are available than expected. Padding can be applied to ensure the correct number of frames is used for consistent model input.
import cv2
def extract_frames(video_path, frame_rate, num_frames):
video = cv2.VideoCapture(video_path)
fps = video.get(cv2.CAP_PROP_FPS)
frame_interval = max(int(fps / frame_rate), 1)
frames = []
count = 0
while True:
success, frame = video.read()
if not success:
break
if count % frame_interval == 0:
frames.append(frame)
count += 1
if len(frames) >= num_frames:
break
video.release()
# If fewer frames were captured, pad the frames by repeating the last frame
if len(frames) < num_frames and frames:
frames += [frames[-1]] * (num_frames - len(frames))
elif len(frames) == 0:
return None # Return None if no frames were extracted
return frames
Note: The frame_rate
parameter controls how many frames per second are extracted, and padding ensures consistent input sizes.
1.2 Resizing
After extracting frames, resize them to a consistent dimension that matches the input size expected by the CNN model. Common sizes include for models like ResNet or VGG.
Code: Frame Resizing
def resize_frames(frames, target_size=(224, 224)):
resized_frames = [cv2.resize(frame, target_size) for frame in frames]
return resized_frames
1.3 Normalization
Normalize pixel values to a range optimal for neural networks. For most pre-trained CNN models, pixel values should be scaled to the range ([0, 1]).
Code: Normalizing Frames
import numpy as np
def normalize_frames(frames):
normalized_frames = [frame / 255.0 for frame in frames]
return np.array(normalized_frames)
1.4 Combining Preprocessing Steps
Combine all preprocessing steps to ensure the video frames are extracted, resized, normalized, and padded if necessary.
Code: Video Preprocessing Pipeline
def preprocess_video(video_path, frame_rate=5, target_size=(224, 224), num_frames=10):
frames = extract_frames(video_path, frame_rate, num_frames)
if frames is None:
return None
resized_frames = resize_frames(frames, target_size)
normalized_frames = normalize_frames(resized_frames)
return normalized_frames
2. Fine-Tuning Pre-trained CNN Models
Fine-tuning adapts a pre-trained CNN (such as ResNet or Inception), trained on a large dataset like ImageNet, for a new task involving video data. The CNN extracts spatial features from individual frames, while the RNN (or LSTM) captures temporal dependencies between those frames.
2.1 Why Fine-Tune?
- Pre-trained Knowledge: The CNN has already learned useful features like edges, textures, and shapes.
- Efficiency: Fine-tuning speeds up convergence, crucial when working with large video datasets.
- Data Requirements: Reduces the need for massive amounts of video data, suitable for domains with limited labeled data.
2.2 Setting Up Fine-Tuning in TensorFlow
Import a pre-trained CNN model (e.g., ResNet50) and remove its fully connected layers. Add RNN layers to process the sequence of features extracted from each frame.
Code: Pre-trained CNN + RNN in TensorFlow
from tensorflow.keras.applications import ResNet50
from tensorflow.keras import layers, models
# Load the pre-trained ResNet50 model without the top layers
cnn_base = ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
# Freeze the pre-trained layers
for layer in cnn_base.layers:
layer.trainable = False
# Create the CNN-RNN model
model = models.Sequential([
layers.TimeDistributed(cnn_base, input_shape=(None, 224, 224, 3)),
layers.TimeDistributed(layers.GlobalAveragePooling2D()),
layers.LSTM(128, return_sequences=False),
layers.Dense(101, activation='softmax') # For UCF-101 dataset
])
# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Summary of the model architecture
model.summary()
2.3 Explanation of Key Components
- TimeDistributed Layer: Applies the CNN to each frame independently.
- Global Average Pooling: Reduces dimensionality while retaining spatial features.
- LSTM Layer: Captures temporal patterns across the video sequence.
- Freezing Layers: Prevents overfitting and reduces training time by keeping pre-trained weights fixed.
2.4 Fine-Tuning Considerations
- Unfreezing Layers: Unfreeze deeper layers for additional fine-tuning as training progresses.
- Learning Rate Adjustment: Use a smaller learning rate to prevent drastic changes to pre-trained weights.
Code: Unfreezing Layers and Learning Rate Scheduling
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import LearningRateScheduler
# Unfreeze the last few layers of the CNN
for layer in cnn_base.layers[-10:]:
layer.trainable = True
# Learning rate scheduler
def lr_scheduler(epoch, lr):
if epoch > 10:
return lr * 0.5
return lr
# Compile the model with a lower initial learning rate
model.compile(optimizer=Adam(learning_rate=1e-4), loss='categorical_crossentropy', metrics=['accuracy'])
# Train the model with the learning rate scheduler
model.fit(train_data, epochs=20, validation_data=val_data, callbacks=[LearningRateScheduler(lr_scheduler)])
3. CNN-RNN Architecture for Video Classification
Combining CNNs and RNNs involves using the CNN to process each video frame individually and passing the extracted features to an RNN to model temporal dependencies.
3.1 CNN for Spatial Feature Extraction
The CNN processes each video frame to extract spatial features.
Where:
- is the frame at time step ,
- is the function learned by the CNN.
3.2 RNN/LSTM for Temporal Sequence Processing
The RNN processes the sequence of features extracted by the CNN.
Where:
- is the hidden state at time step ,
- is the function learned by the RNN.
4. Real-World Examples
4.1 Example 1: UCF-101 Action Recognition
Dataset Overview
UCF-101 is a popular action recognition dataset with 13,320 video clips across 101 action categories.
Data Preprocessing
Extract frames from videos, resize to , and normalize pixel values.
Code: Data Preprocessing for UCF-101
import os
import cv2
import numpy as np
from tensorflow.keras.utils import to_categorical
def load_ucf101_data(video_dir, frame_rate=5, target_size=(224, 224), num_frames=10):
X_data = []
y_data = []
class_names = sorted(os.listdir(video_dir))
for class_idx, class_name in enumerate(class_names):
class_path = os.path.join(video_dir, class_name)
for video_file in os.listdir(class_path):
video_path = os.path.join(class_path, video_file)
frames = preprocess_video(video_path, frame_rate, target_size, num_frames)
if frames is not None:
X_data.append(frames)
y_data.append(class_idx)
X_data = np.array(X_data)
y_data = to_categorical(y_data, num_classes=len(class_names))
return X_data, y_data, class_names
Model Setup and Training
Code: Creating and Training the Model
# Load the data
video_dir = '/path/to/ucf101/videos'
X_data, y_data, class_names = load_ucf101_data(video_dir)
# Split data
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.2, random_state=42)
# Create the CNN-RNN model
model = models.Sequential([
layers.TimeDistributed(cnn_base, input_shape=(None, 224, 224, 3)),
layers.TimeDistributed(layers.GlobalAveragePooling2D()),
layers.LSTM(128, dropout=0.5),
layers.Dense(len(class_names), activation='softmax')
])
# Compile and train the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=20, batch_size=8, validation_data=(X_val, y_val))
Performance Evaluation
Code: Evaluating the Model
from sklearn.metrics import classification_report
# Predict on validation set
y_pred = model.predict(X_val)
y_pred_labels = np.argmax(y_pred, axis=1)
y_true_labels = np.argmax(y_val, axis=1)
# Classification report
report = classification_report(y_true_labels, y_pred_labels, target_names=class_names)
print(report)
4.2 Example 2: FER 2013 Emotion Detection
Dataset Overview
FER 2013 consists of 35,887 grayscale images of facial expressions across 7 emotion categories.
Adapting Images to Video Clips
Since FER 2013 consists of images, create synthetic video clips by grouping images.
Code: Creating Video Clips from Images
def create_video_clips(image_dir, clip_length=10, target_size=(48, 48)):
video_clips = []
labels = []
class_names = sorted(os.listdir(image_dir))
for class_idx, class_name in enumerate(class_names):
class_path = os.path.join(image_dir, class_name)
image_files = sorted(os.listdir(class_path))
for i in range(0, len(image_files) - clip_length + 1, clip_length):
clip = []
for j in range(clip_length):
image_path = os.path.join(class_path, image_files[i + j])
image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
if image is None:
continue
image = cv2.resize(image, target_size)
image = image / 255.0
image = np.expand_dims(image, axis=-1)
clip.append(image)
if len(clip) == clip_length:
video_clips.append(np.array(clip))
labels.append(class_idx)
X_data = np.array(video_clips)
y_data = to_categorical(labels, num_classes=len(class_names))
return X_data, y_data, class_names
Adjusting the CNN Base Model for Grayscale Images
Use a CNN model that accepts grayscale images or replicate channels.
Option 1: Adjust Input Layer
from tensorflow.keras.applications import VGG16
from tensorflow.keras.layers import Input
input_tensor = Input(shape=(48, 48, 1))
cnn_base = VGG16(weights=None, include_top=False, input_tensor=input_tensor)
Option 2: Replicate Grayscale Channels
def preprocess_fer_image(image_path, target_size=(224, 224)):
image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
image = cv2.resize(image, target_size)
image = np.stack((image,)*3, axis=-1)
image = image / 255.0
return image
Model Setup and Training
Code: Creating and Training the Model
# Load the data
image_dir = '/path/to/fer2013/images'
X_data, y_data, class_names = create_video_clips(image_dir)
# Split data
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.2, random_state=42)
# Create the CNN-RNN model
model = models.Sequential([
layers.TimeDistributed(cnn_base, input_shape=(10, 48, 48, 1)),
layers.TimeDistributed(layers.GlobalAveragePooling2D()),
layers.LSTM(64, dropout=0.5),
layers.Dense(len(class_names), activation='softmax')
])
# Compile and train the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=15, batch_size=32, validation_data=(X_val, y_val))
Performance Evaluation
Code: Evaluating the Model
from sklearn.metrics import classification_report
# Predict on validation set
y_pred = model.predict(X_val)
y_pred_labels = np.argmax(y_pred, axis=1)
y_true_labels = np.argmax(y_val, axis=1)
# Classification report
report = classification_report(y_true_labels, y_pred_labels, target_names=class_names)
print(report)
5. Evaluation Metrics for Video Classification
Assess the model’s performance using appropriate metrics.
-
Accuracy: Percentage of correct predictions.
-
Precision: Proportion of positive predictions that are correct.
-
Recall: Proportion of actual positives correctly identified.
-
F1-Score: Harmonic mean of precision and recall.
6. Best Practices for Fine-Tuning CNN-RNN Models
-
Preprocessing: Consistently resize and normalize video frames.
-
Frame Sampling: Use keyframes to reduce computational load.
-
Data Augmentation: Apply temporal augmentations like time-shifting.
Example: Temporal Shift
import random def temporal_shift(frames, max_shift=2): shift = random.randint(-max_shift, max_shift) if shift > 0: frames = frames[shift:] + frames[:shift] elif shift < 0: frames = frames[shift:] + frames[:shift] return frames
-
Batch Size & Sequence Length: Adjust based on memory constraints.
-
Regularization: Use dropout or batch normalization to prevent overfitting.
7. Limitations of CNN-RNN Architecture & Future Directions
7.1 Limitations
- Resource Intensive: Large video datasets require significant computational resources.
- Sequential Processing: RNNs can be slow with long sequences.
- Complexity: Requires careful hyperparameter tuning and architecture adjustments.
7.2 Future Directions
- Transformers: Models like TimeSformer and Video Swin Transformer offer efficient processing of temporal data.
- Efficient Architectures: Using models like MobileNet or EfficientNet can reduce computation.
- 3D CNNs: Capture spatiotemporal features simultaneously.
8. Conclusion
Combining CNNs and RNNs for video classification provides a powerful method for learning spatial and temporal patterns in video data. By fine-tuning pre-trained CNNs and using RNNs to process temporal sequences, tasks like human action recognition and emotion detection become more accessible. Future advancements, including transformer models and efficient architectures, promise to enhance performance and efficiency in video classification.