Object detection is a large field in computer vision, and one of the more important applications of computer vision “in the wild”.
Object detection isn’t as standardized as image classification, mainly because most of the new developments are typically done by individual researchers, maintainers and developers, rather than large libraries and frameworks. It’s difficult to package the necessary utility scripts in a framework like TensorFlow or PyTorch and maintain the API guidelines that guided the development so far.
This makes object detection somewhat more complex, typically more verbose (but not always), and less approachable than image classification.
Fortunately for the masses – Ultralytics has developed a simple, very powerful and beautiful object detection API around their YOLOv5 which has been extended by other research and development teams into newer versions, such as YOLOv7.
In this short guide, we’ll be performing Pose Estimation (Keypoint Detection) in Python, with state-of-the-art YOLOv7.
Keypoints can be various points – parts of a face, limbs of a body, etc. Pose estimation is a special case of keypoint detection – in which the points are parts of a human body, and can be used to replace expensive position tracking hardware, enable over-the-air robotics control, and power a new age of human self expression through AR and VR.
YOLO and Pose Estimation
YOLO (You Only Look Once) is a methodology, as well as family of models built for object detection. Since the inception in 2015, YOLOv1, YOLOv2 (YOLO9000) and YOLOv3 have been proposed by the same author(s) – and the deep learning community continued with open-sourced advancements in the continuing years.
Ultralytics’ YOLOv5 is the first large-scale implementation of YOLO in PyTorch, which made it more accessible than ever before, but the main reason YOLOv5 has gained such a foothold is also the beautifully simple and powerful API built around it. The project abstracts away the unnecessary details, while allowing customizability, practically all usable export formats, and employs amazing practices that make the entire project both efficient and as optimal as it can be.
YOLOv5 is still the staple project to build Object Detection models with, and many repositories that aim to advance the YOLO method start with YOLOv5 as a baseline and offer a similar API (or simply fork the project and build on top of it). Such is the case of YOLOR (You Only Learn One Representation) and YOLOv7 which built on top of YOLOR (same author) which is the latest advancement in the YOLO methodology.
YOLOv7 isn’t just an object detection architecture – provides new model heads, that can output keypoints (skeletons) and perform instance segmentation besides only bounding box regression, which wasn’t standard with previous YOLO models. This isn’t surprising, since many object detection architectures were repurposed for instance segmentation and keypoint detection tasks earlier as well, due to the shared general architecture, with different outputs depending on the task. Even though it isn’t surprising – supporting instance segmentation and keypoint detection will likely become the new standard for YOLO-based models, which have begun outperforming practically all other two-stage detectors a couple of years ago.
This makes instance segmentation and keypoint detection faster to perform than ever before, with a simpler architecture than two-stage detectors.
The model itself was created through architectural changes, as well as optimizing aspects of training, dubbed “bag-of-freebies”, which increased accuracy without increasing inference cost.
Let’s go ahead and install the project from GitHub:
! git clone https://github.com/WongKinYiu/yolov7.git
This creates a
yolov7 directory under your current working directory, in which you’ll be able to find the basic project files:
%cd yolov7 !ls /Users/macbookpro/jup/yolov7 LICENSE.md detect.py models tools README.md export.py paper train.py cfg figure requirements.txt train_aux.py data hubconf.py scripts utils deploy inference test.py
Note: Google Colab Notebooks reset to the main working directory in the next cell, even after calling
%cd dirname, so you’ll have to keep calling it in each cell you want an operation to be performed in. Local Jupyter Notebooks remember the change, so there’s no need to keep calling the command.
Whenever you run code with a given set of weights – they’ll be downloaded and stored in this directory. To perform pose estimation, we’ll want to download the weights for the pre-trained YOLOv7 model for that task, which can be found under the
/releases/download/ tab on GitHub:
! curl -L https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7-w6-pose.pt -o yolov7-w6-pose.pt %cd .. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 153M 100 153M 0 0 3742k 0 0:00:42 0:00:42 --:--:-- 4573k /Users/macbookpro/jup
Great, we’ve downloaded the
yolov7-w6-pose.pt weights file, which can be used to load and reconstruct a trained model for pose estimation.
Loading the YOLOv7 Pose Estimation Model
Let’s import the libraries we’ll need to perform pose estimation:
import torch from torchvision import transforms from utils.datasets import letterbox from utils.general import non_max_suppression_kpt from utils.plots import output_to_keypoint, plot_skeleton_kpts import matplotlib.pyplot as plt import cv2 import numpy as np
torchvision are straightforward enough – YOLOv7 is implemented with PyTorch. The
utils.plots modules come from the YOLOv7 project, and provide us with methods that help with preprocessing and preparing input for the model to run inference on. Amongst those are
letterbox() to pad the image,
non_max_supression_keypoint() to run the Non-Max Supression algorithm on the initial output of the model and to produce a clean output for our interpretation, as well as the
plot_skeleton_kpts() methods to actually add keypoints to a given image, once they’re predicted.
We can load the model from the weight file with
torch.load(). Let’s create a function to check if a GPU is available, load the model, put it in inference mode and move it to the GPU if available:
def load_model(): device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") model = torch.load('yolov7/yolov7-w6-pose.pt', map_location=device)['model'] model.float().eval() if torch.cuda.is_available(): model.half().to(device) return model model = load_model()
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
With the model loaded, let’s create a
run_inference() method that accepts a string pointing to a file on our system. The method will read the image using OpenCV (
cv2), pad it with
letterbox(), apply transforms to it, and turn it into a batch (the model is trained on and expects batches, as usual):
def run_inference(url): image = cv2.imread(url) image = letterbox(image, 960, stride=64, auto=True) image = transforms.ToTensor()(image) image = image.unsqueeze(0) output, _ = model(image) return output, image
Here, we’ve returned the transformed image (because we’ll want to extract the original and plot on it) and the outputs of the model. These outputs contain 45900 keypoint predictions, most of which overlap. We’ll want to apply Non-Max Supression to these raw predictions, just as with Object Detection predictions (where many bounding boxes are predicted and then they’re “collapsed” given some confidence and IoU threshold). After supression, we can plot each keypoint on the original image and display it:
def visualize_output(output, image): output = non_max_suppression_kpt(output, 0.25, 0.65, nc=model.yaml['nc'], nkpt=model.yaml['nkpt'], kpt_label=True) with torch.no_grad(): output = output_to_keypoint(output) nimg = image.permute(1, 2, 0) * 255 nimg = nimg.cpu().numpy().astype(np.uint8) nimg = cv2.cvtColor(nimg, cv2.COLOR_RGB2BGR) for idx in range(output.shape): plot_skeleton_kpts(nimg, output[idx, 7:].T, 3) plt.figure(figsize=(12, 12)) plt.axis('off') plt.imshow(nimg) plt.show()
Now, for some input image, such as
karate.jpg in the main working directory, we can run inference, perform Non-Max Supression and plot the results with:
output, image = run_inference('./karate.jpg') visualize_output(output, image)
This results in:
This is a fairly difficult image to infer! Most of the right arm of the practitioner on the right is hidden, and we can see that the model inferred that it is hidden and to the right of the body, missing that the elbow is bent and that a portion of the arm is in front. The practicioner on the left, which is much more clearly seen, is inferred correctly, even with a hidden leg.
As a matter of fact – a person sitting in the back, almost fully invisible to the camera has had their pose seemingly correctly estimated, just based on the position of the hips while sitting down. Great work on behalf of the network!
In thi guide – we’ve taken a brief look at YOLOv7, the latest advancement in the YOLO family, which builds on top of YOLOR, and further provides instance segmentation and keypoint detection capabilities beyond the standard object detection capabilities of most YOLO-based models.
We’ve then taken a look at how we can download released weight files, load them in to construct a model and perform pose estimation inference for humans, yielding impressive results.
Going Further – Practical Deep Learning for Computer Vision
Your inquisitive nature makes you want to go further? We recommend checking out our Course: “Practical Deep Learning for Computer Vision with Python”.
Another Computer Vision Course?
We won’t be doing classification of MNIST digits or MNIST fashion. They served their part a long time ago. Too many learning resources are focusing on basic datasets and basic architectures before letting advanced black-box architectures shoulder the burden of performance.
We want to focus on demystification, practicality, understanding, intuition and real projects. Want to learn how you can make a difference? We’ll take you on a ride from the way our brains process images to writing a research-grade deep learning classifier for breast cancer to deep learning networks that “hallucinate”, teaching you the principles and theory through practical work, equipping you with the know-how and tools to become an expert at applying deep learning to solve computer vision.
- The first principles of vision and how computers can be taught to “see”
- Different tasks and applications of computer vision
- The tools of the trade that will make your work easier
- Finding, creating and utilizing datasets for computer vision
- The theory and application of Convolutional Neural Networks
- Handling domain shift, co-occurrence, and other biases in datasets
- Transfer Learning and utilizing others’ training time and computational resources for your benefit
- Building and training a state-of-the-art breast cancer classifier
- How to apply a healthy dose of skepticism to mainstream ideas and understand the implications of widely adopted techniques
- Visualizing a ConvNet’s “concept space” using t-SNE and PCA
- Case studies of how companies use computer vision techniques to achieve better results
- Proper model evaluation, latent space visualization and identifying the model’s attention
- Performing domain research, processing your own datasets and establishing model tests
- Cutting-edge architectures, the progression of ideas, what makes them unique and how to implement them
- KerasCV – a WIP library for creating state of the art pipelines and models
- How to parse and read papers and implement them yourself
- Selecting models depending on your application
- Creating an end-to-end machine learning pipeline
- Landscape and intuition on object detection with Faster R-CNNs, RetinaNets, SSDs and YOLO
- Instance and semantic segmentation
- Real-Time Object Recognition with YOLOv5
- Training YOLOv5 Object Detectors
- Working with Transformers using KerasNLP (industry-strength WIP library)
- Integrating Transformers with ConvNets to generate captions of images
- Deep Learning model optimization for computer vision