Human Pose Estimation Technology Capabilities and Use Cases in 2022

2022-10-08

关注

Human Pose Estimation Technology Capabilities and Use Cases in 2022 — Illustration: © IoT For All

What is Human Pose Estimation?

Human Pose Estimation (HPE) is a task in computer vision that focuses on identifying the position of a human body in a specific scene. Most of the HPE methods are based on recording an RGB image with the optical sensor to detect body parts and the overall pose. This can be used in conjunction with other computer vision technologies for fitness and rehabilitation, augmented reality applications, and surveillance.

'Fitness applications and AI-driven coaches are some of the most obvious use cases for body pose estimation.' -MobiDevClick To Tweet

The essence of the technology lies in detecting points of interest on the limbs, joints, and even the face of a human. These key points are used to produce a 2D or 3D representation of a human body model.

2D representation of a Albert Einstein body pose — *2D representation of an Albert Einstein body pose*

These models are basically a map of body joints we track during the movement. This is done for a computer not only to find the difference between a person just sitting and squatting, but also to calculate the angle of flexion in a specific joint and tell if the movement is performed correctly.

There are three common types of human models: skeleton-based model, contour-based, and volume-based. The skeleton-based model is the most used one in human pose estimation because of its flexibility. This is because it consists of a set of joints like ankles, knees, shoulders, elbows, wrists, and limb orientations comprising the skeletal structure of a human body.

A skeleton-based model is used for 2D as well as 3D representation, but, generally, 2D and 3D methods are used in conjunction. 3D human pose estimation grants better accuracy to the application measurements since it considers the depth coordinates and fetches those results into the calculation. For the majority of movements, depth is important, because the human body doesn’t move in a 2D dimension.

So now let’s find out how 3D human pose estimation works from a technical perspective and find out the current capabilities of such systems.

How 3D Human Pose Estimation Works

The overall flow of a body pose estimation system starts with capturing the initial data and uploading it for a system to process. As we’re dealing with motion detection, we need to analyze a sequence of images rather than a still photo since we need to extract how key points change during the movement pattern.

Once the image is uploaded, the HPE system will detect and track the required key points for analysis. In a nutshell, different software modules are responsible for tracking 2D key points, creating a body representation, and converting it into a 3D space. So, generally, when we speak about creating a body pose estimation model, we mean implementing two different modules for 2D and 3D planes.

*The difference between 2D and 3D pose estimation reconstructions*

So, for the majority of human pose estimation tasks, the flow will be broken into two parts:

Detecting and extracting 2D key points from the sequence of images. This entails using horizontal and vertical coordinates that build up a skeleton structure.
Converting 2D key points into 3D adding the depth dimension.

During this process, the application will make the required calculations to perform pose estimation.

Estimating human pose during exercise is just one example in the fitness industry. Some models can also detect key points on the human face and track head position, which can be applied for entertainment applications like Snapchat masks. But we’ll discuss the use cases of HPE later in the article.

You can check our demo to see how it works in a nutshell: just upload a short video performing some movement and wait for the processing time to see the pose analysis.

3D Pose Estimation Performance and Accuracy

Depending on the chosen algorithm, the HPE system will provide different performance and accuracy results. Let’s see how they correlate in terms of our experiment with two of the most popular human pose estimation models, VideoPose3D and BlazePose.

We’ve tested BlazePose and VideoPose3D models on the same hardware using a 5-second video with 2160*3840 dimensions and 60 frames per second. VideoPose3D got a total time of 8 minutes for video processing and a good accuracy result. In contrast, BlazePose processing time reached 3-4 frames per second, which allows the use in real-time applications. But the accuracy results shown below don’t correspond to the objectives of any HPE task.

*VideoPose3D and BlazePose processing results*

The processing time depends on the movement complexity, video and lighting quality, and the 2D pose detector module. Given the fact that BlazePose and VideoPose3D have different 2D detectors, this stage appears to be a performance bottleneck in both cases.

One of the possible ways to optimize HPE performance is the acceleration of 2D key point detection. Existing 2D detectors can be modified or amplified with the post-processing stages to improve general accuracy.

Real-time 3D Human Pose Estimation

Whether we deal with a fitness app, an app for rehabilitation, face masks, or surveillance, real-time processing is highly required. Of course, the performance of the model will depend on the chosen algorithm and hardware, but the majority of existing open-source models provide quite a long response time. In the opposite scenario, the accuracy suffers. So, is it possible to improve existing 3D human pose estimation models to achieve acceptable accuracy with real-time processing?

While models like BlazePose are able to provide real-time processing, the accuracy of its tracking is not suitable for commercial use or complex tasks. In terms of our experiment, we tested the 2D component of a BlazePose with a modified 3D-pose-baseline model using Python language.

In terms of speed, our model achieves about 46 FPS on the above-mentioned hardware without video rendering whereas the 2D pose detection model produces key points with about 50 FPS. In comparison to the 2D pose detection model, the modified 3D baseline model can produce keypoints with about 780 FPS. Detailed information about the spent processing time of our approach is presented below.

*BlazePose 2D + 3D-pose-baseline performance in percent*

While this approach doesn’t guarantee reliability in complex scenarios with dim lighting or unusual poses, standard videos can be processed in real time. But, generally, the accuracy of model predictions will depend on the training and the chosen architecture. Understanding the true capabilities of human pose estimation, we can analyze some common business applications and general use cases for this technology.

Human pose estimation use cases

HPE can be considered a quite mature technology since there are groundworks in the areas of applications like fitness, rehabilitation, augmented reality, animation, gaming, robotics, and even surveillance. So now let’s talk about the existing use cases.

AI Fitness and Self-Coaching

Fitness applications and AI-driven coaches are some of the most obvious use cases for body pose estimation. The model implemented in the phone app can use the hardware camera as a sensor to record someone doing an exercise and perform its analyses.

Tracking the movement of a human body, the exercise can be split into phases of eccentric and concentric movements to analyze different angles of flexion and overall posture. This is done via tracking the key points and providing analytics in the form of hints or graphic analysis. This can be handled in real-time or after some delay, providing analytics on the major movement patterns and body mechanics for the user.

Rehabilitation and Physiotherapy

The physiotherapy industry is another human activity tracking use case with similar rules of application. In the era of telemedicine, in-home consultations become much more flexible and diverse. AI technologies have enabled more complex ways that treatment can be done online.

The analysis of rehab activities applies similar concepts to fitness applications, except for the requirements for accuracy. Since we’re dealing with recovering from the injury, this category of applications will fall into the healthcare category. This means it has to meet the standards of the healthcare industry and general data protection laws in a certain country.

Augmented Reality

Augmented reality applications like virtual fitting rooms can benefit from human estimation as one of the most advanced methods of detecting and recognizing the position of a human body in space. This can be used in e-commerce where shoppers struggle to fit their clothes before buying.

Human pose estimation can be applied to track key points on the human body and pass this data to the augmented reality engine that will fit clothes on the user. This can be applied to any body part and type of clothes, or even face masks. We’ve described our experience of using human pose estimation for virtual fittings rooms in a dedicated article.

Animation and Gaming

Game development is a tough industry with a lot of complex tasks that require knowledge of human body mechanics. Body pose estimation is widely used in the animation of game characters to simplify this process by transferring tracked key points in a certain position to the animated model.

The process of this work resembles motion tracking technology used in video production, but doesn’t require a large number of sensors placed on the model. Instead, we can use multiple cameras to detect the motion pattern and recognize it automatically. The data fetched then can be transformed and transferred to the actual 3D model in the game engine.

Surveillance and Human Activity Analysis

Some surveillance cases don’t require spotting a crime in a crowd of people. Instead, cameras can be used to automate everyday processes like shopping at a grocery store.

Cashierless store systems like Amazon GO, for example, apply human pose estimation to understand whether a person took some item from a shelf. HPE is used in combination with other computer vision technologies, which allows Amazon to automate the process of checkout in their stores using a network of camera sensors, IoT devices, and

Human pose estimation is responsible for the part of the process where the actual area of contact with the product is not visible to the camera. So here, the HPE model analyzes the position of customers’ hands and heads to understand if they took the product from the shelf, or left it in place.

How to train a human pose estimation model?

Human pose estimation is a machine learning technology, which means you’ll need data to train it. Human pose estimation completes quite difficult tasks of detecting and recognizing multiple objects on the screen and neural networks are used as an engine for it. Training a neural network requires enormous amounts of data, so the most optimal way is to use available datasets like the following ones:

HumanEva
Coco
MPI Human Pose, and
Human3.6M

The majority of these datasets are suitable for fitness and rehab applications with human pose estimation. But this doesn’t guarantee high accuracy in terms of more unusual movements or specific tasks like surveillance or multi-person pose estimation.

For the rest of the cases, data collection is inevitable since a neural network will require quality samples to provide accurate object detection and tracking. Here, experienced data science and machine learning teams can be helpful, since they can provide consultancy on how to gather data, and handle the actual development of the model.

Artificial Intelligence
Augmented Reality
Fitness
Machine Learning

Artificial Intelligence
Augmented Reality
Fitness
Machine Learning

参考译文

2022年人体姿态估计技术能力和用例

人体姿态估计(Human Pose Estimation, HPE)是计算机视觉中的一项任务，其重点是在特定场景中识别人体的位置。大多数HPE方法都是基于光学传感器记录RGB图像来检测身体部位和整体姿态。这可以与其他计算机视觉技术结合使用，用于健身和康复，增强现实应用和监视。这项技术的精髓在于探测人的四肢、关节甚至面部的兴趣点。这些关键点用于生成人体模型的2D或3D表示。这些模型基本上是我们在运动过程中跟踪的身体关节图。这样做的目的不仅是为了让电脑找出一个人只是坐着和蹲着的区别，还为了计算一个特定关节的屈曲角度，并判断这个动作是否正确。人体模型有三种常见类型:基于骨骼的模型、基于轮廓的模型和基于体积的模型。基于骨骼的人体姿态估计模型由于其灵活性，在人体姿态估计中应用最多。这是因为它由一组关节组成，如脚踝、膝盖、肩膀、肘部、手腕，以及构成人体骨骼结构的肢体方向。基于骨骼的模型用于2D和3D表示，但通常，2D和3D方法是结合使用的。3D人体姿态估计为应用测量提供了更好的精度，因为它考虑了深度坐标，并将这些结果提取到计算中。对于大多数动作来说，深度是很重要的，因为人体并不是在2D中移动的。现在，让我们从技术的角度来了解3D人体姿势估计是如何工作的，并找出这种系统目前的能力。身体姿态估计系统的整个流程是从捕获初始数据并上传给系统进行处理开始的。当我们处理运动检测时，我们需要分析图像序列而不是静态照片，因为我们需要提取关键点在运动模式中是如何变化的。一旦图像上传，HPE系统将检测和跟踪所需的关键点进行分析。简而言之，不同的软件模块负责跟踪2D关键点，创建身体表示，并将其转换为3D空间。所以，一般来说，当我们谈到创建身体姿势估计模型时，我们指的是为2D和3D平面实现两个不同的模块。因此，对于大多数人体姿势估计任务，流程将被分为两个部分:在此过程中，应用程序将进行所需的计算来执行姿势估计。在健身行业，估计人体运动时的姿势只是一个例子。一些模型还可以检测人脸的关键点，跟踪头部位置，这可以应用于娱乐应用程序，如Snapchat的面具。但是我们将在本文后面讨论HPE的用例。你可以查看我们的演示，看看它是如何工作的:只需上传一个表演一些动作的短视频，等待处理时间，就可以看到姿势分析。根据所选择的算法，HPE系统将提供不同的性能和精度结果。让我们看看它们是如何在我们的实验中与两种最流行的人体姿势估计模型相关联的，VideoPose3D和BlazePose。我们在相同的硬件上测试了BlazePose和VideoPose3D模型，使用5秒视频，2160*3840维，每秒60帧。VideoPose3D的视频处理总时间为8分钟，具有良好的精度结果。相比之下，BlazePose的处理时间达到了每秒3-4帧，这允许在实时应用程序中使用。但是下面显示的准确性结果并不符合任何HPE任务的目标。处理时间取决于运动复杂性、视频和照明质量以及2D姿态检测器模块。考虑到BlazePose和VideoPose3D具有不同的2D检测器，这一阶段似乎是这两种情况下的性能瓶颈。优化HPE性能的可能方法之一是加速二维关键点检测。现有的2D探测器可以在后处理阶段进行修改或放大，以提高一般精度。无论我们处理的是健身应用、康复应用、口罩还是监控，都非常需要实时处理。当然，模型的性能取决于所选择的算法和硬件，但现有的大多数开源模型都提供了相当长的响应时间。在相反的情况下，准确性会受到影响。那么，是否有可能改进现有的三维人体姿态估计模型，在实时处理下达到可接受的精度?虽然像BlazePose这样的模型能够提供实时处理，但其跟踪的准确性并不适合商业用途或复杂任务。在我们的实验中，我们使用Python语言测试了一个BlazePose的2D组件和一个修改的3d姿势基线模型。在速度方面，我们的模型在没有视频渲染的情况下，在上述硬件上实现了约46 FPS，而二维姿态检测模型产生关键点的速度约为50 FPS。与二维姿态检测模型相比，改进的三维基线模型可以产生约780 FPS的关键点。关于我们的方法所花费的处理时间的详细信息如下所示。虽然这种方法不能保证在昏暗的灯光或不寻常的姿势等复杂场景下的可靠性，但标准视频可以实时处理。但是，一般来说，模型预测的准确性将取决于训练和所选择的架构。了解人体姿势估计的真正功能后，我们可以分析该技术的一些常见业务应用程序和一般用例。HPE可以被认为是一项相当成熟的技术，因为在健身、康复、增强现实、动画、游戏、机器人甚至监控等应用领域都有基础工作。现在让我们谈谈现有的用例。健身应用程序和ai驱动的教练是身体姿势估计最明显的用例。在手机应用程序中实现的模型可以使用硬件摄像头作为传感器，记录某人的运动，并进行分析。跟踪人体的运动，运动可以分为偏心运动和同心圆运动的阶段，以分析不同的弯曲角度和整体姿势。这是通过追踪关键点并以提示或图表分析的形式提供分析。这可以实时或延迟处理，为用户提供主要运动模式和身体力学的分析。理疗行业是另一个具有类似应用规则的人类活动跟踪用例。在远程医疗时代，家庭咨询变得更加灵活和多样化。人工智能技术使更复杂的治疗方式能够在线完成。除对准确性的要求外，康复活动的分析应用于健身应用的类似概念。由于我们处理的是从受伤中恢复，这类应用将归入医疗保健类别。这意味着它必须满足医疗保健行业的标准和特定国家的一般数据保护法。增强现实应用，如虚拟试衣间，可以受益于人类估计，作为检测和识别人体在空间中的位置的最先进的方法之一。这可以用在电子商务中，在那里，购物者在购买前都在努力试穿衣服。人体姿势估计可以用于跟踪人体的关键点，并将这些数据传递给增强现实引擎，该引擎将为用户量身定制衣服。这可以应用于身体的任何部位和类型的衣服，甚至口罩。我们已经在一篇专门的文章中描述了在虚拟试衣间中使用人体姿势估计的经验。游戏开发是一个艰难的行业，有许多复杂的任务需要了解人体机制。身体姿态估计被广泛应用于游戏角色的动画制作中，通过将跟踪到的某个位置的关键点转移到动画模型中来简化这一过程。这项工作的过程类似于视频制作中使用的运动跟踪技术，但不需要在模型上放置大量的传感器。相反，我们可以使用多个摄像头来检测运动模式并自动识别。然后获取的数据可以转换并转移到游戏引擎中的实际3D模型中。有些监视案件不需要在人群中发现犯罪。相反，摄像头可以用于自动化日常流程，比如在杂货店购物。例如，像Amazon GO这样的无收银员商店系统应用人体姿势估计来了解一个人是否从货架上拿了某件商品。HPE与其他计算机视觉技术结合使用，这使得亚马逊可以通过摄像头传感器和物联网设备组成的网络，实现商店结账过程的自动化，而人体姿势估计负责与产品接触的实际区域无法被摄像头看到的那部分过程。因此，在这里，HPE模型分析顾客的手和头的位置，以了解他们是从货架上拿了产品，还是留在原地。人体姿势估计是一种机器学习技术，这意味着你需要数据来训练它。人体姿态估计完成了检测和识别屏幕上多个物体的相当困难的任务，神经网络是它的引擎。训练神经网络需要大量的数据，所以最优的方法是使用以下可用的数据集:这些数据集中的大多数适合人体姿势估计的健身和康复应用。但这并不能保证在更不寻常的动作或特定任务中(如监视或多人姿势估计)有很高的准确性。对于其余的情况，数据收集是不可避免的，因为神经网络将需要高质量的样本来提供准确的目标检测和跟踪。在这里，经验丰富的数据科学和机器学习团队可以提供帮助，因为他们可以就如何收集数据和处理模型的实际开发提供咨询。

您觉得本篇内容如何

评分

声明：本文内容及配图源自互联网收集，目的在于传递更多信息，并不代表本网赞同其观点或证实其内容真实性，不承担此类作品侵权行为的直接责任及连带责任。如涉及作品内容、版权等问题，请联系本网处理，侵权内容将在一周内下架整改。