THE SCIENCE BEHIND A HUMAN-CENTRIC COMPUTER VISION ENGINE™
Artificial Intelligence (AI) can help humans in countless ways, whether it is to improve performance, learn more skills, or live new experiences.
Computer vision, a major subfield of AI, involves translating images and video into a language that computers can understand. Specifically, human-centric computer vision is what enables computers to see, understand and respond to human images in the same way that people do. A human-centric computer vision engine works by:
- Acquiring images or videos from cameras for analysis
- Processing these images using deep learning to identify human motion, shape and intent
- Returning 3D human content that can be used in applications to deliver engaging, interactive experiences.
Computer vision uses deep learning models to emulate human sight. Deep learning is a type of machine learning that trains computers to perform human-like tasks, such as identifying humans in images, understanding image captioning, or making predictions (for example, the object in this image is 95% likely to be classified as person).
Deep learning uses a neural network architecture to process data. A neural network is designed to mimic the way the human brain analyzes and processes information. Neural networks are trained by feeding them with thousands of pre-labeled images of a certain object, such as a person. During the training process, the network learns how to quickly make decisions. For example, it might start with “Is there a person in this image?” “What is the edge between the person and its surrounding background?” “Is the person moving in a given scene”? and so forth. During this process, the network is repeatedly refined to reach optimal performance and classify objects as accurately as possible.
Deep learning makes computer vision easier to work with by expanding what a camera and computer can accurately inspect and identify.
CAPTURING MOTION THROUGH HUMAN POSE ESTIMATION
Videos contain an enormous amount of information in the form of pixels, much of which is meaningless to a computer unless it can decode the data within the pixels. To do that, computers need to know which pixels go together and what they represent. The detection and tracking of pixels representing humans is known as human motion capture.
Human motion capture is the process of recording the movements of a person digitally. Human motion can be captured using any camera and input into a computer vision engine for processing. Input forms include red-green-blue (RGB) images, infra-red (IR) images, or video feeds which are treated as a collection of images.
During processing, human pose estimation algorithms are used to estimate the configuration of a body (pose) in two-dimensions (2D) or three-dimensions (3D). 2D human pose estimation predict the location of body joints in an image, while 3D human pose estimation predicts the spatial arrange of all the body joints as its final output.
Human pose estimation leverages joint skeletal tracking, which tracks a human body in a video by creating a virtual skeleton overlay. The skeleton consists of several skeletal joints and segments, representing the body parts and limbs. The number of skeletal joints can vary according to the pixel resolution, which can vary depending on how far an individual is from the camera.
The timeline of skeleton point and segment coordinates forms the digitized human motion data from which movement paths and trajectories can be estimated. For instance, data from the joint angles can be used to infer the rotation of the hip, knee, and ankle joints. Together, these joint skeletal data allow us to analyze pose and movements and reconstruct human behaviors.
A main advantage of human motion capture is that large amounts of human motion data can be processed within a few milliseconds. This enables applications to perform in real-time to support scenarios such as analyzing sports performance, encouraging physical exercise, improving posture and gait, reducing work-place injuries, and more. Without computer vision tools, motion capture would be limited to specialized studios with multiple cameras and actors wearing sensors.