The pursuit of self-improvement can seem never-ending, whether it is to perform better, learn more skills, or live new experiences. AI can help make what is impossible for humans to do on their own, possible. Computer vision, a major subfield of AI, involves translating real-world pixels into a language that computers can understand and work with. Specifically, human-centric computer vision develops algorithmic tools that allow machines to detect and track humans. Human motion data can then be used to generate quantifiable insight into human behavior, and create interactive applications within real, augmented, or virtual worlds. In other words, human-centric computer vision provides the tools needed for machines to work and interact with humans.



Fundamental to human-centric computer vision is the ability of machines to see and understand human motion. Videos contain an enormous amount of information in the form of pixels, much of which is meaningless to a computer unless it can decode the data within the pixels. To do that, computers need to know which pixels go together and what they represent. The detection and tracking of pixels representing humans is known as human motion capture.

Human motion capture digitizes human motion, allowing machines to track or reconstruct human behavior. The main advantage of human motion capture is that large amounts of human motion data can be processed within a few milliseconds. This enables applications to perform in real-time, such as movement analysis for sports and automation involving human-machines interactions. It is increasingly used in health research and kinesiology to help people improve posture, gait, and other movements.

Motion capture is performed via joint skeletal tracking, which tracks humans in a video by creating a virtual skeleton overlay. The skeleton consists of several skeletal joints and segments, representing the body parts and limbs. The number of skeletal joints can vary according to the pixel resolution, which can vary depending on how far an individual is from the camera. The timeline of skeleton point and segment coordinates forms the digitized human motion data from which movement paths and trajectories can be estimated. For instance, data from the joint angles can be used to infer the rotation of the hip, knee, and ankle joints. Together, these joint skeletal data allow us to analyze pose and movements and reconstruct human behaviors.



Computer understanding of the real world is a complex task and is exacerbated by the unpredictability of humans and the environment they are in. Visual human shape capture allows applications to locate and distinguish every individual in a video. It can be accomplished using human instance segmentation, a set of machine learning algorithms optimized for the detection and delimitation of human shapes. Human shape capture improves upon the simple method of segmenting using bounding boxes, which is the delimitation of an item using a square or rectangle shape. Instead, it uses a human pose skeleton which is more precise and better suited for the segmentation of human shapes. Using a pose skeleton has the advantage of being more robust to occlusions, which occur when an individual is only partly visible, whether standing in a crowd or behind an object. Human instance segmentation also allows for automatically counting the number of individuals in a scene and can determine the size of large crowds with a high level of accuracy.

Human segmentation is often performed in combination with background segmentation, which speeds up the process by subtracting the background environment from human actors. Background segmentation becomes quite useful in outdoor environments that can be stormy, foggy, or snowy. It allows for markerless motion capture everywhere, opening opportunities for unhindered motion capture of any activities and for all purposes. Without computer vision tools, motion capture would be limited to specialized studios with multiple cameras and actors wearing sensors.

The main use of human shape capture is for the recording and reconstruction of highly detailed 3D human shapes that are in motion. This is done using mesh data, typically polygons of varying sizes, that can represent the human body in 3D and estimate its volume. Computer vision has progressed such that shape capture can generate detailed 3D human silhouettes of individuals wearing loose clothing, and of athletes performing at high speed. The creation of realistic human silhouettes is essential for applications like “try-on”, in which one can virtually try on clothes before purchasing. It is also used to enhance immersive reality experiences by reproducing the user’s silhouette within a virtual environment. And of course, motion and shape capture data are widely used by the entertainment industry to create virtual characters that move realistically, saving production time and effort.



Intent capture involves recognizing gestures and activity from video and predicting the intention of the human actor. Although both types of action are considered purposeful, a gesture is a localized biomechanical movement, whereas an activity typically involves a full-body motion. Recognizing which actions are being performed within a context is key to determining the probable intentions of the individuals involved. For example, grasping a glass can lead to pouring, drinking or simply to pass it on.

Gesture recognition and the prediction of human intent is critical to the development of advanced human-computer interactions. It is this ability that allows computers to predict humans’ future actions or movement. Imagine the case of a robot situated in a room filled with people: without the ability to evaluate what humans are doing and anticipate where they will go, the robot cannot initiate any actions as it wouldn’t know how to stay out of the way.

Activity recognition involves tracking an individual over time as they perform a series of actions. The machine learning model compares the ongoing action to the set of actions that it was trained on, allowing it to not only recognize the actions but also assess movement deviations by comparing against the average trajectory. For example, patients may be tracked during health rehabilitation to provide feedback on posture and progress.

Accurately interpreting the complex behaviors of humans allows machines to successfully interact with humans by adapting appropriately to their actions. Whether it is to improve automatic broadcasts or to create augmented reality experiences in which humans can interact with the environment.

Computer vision is opening opportunities for machines and humans to work together and reach new levels. It allows machines to see and understand humans, enabling meaningful human-machine interactions. The future of AI computer vision will make such interactions even more useful, for humans to continue improving.

Want to learn more?

Feel free to send us any questions that you may have regarding human-centric computer vision.

Send question