Artificial Intelligence (AI) can help humans in countless ways, whether it is to improve performance, learn more skills, or live new experiences. 

Computer vision, a major subfield of AI, involves translating images and video into a language that computers can understand. Specifically, human-centric computer vision is what enables computers to see, understand and respond to human images in the same way that people do. A human-centric computer vision engine works by: 

  • Acquiring images or videos from cameras for analysis 
  • Processing these images using deep learning to identify human motion, shape and action 
  • Returning 3D human content that can be used in applications to deliver engaging, interactive experiences.

Computer vision uses deep learning models to emulate human sight. Deep learning is a type of machine learning that trains computers to perform human-like tasks, such as identifying humans in images, understanding image captioning, or making predictions (for example, the object in this image is 95% likely to be classified as person).  
Deep learning uses a neural network architecture to process data. A neural network is designed to mimic the way the human brain analyzes and processes information. Neural networks are trained by feeding them with thousands of pre-labeled images of a certain object, such as a person. During the training process, the network learns how to quickly make decisions. For example, it might start with “Is there a person in this image?” “What is the edge between the person and its surrounding background?” “Is the person moving in a given scene”? and so forth. During this process, the network is repeatedly refined to reach optimal performance and classify objects as accurately as possible.  
Deep learning makes computer vision easier to work with by expanding what a camera and computer can accurately inspect and identify. 



Videos contain an enormous amount of information in the form of pixels, much of which is meaningless to a computer unless it can decode the data within the pixels. To do that, computers need to know which pixels go together and what they represent. The detection and tracking of pixels representing humans is known as human motion capture. 

Human motion capture is the process of recording the movements of a person digitally. Human motion can be captured using any camera and input into a computer vision engine for processing. Input forms include red-green-blue (RGB) images, infra-red (IR) images, or video feeds which are treated as a collection of images.

During processing, human pose estimation algorithms are used to estimate the configuration of a body (pose) in two-dimensions (2D) or three-dimensions (3D). 2D human pose estimation predict the location of body joints in an image, while 3D human pose estimation predicts the spatial arrange of all the body joints as its final output.

Human pose estimation leverages joint skeletal tracking, which tracks a human body in a video by creating a virtual skeleton overlay. The skeleton consists of several skeletal joints and segments, representing the body parts and limbs.  The number of skeletal joints can vary according to the pixel resolution, which can vary depending on how far an individual is from the camera.

The timeline of skeleton point and segment coordinates forms the digitized human motion data from which movement paths and trajectories can be estimated. For instance, data from the joint angles can be used to infer the rotation of the hip, knee, and ankle joints. Together, these joint skeletal data allow us to analyze pose and movements and reconstruct human behaviors.

A main advantage of human motion capture is that large amounts of human motion data can be processed within a few milliseconds. This enables applications to perform in real-time to support scenarios such as analyzing sports performance, encouraging physical exercise, improving posture and gait, reducing work-place injuries, and more. Without computer vision tools, motion capture would be limited to specialized studios with multiple cameras and actors wearing sensors. 


Monocular Motion Capture

The wrnch Engine is a monocular motion capture (MMC) system that estimates 3D poses from images of any person in view of any camera or video stream. It supports markerless motion capture, ensuring that the natural motion of the human body is captured and analyzed without requiring a person to wear cumbersome, body sensors.  
Designed using bespoke recurrent and convolutional neural networks trained on proprietary datasets, our monocular motion capture system tracks skeletal joints to estimate human poses and infers 3D human motion from 2D video frames in real-time.

Tracks key skeletal joints Infer 3D human motion
25 body points 30 body joints
21 hand points 20 hand joints
20 face points


The output of the neural network processing is a stream of 3D skeletons that can be used to animate 3D characters or analyze human poses and movement. Human digitization that previously took weeks of work by specialized teams in rented spaces using expensive equipment can now be done by any person with an RGB camera in real-time. 

Watch it in action


A computer’s understanding of the real world is a complex task, and it’s exacerbated by the unpredictability of humans and the environment they are in.

Human shape capture allows applications to locate and distinguish every individual in a video. It can be accomplished using human instance segmentation, a set of machinelearning algorithms optimized for the detection and delimitation of human shapes. Some algorithms create square or rectangle bounding boxes to delineate each person in an image. In addition, a pose skeleton overlay can be used to better distinguish human instances from heavy occlusions, whether an individual is only partly visible, behind an object, or standing in a crowd. Finally, human instance segmentation allows for automatically counting the number of individuals in a scene and can determine the size of large crowds with a high level of accuracy.

Human instance segmentation is often performed in combination with background segmentation, which speeds up the process by subtracting the background environment from human actors. Background segmentation becomes quite useful in outdoor environments that can be stormy, foggy, or snowy. It allows for markerless motion capture everywhere, opening opportunities for unhindered motion capture of any activities and for all purposes.

The main use of human shape capture is for the recording and reconstruction of highly detailed 3D human shapes that are in motion. This is done using mesh data, typically polygons of varying sizes, that can represent the human body in 3D and estimate its volume.

Computer vision has progressed such that shape capture can generate detailed 3D human silhouettes of individuals wearing loose clothing, and of athletes performing at high speed. The creation of realistic human silhouettes is essential for applications like “try-on”, in which one can virtually try on clothes before purchasing. It is also used to enhance immersive reality experiences by reproducing the user’s silhouette within a virtual environment. And of course, motion and shape capture data are widely used by the entertainment industry to create virtual characters that move realistically, saving production time and effort.


Monocular Volumetric Capture

wrnch is developing a way to perform Monocular Volumetric Capture (MVC). MVC infers the 3D shape and appearance of persons in a image. We use deep neural networks to generate a rigged 3D mesh and surface textures from a single front facing view. Our specialized fully-convolutional system is trained on diverse dataset of body-scanned persons. The resulting 3D model captures proportions, clothing and facial details, facilitating use as a 3D avatar in virtual spaces or for analysis of body shape.

By allowing people to generate a textured 3D Mesh from a single image, wrnch enables businesses to incorporate realistic 3D humans in digital worlds – where people will increasingly work, play and socialize. 

While monocular volumetric capture (MVC) is still in development mode, you can watch a preview below. Stay connected for further information.

Watch it in action


Human action capture involves recognizing gestures and activity from video and predicting the intention of the human actor. Although both types of action are considered purposeful, a gesture is a localized biomechanical movement, whereas an activity typically involves a full-body motion. Recognizing which actions are being performed within a context is key to determining the probable intentions of the individuals involved. For example, grasping a glass can lead to pouring, drinking or simply to pass it on. 

Gesture recognition and the prediction of human intent is critical to the development of advanced human-computer interactions. It is this ability that allows computers to predict humans’ future actions or movement. Imagine the case of a robot situated in a room filled with people: without the ability to evaluate what humans are doing and anticipate where they will go, the robot cannot initiate any actions as it wouldn’t know how to stay out of the way.

Activity recognition involves tracking an individual over time as they perform a series of actions. The machine learning model compares the ongoing action to the set of actions that it was trained on, allowing it to not only recognize the actions but also assess movement deviations by comparing against the average trajectory. For example, patients may be tracked during health rehabilitation to provide feedback on posture and progress.

Accurately interpreting the complex behaviors of humans allows machines to successfully interact with humans by adapting appropriately to their actions. Whether it is to improve automatic broadcasts or to create augmented reality experiences in which humans can interact with the environment.

Computer vision is opening opportunities for machines and humans to work together and reach new levels. It allows machines to see and understand humans, enabling meaningful human-machine interactions. The future of AI computer vision will make such interactions even more useful, for humans to continue improving. 


Monocular Action Capture

Monocular Action Capture (MAC) identifies activities and gestures of persons on camera. Our machine learning -based system recognizes poses and motions that signify particular actions or situations and streams out the data in real-time. The underlying temporal neural networks are trained on large task-specific datasets. Applications of this technology include gesture recognition and detection of falling persons.

While monocular action capture (MAC) is in early research and development, you can watch a preview below. Stay connected for further developments.  

Watch it in action


Discover human-centric computer vision with our reference application, wrnch CaptureStream. Available on the App Store for iOS devices.

Download Now