Method And System For Motion Capturing And Retargeting Thereof

< Back

Method And System For Motion Capturing And Retargeting Thereof

Abstract: ABSTRACT METHOD AND SYSTEM FOR MOTION CAPTURING AND RETARGETING THEREOF This disclosure relates generally to a method and system for motion capturing and retargeting. The conventional methods for motion transfer use various supervised techniques and do not incorporate motion information which is necessary for motion transfer. The present disclosure performs motion transfer from an incomplete, single-view depth video to a semantically similar target mesh. The method includes an unsupervised skeletonization extraction that incorporates geometric information of a source object in the depth video which helps in improved motion reconstruction and motion transfer. A motion skeleton is constructed based on rigidity constraints from the motion and structural information from a curve skeleton. The disclosed method helps in animating a virtual character based on a real performer by transferring the motion captured from the real performer to a polygon mesh of the virtual character. [To be published with FIG. 2]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

25 May 2023

Publication Number

48/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th floor, Nariman point, Mumbai 400021, Maharashtra, India

Inventors

1. MAHESHWARI, Shubh

Tata Consultancy Services Limited, 4 & 5th floor, PTI Building, No 4, Sansad Marg, Delhi 110001, Delhi, India

2. HEBBALAGUPPE, Ramya

Tata Consultancy Services Limited, 4 & 5th floor, PTI Building, No 4, Sansad Marg, Delhi 110001, Delhi, India

3. NARAIN, Rahul

IIT Delhi, Department of Computer Science and Engineering, Hauz Khas, New Delhi 110016, Delhi, India

Specification

DESC:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of invention:

METHOD AND SYSTEM FOR MOTION CAPTURING AND RETARGETING THEREOF

Applicant

Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India

Preamble to the description:

The following specification particularly describes the invention and the manner in which it is to be performed.
CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
The present application claims priority from Indian provisional patent application no. 202321036268, filed on May 25, 2023. The entire contents of the aforementioned application are incorporated herein by reference.

TECHNICAL FIELD
The disclosure herein generally relates to computer vision, and, more particularly, to a method and system for motion capturing and retargeting.

BACKGROUND
The growing demand for immersive animated content originates primarily from a widespread of applications such as entertainment, metaverse, education, and augmented/virtual reality. Production of animation content requires modeling the geometry of characters of interest, rigging a skeleton to determine each character’s degrees of freedom, and then animating their motion. The latter step is often performed by transferring the motion captured from a real performer, whether a human actor, an animal, or even a puppet. The intermediate rigging step is tedious and labor-intensive, while motion capture requires an expensive multi-camera system; both factors hinder the large-scale adoption of 3-dimensional content creation.
Animation transfer is the process of transferring the motion captured from a real performer to an articulated character modelled as a polygon mesh. The source is a point cloud obtained from a single-view commodity sensor such as Kinect/Realsense, and the target is a user-defined polygon mesh without a predefined skeletal rig. Performing automatic animation transfer requires reconstructing the source motion from the point cloud and finding a correspondence between points on the source and target shapes before the source performance can be remapped to the target character. Furthermore, many applications often require animation content not just of humans but of other kinds of characters and objects such as animals, birds, and articulated mechanical objects, precluding the use of predefined human body models.
Monocular motion systems generally require a template before registering the incoming video. Templates are created by 360 degree static registration of the source object or by creating a parametric model from a large corpus of meshes. Unfortunately, these methods do not generalize to unknown objects. As a parallel approach, methods that rely on neural implicit representations are comparatively high on computation cost for working on a single video making them unsuitable for frugal motion transfer. Dynamic reconstruction based methods provide better results but tracking error could cause structural artifacts thereby resulting in issues of shape correspondence or skeleton extraction.
Without using a predefined template, establishing correspondence between parts of the partial source and the complete target objects becomes challenging. Some of the prior automatic correspondence methods does not provide accurate results where input is sparse, noisy and incomplete. A family of approaches based on surface level correspondence such as the functional maps incorporate geometry; however, these approaches are data intensive and do not account for the underlying shape topologies explicitly which are critical to matching generic shapes.
Few recently developed deep learning approaches automatically rig character meshes and capture the motion of performing characters from single-view point cloud streams. However, these approaches require supervision at all stages. Furthermore, another deep learning approach introduced a skeleton free approach to transfer poses between 3D characters. The target object is restricted to bipedal characters in T-pose, and the mocap data is limited to human motion.

SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method for motion capturing and retargeting is provided. The method includes receiving (i) a monocular depth video of a source object captured by a single view depth camera and (ii) a target mesh associated with a target object. Further the method includes, extracting by using a mesh extraction technique a source mesh corresponding to the source object in a source depth image frame amongst the plurality of depth image frames. Further a motion trajectory corresponding to the source object is obtained by aligning the source mesh to the plurality of depth image frames using a non-rigid reconstruction technique. Further the method includes applying a curve splitting technique on at least one or more of (i) a set of joint coordinates and (ii) a curve skeleton comprising one or more curves to obtain a motion skeleton. Then the method obtains a motion associated with the motion skeleton from at least one or more of (i) the motion skeleton, (ii) the motion trajectory and (iii) the set of joint coordinates based on a joint motion extraction technique. Further the method includes embedding the motion skeleton to the target mesh to obtain an embedded target skeleton; and then retargeting the motion associated with the motion skeleton along with the embedded target skeleton on the target mesh.
In another aspect, a system for motion capturing and retargeting is provided. The system comprises memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to receive (i) a monocular depth video of a source object captured by a single view depth camera and (ii) a target mesh associated with a target object. Further the system includes to extract by using a mesh extraction technique a source mesh corresponding to the source object in a source depth image frame amongst the plurality of depth image frames. Further a motion trajectory corresponding to the source object is obtained by aligning the source mesh to the plurality of depth image frames using a non-rigid reconstruction technique. Further the system includes to apply a curve splitting technique on at least one or more of (i) a set of joint coordinates and (ii) a curve skeleton comprising one or more curves to obtain a motion skeleton. Then the system includes to obtain a motion associated with the motion skeleton from at least one or more of (i) the motion skeleton, (ii) the motion trajectory and (iii) the set of joint coordinates based on a joint motion extraction technique. Furthermore, the system includes embedding the motion skeleton to the target mesh to obtain an embedded target skeleton; and then retarget the motion associated with the motion skeleton along with the embedded target skeleton on the target mesh.
The source depth image frame is the first depth image frame among the plurality of depth image frames. The set of joint coordinates are obtained by initially applying a curve skeletonization technique on the source mesh to obtain a set of dense clusters. Further the set of joint coordinates are obtained from the set of dense clusters using a joint position extraction method.
The joint position extraction method is represented as
j_i^*=arg min-(j_i?R^3 )??_(v?S_i)¦??_cpp ?(j_i-v×n(v))?_2^2 ?+?_eucl ?j_i-v?_2^2+ ?_smooth ?_(k?N(i))¦?j_i-j_k ?_2^2
where j_i is the set of joint co-ordinates, S_i is the set of dense clusters associated with j_i, v represents co-ordinates of a vertex of a set of vertices of the set of dense clusters , n(v) represents vertex normal of the vertex of the set of vertices , ?_cpp, ?_(eucl ) and ?_smoothare hyperparameters, N(i) are a set of neighbors of a joint coordinate i.
The curve skeleton is obtained from the set of dense clusters using a local separator method. The motion skeleton is obtained by applying curve splitting technique comprising recursively splitting, via the one or more hardware processors, each curve of the one or more curves in the curve skeleton into one or more bone segments having a set of joints associated with the motion skeleton, until a motion reconstruction cost is less than a predefined threshold. The step of obtaining the motion associated with the motion skeleton is preceded by reconstructing the motion by applying a skinning technique on at least one or more of (i) the motion trajectory and (ii) the motion skeleton.
In yet another aspect, there is provided a computer program product comprising a non-transitory computer readable medium having a computer readable program embodied therein, wherein the computer readable program, when executed on a computing device causes the computing device for motion capturing and retargeting by receiving (i) a monocular depth video of a source object captured by a single view depth camera and (ii) a target mesh associated with a target object. Further the computer readable program includes, extracting by using a mesh extraction technique a source mesh corresponding to the source object in a source depth image frame amongst the plurality of depth image frames. Further a motion trajectory corresponding to the source object is obtained by aligning the source mesh to the plurality of depth image frames using a non-rigid reconstruction technique. Further the computer readable program includes applying a curve splitting technique on at least one or more of (i) a set of joint coordinates and (ii) a curve skeleton comprising one or more curves to obtain a motion skeleton. Then the computer readable program includes obtaining a motion associated with the motion skeleton from at least one or more of (i) the motion skeleton, (ii) the motion trajectory and (iii) the set of joint coordinates based on a joint motion extraction technique. Furthermore, the computer readable program includes embedding the motion skeleton to the target mesh to obtain an embedded target skeleton; and then retargeting the motion associated with the motion skeleton along with the embedded target skeleton on the target mesh.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 illustrates an exemplary block diagram of a system for motion capturing and retargeting according to some embodiments of the present disclosure.
FIG. 2 illustrates a schematic diagram depicting broad-level process flow for motion capturing and retargeting according to some embodiments of the present disclosure.
FIG. 3A and FIG.3B is an exemplary flow diagram for a method for motion capturing and retargeting in accordance with some embodiments of the present disclosure.
FIG. 4 illustrates motion transfer results using the disclosed method on a dataset according to some embodiments of the present disclosure.
FIG.5 illustrates comparison of the disclosed method with various prior methods in accordance with some embodiments of the present disclosure.
FIG.6 shows motion transfer results from real captured RGB Depth (RGBD) videos in accordance with some embodiments of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
3D skeleton extraction is an effective technique for motion transfer, shape correspondence, retrieval, and recognition. Traditionally morphological operations like thinning/erosion or medial axis analysis were used for skeletal extraction. Some data-driven methods use neural networks to predict joint location and the skeleton structure. However, these methods are not effective for skeletonization of incomplete surfaces. Furthermore, these methods create a skeleton using only the geometry. Another prior method (from published paper titled “Unsupervised learning of motion skeleton and latent dynamics from volumetric video” authored by “Bae Jinseok, Jang Hojun, Min Cheol-Hui, Choi Hyungun, and Young Min Kim” published in Neural marionette: In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI 2022), February 2022) uses spatio-temporal information to extract key points from a complete mesh sequence. Furthermore, for similar object categories, they can also re-target motion, and generate and interpolate between poses. However, these methods do not generalize to single-view incomplete objects. Another prior method (from published paper titled “Curve skeleton extraction from incomplete point cloud” authored by “Andrea Tagliasacchi, Hao Zhang, and Daniel Cohen-Or” published in ACM Trans. Graph., 2009) extracts a curve skeleton from an incomplete point cloud using a rotational symmetry axis. Local Separator method transforms mesh to graph structure and used local separators to extract joint. However, these work on static 3D meshes and do not include motion information which is necessary for animation transfer.
To incorporate motion information from single-view point clouds, another prior work (from published paper titled “Unsupervised skeleton extraction and motion capture from 3d deformable matching” authored by “Quanshi Zhang, Xuan Song, Xiaowei Shao, Ryosuke Shibasaki, and Huijing Zhao” published in Neurocomputing, 2013) proposes spectral clustering on the point cloud trajectory to find the joint points for the skeleton. Further another prior work (from published paper titled “3d articulated skeleton extraction using a single consumer-grade depth camera” authored by “Xuequan Lu, Zhigang Deng, Jun Luo, Wenzhi Chen, Sai-Kit Yeung, and Ying He” published in Computer Vision and Image Understanding, 2019) optimizes the skeleton extraction procedure in a probabilistic framework using linear blend skinning. However, these methods do not utilize curve skeleton information, and consequently, embedding them into the target mesh results in artifacts.
Few prior works create a puppet instrument for motion retrieval or estimate wave properties from the skeleton, facial, or hand tracked by off-the-shelf sensors or use a smartphone, to tie together to control of the character and camera into a single interaction mechanism. However, these methods require dedicated sensors for tracking motion and are expensive.
For skeleton-based motion retargeting, few prior works proposed motion puppetry systems to drive the motion of non-human characters. Similarly, Kinect-based human skeleton was used in another prior method to transfer animation from human to anthropomorphic (human-like e.g., bear or monkey) characters. Another prior work proposed a skeleton-aware deep learning framework, which requires skeletons to be homeomorphic. Whereas another prior method proposed an interactive system to transfer animation between different creatures by manually assigning correspondence between their skeletons. In a prior method (from published paper titled “Automatic rigging and animation of 3d characters” authored by “Ilya Baran and Jovan Popovic” published in ACM Trans. Graph., 2007) embed the source skeleton to the target shape, calculated attachment weights, and transfer the bone transformation parameters to deform the target shape.
Few prior methods such as shape completion and animation of people (SCAPE), Skinned Multi-Person Linear Model (SMPL) for human bodies, Model with Articulated and Non-rigid defOrmations (MANO) for hands, Faces Learned with an Articulated Model and Expressions (FLAME) for facial expressions, and Skinned Multi-Animal Linear Model (SMAL) discusses capturing motion from a monocular camera for quadrupeds. Similarly, few neural-based parametric prior models require a large corpus of complete meshes to create. Recent neural implicit-based prior methods for RGB and Red Green Blue channels-Depth channel (RGB-D) methods provide an alternative to motion-capture-based systems. But their computation cost hinders their utilization for frugal motion capture; at the same time, they do not retarget motion to non-isometric objects.
The disclosed method using the various embodiments discussed further provides an approach to transfer motion from an incomplete, single view depth video to a semantically similar target mesh, unlike prior works that assume the source to be noise-free and watertight. The disclosed method is an unsupervised technique for intra-category motion transfer from monocular depth videos to virtual characters of varying shapes. It is category-agnostic, permitting motion transfer between quadrupeds, between bipeds, and in general between any characters with similar topology. To handle sparse, incomplete videos from depth video inputs and variations between source and target objects, the disclosed method uses skeletons as an intermediary representation between motion capture and transfer. A novel unsupervised skeleton extraction pipeline from a single-view depth sequence is disclosed that incorporates additional geometric information resulting in better performance in motion reconstruction and transfer in comparison to the prior methods.
Referring now to the drawings, and more particularly to FIG. 1 through FIG.6, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1 illustrates an exemplary block diagram of a system 100 for motion capturing and retargeting according to some embodiments of the present disclosure. In an embodiment, the system 100 includes one or more hardware processors 102, communication interface(s) or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 104 operatively coupled to the one or more processors 102. The one or more hardware processors 102 that are hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, graphics controllers, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) are configured to fetch and execute computer-readable instructions stored in the memory. In the context of the present disclosure, the expressions ‘processors’ and ‘hardware processors’ may be used interchangeably. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
The I/O interface (s) 106 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface(s) can include one or more ports for connecting a number of devices to one another or to another server. For example, a monocular depth video of a source object can be acquired from an external imaging device such as a single view depth camera, via the I/O interface 106.
The memory 104 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random-access memory (SRAM) and dynamic random-access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
In an embodiment, the memory 104 includes a plurality of modules 108 (not shown) that can include modules a mesh extraction technique, a curve splitting technique, a curve skeletonization technique and the like. The plurality of modules includes programs or coded instructions that supplement applications or functions performed by the system 100 for executing different steps involved in the process of motion capture and retargeting being performed by the system 100. The plurality of modules, amongst other things, can include routines, programs, objects, components, and data structures, which performs particular tasks or implement particular abstract data types. The plurality of modules may also be used as, signal processor(s), node machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the plurality of modules can be used by hardware, by computer-readable instructions executed by the one or more hardware processors 102, or by a combination thereof. The plurality of modules can include various sub-modules (not shown).
Further, the memory 104 may include a database or repository. The memory 104 may comprise information pertaining to input(s)/output(s) of each step performed by the processor(s) 102 of the system 100 and methods of the present disclosure. In an embodiment, the database may be external (not shown) to the system 100 and coupled via the I/O interface 106.
FIG. 2 illustrates a schematic diagram depicting broad-level process flow for motion capturing and retargeting according to some embodiments of the present disclosure. As depicted in the figure, the monocular depth video of a source object is provided as input to the disclosed method. Further, a source mesh corresponding to the source object is extracted. Further at point (a) in the FIG. 2 a non-rigid reconstruction technique (NRR) is applied onto the source mesh to obtain a motion trajectory corresponding to the source object. Then at point (b) in the FIG. 2 a novel skeletonization technique which includes a series of techniques such as a curve splitting technique, a curve skeletonization technique, a skinning technique is performed to obtain a curve skeleton, a motion skeleton and a motion associated with the motion skeleton named as skeleton motion in the FIG.2. The series of above mentioned techniques are explained in detail in conjunction with the FIG.3. Modules executing these techniques are present in the module 108 of the system 100.Post performing the skeletonization technique to obtain a motion skeleton and a motion associated with the motion skeleton, the motion skeleton is embedded to a target mesh. Further a retargeting of the motion associated with the motion skeleton is performed along with the embedded target skeleton on the target mesh to obtain a retargeted motion on the target mesh.
FIG. 3A through FIG. 3B is an exemplary flow diagram for a method 300 for motion capturing and retargeting in accordance with some embodiments of the present disclosure.
In an embodiment, the system 100 comprises one or more data storage devices or the memory 104 operatively coupled to the one or more hardware processor(s) 102 and is configured to store instructions for execution of steps of the method 300 by the processor(s) or one or more hardware processors 102. The steps of the method 300 of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in FIG. 1 and the steps of flow diagram as depicted in FIGS. 3A and 3B. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods, and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps to be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
At step 302 of the method 300, the one or more hardware processors 102 are configured to receive (i) the monocular depth video of a source object captured by the single view depth camera and (ii) a target mesh associated with a target object, wherein the monocular depth video comprises a plurality of depth image frames corresponding to a plurality of time frames. Generally, an RGB camera is considered, however the single view depth camera is used for experimentation of the disclosed method to obtain the monocular depth video. The monocular depth video is a noisy, single-view depth video with known camera parameters showing a single source object in motion. The target mesh is a clean, watertight mesh. There is no information of part correspondences between the monocular depth video and the target mesh. The benefits of using the single depth view camera are to reduce in the turnaround time for an animator, and a reduction in the cost of motion capture by a huge factor with some tradeoff in high fidelity reconstruction achieved by sophisticated motion capture systems with markers.
The single view depth camera setup provides a set of depth image frames, D = {D^t ? R^(H×W)}, where t,H,W represents timeframes, height and width of depth image frames respectively. It is presumed the camera is stationary during the recording and camera intrinsic parameters K ? R^3X3 are known. Using the camera parameters, for every pixel u ? R^2 in depth image D^t is back projected to create the point cloud P = {p_u}.
At step 304 of the method 300, the one or more hardware processors 102 are configured to extract by using the mesh extraction technique such as point cloud techniques, volume mesh techniques, a source mesh corresponding to the source object in a source depth image frame amongst the plurality of depth image frames. The source depth image frame is the first depth image frame among the plurality of depth image frames. For the monocular depth video of T depth image frames, one frame in the video is chosen as the source S. The source mesh M_S = {V_S,F_S} is obtained from the source depth image frame where its vertices V_S = P^S and the faces F_S are obtained by connecting adjacent pixels if the distance between their associated vertices is less than a fixed threshold (empirically defined as 0.05). Ideally, the source depth image frame is chosen as the one which best represents the rest state of the source object, minimizing noise, self-contact, and self-occlusions. However, for the disclosed method uses the first depth image frame of the video as the source depth image frame.
At step 306 of the method 300, the one or more hardware processors 102 are configured to obtain a motion trajectory corresponding to the source object by aligning the source mesh to the plurality of depth image frames using the non-rigid reconstruction technique. Here the trajectory of the vertices of the source mesh M_S is obtained by aligning it to all the future frames using non-rigid registration. To enforce spatial coherence, the motion field is represented using an embedded deformation graph G = {V_G,E_G,R_G^T,T_G^T}. Here each node g_j ? V_G is equipped with a time-varying rigid transformation, i.e., a rotation matrix R_j^t ? R^(3×3) and translation vector T_j^t for each timeframe t. The trajectory for each vertex v_i ? V_S at timeframe t ? T is computed using the deformation graph as equation 1,
?Traj?_i^t=?_(j=0)^(N_G)¦?W^G (i,j)(R_j^t ? (v_i-g_j )+g_j+T_j^t) (1)
where W^G (i,j) determines the influence of graph node g_j on vertex v_i. It is defined as equation 2,
W^G (i,j)=ae^(-(?v_i-g_j ?_2^2)/(2s_nc^2)) (2)
where a is the normalization constant so that weights sum to one, and the node coverage parameter s_nc controls the weightage of multiple graph nodes on the vertices. Further for each timeframe, the graph deformation parameters {R_G^T,T_G^T}, are estimated by optimizing an energy function E(G).
At step 308 of the method 300, the one or more hardware processors 102 are configured to apply the curve splitting technique on at least one of (i) a set of joint coordinates and (ii) a curve skeleton comprising one or more curves to obtain a motion skeleton, wherein the set of joint coordinates and the curve skeleton is obtained from the source mesh. The skeletonization process as mentioned in the FIG. 2 includes the steps 308 and 310 of the method 300. The source mesh M_S = {V_S,F_S} of the source object and its trajectory Traj ? R^(T×|V_S |×3), where T is the number of timeframes is used for the skeletonization process. The output from the skeletonization process is the motion skeleton MS={J_MS,B_MS} where J_MS is a set of joints in R^3 and bones B_MS are the edges connecting the joints, along with its skeletal motion SM?R^(T ×|J_MS |×3).
The motion skeleton represents the object’s motion degrees of freedom as a hierarchy of nearly rigid parts (i.e., bones) connected by joints, and a curve skeleton is simply an abstracted one-dimensional representation of an object’s geometry via a tree-structured network of curves. Curve-skeletons naturally incorporate the notion of object parts and therefore are used in animation, shape retrieval and shape correspondence. A curve skeleton CS is a graph, typically a tree, whose nodes are points j_i ? R^3 that together approximate the medial axis of a given shape. The nodes with degree 1, 2, and = 3 are referred to as terminal, intermediate, and junction nodes respectively. Terminal and junction nodes together form the functional nodes. A path consisting of intermediate nodes between two functional nodes i and k is called a curve and denoted C_ik.
Additionally, each node i in the curve skeleton is associated with a subset S_i of the vertices of the original shape. These subsets form a partition of the set of shape vertices, i.e., they are disjoint and their union is the entire set V_S. Typically, each subset forms a cylindrical ring around the corresponding skeleton node, or for an incomplete shape, a strip. By taking unions of these subsets, a curve region CR_ik=?_(l?C_ik)¦S_l is associated with each curve C_ik. These curve regions segment the shape into geometrically significant parts.
The curve skeleton is computed from the source mesh M_S using the local separators method. The local separators method chooses the skeleton nodes so that each associated S_i is a local separator, i.e., a subset which if removed would disconnect M_S into two or more disjoint components. Further the skeleton is pruned by removing curves of length = 3, and connect separate components together based on the least edge length variation and boundary information of local separators in pixel space.
The set of joint coordinates are obtained by first applying the above mentioned curve skeletonization technique on the source mesh to obtain a set of dense clusters. Further the set of joint coordinates are obtained from the set of dense clusters using a joint position extraction method. The joint position extraction method is represented using an equation 3,
j_i^*=arg min-(j_i?R^3 )??_(v?S_i)¦??_cpp ?(j_i-v×n(v))?_2^2 ?+?_eucl ?j_i-v?_2^2+ ?_smooth ?_(k?N(i))¦?j_i-j_k ?_2^2 (3)
where j_i is the set of joint co-ordinates, S_i is the set of dense clusters associated with j_i, v represents co-ordinates of a vertex of a set of vertices of the set of dense clusters, n(v) represents vertex normal of the vertex of the set of vertices , ?_cpp, ?_(eucl ) and ?_smoothare hyperparameters, N(i) are a set of neighbors of a joint coordinate i . The equation 3 is represented in such a way that the node lies near to the medial surface of the source object by constraining the node to the intersection of a separator’s vertex normal. Hence the optimal set of joint coordinates of a node is obtained using the equation 3.
The motion skeleton is obtained by applying curve splitting technique on the set of joint coordinates and the curve skeleton. Using the curve splitting technique, each curve of the one or more curves in the curve skeleton is recursively splitted into one or more bone segments having a set of joints associated with the motion skeleton, until a motion reconstruction cost is less than a predefined threshold wherein a set of coordinates corresponding to the set of joints are subset of the set of joint coordinates. The explanation hereafter details on the curve splitting technique. The curve regions ?CR?_(ik )of the curve skeleton they do not incorporate any motion information. Hence these may be subdivided into functional parts to obtain the motion skeleton. Each curve C_ik is repeatedly splitted to reduce the reconstruction error between the original trajectory and its rigid approximation. The motion skeleton MS is initialized by defining a single bone b for each curve C_ik in CS. The curve-segment ?CR?_(ik )represents the motion cluster whose trajectory is governed by the rigid transformation of bone b. Let R_b^t, T_b^t be the rotation and translation parameters for bone b at timeframe t. The reconstruction cost is represented as equation 4,
RC(i.k)=1/T ?_(t=1)^T¦?_(v?C_ik)¦??Traj?_v^t-(R_b^t v+T_b^t)?_2^2 (4)
For every bone, the intermediate nodes of its underlying curve are traversed, and the one at which splitting the curve would give the lowest reconstruction error is computed as in equation 5,
r^*=arg min-(r?C_ik )??RC(i,r-1)+RC(r,k)? (5)
If the relative change in RC is greater than a threshold ?_split, the bone is replaced with two bones associated with curves C_ir and C_rk and connected at the joint j_r, and repeat. Otherwise, the original bone is kept.
Upon generation of the motion skeleton at step 308, at step 310 of the method 300, the one or more hardware processors 102 are configured to obtain a motion associated with the motion skeleton from at least one of (i) the motion skeleton, (ii) the motion trajectory and (iii) the set of joint coordinates based on a joint motion extraction technique as described below. The step of obtaining the motion associated with the motion skeleton is preceded by reconstructing the motion by applying a skinning technique on at least one of (i) the motion trajectory and (ii) the motion skeleton. The deformation of the underlying skeleton is estimated which leads to the articulated motion of the source object. This skeleton motion SM will be transferred to the target object during motion retargeting. The articulated motion AM ? R^(T×|VS|×3) is calculated by deforming each vertex v using a weighted combination of bones as defined in equation 6,
?AM?_v^t=?_(b?B_MS)¦?W(v,b)*(R_b^t v+T_b^t ?) (6)
where W(v,b) represents the influence of bone b on vertex v. The skinning weights W ? R^(|VS|×|BMS|) by finding the optimal weights subject to the constraints
W(v,b)=0 (7)
?_(b?B_MS)¦?W(v,b)=1? (8)
|{b?B_MS:W(v,b)>0}|=4 (9)
Using the reconstructed motion, the location of each joint at timeframe t is obtained as equation 10,
?SM?_j^t=j+?_(v?S_j)¦?(AM_v^t-v)? (10)
At step 312 of the method 300, the one or more hardware processors 102 are configured to embed the motion skeleton to the target mesh to obtain an embedded target skeleton. The motion skeleton MS is adapted to the target mesh M_T = {V_T ,F_T }. Then the adapted target skeleton TS is attached to M_T by calculating linear blend skinning weights W_T.
At step 314 of the method 300, the one or more hardware processors 102 are configured to retarget the motion associated with the motion skeleton along with the embedded target skeleton on the target mesh. Initially MS and TS are converted to rooted trees. The joint closest to the center of the source mesh, and its corresponding joint in TS, are taken to be the root joints. Parent-child relationships are defined for all other nodes accordingly. The motion of the target skeleton TM is computed as in equation 11,
?TM?_j^t=R_j^t (j-p)+?TM?_p^t (11)
The final re-targeted motion RM of the target mesh M_T is determined via linear blend skinning as in equation 12,
?RM?_v^t=?_(b?B_TS)¦?W(v,b)(R_j^t ? (v-p)+?TM?_p^t) (12)
where v is the vertex of M_T , and b is the bone in the target skeleton which has joint j and p as its adjacent joints.
EXPERIMENTAL RESULTS: Experiments were performed on an Alienware laptop with an Intel i7 CPU, 32GB RAM, and an 8GB Nvidia GeForce GTX 1080. DeformingThings4D, a dataset of non-rigidly deforming objects was used for experiments. For every example, a source depth video is computed by assigning a random camera view and calculating depth images using Blender Eevee2. The disclosed method has been tested on a variety of model pairs with similar semantic structures. FIG. 4 illustrates motion transfer results using the disclosed method on DeformingThings4D dataset according to some embodiments of the present disclosure. The disclosed method is robust to some degree of variation in poses.
The disclosed method is compared both qualitatively and quantitatively against a set of known unsupervised methods, namely Method 1 (from published paper titled “Unsupervised skeleton extraction and motion capture from 3d deformable matching” authored by “Quanshi Zhang, Xuan Song, Xiaowei Shao, Ryosuke Shibasaki, and Huijing Zhao” published in Neurocomputing, 2013), Method 2 (from published paper titled “Robust and accurate skeletal rigging from mesh sequences” authored by “Binh Huy Le and Zhigang Deng” published in ACM Transactions on Graphics, 2014) and Method 3(from published paper titled “Neural marionette: Unsupervised learning of motion skeleton and latent dynamics from volumetric video” authored by “Bae Jinseok, Jang Hojun, Min Cheol-Hui, Choi Hyungun, and Young Min Kim” published in In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI 2022), 2022). FIG.5 illustrates comparison of the disclosed method with Method 1, Method 2 and Method 3 in accordance with some embodiments of the present disclosure. Table.1 shows the quantitative performance scores. The table shows (i) the percentage of joints that were successfully embedded into the same shape, (ii) the reconstruction error between the re-targeted motion and the original motion, and (iii) the local pose error, i.e., the reconstruction error after rigidly aligning the re-targeted motion with the original motion at each timeframe.
Method Percentage
Joints embedded Reconstruction Error Local Pose Error
Method 1 31±21 % 0.91±32 0.36±0.18
Method 2 52±38 % 0.71±38 0.28±0.16
Method 3 - 0.51±0.25 0.23±0.07
Disclosed method 84±22 % 0.37±23 0.18±0.09
Table 1
From FIG.5 and Table 1, it is observed that the disclosed method provides better results than all three tested approaches both qualitatively and quantitatively. As Method 3 uses a volumetric representation without preprocessing, it is unable to handle an incomplete point cloud sequence. Method 1 and Method 2 uses only motion information without structural cues, resulting in a skeleton that is less effective for embedding.
FIG.6 shows motion transfer results from real captured RGBD videos in accordance with some embodiments of the present disclosure. The disclosed method was experimented with two examples. In (a) of FIG.6, the motion of a single-view human puppeteering a doll back was transferred to the same object and to biped examples from Models Resource dataset (from published paper titled “Predicting animation skeletons for 3d articulated models via volumetric nets.” authored by “Zhan Xu, Yang Zhou, Evangelos Kalogerakis, and Karan Singh.” published in International Conference on 3D Vision (3DV), 2019). In (b) a video of an adult moving his arms from the DeepDeform dataset (from published paper titled “Learning non-rigid rgb-d reconstruction with semi-supervised data” authored by “Aljaz Bozic, Michael Zollhofer, Christian Theobalt and Matthias Nießner” published in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2020) was used and retargeted it to a gorilla with manual editing for plausible motion transfer.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
The embodiment of present disclosure herein addresses the problem of motion transfer and retargeting. The disclosed method provides an unsupervised motion transfer from a monocular depth video captured using a single-view depth camera to virtual characters modeled as polygonal mesh. The disclosed method also provides a novel skeletonization approach which helps in reducing artifacts that arises through motion skeletonization. The skeletonization incorporates additional geometric information which helps in better motion reconstruction and transfer as compared to prior methods.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
,CLAIMS:We Claim:
A processor implemented method comprising:
receiving (302), via one or more hardware processors, (i) a monocular depth video of a source object captured by a single view depth camera and (ii) a target mesh associated with a target object, wherein the monocular depth video comprises a plurality of depth image frames corresponding to a plurality of time frames;
extracting by using a mesh extraction technique (304), via the one or more hardware processors, a source mesh corresponding to the source object in a source depth image frame amongst the plurality of depth image frames;
obtaining (306), via the one or more hardware processors, a motion trajectory corresponding to the source object by aligning the source mesh to the plurality of depth image frames using a non-rigid reconstruction technique;
applying (308), via the one or more hardware processors, a curve splitting technique on at least one of (i) a set of joint coordinates and (ii) a curve skeleton comprising one or more curves to obtain a motion skeleton, wherein the set of joint coordinates and the curve skeleton is obtained from the source mesh;
obtaining (310), via the one or more hardware processors, a motion associated with the motion skeleton from at least one of (i) the motion skeleton, (ii) the motion trajectory and (iii) the set of joint coordinates based on a joint motion extraction technique;
embedding (312), via the one or more hardware processors, the motion skeleton to the target mesh to obtain an embedded target skeleton; and
retargeting (314), via the one or more hardware processors, the motion associated with the motion skeleton along with the embedded target skeleton on the target mesh.
The method as claimed in claim 1, wherein the source depth image frame is the first depth image frame among the plurality of depth image frames.

The method as claimed in claim 1, wherein obtaining the set of joint coordinates comprises,
applying, via the one or more hardware processors, a curve skeletonization technique on the source mesh to obtain a set of dense clusters; and
obtaining, via the one or more hardware processors, the set of joint coordinates from the set of dense clusters using a joint position extraction method.

The method as claimed in claim 3, wherein the joint position extraction method is represented as
j_i^*=arg min-(j_i?R^3 )??_(v?S_i)¦??_cpp ?(j_i-v×n(v))?_2^2 ?+?_eucl ?j_i-v?_2^2+ ?_smooth ?_(k?N(i))¦?j_i-j_k ?_2^2
where j_i is the set of joint co-ordinates, S_i is the set of dense clusters associated with j_i, v represents co-ordinates of a vertex of a set of vertices of the set of dense clusters , n(v) represents vertex normal of the vertex of the set of vertices , ?_cpp, ?_(eucl ) and ?_smoothare hyperparameters, N(i) are a set of neighbors of a joint coordinate i .

The method as claimed in claim 1, wherein the curve skeleton is obtained from the set of dense clusters using a local separator method.

The method as claimed in claim 1, wherein obtaining the motion skeleton by applying the curve splitting technique comprises,
recursively splitting, via the one or more hardware processors, each curve of the one or more curves in the curve skeleton into one or more bone segments having a set of joints associated with the motion skeleton, until a motion reconstruction cost is less than a predefined threshold wherein a set of coordinates corresponding to the set of joints are subset of the set of joint coordinates.

The method as claimed in claim 1, wherein obtaining the motion associated with the motion skeleton is preceded by reconstructing the motion by applying a skinning technique on at least one of (i) the motion trajectory and (ii) the motion skeleton.

A system (100), comprising:
a memory (104) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (102) coupled to the memory (104) via the one or more communication interfaces (106), wherein the one or more hardware processors (102) are configured by the instructions to:
receive (i) a monocular depth video of a source object captured by a single view depth camera and (ii) a target mesh associated with a target object, wherein the monocular depth video comprises a plurality of depth image frames corresponding to a plurality of time frames;
extract by using a mesh extraction technique a source mesh corresponding to the source object in a source depth image frame amongst the plurality of depth image frames;
obtain a motion trajectory corresponding to the source object by aligning the source mesh to the plurality of depth image frames using a non-rigid reconstruction technique;
apply a curve splitting technique on at least one of (i) a set of joint coordinates and (ii) a curve skeleton comprising one or more curves to obtain a motion skeleton, wherein the set of joint coordinates and the curve skeleton is obtained from the source mesh;
obtain a motion associated with the motion skeleton from at least one of (i) the motion skeleton, (ii) the motion trajectory and (iii) the set of joint coordinates based on a joint motion extraction technique;
embed the motion skeleton to the target mesh to obtain an embedded target skeleton; and
retarget the motion associated with the motion skeleton along with the embedded target skeleton on the target mesh.

The system as claimed in claim 8, wherein the source depth image frame is the first depth image frame among the plurality of depth image frames.

The system as claimed in claim 8, wherein obtaining the set of joint coordinates comprises,
applying a curve skeletonization technique on the source mesh to obtain a set of dense clusters; and
obtaining the set of joint coordinates from the set of dense clusters using a joint position extraction method.

The system as claimed in claim 10, wherein the joint position extraction method is represented as
j_i^*=arg min-(j_i?R^3 )??_(v?S_i)¦??_cpp ?(j_i-v×n(v))?_2^2 ?+?_eucl ?j_i-v?_2^2+ ?_smooth ?_(k?N(i))¦?j_i-j_k ?_2^2
where j_i is the set of joint co-ordinates, S_i is the set of dense clusters associated with j_i, v represents co-ordinates of a vertex of a set of vertices of the set of dense clusters , n(v) represents vertex normal of the vertex of the set of vertices , ?_cpp, ?_(eucl ) and ?_smoothare hyperparameters, N(i) are a set of neighbors of a joint coordinate i .

The system as claimed in claim 8, wherein the curve skeleton is obtained from the set of dense clusters using a local separator method.

The system as claimed in claim 8, wherein obtaining the motion skeleton by applying the curve splitting technique comprises,
recursively splitting each curve of the one or more curves in the curve skeleton into one or more bone segments having a set of joints associated with the motion skeleton, until a motion reconstruction cost is less than a predefined threshold wherein a set of coordinates corresponding to the set of joints are subset of the set of joint coordinates.

The system as claimed in claim 8, wherein obtaining the motion associated with the motion skeleton is preceded by reconstructing the motion by applying a skinning technique on at least one of (i) the motion trajectory and (ii) the motion skeleton.

Dated this 25th Day of August 2023
Tata Consultancy Services Limited
By their Agent & Attorney

(Adheesh Nargolkar)
of Khaitan & Co
Reg No IN-PA-1086

Documents

Application Documents

#	Name	Date
1	202321036268-STATEMENT OF UNDERTAKING (FORM 3) [25-05-2023(online)].pdf	2023-05-25
2	202321036268-PROVISIONAL SPECIFICATION [25-05-2023(online)].pdf	2023-05-25
3	202321036268-FORM 1 [25-05-2023(online)].pdf	2023-05-25
4	202321036268-DRAWINGS [25-05-2023(online)].pdf	2023-05-25
5	202321036268-FORM-26 [23-06-2023(online)].pdf	2023-06-23
6	202321036268-Proof of Right [25-08-2023(online)].pdf	2023-08-25
7	202321036268-FORM 18 [25-08-2023(online)].pdf	2023-08-25
8	202321036268-ENDORSEMENT BY INVENTORS [25-08-2023(online)].pdf	2023-08-25
9	202321036268-DRAWING [25-08-2023(online)].pdf	2023-08-25
10	202321036268-CORRESPONDENCE-OTHERS [25-08-2023(online)].pdf	2023-08-25
11	202321036268-COMPLETE SPECIFICATION [25-08-2023(online)].pdf	2023-08-25
12	Abstract1.jpg	2024-01-10
13	202321036268-FORM-26 [05-11-2025(online)].pdf	2025-11-05