Abstract: A method and a system for functional skill assessment using video-to-video adaptation for recognizing actions of a subject is provided. The method includes building a classifier based on a source dataset and a sparsely annotated dataset. Further, assessing the functional skill of a subject by applying the classifier to a video of the subject performing an action. The method for building the classifier includes training a model (Pθt (yt | xt)) on the sparsely annotated dataset comprising a plurality of classes. Matching the source dataset class that maximizes a posterior likelihood for each of the plurality of classes of the sparsely annotated data. Training a classifier on the source data using matched source classes. Building the classifier by retraining the model using matched source classes.
The subject matter relates to the field of automation of assessment test,
more particularly but not exclusively the subject matter relates to functional skill
assessment of a subject for recognizing actions using video-to-video adaptation.
5 [0002] Skill assessment processes for subjects suffering from mental disorder
(example: Autism) typically involves invoking instructions to a child, monitoring and
recording their responses as they occur. This requires a trained clinician to engage
with the child, perform mundane repetitive tasks such as recording the child’s
observation and human action responses to fixed set of stimuli. With the advent of
10 tremendously powerful modern Deep learning techniques, one can hope to automate a
lot of such tasks bringing affordability and improved access in value chain of specific
mental disorder screening, diagnostics and behavioral treatment activities while
reducing the dependence on trained clinicians.
[0003] Human action recognition from videos refers to identification and
15 classification of different action types in a video as and when they occur. Accurate
representation and classification of human actions is a challenging area of research in
computer vision which includes human detection, human pose estimation and
tracking human behavior. Human action recognition is typically solved in a
supervised machine learning setting. Most of the successful models employ
20 convolutional neural networks (CNN) as their backbone and can be broadly
categorized into three types - two stream networks, 3-Dimensional ((3D)
Convolutional Neural Networks and Covolutaional long short-termmemory (LSTM)
networks.
[0004] The two stream architecture has two CNNs each of them separately
25 trained on image sequences (RGB) and optical flow sequences. The model averages
the predictions from a single RGB frame and a stack of multiple optical flow frames
after passing them through two CNNs which are pre-trained on large-scale static
3
image datasets. An extension of two-stream model, fuses the two streams after the
last CNN layer, showing improvement in performance. Bag of words modeling
approach ignores the temporal structure as they contained pooled predictions of
features extracted using frames across the video. A recurrent layer like LSTM can be
5 added to such model which can capture long range dependencies and temporal
ordering. 3D CNNs are convolutional networks with spatio-temporal filters that
create hierarchical representations of spatio-temporal data. They have many more
parameters as compared to 2D CNNs and therefore harder to train. As these models
couldn’t static image datasets for pre-training, shallow custom architectures are
10 defined which are trained from scratch.
[0005] The two stream Inflated 3D CNN or I3D has two stream architecture with
each stream trained with RGB and optical flow sequences separately. I3D model can
be seen as inflated version of 2D architecture where the 2D filters and pooling kernels
are added with one more temporal dimension. The 3D model can be pre-trained on
15 static image datasets by repeating weights of 2D filters along the temporal dimension
and then rescaling. I3D is one of the standard benchmarks of UCF101 and HMDB51
datasets. Temporal Segment Networks or TSN is another example of two stream
networks which has better ability to model long range temporal dependencies. Rather
than working on individual frames or stack of contiguous frames, TSN works on
20 short snippets sparsely sampled from the entire video. Each of these sparse snippets is
predicted with a class and a final consensus is taken to predict the class of the video.
[0006] The above techniques may be effective for use cases where the data sets
are huge. It may not be economically feasible to gather and annotate human actions
data for building a data set for all minor and particular use cases, as it is not only time
25 consuming to collect but may often end up in an over-fit model that does not perform
well on the unseen data. Further, there is abundant availability of large-scale public
datasets that contains thousands of annotated video clips corresponding to hundreds
of action classes.
4
[0007] In view of the foregoing, there is a need for an improved skill assessment
technique of a subject for recognizing actions, that may leverage the large-scale
annotated datasets to improve the generalization abilities of classifiers built for
specific tasks.
5
SUMMARY
[0008] Accordingly, an improved technique for functional skill assessment a
subject is provided using video-to-video adaptation for recognizing actions. In an
embodiment, the method includes building a classifier based on a source dataset and a
10 sparsely annotated datasetFurther, assessing the functional skill of a subject by
applying the classifier to a video of the subject performing an action. The method for
building the classifier includes training a model (Pθt (yt | xt)) on the sparsely
annotated dataset comprising a plurality of classes. Computing a posterior likelihood
for each of the classes of the source dataset using the trained model for each of the
15 plurality of classes of the sparsely annotated dataset. Matching the source dataset
class that maximizes a posterior likelihood for each of the plurality of classes of the
sparsely annotated data. Training a classifier on the source data using matched source
classes. Building the classifier by retraining the model using matched source classes.
[0009] In an embodiment, the source dataset is bigger in size as compared to
20 sparsely annotated dataset; and wherein the source dataset comprises actions
performed by one or more subjects who are mentally fit.
[0010] In an embodiment, the sparsely annotated dataset comprises actions
performed by one or more subjects suffering from a mental disorder.
[0011] In an embodiment, a loss term is computed using a Frobenius norm (LDR =
|| Eθ
TEϕ – Ik ||F) based on the parameters of the models Pθ
t
(yt | xt) and Pϕ
s
25 (ys | xs).
Further, the classifier is built by retraining the model using matched source classes
and loss term.
5
In an embodiment, the source dataset and the sparsely annotated dataset comprise a
pre-processed video data of one or more subjects performing at least one similar
action.
[0012] In an embodiment, the source dataset and the sparsely annotated dataset
5 comprise pre-processed video data of one or more subjects performing at least one
similar action.
[0013] In another embodiment, a system for functional skill assessment for
recognizing actions of a subject using video-video adaptation is provided. The system
comprises at least one processor and a memory. The memory includes an application
10 program configured to perform operations to assess the functional skill of a subject.
The operations include building a classifier based on a source dataset and a sparsely
annotated dataset. Further, assessing the functional skill of a subject by applying the
classifier to a video of the subject performing an action. The method for building the
classifier includes training a model (Pθ
t
(yt | xt)) on the sparsely annotated dataset
15 comprising a plurality of classes. Computing a posterior likelihood for each of the
classes of the source dataset using the trained model for each of the plurality of
classes of the sparsely annotated data set. Matching the source dataset class that
maximizes a posterior likelihood for each of the plurality of classes of the sparsely
annotated data. Training a classifier on the source data using matched source classes.
20 Building the classifier by retraining the model using matched source classes
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] Embodiments are illustrated by way of example in the Figures of the
accompanying drawings, in which like references indicate similar elements and in
25 which:
[0015] FIG. 1 illustrates an exemplary method 100 for assessing the functional
6
skill of a subject for recognizing actions in accordance with an embodiment;
[0016] FIG. 2 illustrates action classes in two different datasets in optical flow
domain;
[0017] FIG. 3 illustrates an exemplary process flow diagram 200 for building a
5 classifier in accordance with an embodiment of the invention;
[0018] FIG. 4 illustrates an exemplary method 300 for building a classifier in
accordance with an embodiment;
[0019] FIG. 5 illustrates a block diagram of a system 400 in which an
embodiment for building a classifier may be implemented;
10 [0020] FIG. 6A illustrates an example data collection for training dataset for
assessing subjects suffering with Autism, in accordance with an embodiment;
[0021] FIG. 6B illustrates an example summary of training dataset for assessing
subjects suffering with Autism, in accordance with an embodiment;
[0022] Fig. 6C lists the number of samples in each Autism class according to the
15 position of the camera, in accordance with an exemplary embodiment of the
invention;
[0023] FIG. 7A visualizes the performance of I3D and TSN with iteration over
the source sample with distributional mode matching and directional regularization,
in accordance with an exemplary embodiment of the invention;
20 [0024] FIG. 7B tabulates validation accuracies on each disjoint fold having 10%
(or 20%) of the autism data, in accordance with an exemplary embodiment of the
invention; and
[0025] FIG. 7C depicts t-SNE visualizations of embeddings of I3D and TSN
using 10% baseline autism model applied with distributional mode matching and
25 directional regularization, in accordance with an exemplary embodiment of the
invention.
7
DETAILED DESCRIPTION
[0026] The following detailed description includes references to the
accompanying drawings, which form a part of the detailed description. The drawings
show illustrations in accordance with example embodiments. These example
5 embodiments are described in enough detail to enable those skilled in the art to
practice the present subject matter. However, it will be apparent to one with ordinary
skill in the art that the present invention may be practiced without these specific
details. In other instances, well-known methods, procedures, components, and
networks have not been described in detail so as not to unnecessarily obscure aspects
10 of the embodiments. Within the scope of the detailed description and the teachings
provided herein, additional embodiments, application, features, and modifications are
certainly are recognized by a person skilled in the art. Therefore, the following
detailed description is not to be taken in a limiting sense.
[0027] In this document, the terms “a” or “an” are used, as is common in patent
15 documents, to include one or more than one. In this document, the term “or” is used
to refer to a nonexclusive “or,” such that “A or B” includes “A but not B,” “B but not
A,” and “A and B,” unless otherwise indicated.
[0028] The invention applies the idea of human action recognition to measure a
20 child’s responses based on the stimulus provided by the clinician in the area of
imitation, listener response, fine motor and gross motor skills. These skills are used
exhaustively in making diagnostic outcomes of children on Autism Spectrum
Disorder (ASD), among other mental disorders. In general, one needs very large
amounts of expert annotated data corresponding to the particular classes of actions
25 that one is aiming to recognize. This is not only time consuming to collect but often
end up in an over-fit model that does not perform well on the unseen data. However,
there is abundant availability of large-scale public datasets that contains thousands of
annotated video clips corresponding to hundreds of action classes. Further, the
8
invention leverages the availability of large scale annotated public datasets to
improve the generalization abilities of classifiers built for specific skill assessment
pertaining to mental disorders such as Autism, among others.
5 [0029] The invention adapts video data from a source distribution to build a
robust classifier on a target video data distribution, based on the similarities between
the source and sparsely annotated classes. It’s been observation that despite the
source and sparsely annotated classes being disjoint, there might exist semantic
similarity between some of the source and sparsely annotated classes, under some
10 image transformation (such as optical flow). The invention discloses a technique for
improving the generalization ability of a neural model for human action recognition
problem using the distributional mode matching (DMM) with directional
regularization (DR), which capitalizes the similarities between the source and
sparsely annotated domains. The technique as described further is used to assess
15 functional skill of a subject suffering with specific mental disorder including Autism.
[0030] FIG. 1 illustrates an exemplary method 100 for assessing the functional
skill of a subject for recognizing actions in accordance with an embodiment. At step
300 a classifier is built based on a source dataset and a sparsely annotated dataset.
The step 300 is discussed in further details under FIG. 3 and other parts of the
20 detailed description. At step 104, a video of a subject preforming an action is
recorded. Further the recorded video undergoes pre-processing. At step 106, the
functional skill of the subject is assessed by applying the classifier to the preprocessed video of the subject.
[0031] In general, generalization across unseen data is one of the most important
25 problems in learning theory. Specifically, very high capacity models such as Deep
neural networks, tend to easily over-fit the training data leading to poor performance
on test data. Several regularization techniques are routinely incorporated to address
the problem of overfitting. Most of them attempt to reduce the generalization error by
9
trading increased bias for reduced variance. Some of the popular approaches include,
imposing norm penalty on the network parameters, stochastic pruning of parameters,
normalization of activations and data augmentation.
5 [0032] FIG. 2 illustrates action classes in two different datasets in optical flow
domain. One representative RBG and Flow frame is depicted in each case. The
directional closeness of the optical flow frames can be observed despite complete
unrelated: (a)-(b) Arms up action in a dataset is close to (c)-(d) pull ups action in a
large annotated dataset called ‘Kinetics dataset’. Further, (e)-(f) rolly polly action in
10 the dataset is close to (g)-(h) Flic flac action in another annotated dataset called
‘HMDB51 dataset’.
[0033] In any given video, there exists transformations such as optical flow,
which are no-unique mappings of the video space. Thereby, suggesting that given
15 multiple disjoint set of action classes, there may be spaces (such as flow) where a
given pair of action classes may lie ’close’ albeit they represent different semantics in
the RGB-space. For example, the optical flow characteristics of a ’baseball-strike’
class and ’cricket-slog’ class could be imagined to be close. Further, there may exist
large-scale, open data sets (e.g., Kinetics) that encompass a large number of
20 annotated videos for several action classes. Thus, if one can find the classes in the
open datasets that are ’close’ to a given class in the data of interest, then the videos
from the open dataset can be potentially used for augmentation resulting in
regularization. The following disclosure further discloses the implementation of this
invention in a detailed manner.
25
NOTATIONS
[0034] Let X denote the sample space encompassing the elements of transformed
videos (e.g., Optical flow). Let PS (xs) and Pt
(xt) be two distributions on X
10
respectively called the source and sparsely annotated data distributions. Suppose a
semantic labeling scheme is defined both on PS (xs) and Pt
(xt). That is, let YS = {ys
1
,
ys
2
, … ys
N} and Yt = {yt
1
, yt
2
, … yt
M} be the source and sparsely annotated class
labels that are assigned for the samples of PS (xs) and Pt
(xt) respectively which in-turn
define the joint distributions PS (xs , ys) and Pt 5 (xt , yt). ‘N’ and ‘M’ are the respective
number of source and sparsely annotated classes.
[0035] Let DS = {(xs , ys)} and Dt = {(xt , yt)} denote the tuples of samples drawn
from the two joint distributions PS and Pt
, respectively. Suppose a parametric
10 discriminative classifier (Deep Neural Network) is learned using Dt to obtain estimate
of the conditional distribution Pθ
t
(yt | xt), where ‘θ’ represents the parameters of the
neural network.
[0036] With these notations, we consider the case where the cardinality of Dt is
15 much less than that of Ds implying that the amount of supervised data in the case of
sparsely annotated distribution is much less than that of the source distribution. In
such a case, Pθ
t
(yt | xt) trained on Dt is deemed to overfit and hence doesn’t
generalize well. As discussed, if there exists a ys
p
ϵ YS that is ’close’ to yt
q
ϵ Yt, then
samples drawn from PS
( xs | ys = ys
p
) can be used to augment the class yt
q
for retraining the model Pθ
t 20 (yt | xt). In the subsequent section, we describe a procedure to
find the ’closest’ ys
p
ϵ YS, given yt
q
ϵ Yt and a model Pθ
t
(yt | xt) trained on Dt.
[0037] FIG. 3 illustrates an exemplary process flow diagram 200 for building a
classifier in accordance with an embodiment of the invention. Briefly, as illustrated,
25 first a classifier is trained on the sparsely annotated data which is used to match the
modes (classes) of the source data. After mode matching, the classifier is again
trained with the classes from the matched source data along with directional
regularization loss.
11
DISTRIBUTIONAL MODE MATCHING (DMM)
[0038] Videos lie in very high dimensional space and are of variable length in
general. Thus, standard vector distance metrics are not feasible to measure the
5 closeness of two video objects.
[0039] Further, the objective here is to quantify the distance between the classes
as perceived by the discriminative model (classifier) Pθt (yt | xt) so that data
augmentation is sensible. Thus, we propose to use the maximum posterior likelihood
principle to define the closeness between two classes. X(ys = ysp) = {xs1 , xs2 , … ,
10 xsl } denote the samples drawn from PS ( xs | ys = ysp ).
[0040] Now denotes the posterior distribution of
the sparsely annotated classes Yt given the jth feature vector from the source class
. With this, a joint posterior likelihood Lyt |xs of a class ysp could be
defined as observing the sparsely annotated classes given a set of features
15 drawn from a particular source class ysp .
[0041] Mathematically,
[0042] Where xsj , j ϵ {1, 2, … , l} , are from the class ys
p
. If it is assumed that Xsj
are drawn from an IID, one can express Eq. (1) as,
20
[0043] This is because, the parameters ‘θ’ of the discriminator model created
using Dt are independent of X(ys = ys
p
) and are fixed during the evaluation of Lyt |xs ,
which implies that yti |xsi is independent of xsi ∀ i ≠ j , thus leading to Eq. 2.
12
[0044] The posterior likelihood in Eq. 2 can be evaluated for every sparsely
annotated class yt = yt
q
, q ϵ {1, 2, … , M}, denoted by called the
sparsely annotated-class-posterior likelihood corresponding to the features from
source class ys
q under the learned classifier Pθ
t
(yt | xt). Mathematically,
5
[0045] With this definition of the sparsely annotated-class posterior likelihood,
we define the matched source class to a given sparsely annotated
class yt
q as follows:
10 [0046] Note that the definition of is specific to a source-sparsely
annotated class pair and therefore all xsj in objective function of the optimization
problem in Eq. 4, comes from a particular source class. Thus, one can employ the
discriminative classifier trained on the sparsely annotated data to find out the ’closest’
matching source class as the one that maximizes the posterior likelihood observing
15 that class as the given sparsely annotated class under the classifier. Since every class
in the joint distribution can be looked into a ’mode’ and the goal here is to match the
classes (’modes’) in the joint distributions of the source and sparsely annotated
distributions, the procedure is called distributional mode matching.
[0047] Referring to the FIG. 1, which demonstrates the idea of mode matching
20 through examples - Representative frames of the RGB and the optical flow space
from two sparsely annotated classes (Arms-up and Rolly-polly) are shown with the
corresponding matched (using the aforementioned procedure) classes of two source
datasets. It is observed that optical flow frames of the sparsely annotated and the
source classes have similar visual properties indicating the closeness.
13
[0048] Once the matched source class is determined to every given sparsely
annotated class, a set of source classes matched is defined as ys
* = {ys
*1, ys
*2
, …
ys
*m}. Now, the discriminative classifier Pθ
t can be re-trained on the samples from the
source dataset corresponding to ys
*
in a supervised way with class labels being the
corresponding yt
q
for every ys
* 5 . This procedure thus increases the quantity and variety
of the training data for Pθ
t
.
DIRECTIONAL REGULARIZATION (DR)
10 [0049] The procedure of mode matching described in the previous section
effectively changes the semantic meaning of the matched source classes to the
semantic meaning of the sparsely annotated classes. Thus, it is possible to train a
classifier on the source data to discriminate between the matched source classes ys
* =
{ys
*1, ys
*2, … ys
*m}. Suppose such a classifier is denoted by Pϕ
s
(ys | xs) , where ϕ are
the model parameters. It is assumed that Pϕ
s
(ys | xs) and Pθ
t 15 (yt | xt) have the same
architectural properties. Further, it’s also assumed that, the source dataset is larger in
size and more diverse compared to the sparsely annotated dataset. This implies that
the Pϕ
s
(ys | xs) has better generalization abilities compared to Pθ
t
(yt | xt), this fact is
leveraged in improving the generalization capabilities of Pθ
t
(yt | xt) using Pϕ
s
(ys | xs).
[0050] Further, during the training of Pϕ
s
(ys
*
| xs) with samples from ys
* 20 , it is
desirable that the separation that is achieved between the classes in ys
* under the
classifier Pϕ
s
(ys
*
| xs) is ’preserved’ during the training of Pθ
t
(yt | xt) with samples
from ys
*
. Further, the aforementioned properties may be accomplished by imposing a
regularization term during the training of Pθ
t
(yt | xt). Specifically, we propose to push
25 the Eigen directions of the parameter matrix θ towards that of the parameter matrix ϕ.
Note that ϕ is fixed during the training of Pθ
t
(yt | xt). Intuitively, this implies that the
significant directions of the sparsely annotated parameters should follow that of the
source parameters
14
[0051] Mathematically, let Mθ and Mϕ be two square matrices formed by reshaping (without any preference to particular dimensions) the parameters θ and ϕ,
respectively. Suppose we perform an Eigen-value decomposition on Mθ and Mϕ, to
obtain the Eigen vector matrices Eθ and Eϕ, respectively. Let denote the truncated
5 versions of E with first k significant (a model hyper-parameter) Eigen vectors. Under
this setting, we desire the Eigen directions and to be aligned.
Mathematically, if they are perfectly aligned, then
[0052] where Ik is a k-dimensional identity matrix and T denotes the transpose
10 operation. Thus, any deviation from the condition laid in Eq. 5 is penalized by
minimizing the Frobenius norm of the deviation. This is referred to as the directional
regularization denoted as ‘LDR’, which is given by the following equation:
Where, denotes the Frobenius norm of a matrix.
15 [0053] It shall be noted that, this regularizer is on θ imposed during the training
of Pθ
t
(yt | xt) ensuring that the directions of separating hyperplanes of the classifier is
encouraged to follow those of the source classifier trained with the matched classes.
Finally, objective function during the re-training of the classifier is as follows:
20 where is the predicted sparsely annotated class.
[0054] FIG. 4 illustrates an exemplary method 300 for building a classifier in
accordance with an embodiment of the invention. Assuming only a small amount of
data from the sparsely annotated distribution, the proposed method 300 at step 302
15
the one or more processors of the system 400, trains a classifier on the sparsely
annotated data with plurality of classes. Further, at step 304 and 306, the closest class
from the source distribution to all the sparsely annotated classes is estimated using
the classifier. Which is realized by computing Posterior likelihood for each of the
5 classes of a source data using the trained model for each of the plurality of classes of
the sparsely annotated data at step 304, and further matching the source class that
maximizes the posterior likelihood for each of the plurality of classes of the sparsely
annotated data at step 306.
[0055] Further at step 308, the one or more processors of the system 400, trains a
10 new (relatively robust) classifier on the samples from the source distribution with relabeled source classes (matched with the sparsely annotated classes). At step 310, a
loss term using a Frobenius norm is computed based on the parameters of the models
Pθ
t
(yt | xt) and Pϕ
s
(ys | xs). Finally, the model is retrained to build a classifier using
samples of the matched source classes with loss term (directional regularization). The
15 classifier is the final model for the sparsely annotated data.
[0056] In an embodiment, the source dataset is bigger in size as compared to
sparsely annotated dataset; and wherein the source dataset comprises actions
performed by one or more subjects who are mentally fit.
[0057] In an embodiment, the sparsely annotated dataset comprises actions
20 performed by one or more subjects suffering from a mental disorder.
[0058] In an embodiment, a loss term is computed using a Frobenius norm (LDR =
|| Eθ
TEϕ – Ik ||F) based on the parameters of the models Pθ
t
(yt | xt) and Pϕ
s
(ys | xs).
Further, the classifier is built by retraining the model using matched source classes
and loss term.
25 In an embodiment, the source dataset and the sparsely annotated dataset comprise a
pre-processed video data of one or more subjects performing at least one similar
action.
16
[0059] In an embodiment, the source dataset and the sparsely annotated dataset
comprise pre-processed video data of one or more subjects performing at least one
similar action.
[0060] FIG. 5 illustrates a system 400 in which aspects of the invention may be
5 implemented. As shown, the system 400 includes, without limitation, a central
processing unit (CPU) 410, a network interface 430, a bus 440, a memory 460 and
storage 450. The system 400 may also include an I/O device interface 420 connecting
I/O devices 470 (e.g., keyboard, display and mouse devices) to the system 400.
[0061] The CPU 410 retrieves and executes programming instructions stored in
10 the memory 460. Similarly, the CPU 410 stores and retrieves application data
residing in the memory 460. The bus 440 facilitates transmission, such as of
programming instructions and application data, between the CPU 410, I/O device
interface 420, storage 450, network interface 430, and memory 460. CPU 410 is
included to be representative of a single CPU, multiple CPUs, a single CPU having
15 multiple processing cores, and the like. Further, the memory 460 is generally
included to be representative of a random-access memory. The storage 450 may be a
disk drive storage device. Although shown as a single unit, the storage 450 may be a
combination of fixed and/or removable storage devices, such as tape drives,
removable memory cards or optical storage, network attached storage (NAS), or a
20 storage area-network (SAN). Further, system 400 is included to be representative of a
physical computing system as well as virtual machine instances hosted on a set of
underlying physical computing systems. Further still, although shown as a single
computing system, one of ordinary skill in the art will recognized that the
components of the system 400 shown in FIG. 4 may be distributed across multiple
25 computing systems connected by a data communications network.
[0062] As shown, the memory 460 includes an operating system 462, an
application interface 464 and one or more applications. The one or more applications
may include an application program configured to perform operations for building a
17
classifier. The operations performed by the CPU while executing instructions of the
application program includes - building a classifier based on a source dataset and a
sparsely annotated dataset. Recording and pre-processing a video of a subject
performing an action. Thereafter, assessing the functional skill of the subject by
5 applying the classifier to the pre-processed video of the subject. Further, the
operations for building the classifier includes training a classifier on the sparsely
annotated samples; estimating the closest class from the source distribution to all the
sparsely annotated classes using the classifier; training a new (relatively robust)
classifier on the samples from the source distribution with re-labeled source classes
10 (matched with the sparsely annotated classes); and forming/building the final model/
classifier using the samples of the matched source classes to re-train the classifier
along with directional regularization.
[0063] FIG. 6A illustrates an example data collection for training dataset for
assessing subjects suffering with Autism, in accordance with an embodiment. FIG.
15 6B illustrates an example summary of training dataset for assessing subjects suffering
with Autism, in accordance with an embodiment.
[0064] In an exemplary embodiment, a dataset that consists subjects with Autism
as sparsely annotated dataset, while well know public action dataset like Kinetics and
HMDB51 (A Large Video Database for Human Motion Recognition) are considered
20 for source datasets. The Kinetics dataset with Inflated 3D Convolution Neural net
(I3D) can used as s back bone model for mode matching. Further TSN (Temporal
Segment Networks framework) may be selected for working with HMD51 dataset.
[0065] During assessment, a clinician performs the functional assessment by
probing a child for age-appropriate listener response and imitation skills by invoking
25 an instruction response and expecting a child to response through a human action. For
example in a listener response probing evaluation, a therapist may invoke a child to
identify his/her body parts or perform simple or complex motor action. Similarly
imitation skills may be evaluated by a clinician asking a child to imitate touching of a
18
body part or imitate simple or complex motor action. Eight representative human
action responses (summarized in FIG. 6B) invoked through either listener response or
imitation instruction for this example.
[0066] Videos recorded in a structured environment with the clinician facing the
5 child, and three synchronized cameras placed to record the videos. The first camera
faced the therapist, the second faced the student and the third may be positioned
laterally to both the therapist and the student (refer to FIG. 6), in order to capture all
the angles of the probe trial individually, including the therapists instruction and the
childs response. The videos may then be annotated by trained clinical psychologists.
10 The annotation details included age of the child, instruction delivery to perform
action (audio and video), instruction objective (imitation or listener reaction), and
child’s response (success or failure). Videos are recorded with the same subjects but
with different clinicians for a predetermined period to ensure a large enough data set
generated for a specific action class and also to ensure a wide representation of
15 multiple actions. The response of the child for a particular stimulus may be taken as a
human action response classification problem to be measured using a deep learning
model.
[0067] Traditionally in the assessment process, a clinician manually record the
child’s observation and action responses to a stimulus that is time taking and
20 inefficient. The need for developing a deep learning model arises as there are multiple
steps that are part of the assessment process in which multiple instructions are
invoked to a child in a limited time frame and child responses are gauged through
demonstration of a action and responses are recorded. Fig.6 depicts various human
actions like ’Arms up’, ’Rolly polly’, ’Lock hands’, ’Touch head’ during training
25 sessions by Indian (or American) clinicians with Indian (or American) subjects for
different positions of recording camera.
[0068] The total samples per Autism class is shown in Table 1, the data had
enough variability. Every sample in the dataset has an Indian or American subject.
19
Fig. 6B enumerates the number of samples having Indian or American subject in
every Autism class. Fig. 6B shows the number of samples in the Autism dataset on
the basis of the age the subject, whether the subject is below or above 5 years in age.
Fig. 6C lists the number of samples in each Autism class when the camera is facing
5 either the subject or the clinician.
[0069] Fig 7A shows the variation in accuracy when the new source samples are
augmented with Autism dataset in every iteration. In Fig. 7A (a), the validation
accuracy increases from the baseline in initial iterations. The accuracy starts dropping
after the 2nd iteration while with directional regularization it is still increasing after
10 2nd iteration. It starts dipping from 3rd iteration on-wards. The tolerance of
directional regularization with newer samples as compared to distributional mode
matching. Newer samples are accepted with lesser surprise in directional
regularization which enhances the generalizability and performance. Similar behavior
was observed with TSN as shown in Fig. 7A (b).
15 [0070] Fig 7B shows validation accuracies on each disjoint fold having 10% (or
20%) of the Autism data. The average accuracy with standard deviation for every
baseline model along with distributional mode matching and directional
regularization is disclosed on I3D and TSN. With this approach (DMM or DR or
DMM+DR), the performance is much better than the baseline models, as may clearly
20 noted from the FIG. 7B. The number of source samples from Kinetics/HMDB51
augmented with sparsely annotated autism dataset are comparable to the number of
samples in the autism classes. It is evident that the method is not biased towards
selection of specific group of data for training.
[0071] This method also reduces Misclassification error as shown in Fig. 7C(d).
25 8-dimensional embeddings from penultimate layer of I3D and TSN are plotted using
t-Distributed Stochastic Neighbor Embedding (t-SNE) for 10% baseline model with
and without distributional mode matching and directional regularization as shown in
Fig. 7C. This ascertains that with the propose invention the inter-class separability of
20
samples has increased while intraclass separability has decreased. This makes the
model more confident towards samples belonging to a class.
[0072] The processes described above is described as a sequence of steps, this is
solely for the sake of illustration. Accordingly, it is contemplated that some steps may
5 be added, some steps may be omitted, the order of the steps may be re-arranged, or
some steps may be performed simultaneously. The example embodiments described
herein may be implemented in an operating environment comprising software
installed on a computer, in hardware, or in a combination of software and hardware.
Some of the embodiment may also be directed to a sequence of instructions stored in
10 a computer readable medium, such that, the sequence of instructions when executed
by one or more processing devices allows the processing device to operate as
described herein. The computer readable medium can include any primary storage
devices or secondary storage devices.
[0073] Although embodiments have been described with reference to specific
15 example embodiments, it will be evident that various modifications and changes may
be made to these embodiments without departing from the broader spirit and scope of
the system and method described herein. Accordingly, the specification and
drawings are to be regarded in an illustrative rather than a restrictive sense.
[0074] Many alterations and modifications of the present invention will no doubt
20 become apparent to a person of ordinary skill in the art after having read the
foregoing description. It is to be understood that the phraseology or terminology
employed herein is for the purpose of description and not of limitation. It is to be
understood that the description above contains many specifications, these should not
be construed as limiting the scope of the invention but as merely providing
25 illustrations of some of the personally preferred embodiments of this invention. Thus,
the scope of the invention should be determined by the appended claims and their
legal equivalents rather than by the examples given herein.
I/WE CLAIM:
1. A functional skill assessment method using video-to-video adaptation for
recognizing actions of a subject, the method comprising:
with at least one processor (410) of one or more computing devices (400):
building a classifier based on a source dataset and a sparsely
annotated dataset, the method for building the classifier comprising:
training a model (Pθ
t
(yt | xt)) on the sparsely annotated dataset
comprising a plurality of classes;
computing a posterior likelihood for each of the classes of the
source dataset using the trained model for each of the plurality
of classes of the sparsely annotated data set;
matching the source dataset class that maximizes the posterior
likelihood for each of the plurality of classes of the sparsely
annotated data;
training a classifier on the source data using matched source
classes;
building the classifier by retraining the model using matched
source classes; and
assessing the functional skill of a subject by applying the classifier
to a video of the subject performing an action.
2. The method as claimed in claim 1, wherein the sparsely annotated dataset
comprises actions performed by one or more subjects suffering from a mental
disorder.
22
3. The method as claimed in claim 1, wherein the source dataset is bigger in size
as compared to sparsely annotated dataset; and wherein the source dataset
comprises actions performed by one or more subjects who are mentally fit.
4. The method as claimed in claim 1, further comprising: computing a loss term
using a Frobenius norm (LDR = || Eθ
TEϕ – Ik ||F) based on the parameters of the
models Pθ
t
(yt | xt) and Pϕ
s
(ys | xs); and building the classifier by retraining the
model using matched source classes and loss term.
5. The method as claimed in claim 1 wherein the source dataset and the sparsely
annotated dataset comprise pre-processed video data of one or more subjects
performing at least one similar action.
6. A functional skill assessment system using video-to-video adaptation for
recognizing actions of a subject, the system comprising:
at least one processor (410); and
a memory (460), wherein the memory includes an application program
configured to perform operations to assess the functional skill of a subject,
the operations comprising:
build a classifier based on a source dataset and a sparsely annotated
dataset, the method for building the classifier comprising:
train a model (Pθ
t
(yt | xt)) on the sparsely annotated dataset
comprising a plurality of classes;
compute a posterior likelihood for each of the classes of the
source dataset using the trained model for each of the plurality
of classes of the sparsely annotated data set;
match the source dataset class that maximizes the posterior
likelihood for each of the plurality of classes of the sparsely
annotated data;
23
train a classifier on the source data using matched source
classes;
build the classifier by retraining the model using matched
source classes; and
assess the functional skill of a subject by applying the classifier to a
video of the subject performing an action.
7. The system as claimed in claim 6, wherein the sparsely annotated dataset
comprises actions performed by one or more subjects suffering from a mental
disorder.
8. The system as claimed in claim 6, wherein the source dataset is bigger in size
as compared to sparsely annotated dataset; and wherein the source dataset
comprises actions performed by one or more subjects who are mentally fit.
9. The system as claimed in claim 6, wherein the operations performed by the
application program in the memory further comprises: compute a loss term
using a Frobenius norm (LDR = || Eθ
TEϕ – Ik ||F) based on the parameters of the
models Pθ
t
(yt | xt) and Pϕ
s
(ys | xs); and build the classifier by retraining the
model using matched source classes and loss term.
10. The system as claimed in claim 6, wherein the source dataset and the sparsely
annotated dataset comprise pre-processed video data of one or more subjects
performing at least one similar action.
| Section | Controller | Decision Date |
|---|---|---|
| 15 | Santosh Gupta | 2022-07-01 |
| 77 | Santosh Gupta | 2023-01-16 |
| # | Name | Date |
|---|---|---|
| 1 | 201911028174-PROVISIONAL SPECIFICATION [13-07-2019(online)].pdf | 2019-07-13 |
| 2 | 201911028174-FORM FOR STARTUP [13-07-2019(online)].pdf | 2019-07-13 |
| 3 | 201911028174-FORM FOR SMALL ENTITY(FORM-28) [13-07-2019(online)].pdf | 2019-07-13 |
| 4 | 201911028174-FORM 1 [13-07-2019(online)].pdf | 2019-07-13 |
| 5 | 201911028174-FIGURE OF ABSTRACT [13-07-2019(online)].pdf | 2019-07-13 |
| 6 | 201911028174-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [13-07-2019(online)].pdf | 2019-07-13 |
| 7 | 201911028174-EVIDENCE FOR REGISTRATION UNDER SSI [13-07-2019(online)].pdf | 2019-07-13 |
| 8 | 201911028174-DRAWINGS [13-07-2019(online)].pdf | 2019-07-13 |
| 9 | abstract.jpg | 2019-08-20 |
| 10 | 201911028174-FORM 13 [30-09-2019(online)].pdf | 2019-09-30 |
| 11 | 201911028174-DRAWING [30-09-2019(online)].pdf | 2019-09-30 |
| 12 | 201911028174-COMPLETE SPECIFICATION [30-09-2019(online)].pdf | 2019-09-30 |
| 13 | 201911028174-AMMENDED DOCUMENTS [30-09-2019(online)].pdf | 2019-09-30 |
| 14 | 201911028174-FORM-9 [05-11-2020(online)].pdf | 2020-11-05 |
| 15 | 201911028174-STARTUP [09-11-2020(online)].pdf | 2020-11-09 |
| 16 | 201911028174-FORM28 [09-11-2020(online)].pdf | 2020-11-09 |
| 17 | 201911028174-FORM 18A [09-11-2020(online)].pdf | 2020-11-09 |
| 18 | 201911028174-Proof of Right [24-08-2021(online)].pdf | 2021-08-24 |
| 19 | 201911028174-PETITION u-r 6(6) [24-08-2021(online)].pdf | 2021-08-24 |
| 20 | 201911028174-FORM-26 [24-08-2021(online)].pdf | 2021-08-24 |
| 21 | 201911028174-FORM 3 [24-08-2021(online)].pdf | 2021-08-24 |
| 22 | 201911028174-FER_SER_REPLY [24-08-2021(online)].pdf | 2021-08-24 |
| 23 | 201911028174-ENDORSEMENT BY INVENTORS [24-08-2021(online)].pdf | 2021-08-24 |
| 24 | 201911028174-Covering Letter [24-08-2021(online)].pdf | 2021-08-24 |
| 25 | 201911028174-CLAIMS [24-08-2021(online)].pdf | 2021-08-24 |
| 26 | 201911028174-FER.pdf | 2021-10-18 |
| 27 | 201911028174-US(14)-HearingNotice-(HearingDate-19-01-2022).pdf | 2021-12-23 |
| 28 | 201911028174-Correspondence to notify the Controller [17-01-2022(online)].pdf | 2022-01-17 |
| 29 | 201911028174-US(14)-ExtendedHearingNotice-(HearingDate-27-01-2022).pdf | 2022-01-19 |
| 30 | 201911028174-REQUEST FOR ADJOURNMENT OF HEARING UNDER RULE 129A [27-01-2022(online)].pdf | 2022-01-27 |
| 31 | 201911028174-Power of Authority [27-01-2022(online)].pdf | 2022-01-27 |
| 32 | 201911028174-PETITION u-r 6(6) [27-01-2022(online)].pdf | 2022-01-27 |
| 33 | 201911028174-Covering Letter [27-01-2022(online)].pdf | 2022-01-27 |
| 34 | 201911028174-Correspondence to notify the Controller [27-01-2022(online)].pdf | 2022-01-27 |
| 35 | 201911028174-US(14)-ExtendedHearingNotice-(HearingDate-22-03-2022).pdf | 2022-02-21 |
| 36 | 201911028174-Correspondence to notify the Controller [21-03-2022(online)].pdf | 2022-03-21 |
| 37 | 201911028174-PETITION UNDER RULE 138 [06-04-2022(online)].pdf | 2022-04-06 |
| 38 | 201911028174-Written submissions and relevant documents [06-05-2022(online)].pdf | 2022-05-06 |
| 39 | 201911028174-RELEVANT DOCUMENTS [01-08-2022(online)].pdf | 2022-08-01 |
| 40 | 201911028174-FORM-24 [01-08-2022(online)].pdf | 2022-08-01 |
| 41 | 201911028174-ReviewPetition-HearingNotice-(HearingDate-25-08-2022).pdf | 2022-08-02 |
| 42 | 201911028174-Correspondence to notify the Controller [22-08-2022(online)].pdf | 2022-08-22 |
| 43 | 201911028174-Written submissions and relevant documents [09-09-2022(online)].pdf | 2022-09-09 |
| 1 | 2021-01-2714-46-18E_27-01-2021.pdf |