Abstract: ABSTRACT METHOD AND SYSTEM FOR PERSON IDENTIFICATION BY A MOBILE ROBOT This disclosure relates generally to a method and system for a person identification. State-of-art methods for person identification are convolutional neural network (CNN) based learning of attribute features, semantic learning, and their combination. However, all these methods generally utilize facial features along with other descriptors and the CNN are trained by using image of the person stored in a database. However, person identification while person is followed by a mobile robot from the backside is not yet achieved. The disclosed method utilizes a combination of deep networks with limited attribute features feasible to be captured as the mobile robot that follows a person is exposed only to the backside view of the person. The models in the present disclosure learn the attribute features on the fly to come out with strategies to both follow a target person based on backside view and identify the target person once it comes back into view. The present disclosure is also capable of re-identifying the target person upon re-appearance of the target person in field of vision (FOV) of the mobile robot. [To be published with FIG. 5]
Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR PERSON IDENTIFICATION BY A MOBILE ROBOT
Applicant
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description:
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
[001] The disclosure herein generally relates to deep network based in-motion subject identification, and, more particularly, to a method and system for person identification by a mobile robot.
BACKGROUND
[002] In computer vision applications, person identification involves identifying a person of interest at another time or place than the earlier observed time and place. Its applications range from tracking people through cameras to searching for them in a large galleries, from grouping photos in a photo album to visitor analysis in a retail store. Like many visual recognition problems, variations in pose, viewpoints illumination, and occlusion makes the identification challenging. Person re-identification (ReID) is another big challenge when a robot has to identify the same person after it goes out of robot’s field of vision (FOV) and comes back within FOV after some time. The convolutional neural network (CNN) methods are widely used for ReID, especially in security systems where such methods match and recognize the identities of pedestrians captured by multi-cameras with nonoverlapping views, which is significant to improve the efficiency of the security system. Owing to the low resolution of cameras, it is hard to obtain discriminative face features, so the current ReID methods are mainly based on visual features of the pedestrians, such as color and texture. In practice, changes in viewpoint, pose, and illumination among different camera views, as well as partial occlusions and background clutters, pose a great challenge to person ReID. Two principal person identification methods are feature representation and metric learning. Feature representation seeks to find features with stronger discrimination and better robustness to represent the pedestrians. Many kinds of features have been utilized for this, in which appearance features are the simplest and the most popular ones. Color, texture, and shape are the features that can be extracted for human appearance in feature representation, such as HSV (hue, saturation, value) color histogram, local binary pattern (LBP) texture, and Gabor features, and then used for reidentifying people with similarity among pedestrian features. Attribute features are also widely used in person ReID. Common attributes include gender, length of hair, and clothing. These attributes are highly intuitive and understandable descriptors which have proved to be successful in several tasks, such as face recognition and activity recognition. Although attribute features are complicated in terms of extraction and expression, they contain rich semantic information and are more robust to illumination and viewpoint changes. Traditional approaches have focused on low-level features such as colors, shapes, and local descriptors. With the renaissance of deep learning, the CNN has dominated this field, by learning features in an end-to-end fashion through various metric learning losses.
[003] However, utilizing principles of feature representation and metric learning becomes more challenging when the robot follows the person by walking behind the person. In this situation, the robot can see the back portion (rear view) of the person. While capturing the person from the back, facial features recognition methods become impractical. While a robot may follow a moving person, the camera (mounted on the robot) also moves with it. Sometimes, the moving person may go out of the field of view (FOV) of the robot, and the person may come back after some time. This creates a dynamic environment for person re-identification. Another challenge arises when the person is tagged to be followed (e.g. surveillance purpose) but his image does not exist in the database. In such a situation, learning and extracting features from the prior image in database becomes impossible. Conventional methods allow CNN to learn from the images fed earlier during training. And the trained CNN is used for ReID. However, it is not the ideal case. Many a times, a completely new person is required to be identified and prior training by feeding the image to CNN might not be feasible. Further, conventional methods generally process the images (or video) of the person to be identified, wherein such images (or videos) are captured through a static camera mounted on surveillance points. However, in the case of mobile robot, camera is mounted on the mobile robot and is moving along with the robot. This creates a dynamic and complex environment for person re-identification and person following.
SUMMARY
[004] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method of identification of a target person by a mobile robot is provided. The method includes obtaining, by the mobile robot, an image of the target person to be followed by the mobile robot. The mobile robot follows the target person from behind and captures images through a camera mounted on the mobile robot. The RGB image is captured by the mobile robot camera mounted on the mobile robot. The method further includes receiving information about the target person from the person database. The person database contains identifiers of the target person including name along with corresponding features of that person in a machine learning model form. When the target person is unknown then the information is not fetched from the person database, but the features are learnt on the go and gets stored in the database. The method further includes tagging, the target person wherein the tagging is executed through a voice command or manually by a click. The person database contains identifiers of a person including name along with corresponding features of that person in a machine learning model form. If none of the person model is matched, that person detected is tagged as ‘unknown person’ and followed if instructed so and can be entered in person model database if annotated by user. The alternative way of person tagging happens in remote view scenarios via UI, where the user tags the person among many persons (identified by a deep learning methodology for object detection) visible in screen, by clicking on the target person. The method further includes extracting a segmented mask of the target person from the image using a deep learning-based segmentation as a region of interest (ROI). The segmented mask actually is a region of interest comprises of plurality of features relevant to learn and detect the identity of the target person. The method further includes passing, the segmented mask of the target person to a feature extractor model to extract a plurality of distinctive features. The feature extraction involves CNN wherein the image of the target person is provided to the CNN having multiple layers of convolutions and pooling operations, and wherein the CNN learns to extract features from the image to obtain a feature map. The global pooling is applied to the feature map to reduce spatial dimensions of the feature map to a single value per channel to obtain global features. Simultaneously, a horizontal pooling is applied to the feature map to reduce channel dimensions of the feature map. The output of both pooling operations is combined to obtain the summary of the entire image as well as the distinctive features of the target person. The method further includes, processing, the extracted features in three ways to obtain three models wherein the first way involves processing of the entire mask features, the second way involve processing of partitioned mask features, and the third way involve processing of semantic mask features. The above three models are evaluated at testing time to give a majority voting on their individual confidence scores for target person re-identification. In case of first two deep models, the cosine similarity of current person mask is compared with target person mask in embedding space. In semantic model, the matches happen over a SPARQL query with OPTIONAL clause to match current person features with target person features, and if the count of matches is above a threshold t, the target person is classified. The method further includes assigning, a rank to the models based on individual confidence score of each model in deriving similarity of the mask of the target person with that of the target person of the person database. The method further includes assigning ranks to the plurality of models based on the individual confidence score of each model in deriving similarity of the processed mask with the target person image. The method further includes identifying the target person with the model having the highest match.
[005] In another aspect, a system for a method of identification of a target person by a mobile robot is provided. The system includes at least one memory storing programmed instructions; one or more Input /Output (I/O) interfaces; and one or more hardware processors, a person identification model, operatively coupled to a corresponding at least one memory, wherein the system is configured to obtain, via the one or more hardware processors, by the mobile robot, an image of the target person to be followed by the mobile robot. The mobile robot follows the target person from behind and captures images through a camera mounted on the mobile robot. The RGB image is captured by the mobile robot camera mounted on the mobile robot. Further the system is configured to receive, via the one or more hardware processors, an information about the target person from the person database. The person database contains identifiers of the target person including name along with corresponding features of that person in a machine learning model form. When the target person is unknown then the information is not fetched from the person database, but the features are learnt on the go and gets stored in the database. Further, the system is configured to tag, via the one or more hardware processors, the target person wherein the tagging is executed through a voice command or manually by a click. The person database contains identifiers of a person including name along with corresponding features of that person in a machine learning model form. If none of the person model is matched, that person detected is tagged as ‘unknown person’ and followed if instructed so and can be entered in person model database if annotated by user. The alternative way of person tagging happens in remote view scenarios via UI, where the user tags the person among many persons (identified by a deep learning methodology for object detection) visible in screen, by clicking on the target person. Further, the system is configured to extract, via the one or more hardware processors, a segmented mask of the target person from the image using a deep learning-based segmentation as a region of interest (ROI). The segmented mask actually is a region of interest comprises of plurality of features relevant to learn and detect the identity of the target person. Further, the system is configured to pass, via the one or more hardware processors, the segmented mask of the target person to a feature extractor model to extract a plurality of distinctive features. The feature extraction involves CNN wherein the image of the target person is provided to the CNN having multiple layers of convolutions and pooling operations, and wherein the CNN learns to extract features from the image to obtain a feature map. The global pooling is applied to the feature map to reduce spatial dimensions of the feature map to a single value per channel to obtain global features. Simultaneously, a horizontal pooling is applied to the feature map to reduce channel dimensions of the feature map. The output of both pooling operations is combined to obtain the summary of the entire image as well as the distinctive features of the target person. Further, the system is configured to process, via the one or more hardware processors, the extracted features in three ways to obtain three models wherein the first way involves processing of the entire mask features, the second way involve processing of partitioned mask features, and the third way involve processing of semantic mask features. The above three models are evaluated at testing time to give a majority voting on their individual confidence scores for target person re-identification. In case of first two deep models, the cosine similarity of current person mask is compared with target person mask in embedding space. In semantic model, the matches happen over a SPARQL query with OPTIONAL clause to match current person features with target person features, and if the count of matches is above a threshold t, the target person is classified. Further, the system is configured to assign, via the one or more hardware processors, a rank to the models based on individual confidence score of each model in deriving similarity of the mask of the target person with that of the target person of the person database. The rank is assigned to the plurality of models based on the individual confidence score. Finally, the system is configured to identify the target with the model having the highest match.
[006] In yet another aspect, a computer program product including a non-transitory computer-readable medium embodied therein a computer program for identification of a target person by a mobile robot is provided. The computer readable program, when executed on a computing device, causes the computing device to obtain, via the one or more hardware processors, by the mobile robot, an image of the target person to be followed by the mobile robot. The robot follows the target person from behind and captures images through a camera mounted on the mobile robot. The RGB image is captured by the mobile robot camera mounted on the mobile robot. The computer readable program, when executed on a computing device, causes the computing device to receive, via the one or more hardware processors, an information about the target person from the person database. The person database contains identifiers of the target person including name along with corresponding features of that person in a machine learning model form. When the target person is unknown then the information is not fetched from the person database, but the features are learnt on the go and gets stored in the database. The computer readable program, when executed on a computing device, causes the computing device to tag, via the one or more hardware processors, the target person wherein the tagging is executed through a voice command or manually by a click. The person database contains identifiers of a person including name along with corresponding features of that person in a machine learning model form. If none of the person model is matched, that person detected is tagged as ‘unknown person’ and followed if instructed so and can be entered in person model database if annotated by user. The alternative way of person tagging happens in remote view scenarios via UI, where the user tags the person among many persons (identified by a deep learning methodology for object detection) visible in screen, by clicking on the target person. The computer readable program, when executed on a computing device, causes the computing device to extract, via the one or more hardware processors, a segmented mask of the target person from the image using a deep learning-based segmentation as a region of interest (ROI). The segmented mask actually is a region of interest comprises of plurality of features relevant to learn and detect the identity of the target person. The computer readable program, when executed on a computing device, causes the computing device to pass, via the one or more hardware processors, the segmented mask of the target person to a feature extractor model to extract a plurality of distinctive features. The feature extraction involves CNN wherein the image of the target person is provided to the CNN having multiple layers of convolutions and pooling operations, and wherein the CNN learns to extract features from the image to obtain a feature map. The global pooling is applied to the feature map to reduce spatial dimensions of the feature map to a single value per channel to obtain global features. Simultaneously, a horizontal pooling is applied to the feature map to reduce channel dimensions of the feature map. The output of both pooling operations is combined to obtain the summary of the entire image as well as the distinctive features of the target person. The computer readable program, when executed on a computing device, causes the computing device to process, via the one or more hardware processors, the extracted features in three ways to obtain three models wherein the first way involves processing of the entire mask features, the second way involve processing of partitioned mask features, and the third way involve processing of semantic mask features. The above three models are evaluated at testing time to give a majority voting on their individual confidence scores for target person re-identification. In case of first two deep models, the cosine similarity of current person mask is compared with target person mask in embedding space. In semantic model, the matches happen over a SPARQL query with OPTIONAL clause to match current person features with target person features, and if the count of matches is above a threshold t, the target person is classified. Further, the computer readable program, when executed on a computing device, causes the computing device to assign, via the one or more hardware processors, a rank to the models based on individual confidence score of each model in deriving similarity of the mask of the target person with that of the target person of the person database. Finally, the computer readable program, when executed on a computing device, causes the computing device to identify, via the one or more hardware processors, the target person with the model having the highest match.
[007] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[008] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[009] FIG. 1 illustrates an exemplary system, also referred to as ‘mobile robot, for a person identification when the mobile robot and the person are in motion, according to some embodiments of the present disclosure.
[010] FIG. 2 illustrates the system architecture of the mobile robot including actuation and re-identification, according to some embodiments of the present disclosure.
[011] FIG. 3 illustrates transaction sequence for the person identification, according to some embodiments of the present disclosure
[012] FIG. 4 is a flow diagram of an illustrative method for person identification by the mobile robot, in accordance with some embodiments of the present disclosure.
[013] FIG. 5 illustrates the flow of input image data to the different models which process the information differently to come to inferences, according to some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
[014] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
[015] As used herein the terms “person”, “people”, “persons”, “humans”, “users” are used interchangeably throughout the specification, and refers to the human being to be re-identified according to the method and system disclosed in the present invention.
[016] As used herein the terms or acronyms like “deep network”, Convolutional Neural Network (CNN)”, “CNN”, “Neural Network (NN)”, “Deep Neural Network (DNN)”, “recurrent neural network”, “RNN”, and/or the like may be interchangeably referenced throughout the specification.
[017] With robots beginning to share spaces with humans in an uncontrolled environment, social robot navigation has become an increasingly relevant area of robotics research. However, robots navigating around humans still experience many challenges. Recent works have shown that it is insufficient for robots to simply consider humans as “dynamic obstacles”. Person identification plays a key role in applications where a mobile robot needs to track its users over a long period of time, even if they are partially unobserved for some time, in order to follow them or be available on demand. The function of the mobile robot following the person can have multiple reasons to follow such as tracking, surveillance, guidance, etc. In the context of social robotics, a person detection and re-identification component is imperative. In order for a robot to be able to correctly identify the person, it must be able to precisely map and model their positions. A person identification system involves assigning the same identifiers to the exact same people in a temporal series of images. The problem of person identification by a mobile robot has a complex nature as the performance of such a system can be strongly impacted by multiple factors. The main reason behind the complexity is that when the robot follows the person, mostly back view is available and facial features are hardly recognized. In the absence of prominent facial features recognition, person identification becomes difficult. To design a system that performs well in a variety of scenarios like identification and re-identification implies creating a mechanism that is able to overcome problems such as a dynamic environment created when a mobile robot follows the person from behind. While following the person, the camera mounted on the robot is also moving. The mobile robot can capture the images until the person is within Field of View (FOV). As the person goes out of FOV and then again comes back within FOV, the mobile robot has to be capable of re-identifying the person. To handle these types of problems the disclosed invention provided CNN based method attribute feature as well as on the fly metric learning of features for the person re-identification.
[018] Referring now to the drawings, and more particularly to FIG. 1 through FIG. 5, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments, and these embodiments are described in the context of the following exemplary system and/or method.
[019] FIG. 1 illustrates an exemplary system 100, which is a ‘mobile robot’ for a person identification by the mobile robot, according to some embodiments of the present disclosure.
[020] In an embodiment, the system 100 includes one or more processors 104, communication interface device(s) or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the one or more processors 104. The one or more processors 104 that are hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, graphics controllers, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) are configured to fetch and execute computer-readable instructions stored in the memory. In the context of the present disclosure, the expressions ‘processors’ and ‘hardware processors’ may be used interchangeably. The I/O interface (s) 106 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface(s) 106 can include one or more ports for connecting a number of devices to one another or to another server. The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random-access memory (SRAM) and dynamic random-access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, system 100 comprises a person identification model 110 that performs the person identification through various modules. In an embodiment, the person identification model 110 includes gesture detection module 110A, pose detection module 110B, bone/joints detection module 110C and accessories/ apparel detection module 110D, functionally connected to, functionally connected to the network to receive image/ video stream from the camera 112 mounted on the mobile robot 100. In an embodiment, the memory 102 may include a database or repository. The system 100, further comprised of proxy server 108 and trusted third party server 110. The memory 102 may comprise information pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the system 100 and methods of the present disclosure. In an embodiment, the database may be external (not shown) to the system 100 and coupled via the I/O interface 106. The memory 102 includes the gesture detection module 110A for learning and identifying movement of hands, head etc. of a person to be re-identified. The memory 102 further includes the pose detection module 110B that learns and identifies features by dividing a body into various parts, e.g. head-shoulder, upper body, and lower body. The memory 102 further includes the bone/ joints detection module 110C that learns and identifies body movements by focusing the change on joint positions while on the move. The memory 102 further includes the accessories/ apparels detection module 110D that learns and identifies apparels and accessories, their color, texture, design, and pattern. The memory 102 further includes a plurality of modules (not shown here) comprises programs or coded instructions that supplement applications or functions performed by the system 100 for person-reidentification by a moving robot. The plurality of modules, amongst other things, can include routines, programs, objects, components, and data structures, which perform particular tasks or implement particular abstract data types. The plurality of modules may also be used as, signal processor(s), node machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the plurality of modules can be used by hardware, by computer-readable instructions executed by the one or more hardware processors 104, or by a combination thereof. The plurality of modules can include various sub-modules (not shown).
[021] FIG. 2 illustrates the system architecture for the mobile robot 100 including actuation and re-identification, according to some embodiments of the present disclosure.
[022] Consider a scenario, wherein a specific person of interest is to be tracked and identified by the mobile robot.
[023] At the step 202, the system 100 prompts a user to tag a person to be followed. Tagging is done at time instant t0 and the time gets recorded to the robot as t0. Tagging is the first step in allocating the mobile robot 100 to follow the person among group of people. The user can tag the person by plurality of ways. One way is to tag the person through a voice command. Another way is to tag the person through a user interface (UI). When the user gives the voice command by saying the name of the person to be tagged, the voice gets converted to a text. The robot looks up the text into an existing person database and identifies the matching name in the database. The existing person database contains identifiers of a person including the name along with corresponding features of that person in a machine learning model form. The alternative way of person tagging happens in remote view scenarios via UI, where the user tags the person among many persons visible on screen, by clicking on the target person. This tagging via UI utilizes deep learning-based object detection algorithm. In an embodiment of the present invention, the YOLO 4 methodology is utilized for tagging the person. Upon clicking, the robot matches the features of the tagged person in the existing person database. In case the tagged name or the person clicked does not find a suitable match in the existing person database, the system 100 tags the person as “unknown person”. The system 100 prompts the user to manually annotate the un-known person to be followed. At the step 204, a segmented mask of the tagged person is extracted as a region of interest (ROI). After tagging, ROI extraction is performed in both cases, when a tagging is done through voice command, or it is done through UI. The extraction of the segmented mask of the tagged person is executed by using deep learning-based segmentation algorithm. According to one embodiment of the present disclosure, the segmented mask of the target person is extracted (via YOLO segmentation) as the region of interest (ROI). At the step 206, the ROI thus obtained for the segmented mask is then passed through the feature extractor model to learn the model of the target person. When the target person is a new entry and does not have his existence in the person database, the learned features get stored in the person database. In case the target person found to exist in the person database, the model learns the features and the person database gets refined by further new observation of the target person. At the step 208, in the next observation by the robot at time instant (t0 + dt), the scene is processed to check if target person exists or not. This is done by checking similarity of the tagged person in view with the models in the person database. As a result of similarity checking, the robot action is dependent on two possibilities. The first possibility of having the target person in view; and the second possibility of not having the target person in view. At the step 210, the robot works on the first possibility. If the target person is still in view, i.e. the person exists within the field of view (FOV) region of the robot, the robot starts tracking the target person. In the tracking, the robot follows the mask of the target person in subsequent frames by a trajectory planning. The trajectory planning algorithm guides the robot to keep a safe distance from the target person. This prevents undesired instances of collision, distortion of robot machinery or abolition of the purpose of surveillance. At the step 212, the robot works on the second possibility. If target person is not in view, then the robot searches for locations to find the target person. The mobile robot 100 tries to search for the target person based on three calculations. First, the mobile robot estimates the last seen location of the target person. Second, the mobile robot 100 scans the trajectory of the target person from time t0 until its last seen location. Third, the mobile robot 100 assesses the observed scene context. According to an embodiment of the present disclosure, the actuation space of the mobile robot 100 is pre-defined. The actuation space comprises of robot wheels that can move left, right and forward. Also, a plurality of camera movement is configured in the mobile robot 100. The camera view can change in left and right direction, can be zoomed-in and zoomed-out, and can be tilted-up and down. While the RGB channel is used to process scenes, the depth image is used to locate free space and avoid collisions with obstacles.
[024] FIG. 3 illustrates transaction sequence for the person identification, according to some embodiments of the present disclosure
[025] At step 302 of the FIG. 3, the RGB image is passed to person detector module along with mobile robot’s odometry position at time t0. The person detector module is a deep learning based convolutional neural network (CNN) to identify person instances along with their bounding boxes and masks. Along with RGB image, optionally depth perception can also be given as input. In an embodiment of the present invention the deep learning based convolutional neural network (CNN) is a COCO trained YOLO model. The YOLO model looks at the complete RGB image and uses just one CNN to predict the bounding boxes and the class probabilities. In this person detector module, the image is split into an SxS grid with each grid box having m bounding boxes. For each bounding box, YOLO outputs a class probability and offset values for the bounding box. The boxes having the class probability above a threshold value are selected and used to locate the objects in the image. Therefore, the person detector modules process the RGB image based on person instances, bounding box of the instances and the masks of the instances. At the step, 304, CNN is trained for the person re-identification. For training, the mask of person either tagged or detected via models is passed over a time window of S seconds to a feature extractor. For real time identification of the person by a robot, a bounding box in the centre of FOV is placed where the target person has to stand, and the robot collects the images in the beginning to extract the target's features and train its model to re-identify. The main features extracted are the type of clothes and accessories by cloth detection network, the ratio of body parts from PoseNet, visible body features (including head), skeletal features and gait patterns over this time window, facial features (if view available). First, the single shot detector (SSD) creates bounding boxes for person detection via Convolutional Neural Networks (CNNs) in a 2D image sequence. Then, 2D coordinates are converted to 3D coordinates in a world space by reducing the point cloud to the volume of the human body based on the output of the SSD detector for estimating distances. Meanwhile, the color feature is extracted from the region of interest-based also on the output of the SSD detector. Then, distance and labeled color are obtained to identify and track the target person. The ratio of body parts is calculated by using Posenet which is a real-time pose detection technique to detect poses of a human being. It is a deep learning TensorFlow model that estimates human pose by detecting body parts such a elbows, hips, wrists, knees, ankles by forming a skeleton structure of the pose joining these points. The feature extraction is performed by the process comprising steps: (a) providing the image of the target person to a convolutional neural network (CNN) wherein the CNN comprises of multiple layers of convolutions and pooling operations, and wherein the CNN learns to extract features from the image to obtain a feature map; (b) applying a global pooling to the feature map to reduce spatial dimensions of the feature map to a single value per channel to obtain global features wherein the reduced spatial dimensions with C-dimensional vector represents a summary of the entire image; (c) simultaneously, applying a horizontal pooling to the feature map to reduce channel dimensions of the feature map by applying 1x1 convolution to obtain local features wherein reduced channel dimensions with c-dimensional vector represents horizontal parts of the image; (d) combining C-dimensional vector representing summary of the entire image and c-dimensional vector representing horizontal parts of the image to obtain distinctive features the target person. The extracted features are processed in three ways using three different models. In the first way at the step 306, the extracted features are fed to the model-1(CNN v1). The extracted features in the form of RGBSD layers as a single image is given as input to the ConvNet. Similar to conventional CNN architectures, the network contains convolutional layers, fully-connected layers, and an output layer. The entire feature set is passed through a ViVit (Video Vision Transformer) to learn a 128-dimensional embedding for target person. In the second way at the step 308, the mask of the person is partitioned into X parts by the model-2. In this case, we have set X = 3; as head, middle body and lower body (extracted from Hierarchical PoseNet information) and X sets of 128-dim embedding is learned for the target person. The second model (CNN v2) uses convolutional streams, and the input is RGB channels for one stream and just the stereo depth image for the other. In the fully connected layer, the input is a combination of the flattened output from those two convolutional streams. In the second way at the step 310, semantic model of the person is extracted by the model-3 (CNNv3). The third ConvNet (CNN v3) is a regular RGB image-based CNN. It has a similar structure as that of the first model. The semantic model of the person is extracted in semantic web technology compliant format of RDF facts, by matching existence of feature types in discreet form as seen in the person’s mask (like hair color, cloth types with color, hair type, etc.). At the step 312, the individual outputs of all the three models are evaluated at testing time to give a majority voting on their individual confidence scores for target person re-identification. In case of first two deep models, the cosine similarity of current person mask is compared with target person mask in embedding space. In semantic model, the matches happen over a SPARQL (standard query language and protocol for Linked Open Data on the web or for RDF triplestores) query with OPTIONAL clause to match current person features with target person features, and if the count of matches is above a threshold t, the target person is classified. As a collaborative outcomes of all the three models, the person is identified by the robot. If the person is new, then the system needs to learn the features for some time initially. The system doesn’t have any background information about the person and the model learns the features on the go.. So, only after 10 seconds (S = 10), the person model will get updated at intervals, and the feature updates of model will happen over a sliding overlapping window of segmented pixel from images of that person. These are fed to the embedding space to represent the target person. Therefore, robot gets trained to identify the target person. At the step 314, the trained robot can be utilized in a plurality of ways once it is able to identify the target person. The robot can track the target person. The robot can follow the target person. The robot can perform trajectory planning as well as actuation.
[026] FIG. 4 is a flow diagram of an illustrative method for person identification by a mobile robot, in accordance with some embodiments of the present disclosure.
[027] The steps of the method 400 of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in FIG. 1 through FIG. 5. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods, and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously. At step 402 of the method 400, the one or more hardware processors 104 are configured to obtain an image of the target person to be followed by the mobile robot. The RGB image is captured by the robot camera 112 mounted on the mobile robot 100. When the robot follows from the back, mostly rear view is available to the robot. The camera 112 capturing images is mounted on the mobile robot 100 following the person from behind, therefore, camera 112 while capturing images is not static but it remains dynamic. At step 402 of the method 400, the one or more hardware processors 104 are configured to receive information about the target person from the person database. The person database contains identifiers of the target person including name along with corresponding features of that person in a machine learning model form. The system 100 performs similarity check by scanning the database each time the robot captures the image of the person whom it is following. At step 406 of the method 400, the one or more hardware processors 104 are configured to tag the target person. The user has two ways to initialize person tagging at time t0. The user can specify a person by saying the name of a person, which is converted into text form for look up in an existing person database. The person database contains identifiers of a person including name along with corresponding features of that person in a machine learning model form. If none of the person model is matched, that person detected is tagged as ‘unknown person’ and followed if instructed so and can be entered in person model database if annotated by user. The alternative way of person tagging happens in remote view scenarios via UI, where the user tags the person among many persons (identified by a deep learning methodology such as YOLO (4) object detection) visible in screen, by clicking on the target person. At step 408 of the method 400, the one or more hardware processors 104 are configured to extract a segmented mask of the target person from the image using a deep learning-based segmentation as a region of interest (ROI). The system 100 extracts the segmented mask via YOLO segmentation as the region of interest (ROI). The deep learning algorithm performs segmentation to obtain the segmented mask. The segmented mask comprises of semantic information of the target person within the segmented mask. The segmented mask actually is a region of interest comprises of plurality of features relevant to learn and detect the identity of the target person. At step 410 of the method 400, the one or more hardware processors 104 are configured to pass the segmented mask of the target person to a feature extractor model to extract a plurality of distinctive features. At training time, the segmented mask of the target person either tagged or detected via models is passed over a time window of S seconds to the feature extractor. The main features extracted are the type of clothes and accessories by cloth detection network, the ratio of body parts from PoseNet, visible body features (including head), skeletal features and gait patterns over this time window, facial features (if view available). The segmented mask as the ROI is passed to the feature extractor model to learn the model of target person if it is a new entry in person database, else the person database model is refined by new observations. At step 412 of the method 400, the one or more hardware processors 104 are configured to process the extracted features in three ways to obtain three models wherein the first way involves processing of the entire mask features, the second way involve processing of partitioned mask features, and the third way involve processing of semantic mask features. In the first way, the entire feature set is passed through a ViVit Vision Transformer to learn a 128-dimensional embedding for target person. In the second way, the mask of the target person is partitioned into X parts. In an exemplary case, the X is fixed as 3 wherein, as head, middle body and lower body (extracted from a Hierarchical PoseNet information) and X sets of 128-dim embedding is learned for the target person. In the third way, semantic model of the person is extracted in semantic web technology compliant format of RDF facts, by matching existence of feature types in discreet form as seen in the person’s mask (like hair color, cloth types with color, hair type, etc.). At step 414 of the method 400, the one or more hardware processors 104 are configured to obtain individual confidence score of the plurality of models and the context. The above three models are evaluated at testing time to give a majority voting on their individual confidence scores for target person re-identification. In case of first two deep models, the cosine similarity of current person mask is compared with target person mask in embedding space. In semantic model, the matches happen over a SPARQL query with OPTIONAL clause to match current person features with target person features, and if the count of matches is above a threshold t, the target person is classified. If the person is new, then the system needs to learn the features for some time initially. At the 416 of the method 400, the one or more hardware processors 104 are configured to assign a rank to the models based on individual confidence score of each model in deriving similarity of the mask of the target person with that of the target person of the person database. At the 418 of the method 400, the one or more hardware processors 104 are configured to identify the target person with the model having the highest match. In an exemplary embodiment, it has been found that the target person’s features cannot be immediately on just few views as background information biases the learned model wrongly. So, only after 10 seconds (S = 10), the target person model gets updated at intervals, and the feature updates of model will happen over a sliding overlapping window of segmented pixel from images of that person. This are fed to the embedding space to represent the target person. At step 412 of the method 400, the one or more hardware processors 104 are configured to combine the similarity results of three models to identify the tagged person as the target person. All the three models cumulatively assess the individual characteristic features to identify the target person as a whole.
[028] FIG. 5 illustrates the flow of input image data to the different models which process the information differently to come to inferences, according to some embodiments of the present disclosure.
[029] As illustrated in FIG. 5, at the step 502, the raw image is passed as an input through the person identification model 110 as the model 110 build model-1 and model-2. Similarly, at the step 504, the partitioned image is passed as an input through the person identification model 110 as the model 110 build model-2. The partitions are created by estimating the pose and skeletal features of the person detected in the image; and then, separately sending the head portion, the middle body portion, and the lower portion for further processing to build the model-2. At the step 506, the scene graph is obtained from the raw image in the form of relations of sub-features describing the person identified in the image. The semantic features are extracted like head color, bald or with hair, facial texture, and features, having tattoo or not on skin, type of accessories and footwear, features of clothes, etc. The extracted semantic features are passed as a combination of textual concept and image pixel convolutions to form CLIP embeddings from which a graph is extracted that contains the relations of the entities detected along with features in the scene graph. The graph passes (a) the node features, where each of the sub-entity detected is a node and (b) the relationship between the nodes to Graph Convolutional Network (GCN) to get an n-dimensional embedding where n is of the form of 2^x, x being an integer. This embedding representation in higher dimensional space is used to represent the model-3 of the specific person which is compared with the input images of the person to compute similarity in vector space to re-identify the person if the similarity falls on a higher side. At the step 508, the image level extracted features of the image is passed to auto-encoder to learn a compressed representation of the image and the same is passed to a deep network to validate the representation to reproduce the similar types of images to give a model as output. For model-1and model-2, the process is similar, except that model-2 takes in as input three partitioned images of the raw image whilst the model-1 takes the entire raw image. This model is compared with incoming images to check if the person detected in the new image has high similarity with learnt model. It is to be noted that learning happens through a window of stream of images initially and subsequent classified images are appended to the input in sliding window fashion to the online learning process.
[030] The embodiments of present disclosure herein address the unresolved problem of a person identification by the mobile robot that follows the person from behind and rarely the mobile robot could benefit from the facial feature recognition. The present disclosure provides the person identification by simultaneously processing the image of the person thought three different models and combining the output of all three models. This provides more authentic and more validated identification of the person by the mobile robot. Further, the present disclosure provides the person-reidentification by the mobile robot. This situation arises when the person followed by the mobile robot moves out of FOV and comes back within FOV. The mobile robot re-identifies the target person based on last seen location of the target person, trajectory of the target person and the observed scene context. Further, the present disclosure is suitable for surveillance purpose wherein the camera capturing the person(s) is in mobile condition as it is mounted on the mobile robot.
[031] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[032] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
[033] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[034] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[035] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[036] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
, Claims:We Claim:
1. A processor implemented method (400) for person identification by a mobile robot, the method comprising steps:
obtaining (402), via one or more hardware processors of the mobile robot, an image of a target person to be followed by the mobile robot;
receiving (404), via the one or more hardware processors, information about the target person from the person database;
tagging (406), via the one or more hardware processors, the target person ;
extracting (408), via the one or more hardware processors, a segmented mask of the target person from the image as a region of interest (ROI) using a deep learning-based segmentation;
extracting (410), by a feature extractor model executed via the one or more hardware processors, a plurality of mask features from the segmented mask of the target person;
processing (412), via the one or more hardware processors, the extracted features using a plurality of techniques to obtain three distinct models wherein the plurality of techniques comprising (i) processing of the plurality of mask features to generate a first model, (ii) processing partial mask features from among the plurality of features to generate a second model, and (iii) processing of a semantic mask features to generate a third model wherein the semantic mask features are obtained by a bounding box segmentation of the target person in the image;
obtaining (414), via the one or more hardware processors, an individual confidence scores of the plurality of models in identifying the target person wherein the first model and the second model compares cosine similarity of the processed mask features with the target person image in an embedding space; and the third model perform match over SPARQL query and if count of matches is above a threshold t, the target person is classified.
assigning (414), via the one or more hardware processors, rank to the plurality of models based on the individual confidence score of each model in deriving similarity of the processed mask with the target person image;
identifying (416), via the one or more hardware processors, the target person with the model having the highest match.
2. The method as claimed in claim 1, wherein when the target person exists within a field of view (FOV), the robot identifies the target person by way of trajectory planning; and when the target person re-appears in the FOV after moving out of FOV, the robot re-identifies the target person based on last seen location of the target person, trajectory of the target person and the observed scene context.
3. The method as claimed in claim 1, wherein the plurality of features extracted from the feature extractor model comprises (1) type of clothes and accessories, (2) ratio of body parts, (3) visible body features including head, (4) skeletal features, (5) gait patterns and (6) partial facial features.
4. The method as claimed in claim 1, wherein, the feature extraction comprising:
providing the image of the target person to a convolutional neural network (CNN), wherein the CNN comprises of multiple layers of convolutions and pooling operations, and wherein the CNN learns to extract features from the image to obtain a feature map;
applying a global pooling to the feature map to reduce spatial dimensions of the feature map to a single value per channel to obtain global features wherein the reduced spatial dimensions with C-dimensional vector represents a summary of the entire image;
simultaneously, applying a horizontal pooling to the feature map to reduce channel dimensions of the feature map by applying 1x1 convolution to obtain local features wherein reduced channel dimensions with c-dimensional vector represents horizontal parts of the image;
combining C-dimensional vector representing summary of the entire image and c-dimensional vector representing horizontal parts of the image to obtain distinctive features the target person.
5. The method as claimed in claim 1, wherein the robot follows the target person from behind, and wherein a depth image is used to locate a free space and subsequently avoid collisions with the target person and other obstacles.
6. The method as claimed in claim 1, wherein when the target person is new and information of the target person does not exist in the person database, the robot learns the plurality of features on-the-go and updates the features over a sliding overlapping window of segmented pixel from the image of the target person; and wherein the segmented mask of the target person is then fed to the feature extractor model.
7. The method as claimed in claim 1, wherein the plurality of models processes the extracted features by the plurality of techniques comprising:
(1) the first model processes the extracted features by passing entire feature map through a Vision Transformer to learn a 128-dimensional embedding for the target person; (2) the second model processes the extracted features by partitioning the entire body into parts and to learn a 128-dimensional embedding for the target person; and (3) the third model processes the extracted features by matching the existence of semantic features in a discreet form.
8. A system (100), comprising:
a memory (102) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
obtain, an image of a target person to be followed by the mobile robot;
receive, information about the target person from the person database;
tag, the target person ;
extract, a segmented mask of the target person from the image as a region of interest (ROI) using a deep learning-based segmentation;
extract, by a feature extractor model, a plurality of mask features from the segmented mask of the target person ;
process, the extracted features using a plurality of techniques to obtain three distinct models wherein the plurality of techniques comprising (i) processing of the plurality of mask features to generate a first model, (ii) processing partial mask features from among the plurality of features to generate a second model, and (iii) processing of a semantic mask features to generate a third model wherein the semantic mask features are obtained by a bounding box segmentation of the target person in the image;
obtain, an individual confidence scores of the plurality of models in identifying the target person wherein the first model and the second model compares cosine similarity of the processed mask features with the target person image in an embedding space; and the third model perform match over SPARQL query and if count of matches is above a threshold t, the target person is classified.
assign, rank to the plurality of models based on the individual confidence score of each model in deriving similarity of the processed mask with the target person image;
identify the target person with the model having the highest match.
9. The system as claimed in claim 8, wherein when the target person exists within a field of view (FOV), the robot identifies the target person by way of trajectory planning; and when the target person re-appears in the FOV after moving out of FOV, the robot re-identifies the target person based on last seen location of the target person, trajectory of the target person and the observed scene context.
10. The system as claimed in claim 8, wherein the plurality of features extracted from the feature extractor model comprises (1) type of clothes and accessories, (2) ratio of body parts, (3) visible body features including head, (4) skeletal features, (5) gait patterns and (6) partial facial features.
11. The system as claimed in claim 8, wherein, the process of feature extraction comprising steps:
providing the image of the target person to a convolutional neural network (CNN), wherein the CNN comprises of multiple layers of convolutions and pooling operations, and wherein the CNN learns to extract features from the image to obtain a feature map;
applying a global pooling to the feature map to reduce spatial dimensions of the feature map to a single value per channel to obtain global features wherein the reduced spatial dimensions with C-dimensional vector represents a summary of the entire image;
simultaneously, applying a horizontal pooling to the feature map to reduce channel dimensions of the feature map by applying 1x1 convolution to obtain local features wherein reduced channel dimensions with c-dimensional vector represents horizontal parts of the image;
combining C-dimensional vector representing summary of the entire image and c-dimensional vector representing horizontal parts of the image to obtain distinctive features the target person.
12. The system as claimed in claim 8, wherein the robot follows the target person from behind, and wherein a depth image is used to locate a free space and subsequently avoid collisions with the target person and other obstacles.
13. The system as claimed in claim 8, wherein when the target person is new and information of the target person does not exist in the person database, the robot learns the plurality of features on-the-go and updates the features over a sliding overlapping window of segmented pixel from the image of the target person; and wherein the segmented mask of the target person is then fed to the feature extractor model.
14. The system as claimed in claim 8, wherein the plurality of models processes the extracted features by the plurality of techniques comprising:
(1) the first model processes the extracted features by passing entire feature map through a Vision Transformer to learn a 128-dimensional embedding for the target person; (2) the second model processes the extracted features by partitioning the entire body into parts and to learn a 128-dimensional embedding for the target person; and (3) the third model processes the extracted features by matching the existence of semantic features in a discreet form.
Dated this 31st Day of October 2023
Tata Consultancy Services Limited
By their Agent & Attorney
(Adheesh Nargolkar)
of Khaitan & Co
Reg No IN-PA-1086
| # | Name | Date |
|---|---|---|
| 1 | 202321074065-STATEMENT OF UNDERTAKING (FORM 3) [31-10-2023(online)].pdf | 2023-10-31 |
| 2 | 202321074065-REQUEST FOR EXAMINATION (FORM-18) [31-10-2023(online)].pdf | 2023-10-31 |
| 3 | 202321074065-FORM 18 [31-10-2023(online)].pdf | 2023-10-31 |
| 4 | 202321074065-FORM 1 [31-10-2023(online)].pdf | 2023-10-31 |
| 5 | 202321074065-FIGURE OF ABSTRACT [31-10-2023(online)].pdf | 2023-10-31 |
| 6 | 202321074065-DRAWINGS [31-10-2023(online)].pdf | 2023-10-31 |
| 7 | 202321074065-DECLARATION OF INVENTORSHIP (FORM 5) [31-10-2023(online)].pdf | 2023-10-31 |
| 8 | 202321074065-COMPLETE SPECIFICATION [31-10-2023(online)].pdf | 2023-10-31 |
| 9 | 202321074065-FORM-26 [12-12-2023(online)].pdf | 2023-12-12 |
| 10 | 202321074065-Proof of Right [20-12-2023(online)].pdf | 2023-12-20 |
| 11 | Abstract.1.jpg | 2024-02-13 |
| 12 | 202321074065-FORM-26 [12-11-2025(online)].pdf | 2025-11-12 |