Abstract: ABSTRACT METHOD AND SYSTEM FOR CONTEXT AWARE ENTITY EXTRACTION FOR MATCHING RESUMES AND JOB DESCRIPTIONS State of art techniques utilize standard Named Entity Recognition (NER) models, that have limitation in extracting relevant entities with context aware perspective. Further, require at least a minimal structured query as input. A method and system 5 for context aware entity extraction for matching resumes and JDs is provided. The method provides context NER and NLQ models that enable accurate entity extraction from resumes, JDs, and a free flowing text. Standard NER and NLQ models are finetuned using curated training dataset, generated automatically, to build context awareness and eliminating the manual effort required for data annotation. 10 An option is provided to an end user to define individual weightage for each of the entity of interest such as skill, location, experience and role, wherein the a cumulative weightage in combination with a cosine similarity measure is used to find a list of matching resumes or JDs in accordance to an input request. [To be published with FIG. 1B]nt
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR CONTEXT AWARE ENTITY
EXTRACTION FOR MATCHING RESUMES AND JOB DESCRIPTIONS
Applicant
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the
manner in which it is to be performed.
2
TECHNICAL FIELD
[001] The embodiments herein generally relate to Natural language
processing and entity extraction, and more particularly, to a method and system for
context aware entity extraction for matching resumes and Job Descriptions (JDs).
5
BACKGROUND
[002] Intelligent automation has already penetrated in hiring and
recruitment domain. Considering huge volumes of resumes and Job Descriptions
(JDs) flowing in, it is critical that most relevant matches are identified, so as to
10 ensure right candidates are filtered and irrelevant matching is eliminated.
[003] Resumes or the JDs include multiple entities such as skills, role,
location, experience, and the like, that need to be compared appropriately in
accordance with the recruiter’s or candidate’s preferences. There may be finer
preferences and variations in requirement for each entity. Many existing machine
15 learning based systems that provide mapping of resumes to JDs are generic and
follow preset logic to identify relevant resume. However, it can be understood that
the requirement or preferences of a recruiter or a candidate can vary with time
factor, urgency factor, wherein importance for each entity such as skill, location,
role, experience may individually vary. Furthermore, the conventional resume to
20 JD matching solutions demand the query to be provided in at least a minimal
structured format, which affects the ease of utilizing the functionality, effectively
degrading the end user experience.
SUMMARY
25 [004] Embodiments of the present disclosure present technological
improvements as solutions to one or more of the above-mentioned technical
problems recognized by the inventors in conventional systems.
[005] For example, in one embodiment, a method for context aware entity
extraction for matching resumes and Job Descriptions (JDs) is provided.
30 [006] The method includes receiving a request, the request comprising one
of: a) a resume to be mapped with a JD list identified from a plurality of JDs, b) a
3
JD to be mapped with a resume list identified from a plurality of resumes and c) a
Natural language query to be mapped with one of i) the JD list and ii) the resume
list. Further, the method includes extracting a plurality of input entities from the
request to generate a structured query. The plurality of input entities comprise at
5 least one of skills, an experience in terms of range and text, a role, a location tagged
with multi-tier category. A context aware Named Entity Recognition (NER) model
extracts the plurality of input entities from the resume or the JD. A context aware
Natural Language Query (NLQ) model extracts the plurality of input entities from
the Natural Language Query. The context aware NER model and the context aware
10 NLQ model are trained by generating a curated training dataset.
[007] Further, the method includes querying an entity database using the
structured query constructed from the plurality of input entities, wherein the entity
database is a structured database comprising a plurality of entities, extracted from
each of the plurality of JDs and each of the plurality resumes, by the trained context
15 aware NER model. The plurality of entities comprise the skills, the experience in
terms of range and text , the role, the location tagged with multi-tier category, a
person name tagged with geographical origin, organization names, educational
institutes, the email, and the phone.
[008] Furthermore, the method includes generating an output list in
20 response to the structured query, wherein the output list maps to the plurality of
input entities in accordance with a preset mapping criterion comprising one of a) an
exact match and b) a partial match. The output list comprises one of a) the JD list
mapping to the resume or the natural language query in accordance with the preset
mapping criterion and b) the resume list mapping to the JD or the natural language
25 query in accordance with the preset mapping criterion. The mapping is computed
in terms of a percentage match function based on, a) a cosine similarity between
each of the plurality of input entities with corresponding each of the plurality of
entities of each of the plurality of JDs or each of the plurality of resumes, and b) a
cumulative weightage based on an individual entity weightage dynamically defined
30 for each of the plurality of input entities by a user while generating the request.
4
[009] Furthermore, the method includes performing context aware training
by generating the curated training dataset by training a standard NER model and a
standard NLQ model using a standard training dataset. The context aware training
includes updating the entity database with a plurality of entity samples
5 corresponding to the plurality of entities, wherein the plurality of entity samples are
from a domain of interest. Further includes collecting a plurality of sample
sentences having variations in sentence structure from the domain of interest,
wherein a placeholder tagged with an index value is identified for one or more
entities among the plurality of entities within each of the plurality of sample
10 sentences. Further, includes generating the curated training dataset comprising a list
of natural language training sample sentences, wherein the list is generated by
randomly inserting each of the plurality of entity samples in each of the plurality of
sample sentences in accordance with the placeholder. Furthermore, includes
retraining the trained standard NER model and the. trained standard NLQ model
15 with the curated training dataset comprising the list of natural language training
samples to generate the trained context aware NER model and the trained context
aware NLQ model, wherein the index value of the place holder in conjunction with
remaining words of the plurality of sample sentences in the list enables context
aware training.
20 [0010] In another aspect, a system for context aware entity extraction for
matching resumes and Job Descriptions (JDs) is provided. The system comprises a
memory storing instructions; one or more Input/Output (I/O) interfaces; and one or
more hardware processors coupled to the memory via the one or more I/O
interfaces, wherein the one or more hardware processors are configured by the
25 instructions to receive a request, the request comprising one of: a) a resume to be
mapped with a JD list identified from a plurality of JDs, b) a JD to be mapped with
a resume list identified from a plurality of resumes and c) a Natural language query
to be mapped with one of i) the JD list and ii) the resume list. Further, the method
includes extracting a plurality of input entities from the request to generate a
30 structured query. The plurality of input entities comprise at least one of skills, an
experience in terms of range and text, a role, a location tagged with multi-tier
5
category. A context aware Named Entity Recognition (NER) model extracts the
plurality of input entities from the resume or the JD. A context aware Natural
Language Query (NLQ) model extracts the plurality of input entities from the
Natural Language Query. The context aware NER model and the context aware
5 NLQ model are trained by generating a curated training dataset.
[0011] Further, the one or more hardware processors, query an entity
database using the structured query constructed from the plurality of input entities,
wherein the entity database is a structured database comprising a plurality of
entities, extracted from each of the plurality of JDs and each of the plurality
10 resumes, by the trained context aware NER model. The plurality of entities
comprise the skills, the experience in terms of range and text , the role, the location
tagged with multi-tier category, a person name tagged with geographical origin,
organization names, educational institutes, the email, and the phone.
[0012] Furthermore, the one or more hardware processors generate an
15 output list in response to the structured query, wherein the output list maps to the
plurality of input entities in accordance with a preset mapping criterion comprising
one of a) an exact match and b) a partial match. The output list comprises one of a)
the JD list mapping to the resume or the natural language query in accordance with
the preset mapping criterion and b) the resume list mapping to the JD or the natural
20 language query in accordance with the preset mapping criterion. The mapping is
computed in terms of a percentage match function based on, a) a cosine similarity
between each of the plurality of input entities with corresponding each of the
plurality of entities of each of the plurality of JDs or each of the plurality of resumes,
and b) a cumulative weightage based on an individual entity weightage dynamically
25 defined for each of the plurality of input entities by a user while generating the
request.
[0013] Furthermore, the one or more hardware processors perform context
aware training by generating the curated training dataset by training a standard NER
model and a standard NLQ model using a standard training dataset. The context
30 aware training includes updating the entity database with a plurality of entity
samples corresponding to the plurality of entities, wherein the plurality of entity
6
samples are from a domain of interest. Further includes collecting a plurality of
sample sentences having variations in sentence structure from the domain of
interest, wherein a placeholder tagged with an index value is identified for one or
more entities among the plurality of entities within each of the plurality of sample
5 sentences. Further, includes generating the curated training dataset comprising a list
of natural language training sample sentences, wherein the list is generated by
randomly inserting each of the plurality of entity samples in each of the plurality of
sample sentences in accordance with the placeholder. Furthermore, includes
retraining the trained standard NER model and the. trained standard NLQ model
10 with the curated training dataset comprising the list of natural language training
samples to generate the trained context aware NER model and the trained context
aware NLQ model, wherein the index value of the place holder in conjunction with
remaining words of the plurality of sample sentences in the list enables context
aware training.
15 [0014] In yet another aspect, there are provided one or more non-transitory
machine-readable information storage mediums comprising one or more
instructions, which when executed by one or more hardware processors causes a
method for context aware entity extraction for matching resumes and Job
Descriptions (JDs).
20 [0015] The method includes receiving a request, the request comprising one
of: a) a resume to be mapped with a JD list identified from a plurality of JDs, b) a
JD to be mapped with a resume list identified from a plurality of resumes and c) a
Natural language query to be mapped with one of i) the JD list and ii) the resume
list. Further, the method includes extracting a plurality of input entities from the
25 request to generate a structured query. The plurality of input entities comprise at
least one of skills, an experience in terms of range and text, a role, a location tagged
with multi-tier category. A context aware Named Entity Recognition (NER) model
extracts the plurality of input entities from the resume or the JD. A context aware
Natural Language Query (NLQ) model extracts the plurality of input entities from
30 the Natural Language Query. The context aware NER model and the context aware
NLQ model are trained by generating a curated training dataset.
7
[0016] Further, the method includes querying an entity database using the
structured query constructed from the plurality of input entities, wherein the entity
database is a structured database comprising a plurality of entities, extracted from
each of the plurality of JDs and each of the plurality resumes, by the trained context
5 aware NER model. The plurality of entities comprise the skills, the experience in
terms of range and text, the role, the location tagged with multi-tier category, a
person name tagged with geographical origin, organization names, educational
institutes, the email, and the phone.
[0017] Furthermore, the method includes generating an output list in
10 response to the structured query, wherein the output list maps to the plurality of
input entities in accordance with a preset mapping criterion comprising one of a) an
exact match and b) a partial match. The output list comprises one of a) the JD list
mapping to the resume or the natural language query in accordance with the preset
mapping criterion and b) the resume list mapping to the JD or the natural language
15 query in accordance with the preset mapping criterion. The mapping is computed
in terms of a percentage match function based on, a) a cosine similarity between
each of the plurality of input entities with corresponding each of the plurality of
entities of each of the plurality of JDs or each of the plurality of resumes, and b) a
cumulative weightage based on an individual entity weightage dynamically defined
20 for each of the plurality of input entities by a user while generating the request.
[0018] Furthermore, the method includes performing context aware training
by generating the curated training dataset by training a standard NER model and a
standard NLQ model using a standard training dataset. The context aware training
includes updating the entity database with a plurality of entity samples
25 corresponding to the plurality of entities, wherein the plurality of entity samples are
from a domain of interest. Further includes collecting a plurality of sample
sentences having variations in sentence structure from the domain of interest,
wherein a placeholder tagged with an index value is identified for one or more
entities among the plurality of entities within each of the plurality of sample
30 sentences. Further, includes generating the curated training dataset comprising a list
of natural language training sample sentences, wherein the list is generated by
8
randomly inserting each of the plurality of entity samples in each of the plurality of
sample sentences in accordance with the placeholder. Furthermore, includes
retraining the trained standard NER model and the. trained standard NLQ model
with the curated training dataset comprising the list of natural language training
5 samples to generate the trained context aware NER model and the trained context
aware NLQ model, wherein the index value of the place holder in conjunction with
remaining words of the plurality of sample sentences in the list enables context
aware training.
[0019] It is to be understood that both the foregoing general description and
10 the following detailed description are exemplary and explanatory only and are not
restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0020] The accompanying drawings, which are incorporated in and
15 constitute a part of this disclosure, illustrate exemplary embodiments and, together
with the description, serve to explain the disclosed principles:
[0021] FIG. 1A is a functional block diagram of a system for context aware
entity extraction for matching resumes and Job Descriptions (JDs), in accordance
with some embodiments of the present disclosure.
20 [0022] FIG. 1B illustrates an architectural overview of the system of FIG. 1,
in accordance with some embodiments of the present disclosure.
[0023] FIGS. 2A and 2B (collectively referred as FIG. 2) is a flow diagram
illustrating a method for context aware entity extraction for matching resumes and
Job Descriptions (JDs), using the system of FIG. 1, in accordance with some
25 embodiments of the present disclosure.
[0024] FIG. 3 is a flow diagram illustrating a process for context aware
training of a standard NER model and a standard NLQ model by generating the
curated training dataset, using the system of FIG. 1, in accordance with some
embodiments of the present disclosure.
30 [0025] It should be appreciated by those skilled in the art that any block
diagrams herein represent conceptual views of illustrative systems and devices
9
embodying the principles of the present subject matter. Similarly, it will be
appreciated that any flow charts, flow diagrams, and the like represent various
processes which may be substantially represented in computer readable medium and
so executed by a computer or processor, whether or not such computer or processor
5 is explicitly shown.
DETAILED DESCRIPTION OF EMBODIMENTS
[0026] Exemplary embodiments are described with reference to the
accompanying drawings. In the figures, the left-most digit(s) of a reference number
10 identifies the figure in which the reference number first appears. Wherever
convenient, the same reference numbers are used throughout the drawings to refer
to the same or like parts. While examples and features of disclosed principles are
described herein, modifications, adaptations, and other implementations are possible
without departing from the scope of the disclosed embodiments.
15 [0027] Existing resume to Job Description (JD) matching solutions utilize
standard Named Entity Recognition (NER) models to extract entities from
documents such as resumes or JDs. For example, the standard NER model can be
based on a spaCy ™ model. The open source spaCy ™Models are designed only to
identify entities like English origin Names, and tier 1 locations (cities , countries)
20 and numbers. The accuracy is low as the model does not identify the tier 2 and tier
3 cities. The model does not identify the skills/ technology.
[0028] However, there is possibly an overlap between entities such as
names of persons, names of locations etc., which requires a context aware extraction
of entities for best possible resume and JD matching. Further, requirements of an
25 end user looking for a resume or a JD may vary based on time factor, urgency factor
or the like, wherein he/she prefers a varying weightage to the entities of interest. It
is not desirable nor practical to hard set the system settings for every change
requirement from the end user. Furthermore, most existing solutions demand a
minimal structured input query to appropriately interpret the user requirement and
30 search for the right resume or candidate. This structured input requirements is a
technical limitation in the existing solutions. The user experience is enhanced when
10
he/she is able to input a free flowing text to extract required data. However, existing
Natural Language Query (NLQ) models, as are, may not a be a direct solution are
for appropriate extraction of input entities from the natural language query, needs a
context aware solution.
5 [0029] Embodiments herein provide a method and system for context aware
entity extraction for matching resumes and Job Descriptions (JDs). The method
provides a context NER model and NLQ model that enable accurate entity
extraction from resumes, JDs and natural language query or a free flowing text.
Standard NER models and NLQ models are finetuned using a curated training
10 dataset to build context awareness into the models. This makes the system 100
context-based rather than data driven system. The method disclosed provides
automation in generating the curated training dataset and eliminates the manual
effort required for data annotation. Furthermore, the method provides option to
define an individual weightage to each of the entity of interest such as skill, location,
15 experience and role, wherein a cumulative weightage in combination with a cosine
similarity measure is used to find a list of matching resumes or JDs in accordance to
an input request.
[0030] The method disclosed herein, builds a context aware NER model,
which is an enhanced spacy based model, for recruitment domain and is capable of
20 identifying:
1) Skills/ Technology,
2) Tier 2 cities,
3) Asian - Indic Names,
4) Name of the organizations,
25 5) Educational Institutes,
6) Experience in range / in text.
7) Intelligently identifying difference between Person Name and Location in
case of overlapping words.
[0031] The method provides a single platform for a candidate, a recruiter, to
30 enter his request in form of a resume, a JD, or a free flowing text, and provides a
mapping comprising the most relevant resumes or JDs.
11
[0032] Referring now to the drawings, and more particularly to FIGS. 1A
through 3, where similar reference characters denote corresponding features
consistently throughout the figures, there are shown preferred embodiments and
these embodiments are described in the context of the following exemplary system
5 and/or method.
[0033] FIG. 1 is a functional block diagram of a system 100 for context
aware entity extraction for matching resumes and Job Descriptions (JDs), in
accordance with some embodiments of the present disclosure.
[0034] In an embodiment, the system 100 includes a processor(s) 104,
10 communication interface device(s), alternatively referred as input/output (I/O)
interface(s) 106, and one or more data storage devices or a memory 102 operatively
coupled to the processor(s) 104. The system 100 with the one or more hardware
processors is configured to execute functions of one or more functional blocks of the
system 100.
15 [0035] Referring to the components of system 100, in an embodiment, the
processor(s) 104, can be one or more hardware processors 104. In an embodiment,
the one or more hardware processors 104 can be implemented as one or more
microprocessors, microcomputers, microcontrollers, digital signal processors,
central processing units, state machines, logic circuitries, and/or any devices that
20 manipulate signals based on operational instructions. Among other capabilities, the
one or more hardware processors 104 are configured to fetch and execute computerreadable instructions stored in the memory 102. In an embodiment, the system 100
can be implemented in a variety of computing systems including laptop computers,
notebooks, hand-held devices such as mobile phones, workstations, mainframe
25 computers, servers, and the like.
[0036] The I/O interface(s) 106 can include a variety of software and
hardware interfaces, for example, a web interface, a graphical user interface to
display the generated target images and the like and can facilitate multiple
communications within a wide variety of networks N/W and protocol types,
30 including wired networks, for example, LAN, cable, etc., and wireless networks,
such as WLAN, cellular and the like. In an embodiment, the I/O interface (s) 106
12
can include one or more ports for connecting to a number of external devices or to
another server or devices.
[0037] The memory 102 may include any computer-readable medium
known in the art including, for example, volatile memory, such as static random
5 access memory (SRAM) and dynamic random access memory (DRAM), and/or
non-volatile memory, such as read only memory (ROM), erasable programmable
ROM, flash memories, hard disks, optical disks, and magnetic tapes.
[0038] Further, the memory 102 includes a database 108 that comprises an
entity database, as depicted in FIG. 1B. Further, the memory 102 includes modules
10 such as a context aware NER model 110, a context aware NLQ model 112, and a
mapping module 114. The entity database stores a plurality of entities extracted from
a plurality of resumes, a plurality of JDs for a domain of interest such as IT industry,
auto industry etc. The database 108, may also store a plurality of resumes and JDs
received by the system 100, a curated training dataset for training the context aware
15 NER model 110 and the context aware NLQ model 112, etc. The database 108 also
stores a script that enables automatically creating the curated training dataset for
context aware training. Further, the memory 102 may comprise information
pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the
system100 and methods of the present disclosure. In an embodiment, the database
20 108 may be external (not shown) to the system 100 and coupled to the system via
the I/O interface 106. Functions of the components of the system 100 are explained
in conjunction with the system architecture depicted in FIG. 1B and flow diagram
of FIG. 2.
[0039] FIGS. 2A and FIG. 2B (collectively referred as FIG. 2) is a flow
25 diagram illustrating a method for context aware entity extraction for matching
resumes and Job Descriptions (JDs), using the system of FIG. 1, in accordance with
some embodiments of the present disclosure.
[0040] In an embodiment, the system 100 comprises one or more data
storage devices or the memory 102 operatively coupled to the processor(s) 104 and
30 is configured to store instructions for execution of steps of the method 200 by the
processor(s) or one or more hardware processors 104. The steps of the method 200
13
of the present disclosure will now be explained with reference to the components or
blocks of the system 100 as depicted in FIG. 1A, FIG. 1B and the steps of flow
diagram as depicted in FIG. 2. Although process steps, method steps, techniques or
the like may be described in a sequential order, such processes, methods, and
5 techniques may be configured to work in alternate orders. In other words, any
sequence or order of steps that may be described does not necessarily indicate a
requirement that the steps to be performed in that order. The steps of processes
described herein may be performed in any order practical. Further, some steps may
be performed simultaneously.
10 [0041] Referring to the steps of the method 200, at step 202 of the method
200, the one or more hardware processors 104 to receive a request, comprising one
of: a) a resume to be mapped with a JD list identified from a plurality of JDs, b) a
JD to be mapped with a resume list identified from a plurality of resumes and c) a
Natural language query to be mapped with one of i) the JD list and ii) the resume
15 list. This, the method disclosed provides a single platform for multiple types of
request. Unlike traditional methods, the method disclosed also enables a free
flowing text or the natural language query as input, that eliminates manual effort
for the end user to generate a structured query as input to the system 100, enabling
ease to use the system 100, without any input restrictions.
20 [0042] At step 204 of the method 200, the one or more hardware processors
104 extract a plurality of input entities from the request to generate a structured
query. The plurality of input entities comprise at least one of skills, an experience
in terms of range, a role, a location tagged with tier1 and tier 2 category (multi-tier
category) and the like. As depicted in architectural overview of the system 100 in
25 FIG. 1B, the context aware Named Entity Recognition (NER) model 110 extracts
the plurality of input entities from the resume or the JD, while the context aware
Natural Language Query (NLQ) model 114 extracts the plurality of input entities
from the Natural Language Query. The context aware NER model 110 and the
context aware NLQ model 114 are trained by generating the curated training
30 dataset.
14
[0043] FIG. 3 is a flow diagram illustrating a process 300 for context aware
training of a standard NER model and a standard NLQ model by generating the
curated training dataset, using the system of FIG. 1, in accordance with some
embodiments of the present disclosure. The steps involved in the context aware
5 training include:
a) Training (302) the standard NER model and the standard NLQ model using
a standard training dataset. The standard NER model can be a spaCy™ based
model.
b) Updating (304) the entity database (shown in FIG. 1B) with a plurality of
10 entity samples corresponding to the plurality of entities, wherein the plurality
of entity samples are from a domain of interest.
c) Collecting (306) a plurality of sample sentences having variations in
sentence structure from the domain of interest such as IT , automobile
industry, finance industry and the like. A placeholder tagged with an index
15 value is identified for one or more entities among the plurality of entities
within each of the plurality of sample sentences.
d) Generating (308) the curated training dataset comprising a list of natural
language training sample sentences. The list is generated by randomly
inserting each of the plurality of entity samples in each of the plurality of
20 sample sentences in accordance with the placeholder.
e) Retraining (310) the trained standard NER model and the trained standard
NLQ model with the curated training dataset comprising the list of natural
language training samples to generate the trained context aware NER model
and the trained context aware NLQ model. The index value of the place
25 holder in conjunction with remaining words of the plurality of sample
sentences in the list enables context aware training. The trained model is
capable of understanding the sequences of entities & contextual intelligence
to identify entities such as Candidate name, Technical and Business Skills,
Location, Job Role, Certifications, Educational Background, Project
30 experience, and Experience in years/months. The context aware NER model
15
110, is an enhanced spacy based model, for recruitment domain and is
capable of identifying:
1) Skills/ Technology,
2) Tier 2 cities,
5 3) Asian - Indic Names,
4) Name of the organizations,
5) Educational Institutes,
6) Experience in range / in text.
7) Intelligently identifying difference between Person Name and Location in
10 case of overlapping words.
The open source spaCy ™Models are designed only to identify entities like
English origin Names, and tier 1 locations (cities , countries) and numbers.
The accuracy is low as the model does not identify the tier 2 and tier 3 cities.
The model does not identify the skills/ technology.
15 The model training pipeline can be re-used for other domain dataset training
(minimal changes).
[0044] The steps 302 to 310 of the process 300, for generating of the curated
training dataset, are explained below with examples. The method disclosed
provides an automated pipeline execution for steps 302 to 310, which can be
20 performed for any domain of interest of the end user, such as an IT domain, or an
auto industry domain, a management domain, finance domain or the like. The
pipeline automatically prepare the curated training data set, annotates, and then
trains the custom NER models over standard pre-trained model for context aware
training.
25 Step 1: Data Curation / Storage: 100+ generalized sentences with placeholder for
Role, Location, Experience and. Skills are gathered from internal recruitment
domain database.
Example - Looking for employees skilled in at
30 Step 2: Annotated data creation: A regular expression based shuffling and custom
looping mechanism is designed to map the entities Role, Location, Experience and
16
Skills from the database in sequence to generate Natural Language Sentence for
training.
Example -Looking for employees skilled in at
Entity Samples from IT domain:
5 Role – Data Scientist, Data analyst
Location – Manhattan, London
Skills - HTML, CSS and Angular
Automatically Generated Sentences:
a) (natural language first training sample sentence). Looking for
10 Data Scientist employees skilled in HTML, CSS, and Angular at
Manhattan.
b) natural language second training sample sentence)Similarly,
for the same set of entities above another sentence can be: Fetch me
data scientist employees located in Manhattan skilled in HTM,
15 css and Angular with 5+ yrs of exp. (natural language second
training sample sentence).
Step 3: NER Training Ready Data: The automation script is designed to identify
the starting index and ending index of the placeholder words (Role, Location,
Experience and Skills). Training these sequenced sentences with index values and
20 placement of words will help the machine to achieve the contextual knowledge to
identify (HTML) on the context it is related to (SKILLS) during the NLQ query.
Annotation Generation Rule:
Looking for Data Scientist employees skilled in HTML, CSS and Angular
at Manhattan', {'entities': [(Start character value, End character value, 'Entity
25 Type')]}
Final Annotated Sentence, ready for training:
Looking for Data Scientist employees skilled in HTML, CSS and Angular
at Manhattan', {'entities': [(13, 26, 'ROLE'),(74, 82, 'LOCATION'), (49, 52,
SKILLS), (55, 57, SKILLS), (63, 69, SKILLS)]}
30 [0045] In the example above ‘HTML’ is identified with index values 49 to
52, and these characters are mapped to SKILLS. Similarly, ‘Manhattan’ has been
17
marked as LOCATION by mentioning that characters from 74 to 82 are mapped as
a Location.
Step 4: Training NER model on Annotated Data: As the more and more of
multiple such scenarios of sentences, the standard NER and NLQ models be trained
5 for context aware training to identify ‘skills’ without actually remembering the
actual word but based upon the context on which it has been written. Thus, training
with this curated training dataset creates trained context aware NER model 110 and
the context aware NLQ model 112.
[0046] The curated training dataset so generated is first pushed in the
10 training pipeline for Hyper-parameter tuning. The best parameter values are
evaluated through Accuracy and Loss. These hyper-parameter values are used for
training the standard NER and NLQ models to generate context aware NER model
112 and the context aware NLQ model 114. Since method disclosed provides the
context aware NLQ model, input to the system can be straight forward simple free
15 flowing text, as described in examples below.
[0047] Examples of free flowing text:
1) Fetch me skilled in located at with years
2) looking for at having yrs of experience with
knowledge in (Identifies yrs as years)
20 3) years of experience associate located in having
skills of
4) Fetch me data scientist employees located in Manhattan skilled in HTML, css
and Angular with 5+ yrs of exp. (5+ identified as experience)
In the above example 4) the method disclosed herein is trained to recognize
25 intent/context from the input sentence considering short forms and spelling
mistakes. The trained context aware NLQ model 112 can identify the sequence or
placement of intent in the above NLQ sentence:
Role – Data Scientist
Location - Manhattan
18
Skills - HTML, CSS/css and Angular (term ‘css’, whether capital or small
letters are considered to be a skill in context of the other words identified in
the sentence)
Experience - 5 + yrs
5 5) Located at ‘New York’ or lives in ‘Manhattan’ or associate from ‘Washington'
district. In these examples the context of sentences tells the context aware NLQ
model 112 that the word in quotes (‘ ) are locations.
6) Contextual to skills: trained in ‘html’, ‘css’ or skilled in ‘java’, ‘python’,
‘angular’ or have knowledge in ‘sql’ and ‘analytics’. In this example the context
10 of sentences is interpreted by the context aware NLQ model 112 to understand that
the word in quotes (‘ ) are skills/ technologies. This makes the system 100 contextbased rather than data driven system.
7) For an query including experience, the system 100 is able to identify experience
mention in any form such as text, range, or numerals such as: Total experience and
15 (8-10 yrs , 12 + yrs , more than 10 yrs).
[0048] Referring back to step 204, once the plurality of input entities are
extracted from the request, which is either of the resume, the JD or the Natural
language query (free flowing text), an optimized structured query such as an SQL
query is automatically generated. The same is depicted in FIG. 1B. Thus, at step
20 206 of the method 200, the one or more hardware processors 104 query the entity
database (depicted inside the database 108 in FIG. 1B) using the structured query
constructed from the plurality of input entities. The entity database is a structured
database comprising a plurality of entities, extracted from each of the plurality of
JDs and each of the plurality resumes, by the trained context aware NER model.
25 The plurality of entities comprise the skills, the experience in terms of range and
text, the role, the location tagged with tier1, tier 2 and tier 3 ( multi-tier) category,
a person name tagged with geographical origin, organization names, educational
institutes, the email and the phone.
[0049] A response is received to the structured query from the entity
30 database. Then at step 208 of the method 200, the one or more hardware processors
104 generate an output list in response to the structured query. The output list maps
19
to the plurality of input entities in accordance with a preset mapping criterion
comprising one of a) an exact match and b) a partial match. The output list
comprises one of a) the JD list mapping to the resume or the natural language query
in accordance with the preset mapping criteria and b) the resume list mapping to
5 the JD or the natural language query in accordance with the preset mapping
criterion.
[0050] The mapping is computed by the mapping module 114 in terms of a
percentage match function based on a) a cosine similarity between each of the
plurality of input entities with corresponding each of the plurality of entities of each
10 of the plurality of JDs or each of the plurality of resumes, and b) a cumulative
weightage, based on an individual weightage dynamically defined for each of the
plurality of input entities by a user while generating the request.
[0051] The cosine similarity plays important role when the recruiter tries to
search for a matching job for the candidate or an eligible candidate for a new job
15 opening. The algorithm considers the four major parameters for similarity mapping:
1) Role, 2) Location, 3) Skills, and 4) Experience
[0052] The percentage mapping formula that provides the cumulative
weightage for mapping is :
Final % = ((ROLE Cosine Similarity Factor * individual weightage %) +
20 (LOCATION Cosine Similarity Factor * individual weightage %) + (SKILLS
Cosine Similarity Factor * individual weightage %) + (EXPERIENCE Cosine
Similarity Factor * individual weightage %)).
[0053] The exact match or partial match may be specified by the end user
as per the requirement. The exact match of the skill to skill or location to location
25 or Role to Role is in accordance with equation mentioned 2 below. During partial
match, say for example the JD has requirement for 'Azure Cloud Engineer' and
candidate has experience in 'Azure', then the system 100 via the mapping module
114 performs a partial match based upon the matching word 'Azure', as provided in
equation 1 and 3 below. A vector distance between similar words (words related to
30 each other) gives the match % for the words which are related or closely associated,
example. JD has requirement for 'HTML' and candidate has 'CSS’ as skill, because
20
HTML and CSS are closely associated the vector distance between these two words
will be lesser. Hence, they would be considered as near match.
Example: A job Opening with below details
Role- AWS DevOps Engineer
5 Location- London
Experience- 5 +
Skills- Jenkins, Google Cloud
Weightage set by recruiter as in table 1 below (totals to 100):
10 Table1:
ROLE 20
LOCATION 40
SKILLS 30
EXPERIENCE 10
[0054] The flexibility to change the weightage has been provided to the
recruiter each time a new search is made. As per the weightage, in example above,
the recruiter is more focused on ‘location’ for this search. Three eligible candidates
15 in the database 108 are identified as in table 2:
Table2
ROLE LOCATIO
N
SKILLS EXPERIENC
E
Candidate A Data Analyst London Power BI, Excel, SQL,
Jenkins
8 years
Candidate B Web
Developer
Manchester Angular, JAVA,
HTML, CSS
15 years
Candidate C AWS DevOps
Engineer
London Azure, Jenkins, AWS 6.5 years
21
[0055] Equation 1: Cosine Similarity Score for Candidate A:
Role-0.0
Location-1.0 (Exact Match)
5 Experience-1.0
Skills- 0.70710678 (partial Cosine Match)
Final % = ((0.0 * 20)+ (1 * 40) + (1 * 10) + (0.7071 * 30))
= ( (0) + (40) + (0) + (21.21) )
= 61.21 %
10 [0056] Equation 2: Cosine Similarity Score for Candidate B:
Role-0.0
Location-0.0
Experience-1.0
Skills- 0.0
15 Final % = ((0.0* 20)+ (0 * 40) + (1 * 10) + (0.0* 30))
= ( (0) + (0) + (10) + (0) )
= 10.00 %
[0057] Equation 3: Cosine Similarity Score for Candidate C:
Role-0.57735027 (Partial Match)
20 Location-1.0 (Exact Match)
Experience-1.0
Skills- 0.70710678 (Partial Match)
Final % = ((0.5773 * 20)+ (1 * 40) + (1 * 10) + (0.7071 * 30))
= ( (11.546) + (40) + (10) + (21.21) )
25 = 82.76 %
Table 4 below provides the mapping list in accordance with the cumulative
weightage and cosine similarity. Wherein in candidates are arranged in
descending order of the match percentage.
Table 4:
Candidate % Match Ranking
22
Candidate C 82.76 1
Candidate A 61.21 2
Candidate B 10 3
[0058] This system will provide the liberty of Exact Match, Partial Match
along with the weightage consideration, so that the recruiter can sort and rank the
candidates for shortlisting and requirement fulfillment
5 [0059] The written description describes the subject matter herein to enable
any person skilled in the art to make and use the embodiments. The scope of the
subject matter embodiments is defined by the claims and may include other
modifications that occur to those skilled in the art. Such other modifications are
intended to be within the scope of the claims if they have similar elements that do
10 not differ from the literal language of the claims or if they include equivalent
elements with insubstantial differences from the literal language of the claims.
[0060] It is to be understood that the scope of the protection is extended to
such a program and in addition to a computer-readable means having a message
therein; such computer-readable storage means contain program-code means for
15 implementation of one or more steps of the method, when the program runs on a
server or mobile device or any suitable programmable device. The hardware device
can be any kind of device which can be programmed including e.g. any kind of
computer like a server or a personal computer, or the like, or any combination
thereof. The device may also include means which could be e.g. hardware means
20 like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate
array (FPGA), or a combination of hardware and software means, e.g. an ASIC and
an FPGA, or at least one microprocessor and at least one memory with software
processing components located therein. Thus, the means can include both hardware
means, and software means. The method embodiments described herein could be
25 implemented in hardware and software. The device may also include software
means. Alternatively, the embodiments may be implemented on different hardware
devices, e.g. using a plurality of CPUs.
23
[0061] The embodiments herein can comprise hardware and software
elements. The embodiments that are implemented in software include but are not
limited to, firmware, resident software, microcode, etc. The functions performed by
various components described herein may be implemented in other components or
5 combinations of other components. For the purposes of this description, a computerusable or computer readable medium can be any apparatus that can comprise, store,
communicate, propagate, or transport the program for use by or in connection with
the instruction execution system, apparatus, or device.
[0062] The illustrated steps are set out to explain the exemplary
10 embodiments shown, and it should be anticipated that ongoing technological
development will change the manner in which particular functions are performed.
These examples are presented herein for purposes of illustration, and not limitation.
Further, the boundaries of the functional building blocks have been arbitrarily
defined herein for the convenience of the description. Alternative boundaries can be
15 defined so long as the specified functions and relationships thereof are appropriately
performed. Alternatives (including equivalents, extensions, variations, deviations,
etc., of those described herein) will be apparent to persons skilled in the relevant
art(s) based on the teachings contained herein. Such alternatives fall within the scope
of the disclosed embodiments. Also, the words “comprising,” “having,”
20 “containing,” and “including,” and other similar forms are intended to be equivalent
in meaning and be open ended in that an item or items following any one of these
words is not meant to be an exhaustive listing of such item or items, or meant to be
limited to only the listed item or items. It must also be noted that as used herein and
in the appended claims, the singular forms “a,” “an,” and “the” include plural
25 references unless the context clearly dictates otherwise.
[0063] Furthermore, one or more computer-readable storage media may be
utilized in implementing embodiments consistent with the present disclosure. A
computer-readable storage medium refers to any type of physical memory on which
information or data readable by a processor may be stored. Thus, a computer30 readable storage medium may store instructions for execution by one or more
processors, including instructions for causing the processor(s) to perform steps or
24
stages consistent with the embodiments described herein. The term “computerreadable medium” should be understood to include tangible items and exclude
carrier waves and transient signals, i.e., be non-transitory. Examples include random
access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile
5 memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known
physical storage media.
[0064] It is intended that the disclosure and examples be considered as
exemplary only, with a true scope of disclosed embodiments being indicated by the
following claims.
We Claim:
1. A processor implemented method (200) for context aware entity extraction
for matching resumes and Job Descriptions (JDs), the method comprising:
receiving (202), via one or more hardware processors, a request, the
5 request comprising one of: a) a resume to be mapped with a JD list identified
from a plurality of JDs, b) a JD to be mapped with a resume list identified
from a plurality of resumes and c) a Natural language query to be mapped
with one of i) the JD list and ii) the resume list;
extracting (204), via the one or more hardware processors, a
10 plurality of input entities from the request to generate a structured query,
wherein the plurality of input entities comprise at least one
of skills, an experience in terms of range and text, a role, a
location tagged with multi-tier category,
wherein a context aware Named Entity Recognition (NER)
15 model extracts the plurality of input entities from the resume
or the JD,
wherein a context aware Natural Language Query (NLQ)
model extracts the plurality of input entities from the Natural
Language Query, and
20 wherein the context aware NER model and the context aware
NLQ model are trained by generating a curated training
dataset;
querying (206), via the one or more hardware processors, an entity
database using the structured query constructed from the plurality of input
25 entities, wherein the entity database is a structured database comprising a
plurality of entities, extracted from each of the plurality of JDs and each of
the plurality resumes, by the trained context aware NER model, wherein the
plurality of entities comprise the skills, the experience in terms of range and
text , the role, the location tagged with multi-tier category, a person name
30 tagged with geographical origin, organization names, educational institutes,
the email and the phone; and
26
generating (208), via the one or more hardware processors, an output
list in response to the structured query, wherein the output list maps to the
plurality of input entities in accordance with a preset mapping criterion
comprising one of a) an exact match and b) a partial match,
5 wherein the output list comprises one of a) the JD list
mapping to the resume or the natural language query in
accordance with the preset mapping criterion and b) the
resume list mapping to the JD or the natural language query
in accordance with the preset mapping criterion,
10 wherein the mapping is computed in terms of a percentage
match function based on,
a) a cosine similarity between each of the plurality of
input entities with corresponding each of the plurality
of entities of each of the plurality of JDs or each of
15 the plurality of resumes, and
b) a cumulative weightage based on an individual
entity weightage dynamically defined for each of the
plurality of input entities by a user while generating
the request.
20
2. The method as claimed in claim 1, wherein a context aware training for the
context aware NER model and the context aware NLQ model by generating
the curated training dataset comprises:
training (302) a standard NER model and a standard NLQ model
25 using a standard training dataset;
updating (304) the entity database with a plurality of entity samples
corresponding to the plurality of entities, wherein the plurality of entity
samples are from a domain of interest;
collecting (306) a plurality of sample sentences having variations in
30 sentence structure from the domain of interest, wherein a placeholder tagged
27
with an index value is identified for one or more entities among the plurality
of entities within each of the plurality of sample sentences;
generating (308) the curated training dataset comprising a list of
natural language training sample sentences, wherein the list is generated by
5 randomly inserting each of the plurality of entity samples in each of the
plurality of sample sentences in accordance with the placeholder; and
retraining (310) the trained standard NER model and the trained
standard NLQ model with the curated training dataset comprising the list of
natural language training samples to generate the trained context aware
10 NER model and the trained context aware NLQ model, wherein the index
value of the place holder in conjunction with remaining words of the
plurality of sample sentences in the list enables context aware training.
3. A system (100) for context aware entity extraction for matching resumes
15 and Job Descriptions (JDs), the system (100) comprising:
a memory (102) storing instructions;
one or more Input/Output (I/O) interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the
one or more I/O interfaces (106), wherein the one or more hardware
20 processors (104) are configured by the instructions to:
receive a request, the request comprising one of: a) a resume to be
mapped with a JD list identified from a plurality of JDs, b) a JD to be
mapped with a resume list identified from a plurality of resumes and c) a
Natural language query to be mapped with one of i) the JD list and ii) the
25 resume list;
extract a plurality of input entities from the request to generate a
structured query,
wherein the plurality of input entities comprise at least one
of skills, an experience in terms of range and text, a role, a
30 location tagged with multi-tier category,
28
wherein a context aware Named Entity Recognition (NER)
model extracts the plurality of input entities from the resume
or the JD,
wherein a context aware Natural Language Query (NLQ)
5 model extracts the plurality of input entities from the Natural
Language Query, and
wherein the context aware NER model and the context aware
NLQ model are trained by generating a curated training
dataset;
10 query an entity database using the structured query constructed from
the plurality of input entities, wherein the entity database is a structured
database comprising a plurality of entities, extracted from each of the
plurality of JDs and each of the plurality resumes, by the trained context
aware NER model, wherein the plurality of entities comprise the skills, the
15 experience in terms of range and text , the role, the location tagged with
multi-tier category, a person name tagged with geographical origin,
organization names, educational institutes, the email and the phone; and
generate an output list in response to the structured query, wherein
the output list maps to the plurality of input entities in accordance with a
20 preset mapping criterion comprising one of a) an exact match and b) a partial
match,
wherein the output list comprises one of a) the JD list
mapping to the resume or the natural language query in
accordance with the preset mapping criterion and b) the
25 resume list mapping to the JD or the natural language query
in accordance with the preset mapping criterion,
wherein the mapping is computed in terms of a percentage
match function based on,
a) a cosine similarity between each of the plurality of
30 input entities with corresponding each of the plurality
29
of entities of each of the plurality of JDs or each of
the plurality of resumes, and
b) a cumulative weightage based on an individual
entity weightage dynamically defined for each of the
5 plurality of input entities by a user while generating
the request.
4. The system as claimed in claim 3, wherein the one or more hardware
processors are configured to perform context aware training by generating
10 the curated training dataset by:
training a standard NER model and a standard NLQ model using a
standard training dataset;
updating the entity database with a plurality of entity samples
corresponding to the plurality of entities, wherein the plurality of entity
15 samples are from a domain of interest;
collecting a plurality of sample sentences having variations in
sentence structure from the domain of interest, wherein a placeholder tagged
with an index value is identified for one or more entities among the plurality
of entities within each of the plurality of sample sentences;
20 generating the curated training dataset comprising a list of natural
language training sample sentences, wherein the list is generated by
randomly inserting each of the plurality of entity samples in each of the
plurality of sample sentences in accordance with the placeholder; and
retraining the trained standard NER model and the trained standard
25 NLQ model with the curated training dataset comprising the list of natural
language training samples to generate the trained context aware NER model
and the trained context aware NLQ model, wherein the index value of the
place holder in conjunction with remaining words of the plurality of sample
sentences in the list enables context aware training.
| # | Name | Date |
|---|---|---|
| 1 | 202121010090-STATEMENT OF UNDERTAKING (FORM 3) [10-03-2021(online)].pdf | 2021-03-10 |
| 2 | 202121010090-REQUEST FOR EXAMINATION (FORM-18) [10-03-2021(online)].pdf | 2021-03-10 |
| 3 | 202121010090-PROOF OF RIGHT [10-03-2021(online)].pdf | 2021-03-10 |
| 4 | 202121010090-FORM 18 [10-03-2021(online)].pdf | 2021-03-10 |
| 5 | 202121010090-FORM 1 [10-03-2021(online)].pdf | 2021-03-10 |
| 6 | 202121010090-FIGURE OF ABSTRACT [10-03-2021(online)].jpg | 2021-03-10 |
| 7 | 202121010090-DRAWINGS [10-03-2021(online)].pdf | 2021-03-10 |
| 8 | 202121010090-DECLARATION OF INVENTORSHIP (FORM 5) [10-03-2021(online)].pdf | 2021-03-10 |
| 9 | 202121010090-COMPLETE SPECIFICATION [10-03-2021(online)].pdf | 2021-03-10 |
| 10 | 202121010090-FORM-26 [22-10-2021(online)].pdf | 2021-10-22 |
| 11 | Abstract1.jpg | 2022-11-12 |
| 12 | 202121010090-FER.pdf | 2023-10-12 |
| 13 | 202121010090-OTHERS [22-02-2024(online)].pdf | 2024-02-22 |
| 14 | 202121010090-FER_SER_REPLY [22-02-2024(online)].pdf | 2024-02-22 |
| 15 | 202121010090-COMPLETE SPECIFICATION [22-02-2024(online)].pdf | 2024-02-22 |
| 16 | 202121010090-CLAIMS [22-02-2024(online)].pdf | 2024-02-22 |
| 17 | 202121010090-US(14)-HearingNotice-(HearingDate-13-10-2025).pdf | 2025-09-04 |
| 18 | 202121010090-Correspondence to notify the Controller [10-10-2025(online)].pdf | 2025-10-10 |
| 19 | 202121010090-Written submissions and relevant documents [24-10-2025(online)].pdf | 2025-10-24 |
| 20 | 202121010090-PatentCertificate12-11-2025.pdf | 2025-11-12 |
| 21 | 202121010090-IntimationOfGrant12-11-2025.pdf | 2025-11-12 |
| 1 | SearchHistoryE_09-10-2023.pdf |
| 2 | 202121010090_SearchStrategyAmended_E_SearchHistoryAE_04-03-2025.pdf |