Abstract: The present disclosure introduces a system (102) and method (300) for job description parsing and analysis through a seamless integration of advanced technologies. Leveraging web scraping techniques, the system (102) collects diverse job descriptions from various sources, encompassing industries such as IT, Finance, Pharma, and Energy. Employing Named Entity Recognition (NER) tools, like Tecoholic NER, ensures precise identification and categorization of essential segments within job descriptions. The system consolidates annotated data into a structured JSON file, streamlining access for subsequent deep learning operations. Further enhancements involve annotation trimming for data cleanliness and format conversion using SpaCy, and multi-level training and testing refines a SpaCy RoBERTa NER model, enabling proficient recognition and extraction of specific entities across industries and languages. Rigorous testing validates efficacy of the model, positioning this system as an advanced tool for automating job description analysis, offering invaluable insights for recruitment processes and industry trend analysis.
Description:TECHNICAL FIELD
[0001] The present disclosure pertains to the field of human resources management. More specifically, it relates to a system and method for intelligent parsing and analysis of job descriptions using a SpaCy NER RoBERTa model, facilitating enhanced recruitment and workforce management processes.
BACKGROUND
[0002] Background description includes information that may be useful in understanding the present disclosure. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed disclosure, or that any publication specifically or implicitly referenced is prior art.
[0003] Job descriptions (JDs) serve as crucial documents that outline the responsibilities, qualifications, and expectations associated with a specific job role within an organization. These documents are instrumental in the recruitment and hiring processes, acting as a bridge of communication between employers and potential candidates. A well-crafted JD not only provides a clear understanding of the role but also helps set expectations for both parties involved in the hiring process. It typically includes details such as job title, responsibilities, qualifications, skills required, work environment, and other relevant information.
[0004] The requirement for an accurate and comprehensive JD is paramount for effective recruitment and organizational planning. A well-defined JD not only attracts suitable candidates but also assists recruiters in identifying the most qualified individuals for a particular role. It serves as a foundational document that aligns hiring decisions with organizational goals and ensures a transparent and standardized approach to the recruitment process. Additionally, a clear and detailed JD can contribute to employee satisfaction and retention by setting realistic expectations and helping individuals understand their roles within the larger context of the organization.
[0005] In the manual analysis of JDs, the challenges arise from the inherent limitations of human interpretation. Individuals tasked with creating or assessing JDs may inadvertently introduce biases, and the interpretation of certain terms or phrases can vary widely. This subjectivity can lead to inconsistencies and potential misunderstandings between employers and job seekers. Moreover, the manual process can be time-consuming and prone to errors, particularly when dealing with a large volume of JDs. The need for more streamlined and accurate methods of JD analysis has become evident, prompting the exploration of automated solutions that leverage technology to enhance the efficiency and effectiveness of the recruitment and hiring processes.
[0006] There is, therefore, a need for an improved solution that improves efficiency, accuracy, and scalability of analysis process, while mitigating the challenges associated with both manual and existing automated methodologies.
OBJECTS OF THE PRESENT DISCLOSURE
[0007] Some of the objects of the present disclosure, which at least one embodiment herein satisfies are as listed herein below.
[0008] It is an object of the present disclosure to provide a system that enhances efficiency of job description (JD) analysis, addressing the limitations of manual methods.
[0009] It is an object of the present disclosure to provide a system that ensures accuracy and consistency in extracting key information from diverse JDs, mitigating the inherent subjectivity and variability associated with manual analysis.
[0010] It is an object of the present disclosure to provide a system that significantly improves time efficiency in JD analysis, where automation allows the system to process a large volume of JDs in a shorter time frame, streamlining recruitment workflows and facilitating faster decision-making.
[0011] It is an object of the present disclosure to provide a system that demonstrates scalability, adapting to changes in the job market and industry-specific terminology, and also efficiently handles a diverse range of JDs across various sectors, ensuring relevance and adaptability over time.
[0012] It is an object of the present disclosure to provide a system that reduces biases in JD analysis, by relying on predefined rules and automated algorithms, the system minimizes the potential for biases introduced during manual analysis, promoting fair and impartial assessments.
[0013] It is an object of the present disclosure to provide a system that enhances context understanding within JDs, by leveraging advanced Natural Language Processing (NLP) techniques, and improves nuanced analysis of language and better extraction of relevant information.
[0014] It is an object of the present disclosure to provide a system that adapts to evolving industry-specific terminology, ensuring effectiveness in capturing the latest trends and requirements across various job markets.
[0015] It is an object of the present disclosure to provide a system with a user-friendly interface.
[0016] It is an object of the present disclosure to provide a system that prioritizes accessibility, making it easy for recruiters and stakeholders to navigate and use, contributing to a seamless and positive user experience.
[0017] It is an object of the present disclosure to provide a system that streamlines recruitment processes. Automation of JD parsing facilitates quicker decision-making, reduces manual effort, and enhances overall efficiency in talent acquisition.
[0018] It is an object of the present disclosure to provide a system that generates structured data outputs for data-driven insights, and the data outputs are leveraged for data-driven insights into job market trends, skill requirements, and other valuable information, contributing to informed decision-making in recruitment strategies.
SUMMARY
[0019] Various aspects of present disclosure pertain to the field of Natural Language Processing (NLP) and Machine Learning. More specifically, it relates to a system and method for intelligent parsing and analysis of job descriptions using SpaCy NER RoBERTa models to extract, categorize, and understand key information in diverse professional contexts, facilitating enhanced recruitment and workforce management processes.
[0020] An aspect of the present disclosure pertains to a system for job description (JD) parsing and analysis, a processor-memory configuration executing instructions to collect, store, and process job descriptions. The system includes a Named Entity Recognition (NER) annotator tool to identify segments within the job descriptions, creating structured data. This annotated data is further processed, trimmed, and converted into a SpaCy format. The system incorporates a training unit for NER RoBERTa model training, testing, and storage. The model is tested at two levels using unseen data, with the results annotated and merged.
[0021] In an aspect, the segments in the system include 26 various job-related details, such as job title, number of vacancies, college category, company brief, company name, job location, availability, salary change, salary type, qualification, hard skill, soft skill, domain, work mode, mode of employment, roles and responsibilities, perk and benefit, level of role, preferred qualification, salary currency, total experience, relevant experience, job description summary, experience skill, preferred skill, and remaining.
[0022] In an aspect, the NER RoBERTa model performs tokenization and subword tokenization, enhancing its ability to extract information from diverse job descriptions.
[0023] In an aspect, the system provides a graphical user interface (GUI) for user-friendly inspection of job descriptions that provides users with a convenient way to visually inspect and navigate through job descriptions presented in the SpaCy format.
[0024] Another aspect of the present disclosure pertains to a method for job description parsing and analysis that includes a systematic process executed by one or more processors and a training unit. Initially, a diverse set of job descriptions is collected and stored electronically. The processors then extract each job description, utilizing a Named Entity Recognition (NER) annotator tool to identify specific segments within the text files. The annotated data is merged into a structured file with trimmed entity annotations. Following this, the merged data is converted into a SpaCy format. Subsequently, a NER RoBERTa model is trained by a dedicated training unit using the SpaCy-formatted merged job description data. The model undergoes a rigorous testing process, including a first level of testing with unseen data and a second level of testing with received unseen data. This method integrates advanced techniques to enhance job description parsing and analysis, ensuring accuracy and efficiency in information extraction.
[0025] Various objects, features, aspects, and advantages of the inventive subject matter will become more apparent from the following detailed description of preferred embodiments, along with the accompanying drawing figures in which numerals represent like components.
BRIEF DESCRIPTION OF DRAWINGS
[0026] The accompanying drawings are included to provide a further understanding of the present disclosure and are incorporated in, and constitute a part of this specification. The drawings illustrate exemplary embodiments of the present disclosure, and together with the description, serve to explain the principles of the present disclosure.
[0027] FIG. 1 illustrates an exemplary network architecture of proposed system for parsing and analysis of job descriptions, in accordance with an embodiment of the present disclosure.
[0028] FIG. 2 illustrates an exemplary architecture of proposed system for parsing and analysis of job descriptions, in accordance with an embodiment of the present disclosure.
[0029] FIG. 3 illustrates an exemplary view of a flow diagram of proposed method for parsing and analysis of job descriptions, in accordance with some embodiments of the present disclosure.
DETAILED DESCRIPTION
[0030] The following is a detailed description of embodiments of the disclosure depicted in the accompanying drawings. The embodiments are in such detail as to clearly communicate the disclosure. However, the amount of detail offered is not intended to limit the anticipated variations of embodiments; on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present disclosure as defined by the appended claims.
[0031] References to “an embodiment”, “an exemplary embodiment”, “one example”, “an example”, “for instance”, and so on, indicate that the embodiment(s) or example(s) may include a particular feature, structure, characteristic, property, element, or limitation, but that not every embodiment or example necessarily includes that particular feature, structure, characteristic, property, element, or limitation. Furthermore, repeated use of the phrase “in an embodiment” does not necessarily refer to the same embodiment.
[0032] Definitions: The following terms shall have, for the purposes of this application, the meanings set forth below.
[0033] RoBERTa, refers to a " Robustly Optimised BERT Pre Training Approach," a natural language processing (NLP) model developed by Facebook AI. Built upon Transformer architecture, this is designed to understand and generate human-like text. RoBERTa Base is the foundational model of RoBERTa, pre-trained on a large corpus of text data, serving as the basis for more specialized models.
[0034] Embodiments of present disclosure relate to the field of Natural Language Processing (NLP) and Machine Learning. More specifically, it relates to a system and method for intelligent parsing and analysis of job descriptions using SpaCy NER RoBERTa models to extract, categorize, and understand key information in diverse professional contexts, facilitating enhanced recruitment and workforce management processes.
[0035] An embodiment of the present disclosure pertains to a system for job description (JD) parsing and analysis, a processor-memory configuration executing instructions to collect, store, and process job descriptions. The system includes a Named Entity Recognition (NER) annotator tool to identify segments within the job descriptions, creating structured data. This annotated data is further processed, trimmed, and converted into a SpaCy format. The system incorporates a training unit for NER RoBERTa model training, testing, and storage. The model is tested at two levels using unseen data, with the results annotated and merged.
[0036] In an embodiment, the segments in the system include 26 various job-related details, such as job title, number of vacancies, college category, company brief, company name, job location, availability, salary change, salary type, qualification, hard skill, soft skill, domain, work mode, mode of employment, roles and responsibilities, perk and benefit, level of role, preferred qualification, salary currency, total experience, relevant experience, job description summary, experience skill, preferred skill, and remaining.
[0037] In an embodiment, the NER RoBERTa model performs tokenization and subword tokenization, enhancing its ability to extract information from diverse job descriptions.
[0038] In an embodiment, the system provides a graphical user interface (GUI) for user-friendly inspection of job descriptions. The graphical user interface provides users with a convenient way to visually inspect and navigate through job descriptions presented in the SpaCy format.
[0039] Another embodiment of the present disclosure pertains to a method for job description parsing and analysis that includes a systematic process executed by one or more processors and a training unit. Initially, a diverse set of job descriptions is collected and stored electronically. The processors then extract each job description, utilizing a Named Entity Recognition (NER) annotator tool to identify specific segments within the text files. The annotated data is merged into a structured file with trimmed entity annotations. Following this, the merged data is converted into a SpaCy format. Subsequently, a NER RoBERTa model is trained by a dedicated training unit using the SpaCy-formatted merged job description data. This method integrates advanced techniques to enhance job description parsing and analysis, ensuring accuracy and efficiency in information extraction.
[0040] The manner in which the proposed system works is described in further detail in conjunction with FIGs. 1 to 3. It may be noted that these figures are only illustrative, and should not be construed to limit the scope of the subject matter in any manner.
[0041] FIG. 1 illustrates an exemplary network architecture of proposed system for parsing and analysis of job descriptions, in accordance with an embodiment of the present disclosure.
[0042] In an embodiment, referring to FIG. 1, a system 102 for parsing and analysis of job descriptions is disclosed. The system 102 will be connected to a network 106, which is further connected to at least one computing device 104-1, 104-2, … 104-N (collectively referred to as computing device 104, herein). The communication network 106 may include, but not be limited to, a wireless network, a wired network, an internet, an intranet, a public network, a private network, a packet-switched network, a circuit-switched network, an ad hoc network, an infrastructure network, a Public-Switched Telephone Network (PSTN), a cable network, a cellular network, a satellite network, a fiber optic network, or some combination thereof.
[0043] In an embodiment, the computing device 104 may communicate with the system 102 via a set of executable instructions residing on any operating system to receive the grayscale images. In an embodiment, the one or more computing devices 104 may include, but not be limited to, any electrical, electronic, electromechanical, or equipment, or a combination of one or more of the above devices such as mobile phone, smartphone, Virtual Reality (VR) devices, Augmented Reality (AR) devices, laptop, a general-purpose computer, desktop, personal digital assistant, tablet computer, mainframe computer, or any other computing device. It may be appreciated that computing devices 104 may not be restricted to the mentioned devices and various other devices may be used.
[0044] The system 102 is configured to collect, store, and process job descriptions. The system includes a Named Entity Recognition (NER) annotator tool 112 to identify segments within the job descriptions, creating structured data. This annotated data is further processed, trimmed, and converted into a digital format (i.e. SpaCy format). The system incorporates a training unit 114 for NER RoBERTa model training, testing, and storage, the trained model is stored on a server 110. The model is tested at two levels using unseen data, with the results annotated and merged.
[0045] The server 110 is communicatively coupled to the system by the network 106 and acts as a centralized storage location for this trained model. Storing the model on a server allows other components or modules of the system, as well as external applications, to access and utilize the trained NER RoBERTa model. This centralization ensures that the trained model is easily retrievable, scalable, and can be deployed for various tasks without the need to retrain it every time. Additionally, storing the model on a server facilitates collaboration and integration with other systems that may require the capabilities of this trained NER model.
[0046] In an embodiment, the system 102 provides a graphical user interface (GUI) (not shown) for user-friendly inspection of job descriptions. The graphical user interface provides users with a convenient way to visually inspect and navigate through job descriptions presented in the SpaCy format.
[0047] In another embodiment, the computing device 104 is equipped to host and display this graphical user interface, providing users with an intuitive and convenient way to inspect and navigate through the job descriptions. Users can interact with the system, view parsed job information, and explore the details presented in the SpaCy format through this graphical interface. The computing device 104 serves as a platform for the graphical user interface, enabling effective communication and interaction between users and the system.
[0048] Although FIG. 1 shows exemplary components of the network architecture 100, in other embodiments, the network architecture 100 may include fewer components, different components, differently arranged components, or additional functional components than depicted in FIG. 1. Additionally, or alternatively, one or more components of the network architecture 100 may perform functions described as being performed by one or more other components of the network architecture 100.
[0049] FIG. 2 illustrates an exemplary architecture of proposed system for parsing and analysis of job descriptions, in accordance with an embodiment of the present disclosure.
[0050] In an aspect, referring to FIG. 2, a system 102 may comprise one or more processor(s) 202 (interchangeably referred to as processor 202, hereinafter). The processor 202 may be implemented as one or more microprocessors, microcomputers, microcontrollers, edge or fog microcontrollers, digital signal processors, central processing units, logic circuitries, and/or any devices that process data based on operational instructions. Among other capabilities, the processor) 202 may be configured to fetch and execute computer-readable instructions stored in a memory 204 of the system 102. The memory 204 may be configured to store one or more computer-readable instructions or routines in a non-transitory computer-readable storage medium, which may be fetched and executed to create or share data packets over a network service. The memory 204 may comprise any non-transitory storage device including, for example, volatile memory such as Random Access Memory (RAM), or non-volatile memory such as Erasable Programmable Read-Only Memory (EPROM), flash memory, and the like.
[0051] The system 102 may include an interface(s) 206. The interface(s) 206 may comprise a variety of interfaces, for example, interfaces for data input and output devices, referred to as I/O devices, storage devices, and the like. The interface(s) 206 may facilitate communication to/from the system 102. The interface(s) 206 may also provide a communication pathway for one or more components of the system 102. Examples of such components include, but are not limited to, processing unit/engine(s) 208 and a database 210.
[0052] In an embodiment, the processing unit/engine(s) 208 may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processing engine(s) 208. In examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the processing engine(s) 208 may be processor-executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the processing engine(s) 208 may comprise a processing resource (for example, one or more processors), to execute such instructions. In the present examples, the machine-readable storage medium may store instructions that, when executed by the processing resource, implement the processing engine(s) 208. In such examples, the system 102 may comprise the machine-readable storage medium storing the instructions and the processing resource to execute the instructions, or the machine-readable storage medium may be separate but accessible to the system 102 and the processing resource. In other examples, the processing engine(s) 208 may be implemented by electronic circuitry.
[0053] In an embodiment, the database 210 may include data that may be either stored or generated as a result of functionalities implemented by any of the components of the processor 202 or the processing engine 208. In an embodiment, the database 210 may be separate from the system 202.
[0054] In an exemplary embodiment, the processing engine 208 may include one or more engines selected from any of a data collection module 212, a data compilation module 214, a data extraction module 216, an annotation module 218, a merging module 220, an annotation trimming module 222, a data loading module 224, a training and testing module 226, and other module(s) 228. The other module(s) 228 have functions that may include but are not limited to testing, storage, and peripheral functions, such as wireless communication unit for remote operation, an audio unit for alerts and the like.
[0055] In an embodiment, the data collection module 212 may be configured to collect a plurality of job descriptions (interchangeably referred to as JDs data) from a set of sources employing advanced techniques including web scraping from job platforms. This data gathering process covers different industries or sectors, such as IT, Finance, Pharma, and Energy, ensuring a diverse and representative collection of job descriptions for analysis. For instance, the data collection module 212 retrieves job descriptions from diverse and multiple sources. These sources may include various online platforms, websites, or databases where job postings and descriptions are available.
[0056] In an embodiment, the data compilation module 214 may be configured to store the collected plurality of job descriptions in an electronic file such as CSV (comma-separated values) file, facilitating subsequent parsing and lexical analysis.
[0057] In an embodiment, the data extraction module 216 may be configured to extract each job description from the CSV file and store corresponding information in text files separately. Job descriptions are typically organized in rows or records within the CSV file, and the data extraction module processes these entries.
[0058] In an embodiment, the annotation module 218 may be configured to utilize a Named Entity Recognition (NER) annotator tool 112 such as a Tecoholic Named Entity Recognition (NER) annotator that enables annotation and identification of a first set of segments (interchangeably referred to as an entity, herein) within the job descriptions is in the text files. This empowers manual annotation of job description data, encompassing identification and categorization of specific first set of segments or entities. The first set of segments include such as but not limited to job title, no of vacancies, college category, about the company, company name, job location, availability (immediate joiner, notice period), salary change, salary type (daily, weekly, monthly, or annually), essential qualification, remaining, hard skills, soft skills, domain (IT, Finance, Pharma, Energy), work mode(onsite, remote, hybrid), mode of employment (part time, full time, freelance, internship), roles and responsibilities, perks and benefits, level of role (junior, senior, associate, mid, c-level, director, VP), preferred or desired qualification, salary currencies, overall experience, relevant experience, job description summary, experience skills, preferred skills.
[0059] In an exemplary implementation, for effective training, the Tecoholic NER annotator tool is employed, which allows for precise manual annotation of JDs data. The NER annotator tool 112 systematically identifies and labels specific segments or entities within the JDs. These contain vital elements:
a) Job Title: Specific title or position mentioned in the job description.
b) Number of Vacancies: Quantity of job openings available for the mentioned position.
c) College Category: Categorization of educational institutions attended by candidates, if mentioned in the job description.
d) About the Company: A brief overview of the hiring company.
e) Company Name: Name of the organization or company offering the job.
f) Job Location: Geographical location where the job is based.
g) Availability: Information regarding the candidate's availability, such as whether they are an immediate joiner or have a notice period.
h) Salary range: Range of the salary.
i) Salary Type: Specifies whether the salary is provided on a daily, weekly, monthly, or annual basis.
j) Essential Qualification: Minimum educational qualifications required for the job.
k) Hard Skills: Specific technical skills and competencies required for the job.
l) Soft Skills: Non-technical skills and personal attributes that are beneficial for the role.
m) Domain: Sector or industry the job belongs to, such as IT, Finance, Pharma, or Energy.
n) Work Mode: Indicates whether the job is onsite, remote, or a hybrid arrangement.
o) Mode of Employment: Specifies whether the job is part-time, full-time, freelance, or an internship.
p) Roles and Responsibilities: Duties and tasks associated with the job role.
q) Perks and Benefits: Benefits or additional offerings provided to the candidate as part of the job
r) Level of Role: Indicates the position's seniority, such as junior, senior, associate, mid, C-level, director, or VP.
s) Preferred/Desired Qualification: Educational qualifications or skills that are preferred but not mandatory.
t) Salary Currencies: Specifies the currency in which the salary is denominated.
u) Overall Experience: Candidate's total professional work experience.
v) Relevant Experience: Experience directly related to the job role in question.
w) Job description Summary: A concise summary of the job role and its primary responsibilities.
x) Experience Skills: Skills gained through professional experience.
y) Preferred Skills: Additional skills that are preferred by the hiring company but not mandatory.
z) Remaining: Any information that is not covered by the above tags.
[0060] The meticulously annotated data is then neatly structured and saved in JSON format. This annotation process ensures that the NER model can accurately identify and extract these specific entities from job descriptions during subsequent automated processing.
[0061] In an embodiment, the merging module 220 may be configured to merge the generated annotated job description data of the plurality of job descriptions into a first file (i.e. a single JSON file), and the first file includes structured information on the first set of segments. This consolidated JSON file contains structured information regarding job description segments, streamlining access and manipulation for subsequent deep learning operations.
[0062] In an embodiment, the annotation trimming module 222 may be configured to conduct entity annotation trimming by removing leading and trailing white spaces from the first set of segments of the merged job description data, ensuring precision and preparing the JDs data for training. In an exemplary embodiment, the system employs a 'trim_entity_spans' function for cleaning entity annotations. This function operates on a list of data in SpaCy JSON format, where each item is a tuple containing text and a dictionary of entities. Its purpose is to enhance data cleanliness by removing leading and trailing white spaces from entity spans. This step is crucial to ensure the accuracy of NER models, as extraneous white spaces in entity annotations can impact performance. The function utilizes regular expressions to identify and eliminate invalid span tokens, specifically white spaces. The resulting cleaned data is stored in the 'cleaned_data' list, with each item maintaining the tuple structure of text and a dictionary of cleaned entities. In case of errors during processing, the system prints an error message and seamlessly proceeds to the next data item. The finalized cleaned data is then returned as 'Processed_Annotated_data.'
[0063] In an embodiment, the data loading module 224 may be configured to convert the first file of the merged job description data into a SpaCy format. The cleaned SpaCy-formatted file containing annotated job description data is loaded into a SpaCy Roberta NER model i.e. a cutting-edge deep tech solution renowned for its capabilities in Natural Language Processing (NLP) tasks.
[0064] In an embodiment, the training and testing module 226 may be configured to train the NER RoBERTa model (i.e. level 1 training) from the SpaCy-formatted merged job description data from a training and testing unit 114.
[0065] In an embodiment, the RoBERTa's functionality has been applied to job description parsing.
[0066] Text Extraction: Initially Job descriptions are in raw text format.
[0067] Tokenization: The text from the Job description needs to be split into individual tokens or words. RoBERTa uses subword tokenization, which breaks text into smaller units like words or subwords to handle various languages and out-of-vocabulary words. In an example, RoBERTa's subword tokenization is a sophisticated technique that allows the model to process text efficiently, handle different languages, manage complex words, and adapt to various writing styles. It's a crucial step that enables the model to work with text data effectively
[0068] Job description: When RoBERTa receives a piece of raw text from a Job description, this needs to be segmented into smaller, manageable units. Instead of treating text as one continuous string, tokenization splits the text into discrete tokens.
[0069] Word-Level and Subword-Level Tokenization: Tokenization can occur at different levels. RoBERTa uses subword tokenization, which goes beyond simple word-level tokenization. In subword tokenization, words are further broken down into smaller units, often subword pieces. This is particularly useful for handling complex words, compound words, or languages with extensive morphology.
[0070] Vocabulary Handling: RoBERTa uses a predefined vocabulary of subword tokens. These tokens are like puzzle pieces that can be used to represent any word in a language. The model tries to find the best combination of these subword tokens to represent the original words in a text. This approach allows RoBERTa to handle words that may not be present in its vocabulary, effectively addressing the issue of out-of-vocabulary words
[0071] Multilingual Support: RoBERTa can effectively tokenize and understand text in various languages, making the text versatile in a global context.
[0072] Context Preservation: Tokenization preserves contextual meaning of words and phrases. By breaking text into meaningful units, RoBERTa can understand the relationships between words and their positions in a sentence, which is essential for comprehending the semantics and nuances of language.
[0073] Information Extraction: RoBERTa can be used for extracting specific information from the Job description text.
[0074] Named Entity Recognition (NER): RoBERTa can help identify entities such as JOB_TITLE, NO_OF_VACANCIES, COLLEGE_CATEGORY, ABOUT_THE_COMPANY, COMPANY_NAME , JOB_LOCATION, AVAILABILITY (IMMEDIATE JOINER, NOTICE PERIOD), SALARY_RANGE, SALARY_TYPE(DAILY,WEEKLY, MONTHLY, OR ANNUALLY), ESSENTIAL_QUALIFICATION, HARD_SKILLS, SOFT_SKILLS, DOMAIN(IT, ENERGY, FINANCE, PHARMA), WORK_MODE (ONSITE, REMOTE, HYBRID), MODE_OF_EMPLOYMENT(PART TIME, FULL TIME, FREELANCER, INTERNSHIP), ROLES_AND_RESPONSIBILITIES, PERKS_AND_BENEFITS, LEVEL_OF_ROLE (JUNIOR, SENIOR, ASSOCIATE, MID, C-LEVEL, DIRECTOR, VP), PREFERED/DESIRED QUALIFICATION, SALARY_CURRENCIES, OVERALL_EXPERIENCE, RELEVANT_EXPERIENCE, JOB_DESCRIPTION_SUMMARY, EXPERIENCE_SKILLS, PREFERED_SKILLS other informations into ‘Remaining’ within the job description text as outlined below:
JOB_TITLE: RoBERTa can identify and extract the job title mentioned in the JD, which is crucial for understanding the role.
NO_OF_VACANCIES: RoBERTa can recognize and extract the number of job openings mentioned in the JD.
COLLEGE_CATEGORY: RoBERTa can pinpoint and extract information about the preferred college category or educational background specified in the JD.
ABOUT_THE_COMPANY: RoBERTa can extract the description of the hiring company, allowing job seekers to learn more about the organization.
COMPANY_NAME: RoBERTa can recognize and extract the name of the hiring company from the JD.
JOB_LOCATION: RoBERTa can identify and extract the specific geographic location or locations where the job is located.
AVAILABILITY: RoBERTa can extract information regarding whether the company is looking for immediate joiners or candidates with specific notice periods.
SALARY_RANGE: RoBERTa can recognize and extract the salary range or expected compensation mentioned in the JD.
SALARY_TYPE: RoBERTa can extract information about how the salary is paid, whether it's daily, weekly, monthly, or annually.
ESSENTIAL_QUALIFICATION: RoBERTa can identify and extract the essential qualifications required for the job.
HARD_SKILLS: RoBERTa can pinpoint and extract specific technical or job-related skills mentioned as requirements.
SOFT_SKILLS: RoBERTa can extract information about non-technical skills or interpersonal qualities desired for the role.
DOMAIN: RoBERTa can recognize and categorize the industry or field of work associated with the job.
WORK_MODE: RoBERTa can identify and extract information about whether the job is onsite, remote, or a hybrid work arrangement.
MODE_OF_EMPLOYMENT: RoBERTa can extract details about the type of employment, whether it's part-time, full-time, freelance, or an internship.
ROLES_AND_RESPONSIBILITIES: RoBERTa can identify and extract the specific duties and responsibilities associated with the job.
PERKS_AND_BENEFITS: RoBERTa can recognize and extract information about additional benefits or incentives offered by the company.
LEVEL_OF_ROLE: RoBERTa can extract the level of the job position, such as junior, senior, associate.
PREFERRED/DESIRED QUALIFICATION: RoBERTa can identify and extract qualifications and skills that are preferred but not mandatory.
SALARY_CURRENCIES: RoBERTa can extract the currency in which the salary is mentioned.
OVERALL_EXPERIENCE: RoBERTa can identify and extract the total years of professional experience required for the job.
RELEVANT_EXPERIENCE: RoBERTa can extract information about the number of years of experience specifically related to the job.
JOB_DESCRIPTION_SUMMARY: RoBERTa can identify and summarize the key points in the job description, providing a quick overview.
REMAINING: Any other information that doesn't fit into the aforementioned categories is grouped under "Remaining," ensuring a comprehensive analysis of the Job description text.
[0075] Additionally, after extracting information from the job description, RoBERTa assists in structuring the extracted information into a formatted structure, such as JSON or a database, facilitating easier search and analysis of the JDs data. Moreover, RoBERTa's contextual language understanding aids in resolving ambiguities and comprehending the context of the information provided in the job description. Furthermore, RoBERTa is trained in a diverse range of languages, making the system suitable for job descriptions in multiple languages, thereby enhancing the flexibility of the job description parsing system.
[0076] In an embodiment, the training and testing unit 114 may be further configured to store the trained NER RoBERTa model on a server. Through this level 1 training, the model acquires the ability to recognize and extract key segments of resumes with remarkable precision, owing to the transformative architecture of RoBERTa Model, deeply rooted in the world of deep learning and NLP.
[0077] In an exemplary embodiment, the process of model training is a critical step in developing high-performing linguistic models. To facilitate this, a scikit-learn library's model_selection module offers a robust tool known as the train_test_split function. This function plays a pivotal role in parsing datasets into distinct training and testing subsets, a foundational practice in NLP research and development Processed_Annotated_data from the Trim Entity Span is taken for the training.
[0078] test_size=0.3: This parameter determines the proportion of the data that is allocated to the testing set. In this case, the test size is set to 0.3, which means 30% of the entire dataset will be used as the testing set, while the remaining 70% will be used as the training set. Adjusting this parameter allows control of the size of training and testing sets.
[0079] train, test: These are two variables used to store the resulting datasets after the split.
[0080] train: This variable contains a training dataset, which typically includes a majority of data (70% in this case), and the training dataset is used to train the model.
[0081] test: This variable contains a testing dataset, which consists of a smaller portion of data (30% in this case), and the testing dataset is used to evaluate the performance of the trained model by testing on data that hasn't been seen during training.
[0082] Get_SpaCy_doc function: The get_SpaCy_doc function takes two arguments: file (a filename for error logging) and data (an iterable containing text and entity annotation data).
[0083] Inside the function: This initialises SpaCy's English language model with large word vectors ('en_core_web_lg') and creates an empty DocBin to store processed documents. Iterating through each item in the data iterable, which is expected to be a collection of text and entity annotations. For each text and annotation pair, process the text using SpaCy (nlp.make_doc(text)) and extract the entity annotations. Iterating through the entity annotations and checking for overlapping entities by examining character indices. If an overlap is found, the entity is skipped. For non-overlapping entities, create SpaCy Span objects representing the entities and append them to the ents list. If the creation of a Span fails, log an error to the specified file. After processing all entities in the text, assign the list of entity Spans to the attribute of the processed Doc object. Finally, add the processed Doc object to the DocBin.
[0084] Outside the function: This defines a file variable with the value 'error.txt', which is expected to be the filename for error login and uses get_SpaCy_doc function to process the training data (train) and the testing data (test). The processed documents are saved in SpaCy's binary format ('.SpaCy') for both the training and testing datasets.
[0085] Model Training: Training a SpaCy NLP (Natural Language Processing) model using the SpaCy train command. The training process is specified by a configuration file (config.cfg) that defines various settings, model architecture and hyperparameters for the model and data paths to save the trained model. The training data which is in SpaCy's binary format is loaded from the ‘train_data.SpaCy’ and the testing data which is in SpaCy's binary format is loaded from the ‘test_data.SpaCy’ and the GPU device is used for the training.
[0086] Config.cfg: This configuration file (config.cfg) defines various settings for the SpaCy training process. The key sections and settings in the configuration file are defined as follows:
[0087] Specifies general settings for the NLP model as detailed below:
a) NLP Section: Specify the language of the model (English), and the processing pipeline, which includes a transformer model and a named entity recognition (NER) component.
b) Components Section: Defines NER-related components configuration for the NER component, "NER" Specifies that the component is for named entity recognition. "SpaCy.ner_scorer.v1" Specifies the scorer used for NER. "SpaCy.TransitionBasedParser.v2" Specifies the model architecture and hidden_width = 64
c) Configuration for the transformer component "transformer" Specifies that the component is based on a transformer model. 4096 is the Maximum batch size for training.
d) The training section has settings related to the training process, Number of gradient accumulations before an update is 3, Dropout rate 0.1, Maximum number of training steps is 20000, Patience is 1600, Frequency of model evaluation during training is 200.
e) The training optimizer section has the configuration for the optimizer used during training. Adam optimizer and its hyperparameters are defined, L2 weight decay settings, Gradient clipping threshold =1.0
f) Learning rate scheduling "warmup_linear.v1": Specifies a linear learning rate warmup and Initial learning rate are defined.
[0088] Additionally, the training and testing unit 114 performs a first level of testing on the NER RoBERTa model with a set of unseen data to evaluate its performance. For instance, the set of unseen data is given in pdf or doc format, and the text is extracted from the set of unseen data and passed to the model to get the predictions and further analysis. During level 1 testing, extract a second set of segments from the trained NER RoBERTa model, and annotate the extracted second set of segments. For instance, segments of education are further annotated for getting Required and Preferred Qualification [Level, Field of Study]. and merge into a second file, and convert the second file into a digital format (i.e. SpaCy format).
[0089] In an embodiment, training and testing unit 114 trains the trained NER RoBERTa model (i.e. level 2 training) from the SpaCy formatted second file in JSON format, and stores the trained NER RoBERTa model on the server 110. The server 110, a hub of intelligence, securely stores both the Level 1 and Level 2 trained models, ready to empower recruitment processes with unparalleled accuracy and efficiency. Further, perform a second level of testing on the NER RoBERTa model with the received set of unseen data. Once the model for level 1 and level 2 is finalised, the model can be used together to extract useful information. For instance, the trained model can process new resumes and provide structured output.
[0090] In an exemplary embodiment, testing includes steps as detailed below:
[0091] Loading The Trained Customised SpaCy RoBERTa NER Model: First step in this process involves loading ‘Model-Best’ from the Trained SpaCy NER model collection. This model has been previously trained to understand the structure, context and language of text from the various kinds of Job Descriptions. Using PyPDF2 PDF Handling to convert PDF to text.
[0092] There is a specific PDF to Text Conversion Function that has been defined by a function named `extract_text_from_pdf` to extract text from a PDF document. This function takes the path to a PDF file as input and performs the following steps:
i) This opens the PDF file in binary read mode.
ii) Initialises an empty string to store the extracted text.
iii) Iterates through each page in the PDF document.
iv) For each page, extract the text and append it to the 'text' variable.
v) Finally, returns the accumulated text from all pages as a single string.
[0093] Processing Text with SpaCy NLP: The loaded SpaCy NEP model is used to process the extracted text stored in the `text` variable. This includes breaking down the text into its constituent parts (tokens) and identifying various linguistic features, including named entities. The model is trained to extract the following segments from a Job Description:
JOB_TITLE, NO_OF_VACANCIES, COLLEGE_CATEGORY, ABOUT_THE_COMPANY, COMPANY_NAME, JOB_LOCATION, AVAILABILITY (IMMEDIATE JOINER, NOTICE PERIOD), SALARY_RANGE, SALARY_TYPE (DAILY, WEEKLY, MONTHLY, OR ANNUALLY), ESSENTIAL_QUALIFICATION, HARD_SKILLS, SOFT_SKILLS, DOMAIN (IT, ENERGY, FINANCE, PHARMA), WORK_MODE (ONSITE, REMOTE, HYBRID), MODE_OF_EMPLOYMENT (PART TIME, FULL TIME, FREELANCER, INTERNSHIP), ROLES_AND_RESPONSIBILITIES, PERKS_AND_BENEFITS, LEVEL_OF_ROLE (JUNIOR, SENIOR, ASSOCIATE, MID, C-LEVEL, DIRECTOR, VP), PREFERED/DESIRED QUALIFICATION, SALARY_CURRENCIES, OVERALL_EXPERIENCE, RELEVANT_EXPERIENCE, JOB_DESCRIPTION_SUMMARY, EXPERIENCE_SKILLS, PREFERED_SKILLS other information into REMAINING.
[0094] Iterating Through Named Entities: Iterates through the named entities identified within the processed text. For each named entity found in the text, the model prints two pieces of information: the text of the entity itself and the label associated with that entity.
[0095] This training and testing module 226 ensures accurate extraction of information, unveiling insights and paving the way for potential model enhancements.
[0096] FIG. 3 illustrates an exemplary view of a flow diagram of proposed method for parsing and analysis of job descriptions, in accordance with some embodiments of the present disclosure.
[0097] In an embodiment, a method 300 for parsing and analysis of job descriptions is disclosed. At step 302, a processor 202, collects a plurality of job descriptions (JDs) from a set of sources in CSV format, via web scraping from job platforms. This comprehensive data gathering process spans sectors such as but not limited to IT, Finance, Pharma, and Energy, enriching the dataset with a broad spectrum of expertise. The JDs are in pdf or doc format.
[0098] At step 304, the processor 202, stores the collected plurality of job descriptions in an electronic file such as in CSV format. For example, CSV is a common and widely used file format for storing tabular data, and it is suitable for representing structured information such as job descriptions. Each job description, along with its associated details, is organized in rows and columns within the CSV file. Storing the job descriptions in this electronic format facilitates subsequent processing and analysis, as CSV files are easily readable by both humans and computer systems. This step ensures a systematic and organized storage of the collected data, setting the stage for further stages in the parsing and analysis process.
[0099] At step 306, the processor 202 extracts each job description from the electronic file and stores corresponding information in text files separately. This step includes converting structured tabular data from the CSV file into a format that retains the textual content and specific details of each job description. By storing the information in text files, the processor creates a more flexible and accessible format for subsequent processing. In one example, text files are commonly used for handling unstructured data and are suitable for natural language processing tasks. This separation of job descriptions into individual text files prepares the data for further analysis and annotation, allowing for efficient utilization of natural language processing techniques in subsequent steps of the method.
[00100] At step 308, the processor 202 utilises a Named Entity Recognition (NER) annotator tool 112 for annotation and identification of a first set of segments within the plurality of job descriptions in the text files. The first set of segments include such as but not limited to job title, number of vacancies, college category, company brief, company name, job location, availability, salary change, salary type, qualification, hard skill, soft skill, domain, work mode, mode of employment, roles and responsibilities, perk and benefit, level of role, preferred qualification, salary currency, total experience, relevant experience, job description summary, experience skill, preferred skill, and remaining.
[00101] At step 310, the processor 202 merges the generated annotated job description data of the plurality of job descriptions into a first file. This merging process results in a structured file format, specifically JSON (JavaScript Object Notation), which is commonly used for storing and exchanging structured data. The first file includes information related to the first set of segments, which are specific elements identified through Named Entity Recognition (NER) within the job descriptions. These segments encompass various details such as job title, number of vacancies, company details, location, salary information, qualifications, skills, and more. Additionally, the step includes entity annotation trimming, a process of refining the annotated data by removing any unnecessary leading and trailing white spaces from the identified segments. This trimming ensures data precision and cleanliness, which is crucial for the subsequent stages of the analysis. The resulting first file, in JSON format, serves as a well-organized and cleaned dataset, ready for further processing and training of the Named Entity Recognition model.
[00102] At step 312, the processor 202 converts the previously generated first file, containing merged and annotated job description data in JSON format, into a format compatible with spaCy, a popular natural language processing (NLP) library. This format conversion is essential to prepare the data for subsequent processing using spaCy's capabilities. For instance, SpaCy has its own data structures and formats for representing annotated text, and converting the data into spaCy format involves adapting the information to fit these structures. Typically, spaCy uses binary formats (e.g., ".spacy") to store processed documents, allowing for efficient and optimized handling during NLP tasks.
[00103] Further, conversion to spaCy format enables the utilization of spaCy's functionalities, such as training Named Entity Recognition (NER) models and performing linguistic analyses on the job description data. This processed spaCy-formatted data becomes an input for training the NER RoBERTa model in the next steps of the method.
[00104] At step 314, training (i.e. level 1 training) by a training and testing unit, a NER RoBERTa model from the SpaCy formatted merged job description data. The training is conducted by a training and testing unit 114. The key steps in this process are as follows:
[00105] Level 1 Training: The NER RoBERTa model is trained using the spaCy-formatted merged job description data generated in the previous steps. RoBERTa is a transformer-based model known for its effectiveness in natural language understanding tasks.
[00106] Storage of Trained Model: Once the NER RoBERTa model is trained, it is stored on a server 110. This allows for easy access and deployment when needed for subsequent tasks.
[00107] First Level Testing: The trained NER RoBERTa model undergoes a preliminary testing phase using a set of unseen data. This step assesses the model's performance on new and unseen job description information.
[00108] Extraction and Annotation: The second set of segments is extracted from the trained NER RoBERTa model. These segments likely correspond to various entities identified during the NER process, such as education-related information (e.g., required and preferred qualifications). The specific example mentioned, focusing on the second segment of education and further annotating for required and preferred qualifications (Level, Field of Study), highlights the detailed nature of the information extraction process, emphasizing the model's ability to discern specific attributes within the job descriptions.
[00109] Merging and Conversion: The annotated information extracted in the previous step is merged into a second file, which is then converted into spaCy format. This prepares the data for further analysis and processing using spaCy's capabilities.
[00110] Additionally, the training and testing include a second level of training (referred to as level 2 training) for the Named Entity Recognition (NER) RoBERTa model. The key steps in this process are as follows:
[00111] Retraining the Model: The previously trained NER RoBERTa model, which is stored after the first level of training, is used as the starting point for the second level of training. This involves refining the model based on the accumulated knowledge from the initial training and testing phases.
[00112] Data Source: The training and testing unit takes the SpaCy-formatted second file, which contains annotated information from the first level of training, as the input data source for the second level of training. This file likely contains enriched information about various entities identified in job descriptions.
[00113] Storage of Updated Model: After completing the second level of training, the refined NER RoBERTa model is stored again on the server. This updated model now incorporates additional insights gained from the second level of training.
[00114] Second Level Testing: Following the second training phase, a second level of testing is performed. This involves evaluating the model's performance using a set of unseen data, providing a more comprehensive assessment of the model's accuracy and generalization capabilities.
[00115] The iterative nature of the training process (level 1 and level 2) allows the NER RoBERTa model to continuously improve its ability to recognize and annotate entities in job descriptions. This adaptability is crucial for handling diverse and evolving language patterns in job postings and ensuring the model's effectiveness over time.
[00116] Thus, the present disclosure provides a system and method that seamlessly integrates data acquisition, text processing, annotation, and advanced NER model training, offering a robust solution for parsing and analyzing job descriptions. By combining multiple technologies, such as Named Entity Recognition and RoBERTa models, the system ensures precise extraction of key information, providing users with a comprehensive and efficient tool for job-related insights.
[00117] The disclosed methods and systems, as illustrated in the ongoing description or any of its components, may be embodied in the form of a computer system. Typical examples of a computer system include a general-purpose computer, a programmed microprocessor, a micro-controller, a peripheral integrated circuit element, and other devices, or arrangements of devices that are capable of implementing the steps that constitute the method of the disclosure.
[00118] The computer system includes a computer, an input device, a display unit, and the internet. The computer further includes a microprocessor. The microprocessor is connected to a communication bus. The computer also includes a memory. The memory may be RAM or ROM. The computer system further comprises a storage device, which may be an HDD or a removable storage drive such as a floppy-disk drive, an optical-disk drive, and the like. The storage device may also be a means for loading computer programs or other instructions onto the computer system. The computer system also includes a communication unit. The communication unit allows the computer to connect to other databases and the internet through an input/output (I/O) interface, allowing the transfer as well as reception of data from other sources. The communication unit may include a modem, an Ethernet card, or other similar devices that enable the computer system to connect to databases and networks, such as LAN, MAN, WAN, and the internet. The computer system facilitates input from a user through input devices accessible to the system through the I/O interface.
[00119] To process input data, the computer system executes a set of instructions stored in one or more storage elements. The storage elements may also hold data or other information, as desired. The storage element may be in the form of an information source or a physical memory element present in the processing machine.
[00120] The programmable or computer-readable instructions may include various commands that instruct the processing machine to perform specific tasks, such as steps that constitute the method of the disclosure. The systems and methods described can also be implemented using only software programming or only hardware, or using a varying combination of the two techniques. The disclosure is independent of the programming language and the operating system used in the computers. The instructions for the disclosure can be written in all programming languages, including, but not limited to, ‘C’, ‘C++’, ‘Visual C++’, ‘Visual Basic’, ‘Java’, ‘Python’. Further, software may be in the form of a collection of separate programs, a program module containing a larger program, or a portion of a program module, as discussed in the ongoing description. The software may also include modular programming in the form of object-oriented programming. The processing of input data by the processing machine may be in response to user commands, the results of previous processing, or a request made by another processing machine. The disclosure can also be implemented in various operating systems and platforms, including, but not limited to, ‘Unix’, ‘DOS’, ‘Android’, ‘Symbian’, and ‘Linux’.
[00121] The programmable instructions can be stored and transmitted on a computer-readable medium. The disclosure can also be embodied in a computer program product comprising a computer-readable medium, or with any product capable of implementing the above methods and systems, or the numerous possible variations thereof.
[00122] Various embodiments of the system and method for parsing and analysis of job descriptions have been disclosed. However, it should be apparent to those skilled in the art that modifications in addition to those described are possible without departing from the inventive concepts herein. The embodiments, therefore, are not restrictive, except in the spirit of the disclosure. Moreover, in interpreting the disclosure, all terms should be understood in the broadest possible manner consistent with the context. In particular, the terms “comprises” and “comprising” should be interpreted as referring to elements, components, or steps, in a non-exclusive manner, indicating that the referenced elements, components, or steps may be present, used, or combined with other elements, components, or steps that are not expressly referenced.
[00123] Those having ordinary skills in the art will appreciate that the systems, modules, and sub-modules have been illustrated and explained to serve as examples and should not be considered limiting in any manner. It will be further appreciated that the variants of the above-disclosed system elements, modules, and other features and functions, or alternatives thereof, may be combined to create other different systems or applications.
[00124] Those skilled in the art will appreciate that any of the aforementioned steps and/or system modules may be suitably replaced, reordered, or removed, and additional steps and/or system modules may be inserted, depending on the needs of a particular application. In addition, the systems of the aforementioned embodiments may be implemented using a wide variety of suitable processes and system modules, and are not limited to any particular computer hardware, software, middleware, firmware, microcode, and the like.
[00125] The claims can encompass embodiments for hardware and software or a combination thereof.
[00126] It will be appreciated that variants of the above disclosed, and other features and functions or alternatives thereof, may be combined into many other different systems or applications. Presently unforeseen or unanticipated alternatives, modifications, variations, or improvements therein may be subsequently made by those skilled in the art, which are also intended to be encompassed by the following claims.
ADVANTAGES OF THE PRESENT DISCLOSURE
[00127] The present disclosure provides a system that improves efficiency of job description (JD) analysis, addressing the limitations inherent in manual methods.
[00128] The present disclosure provides a system that guarantees accuracy and consistency in extracting essential information from diverse JDs, mitigating the subjectivity and variability associated with manual analysis.
[00129] The present disclosure provides a system that significantly enhances the time efficiency of JD analysis, through automation, the system processes a substantial volume of JDs in a shorter timeframe, streamlining recruitment workflows and expediting decision-making.
[00130] The present disclosure provides a system that demonstrates scalability by adapting to changes in the job market and industry-specific terminology, and efficiently handles a diverse range of JDs across various sectors, ensuring ongoing relevance and adaptability.
[00131] The present disclosure provides a system that minimizes biases in JD analysis, and by relying on predefined rules and automated algorithms, the system reduces the potential for biases introduced during manual analysis, promoting fair and impartial assessments.
[00132] The present disclosure provides a system that enhances context understanding within JDs by leveraging advanced Natural Language Processing (NLP) techniques and improves nuanced language analysis and the extraction of relevant information.
[00133] The present disclosure provides a system that adapts to evolving industry-specific terminology, ensuring effectiveness in capturing the latest trends and requirements across various job markets.
[00134] The present disclosure provides a system with a user-friendly interface, prioritizing accessibility for recruiters and stakeholders, and contributing to a seamless and positive user experience.
[00135] The present disclosure provides a system that streamlines recruitment processes. The automation of JD parsing facilitates quicker decision-making, reduces manual effort, and enhances overall efficiency in talent acquisition.
[00136] The present disclosure provides a system that generates structured data outputs for data-driven insights, outputs are leveraged for data-driven insights into job market trends, skill requirements, and other valuable information, contributing to informed decision-making in recruitment strategies.
, Claims:We Claim:
1. A system (102) for job description (JD) parsing and analysis, the system comprising:
one or more processors (202) coupled with a memory (204), wherein said memory (204) stores instructions which when executed by the one or more processors (202) cause the system (102) to:
collect a plurality of job descriptions from a set of sources;
store the collected plurality of job descriptions in an electronic file;
extract each job description from the electronic file and store corresponding information in text files separately;
utilize a Named Entity Recognition (NER) annotator tool (112), that enables annotation and identification of a first set of segments within the plurality of job descriptions in the text files;
merge the generated annotated job description data of the plurality of job descriptions into a first file, wherein the first file comprises structured information on the first set of segments, and conduct entity annotation trimming by removing leading and trailing white spaces, from the first set of segments of the merged job description data; and
convert the first file of the merged job description data into a digital format; and
a training and testing unit configured to train a NER RoBERTa model from the digitally formatted merged job description data, wherein the training and testing unit is further configured to:
store, the trained NER RoBERTa model on a server;
perform a first level of testing on the NER RoBERTa model with a set of unseen data;
extract a second set of segments from the trained NER RoBERTa model;
annotate the extracted second set of segments, and merge into a second file;
convert the second file into a digital format;
train the trained NER RoBERTa model from the digitally formatted second file, and store the trained NER RoBERTa model on the server; and
perform a second level of testing on the NER RoBERTa model with the received set of unseen data.
2. The system (102) as claimed in claim 1, wherein the first set of segments comprise any or a combination of job title, number of vacancies, college category, company brief, company name, job location, availability, salary change, salary type, qualification, hard skill, soft skill, domain, work mode, mode of employment, roles and responsibilities, perk and benefit, level of role, preferred qualification, salary currency, total experience, relevant experience, job description summary, experience skill, preferred skill, and remaining.
3. The system (102) as claimed in claim 1, wherein the digital format is SpaCy format.
4. The system (102) as claimed in claim 1, wherein the NER RoBERTa model is further configured to:
receive raw text from a plurality of job descriptions;
perform tokenizations, wherein the received raw text is segmented into a set of words; and
conduct subword tokenization, wherein the segmented set of words is broken down into a set of subword pieces to enable the NER RoBERTa model to extract information of the set of segments from the received plurality of job descriptions.
5. The system (102) as claimed in claim 1, further comprising:
a graphical user interface (GUI) configured to enable the one or more users to visually inspect and navigate through the plurality of job descriptions presented in the digital format.
6. A method (300) for job description (JD) parsing and analysis, comprising the steps of:
collecting (302), by one or more processors (202), a plurality of job descriptions from a set of sources;
storing (304), by the one or more processors (202), the collected plurality of job descriptions in an electronic file;
extracting (306), by the one or more processors (202), each job description from the electronic file and stores corresponding information in text files separately;
utilizing (308), by the one or more processors (202), a Named Entity Recognition (NER) annotator tool for annotation and identification of a first set of segments within the plurality of job descriptions in the text files;
merging (310), by the one or more processors (202), the generated annotated job description data of the plurality of job descriptions into a first file, wherein the first file comprises structured information on the first set of segments, and conducts entity annotation trimming by removing leading and trailing white spaces, from the first set of segments of the merged job description data;
converting (312), by the one or more processors (202), the first file of the merged job description data into a digital format;
training and testing (314), by a training and testing unit (114), a NER RoBERTa model from the digitally formatted merged job description data, wherein the training and testing unit is further configured to:
storing, the trained NER RoBERTa model on a server;
performing, a first level of testing on the NER RoBERTa model with a set of unseen data;
extracting, a second set of segments from the trained NER RoBERTa model;
annotating, the extracted second set of segments, and merging into a second file;
converting, the second file into a digital format;
training, the trained NER RoBERTa model from the digitally formatted second file, and store the trained NER RoBERTa model on the server; and
performing, a second level of testing on the NER RoBERTa model with the received set of unseen data.
7. The method as claimed in claim 5, wherein the first set of segments comprise any or a combination of job title, number of vacancies, college category, company brief, company name, job location, availability, salary change, salary type, qualification, hard skill, soft skill, domain, work mode, mode of employment, roles and responsibilities, perk and benefit, level of role, preferred qualification, salary currency, total experience, relevant experience, job description summary, experience skill, preferred skill, and remaining.
8. The method as claimed in claim 5, wherein the NER RoBERTa model is further configured to:
receiving raw text from a plurality of job descriptions;
performing tokenizations, wherein the received raw text is segmented into a set of words; and
conducting subword tokenization, wherein the segmented set of words is broken down into a set of subword pieces for enabling the NER RoBERTa model to extract information of the first set of segments from the received plurality of job descriptions.
| # | Name | Date |
|---|---|---|
| 1 | 202411008739-FORM-9 [08-02-2024(online)].pdf | 2024-02-08 |
| 2 | 202411008739-FORM-26 [08-02-2024(online)].pdf | 2024-02-08 |
| 3 | 202411008739-FORM FOR SMALL ENTITY(FORM-28) [08-02-2024(online)].pdf | 2024-02-08 |
| 4 | 202411008739-FORM FOR SMALL ENTITY [08-02-2024(online)].pdf | 2024-02-08 |
| 5 | 202411008739-FORM 1 [08-02-2024(online)].pdf | 2024-02-08 |
| 6 | 202411008739-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [08-02-2024(online)].pdf | 2024-02-08 |
| 7 | 202411008739-EVIDENCE FOR REGISTRATION UNDER SSI [08-02-2024(online)].pdf | 2024-02-08 |
| 8 | 202411008739-DRAWINGS [08-02-2024(online)].pdf | 2024-02-08 |
| 9 | 202411008739-COMPLETE SPECIFICATION [08-02-2024(online)].pdf | 2024-02-08 |
| 10 | 202411008739-FORM 3 [01-03-2024(online)].pdf | 2024-03-01 |
| 11 | 202411008739-MSME CERTIFICATE [02-03-2024(online)].pdf | 2024-03-02 |
| 12 | 202411008739-FORM28 [02-03-2024(online)].pdf | 2024-03-02 |
| 13 | 202411008739-FORM 18A [02-03-2024(online)].pdf | 2024-03-02 |
| 14 | 202411008739-ENDORSEMENT BY INVENTORS [02-03-2024(online)].pdf | 2024-03-02 |
| 15 | 202411008739-FER.pdf | 2024-03-11 |
| 16 | 202411008739-FER_SER_REPLY [03-05-2024(online)].pdf | 2024-05-03 |
| 17 | 202411008739-ABSTRACT [03-05-2024(online)].pdf | 2024-05-03 |
| 18 | 202411008739-US(14)-HearingNotice-(HearingDate-18-07-2024).pdf | 2024-06-20 |
| 19 | 202411008739-Correspondence to notify the Controller [29-06-2024(online)].pdf | 2024-06-29 |
| 20 | 202411008739-Written submissions and relevant documents [01-08-2024(online)].pdf | 2024-08-01 |
| 21 | 202411008739-PatentCertificate11-10-2024.pdf | 2024-10-11 |
| 22 | 202411008739-IntimationOfGrant11-10-2024.pdf | 2024-10-11 |
| 1 | SearchHistory(21)E_06-03-2024.pdf |