System And Method For Resume Parsing And Information Extraction

< Back

System And Method For Resume Parsing And Information Extraction

Abstract: The present disclosure introduces a system (102) and a method (600) for resume parsing and information extraction, to enhance recruitment and human resources management processes. Through utilization of advanced techniques, such as text extraction, font size analysis, and Named Entity Recognition (NER), the system extracts, categorizes, and annotates resume data effectively. This utilizes deep learning models for accurate information extraction and standardizes the data for further analysis. The system (102) evaluates the performance of sentence-level and word-level Natural Language Processing (NLP) models on unseen data to ensure precision. Additionally, the system (102) integrates the outputs of these models to extract comprehensive information from resumes, which is further processed using clustering techniques for grouping based on similarities. Moreover, the proposed system offers a versatile and efficient solution for resume parsing, promising improved accuracy and effectiveness in recruitment and HR management. [FIGs. 1 and 6 will be the reference figures]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

07 May 2024

Publication Number

20/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

AVUA INTERNATIONAL PRIVATE LIMITED

E 272 Phase, 8 A, 2nd Floor, Sector 75, Sahibzada Ajit Singh Nagar, Punjab 160071

Inventors

1. CHOUDHARY, Adit

House No. 866 / Sector - 5, Urban Estate, Kurukshetra - 136118 , Haryana

2. MANDLOI, Mohit

H-4/4, Shankar hill town, Toranagallu, bellary, karnataka - 583123

3. PAWAR, Sathwik

236-4, Shivathmika, Hiriyangadi, Opp. Shri Durgaparameshwari Temple, Karkala, Udupi, Karnataka - 574104

4. KUMAR, Bharath

71 Gollahalli, JP Nagar 9th Phase, Anjanapura Banglore South, Karnataka - 560062

Specification

Description:TECHNICAL FIELD
[0001] The present disclosure pertains to the field of human resources management. More specifically, it relates to a system and method for resume parsing and information extraction utilizing advanced techniques for efficient extraction and categorization of resume data. This facilitates streamlined recruitment, talent acquisition, and human resources management processes.

BACKGROUND
[0002] Background description includes information that may be useful in understanding the present disclosure. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed disclosure, or that any publication specifically or implicitly referenced is prior art.
[0003] Resume parsing is a technology used in the recruitment and human resources (HR) industry to extract and organize information from job applicants' resumes or CVs (Curriculum Vitae) into a structured format that can be easily stored, searched, and analyzed by applicant tracking systems (ATS) or HR software. Traditionally, resume parsing has been performed manually, requiring recruiters to review and extract relevant details such as work experience, education, skills, and contact information from each resume. However, with the proliferation of digital resumes and the increasing volume of job applications, manual parsing has become impractical and time-consuming.
[0004] Existing resume parsing solutions attempt to automate this process by employing various techniques such as keyword matching, rule-based algorithms, and template-based extraction. While these solutions provide some level of automation, they often lack the accuracy and flexibility required to handle the diverse formats and structures of modern resumes effectively. Additionally, traditional parsing methods may struggle to interpret complex resume layouts, unconventional section headings, or non-standardized language usage, leading to errors and inaccuracies in the extracted information.
[0005] Furthermore, the reliance on outdated parsing techniques limits the scalability and adaptability of existing systems, particularly in industries with dynamic job requirements and evolving candidate profiles. Recruiters are faced with the challenge of manually verifying and correcting parsing errors, resulting in delays in the hiring process and increased administrative burden.
[0006] In light of these challenges, there is a growing demand for advanced resume parsing solutions that leverage cutting-edge technologies such as natural language processing (NLP), machine learning, and semantic analysis. These technologies enable the development of intelligent parsing systems capable of accurately interpreting and extracting information from diverse resume formats, including those with complex layouts, unconventional structures, and non-standardized language usage.
[0007] By harnessing the power of NLP and machine learning, modern resume parsing systems can automate the extraction and categorization of resume content with unprecedented accuracy and efficiency. These systems not only streamline the recruitment process by reducing manual effort and eliminating parsing errors but also provide recruiters with valuable insights into candidates' qualifications, experiences, and skills. Additionally, advanced parsing techniques enable the integration of parsed resume data with other HR systems, facilitating seamless candidate management and talent acquisition workflows.
[0008] Therefore, there is a need for an improved solution that utilizes advanced technologies to effectively address the limitations of existing solutions and significantly enhance the efficiency and efficacy of recruitment and human resources management processes.

OBJECTS OF THE PRESENT DISCLOSURE
[0009] Some of the objects of the present disclosure, which at least one embodiment herein satisfies are as listed herein below.
[0010] It is an object of the present disclosure to provide a system that revolutionizes resume data management by ensuring data accuracy enhancing reliability of extracted information.
[0011] It is an object of the present disclosure to provide a system that empowers users with the Named Entity Recognition (NER) annotator tool for precise data annotation, facilitating the identification and categorization of essential resume segments.
[0012] It is an object of the present disclosure to provide a system that enhances recruitment processes, job matching, and personal career insights by delivering a state-of-the-art resume parsing solution that saves time, improves accuracy, and maximizes the utility of resume data.
[0013] It is an object of the present disclosure to provide a system that simplifies the data collection process by employing advanced techniques, including user submissions, to create a diverse and extensive dataset covering various sectors and resume formats.
[0014] It is an object of the present disclosure to provide a system that utilizes cutting-edge deep learning models for accurate information extraction and mapping, ensuring the model's precision in recognizing key resume components.
[0015] It is an object of the present disclosure to provide a system that standardizes and enriches resume data through the extraction of text-based information from resumes and the application of semantic analysis, making it more valuable for subsequent analysis.
[0016] It is an object of the present disclosure to provide a system that offers comprehensive, accurate, and versatile solutions that streamline the entire resume processing pipeline, improving efficiency and productivity in recruitment and human resources management processes.

SUMMARY
[0017] Various aspects of present disclosure pertain to the field of human resources management. More specifically, it relates to a system and method for resume parsing and information extraction utilizing advanced techniques for efficient extraction and categorization of resume data. This system streamlines resume processing, enhances data accuracy, and facilitates efficient recruitment and human resources management processes.
[0018] An aspect of the present disclosure pertains to a resume parsing system that includes an input unit for receiving resumes from various sources and a controller with processors and memory. The system extracts text from each resume, identifies headings based on font size, and segments the text into predefined categories. It then extracts sentence-based headings, assigns labels to sentences, and annotates data using a Named Entity Recognition (NER) tool for training a sentence-level Natural Language Processing (NLP) model. Additionally, it extracts word-based headings and annotates segments for training a word-level NLP model. A training and testing unit evaluates the performance of these models on unseen data. The controller integrates outputs of both NLP models to extract information from resumes.
[0019] In an aspect, the NER tool for word-based headings utilizes machine learning techniques.
[0020] In an aspect, the output of the word-level NLP model aids in extracting key phrases for keyword-based search and analysis.
[0021] In an aspect, merging outputs of both NLP models generates comprehensive candidate profiles with structured and unstructured information. Furthermore, integrated outputs undergo clustering techniques to group resumes based on extracted similarities.
[0022] Another aspect of the present disclosure pertains to a method for parsing resumes that includes multiple steps including, receiving a set of resumes, extracting text from each resume, checking font size to identify headings, segmenting the text into predefined categories, and extracting sentence-based headings. These headings are merged to create a first dataset with labeled categories. The method further involves utilizing a Named Entity Recognition (NER) annotator tool to annotate the data for training a sentence-level Natural Language Processing (NLP) model.
[0023] Additionally, word-based headings are extracted and annotated for training a word-level NLP model. The performance of both NLP models is evaluated on unseen data using a training and testing unit. Finally, the outputs of the sentence-level and word-level NLP models are integrated to extract information from the received set of resumes.
[0024] In an aspect, method may include converting resumes into PDF format.
[0025] In an aspect, NER annotator tool leverages machine learning techniques for annotating word-based headings. The output of the word-level NLP model aids in extracting key phrases for keyword-based search and analysis.
[0026] Furthermore, merging the outputs of both NLP models generates comprehensive candidate profiles with structured and unstructured information. Integrated outputs are processed using clustering techniques to group resumes based on similarities in the extracted information. Moreover, this method streamlines resume parsing, enhances data accuracy, and facilitates efficient recruitment processes.
[0027] Various objects, features, aspects, and advantages of the inventive subject matter will become more apparent from the following detailed description of preferred embodiments, along with the accompanying drawing figures in which like numerals represent like components.

BRIEF DESCRIPTION OF DRAWINGS
[0028] The accompanying drawings are included to provide a further understanding of the present disclosure, and are incorporated in, and constitute a part of this specification. The drawings illustrate exemplary embodiments of the present disclosure, and together with the description, serve to explain the principles of the present disclosure.
[0029] FIG. 1 illustrates an exemplary network architecture of proposed system for resume parsing, in accordance with an embodiment of the present disclosure.
[0030] FIG. 2 illustrates an exemplary architecture of proposed system for resume parsing, in accordance with an embodiment of the present disclosure.
[0031] FIG. 3 illustrates an exemplary flow chart of data collection process of proposed system, in accordance with some embodiments of the present disclosure.
[0032] FIG. 4 illustrates an exemplary flowchart of data annotation of proposed system, in accordance with some embodiments of the present disclosure.
[0033] FIG. 5 illustrates an exemplary flowchart of internal working of training process, of proposed system, in accordance with some embodiments of the present disclosure.
[0034] FIG. 6 illustrates an exemplary view of a flow diagram of proposed method for parsing resumes, in accordance with some embodiments of the present disclosure.

DETAILED DESCRIPTION
[0035] The following is a detailed description of embodiments of the disclosure depicted in the accompanying drawings. The embodiments are in such detail as to clearly communicate the disclosure. However, the amount of detail offered is not intended to limit the anticipated variations of embodiments; on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present disclosure as defined by the appended claims.
[0036] References to “an embodiment”, “an exemplary embodiment”, “an example”, “for instance”, and so on, indicate that the embodiment(s) or example(s) may include a particular feature, structure, characteristic, property, element, or limitation, but that not every embodiment or example necessarily includes that particular feature, structure, characteristic, property, element, or limitation. Furthermore, repeated use of the phrase “in an embodiment” does not necessarily refer to the same embodiment.
[0037] Embodiment of present disclosure relates to the field of human resources management. More specifically, it relates to a system and method for resume parsing and information extraction utilizing advanced techniques for efficient extraction and categorization of resume data. This system streamlines resume processing, enhances data accuracy, and facilitates efficient recruitment and human resources management processes.
[0038] An aspect of the present disclosure pertains to a resume parsing system that includes an input unit for receiving resumes from various sources and a controller with processors and memory. The system extracts text from each resume, identifies headings based on font size, and segments the text into predefined categories. It then extracts sentence-based headings, assigns labels to sentences, and annotates data using a Named Entity Recognition (NER) tool for training a sentence-level Natural Language Processing (NLP) model. Additionally, it extracts word-based headings and annotates segments for training a word-level NLP model. A training and testing unit evaluates the performance of these models on unseen data. The controller integrates outputs of both NLP models to extract information from resumes.
[0039] Additionally, the system can convert resumes into PDF format.
[0040] In an aspect, the NER tool for word-based headings utilizes machine learning techniques.
[0041] In an aspect, the output of the word-level NLP model aids in extracting key phrases for keyword-based search and analysis.
[0042] In an aspect, merging outputs of both NLP models generates comprehensive candidate profiles with structured and unstructured information. Furthermore, integrated outputs undergo clustering techniques to group resumes based on extracted similarities.
[0043] Another aspect of the present disclosure pertains to a method for parsing resumes that includes multiple steps including, receiving a set of resumes, extracting text from each resume, checking font size to identify headings, segmenting the text into predefined categories, and extracting sentence-based headings. These headings are merged to create a first dataset with labeled categories. The method further involves utilizing a Named Entity Recognition (NER) annotator tool to annotate the data for training a sentence-level Natural Language Processing (NLP) model.
[0044] Additionally, word-based headings are extracted and annotated for training a word-level NLP model. The performance of both NLP models is evaluated on unseen data using a training and testing unit. Finally, the outputs of the sentence-level and word-level NLP models are integrated to extract information from the received set of resumes.
[0045] In an aspect, NER annotator tool leverages machine learning techniques for annotating word-based headings. The output of the word-level NLP model aids in extracting key phrases for keyword-based search and analysis.
[0046] Furthermore, merging the outputs of both NLP models generates comprehensive candidate profiles with structured and unstructured information. Integrated outputs are processed using clustering techniques to group resumes based on similarities in the extracted information.
[0047] The manner in which the proposed system works, is described in further details in conjunction with FIGs. 1 to 3. It may be noted that these figure is only illustrative, and should not be construed to limit the scope of the subject matter in any manner.
[0048] FIG. 1 illustrates an exemplary network architecture of proposed system for resume parsing, in accordance with an embodiment of the present disclosure.
[0049] In an embodiment, referring to FIG. 1, a system 102 for resume parsing is disclosed. The system 102 will be connected to a network 106, which is further connected to an input unit 104 to receive a set of resumes from candidates from a set of sources. The network 106 may include, but not be limited to, a wireless network, a wired network, an internet, an intranet, a public network, a private network, a packet-switched network, a circuit-switched network, an ad hoc network, an infrastructure network, a Public-Switched Telephone Network (PSTN), a cable network, a cellular network, a satellite network, a fiber optic network, or some combination thereof.
[0050] In an embodiment, the input unit 104 can be a computing device 1that may communicate with the system 102 via a set of executable instructions residing on any operating system to receive the grayscale images. In an embodiment, the computing device 104 may include, but not be limited to, any electrical, electronic, electro-mechanical, or an equipment, or a combination of one or more of the above devices such as mobile phone, smartphone, Virtual Reality (VR) devices, Augmented Reality (AR) devices, laptop, a general-purpose computer, desktop, personal digital assistant, tablet computer, mainframe computer, or any other computing device. It may be appreciated that computing devices 104 may not be restricted to the mentioned devices and various other devices may be used.
[0051] In an exemplary embodiment, the input unit 104, serves as a pivotal platform for facilitating user interaction with the system through a graphical user interface (GUI). As the initial point of contact for users, the input unit provides a user-friendly interface that allows individuals to input commands, queries, or data into the system. Through this interface, users can easily communicate their requirements, preferences, or instructions to the system, initiating various processes such as resume parsing, data analysis, or system configuration. By integrating a graphical user interface, the input unit enhances the user experience by presenting information in a visual and intuitive manner. Users can interact with the system using familiar graphical elements such as buttons, menus, and forms, making it easier to navigate and operate the system effectively. Additionally, the GUI provides feedback to users, such as status updates, prompts, or error messages, ensuring clear communication between the user and the system. Furthermore, the input unit facilitates bidirectional communication, allowing the system to provide responses, results, or feedback to users in real-time. This interactive capability enables users to monitor the progress of ongoing tasks, review output, and make informed decisions based on the information presented by the system through the GUI.
[0052] This input unit 104 acts as the initial point of interaction between the system and the external sources providing the resumes. This is configured to handle the reception of resumes from diverse channels or platforms, which could include online job portals, email submissions, or other application interfaces. This could be a physical or virtual entity responsible for collecting and aggregating resumes from multiple candidates. The resumes may come in various formats such as PDF, Word documents, or even structured data formats. By incorporating an input unit specifically for receiving resumes, the system ensures that it can efficiently gather candidate information from different channels or sources. This streamline the subsequent processing and analysis stages within the system, enabling it to handle a diverse range of resume inputs from various candidates and sources.
[0053] In an embodiment, the sources to collect resumes can be online job boards, company websites, and social media platforms serving as common channels for resume submission. Additionally, recruitment agencies and headhunters often act as intermediaries, sourcing resumes from their network and presenting them to hiring companies. Moreover, referrals from existing employees and professional networking platforms contribute to the pool of resume sources. As technology evolves, newer avenues such as resume databases, applicant tracking systems (ATS), and online resume submission portals continue to emerge, offering recruiters and employers additional channels for accessing candidate resumes.
[0054] In an embodiment, the system 102 may be configured to utilise the pymupdf library to extract text from each resume, enabling the conversion of resume documents into machine-readable text format. Subsequently, the system 102 checks the font size of the extracted text to discern headings, identifying and labeling text with larger font sizes as headings. Following this, the extracted text is segmented into predefined categories to structure the information systematically. Moreover, sentence-based headings are extracted from each resume and consolidated into a first dataset, with each row representing an individual sentence. The system assigns labels to each sentence to denote its category, encoding these labels into integer values for computational efficiency.
[0055] Additionally, the system 102 employs a Spacy Roberta Named Entity Recognition (NER) annotator tool to annotate the stored data in the first dataset using JavaScript Object Notation (JSON) format, facilitating the training of a sentence-level Natural Language Processing (NLP) model. Furthermore, the system extracts word-based headings from each resume and annotates each segment using the NER annotator tool, merging the annotated segments into a second dataset in JSON format for training a word-level NLP model.
[0056] In an embodiment, a training and testing unit 112 is configured to evaluate the performance of both the trained sentence-level NLP model and the word-level based NLP model on unseen data, ensuring the effectiveness and accuracy of the parsing process. These models have been trained using labeled data and are now ready to be evaluated on unseen data to determine their effectiveness and accuracy in parsing resumes. This evaluation ensures that both models can effectively process and analyze new, unseen resumes with accuracy comparable to the performance observed during their training phase. By testing these models on previously unseen data, the system can verify their robustness and generalizability, which are crucial factors in ensuring reliable performance in real-world scenarios. The server 110 is communicatively coupled to the system by the network 106, and acts as a centralized storage location for this trained model.
[0057] In an embodiment, a server 120 is communicatively coupled to the system 102 through the network 106. The server 120 serves as a centralized storage repository for storing the trained models and associated data. This server acts as a centralized hub where all the trained models, annotated datasets, and other relevant information are stored in an organized and accessible manner. By storing the trained models and data on a centralized server, the system ensures easy access and retrieval of this information whenever needed. This centralization also facilitates collaboration and scalability, as multiple users or systems can access and utilize the stored data simultaneously. Additionally, having the server for storage enhances data security and integrity, as it allows for implementing robust access control mechanisms and backup procedures to safeguard the stored information against unauthorized access, loss, or corruption.
[0058] In an exemplary embodiment, avuaRParser (i.e. resume parser system) is a versatile resume parsing tool designed to cater to various formats of resumes, including single-column and two-column layouts. Developed using a customized Spacy NER RoBERTa based model, avuaRParser offers a user-friendly experience as an end-to-end web application, accessible across multiple platforms such as websites, Android apps, and iOS devices.
[0059] The system has been meticulously trained and tested on resumes from diverse sectors including Technology, Pharma, Finance, and Energy. With over 6800 manually annotated data points, avuaRParser continues to undergo refinement for even better performance. It parses resume information into 19 different segments, covering a wide range of details including:
[0060] avuaRParser parses the resume information into 19 different segments:
1. Skills: Knowledge of languages, tools, algorithms etc.
2. Work Experience: Experience of employment i.e. the job title , company name, location , roles and responsibilities etc.
3. Personal Details: Mobile Number, Online Profiles(LinkedIn, GitHub etc.)
Address, Nationality etc.
4. Education: Educational Background i.e. 10th , 12th, Diploma, Bachelors, Masters, Doctorate
5. Internships: Experience of employment as an intern
6. Certifications: Professional Certificates or Training related information
7. Extracurricular: Activities performed by candidate that falls outside the normal, curriculum of school, college or university education
8. Achievements: Winning prizes etc.
9. Projects : Personal Projects done
10. Remaining: Any remaining information which couldn’t be tagged into other tags
11. Research/Publications: Information about Research Papers etc.
12. Conference/Seminar: Any Conferences or Seminars organised or attended
13. Areas of Interest: focus or curiosity in a particular subject area
14. Languages: Proficiency in Languages known i.e. English, Hindi, Spanish etc.
15. Patents: Any patents registered
16. Profile Summary: Brief description of the candidate
17. Hobbies: What the candidates prefer to do in their spare time
18. References
19. Thesis
[0061] Further, avuaRParser employs segmentation information extraction for both work experience and education, alleviating the tedious process of autofilling job and educational information. With a remarkable F1 score of 86% to date, avuaRParser also integrates LDA analysis after NER to verify the accuracy of extracted segments, ensuring precision and reliability in resume parsing.
[0062] Although FIG. 1 shows exemplary components of the network architecture 100, in other embodiments, the network architecture 100 may include fewer components, different components, differently arranged components, or additional functional components than depicted in FIG. 1. Additionally, or alternatively, one or more components of the network architecture 100 may perform functions described as being performed by one or more other components of the network architecture 100.
[0063] FIG. 2 illustrates an exemplary architecture of proposed system for resume parsing, in accordance with an embodiment of the present disclosure.
[0064] In an aspect, referring to FIG. 2, a system 102 may comprise one or more controller(s) 202 (interchangeably referred to as controller 202, hereinafter). The controller 202 may be implemented as one or more microprocessors, microcomputers, microcontrollers, edge or fog microcontrollers, digital signal processors, central processing units, logic circuitries, and/or any devices that process data based on operational instructions. Among other capabilities, the cotroller 202 may be configured to fetch and execute computer-readable instructions stored in a memory 204 of the system 102. The memory 204 may be configured to store one or more computer-readable instructions or routines in a non-transitory computer readable storage medium, which may be fetched and executed to create or share data packets over a network service. The memory 204 may comprise any non-transitory storage device including, for example, volatile memory such as Random Access Memory (RAM), or non-volatile memory such as Erasable Programmable Read-Only Memory (EPROM), flash memory, and the like.
[0065] The system 102 may include an interface(s) 206. The interface(s) 206 may comprise a variety of interfaces, for example, interfaces for data input and output devices, referred to as I/O devices, storage devices, and the like. The interface(s) 206 may facilitate communication to/from the system 102. The interface(s) 206 may also provide a communication pathway for one or more components of the system 102. Examples of such components include, but are not limited to, processing unit/engine(s) 208 and a database 210.
[0066] In an embodiment, the processing unit/engine(s) 208 may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processing engine(s) 208. In examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the processing engine(s) 208 may be processor-executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the processing engine(s) 208 may comprise a processing resource (for example, one or more processors), to execute such instructions. In the present examples, the machine-readable storage medium may store instructions that, when executed by the processing resource, implement the processing engine(s) 208. In such examples, the system 102 may comprise the machine-readable storage medium storing the instructions and the processing resource to execute the instructions, or the machine-readable storage medium may be separate but accessible to the system 102 and the processing resource. In other examples, the processing engine(s) 208 may be implemented by electronic circuitry.
[0067] In an embodiment, the database 210 may include data that may be either stored or generated as a result of functionalities implemented by any of the components of the controller 202 or the processing engine 208. In an embodiment, the database 210 may be separate from the system 202.
[0068] In an exemplary embodiment, the processing engine 208 may include one or more engines selected from any of a resume collection module 212, a text extraction module 214, a segmentation module 216, an annotation module 218, a training and testing module 220, an information extraction module 222, and, other module(s) 224. The other module(s) 224 having functions that may include but are not limited to testing, storage, and peripheral functions, such as wireless communication unit for remote operation, audio unit for alerts and the like.
[0069] In an embodiment, the resume collection module 212 may be configured to receive a set of resumes from candidates from a set of sources from an input unit 104, as shown in FIG. 3. The input unit 104 acts as the initial point for receiving resumes from candidates through various sources or platforms, accommodating resumes in different formats. The received data, such as CVs, is consolidated and stored on a server (110). The system loads PDF or DOC files individually and extracts text, converting it into a single format, typically .txt format, before storing the data. This process ensures uniformity and ease of access to the extracted text data for further processing and analysis.
[0070] Upon receiving these resumes, the system undergoes a preprocessing step where the resumes are converted into PDF format. This conversion ensures uniformity in the format, making it easier for subsequent processing steps.
[0071] In an exemplary embodiment, when a recruitment manager utilizing the proposed system to streamline the hiring process. Through the system's user interface, accessible via an input unit like an online dashboard, the manager uploads resumes received from multiple channels—such as online job portals, email submissions, and internal application platforms. Once uploaded, the system automatically converts these resumes into a standardized PDF format, ensuring consistency for further processing.
[0072] In this embodiment, the text extraction module 214 may be configured to process resumes effectively. Firstly, it utilizes the pymupdf library to extract text from each resume. pymupdf is a Python library commonly used for reading and extracting text from PDF files. This library allows the the text extraction module 214to access the textual content within the resumes and prepare it for further analysis. Once the text is extracted, the the text extraction module 214 proceeds to check the font size of each extracted text. This is achieved through font size detection algorithms that analyze the characteristics of the text, such as its size relative to the page dimensions. Text with a larger font size is identified as headings. This step ensures accurate identification of headings within the resumes, which is crucial for structuring the content effectively. After identifying the headings, the the text extraction module 214extracts sentence-based headings from each resume. This process include parsing the text to identify sentences that function as headings, typically by analyzing punctuation and grammatical structure. These sentence-based headings are then merged to create a first dataset in a predefined file format, such as CSV or JSON. In this dataset, each row represents an individual sentence, providing a structured format for further analysis.
[0073] Similarly, the text extraction module 214 extracts word-based headings from each resume using techniques like tokenization and pattern matching. Word-based headings are individual words or phrases that serve as headings within the resume content. These word-based headings are also merged to create a second dataset in the same predefined file format as the first dataset. This ensures that both sentence-based and word-based headings are captured and structured for analysis.
[0074] Additionally, the text extraction module 214 module assigns labels to each sentence indicating the category to which it belongs. These categories could represent different sections or topics within the resume, such as work experience, education, or skills. The labels are then encoded into integer values, which facilitates efficient processing and analysis of the data.
[0075] In an exemplary implementation, extracted text from resumes can be categorized into three predefined sections: "Work Experience," "Education," and "Skills." Now, we need to assign labels to each segment indicating which category it belongs to. For each segment of text, the system assigns a label based on its category. For instance:
[0076] "Work Experience" segments are labeled as 0
[0077] "Education" segments are labeled as 1
[0078] "Skills" segments are labeled as 2
[0079] Encoding into Integer Values: Once the labels are assigned, they are encoded into integer values. In this example:
[0080] "Work Experience" label (0) is encoded as 0
[0081] "Education" label (1) is encoded as 1
[0082] "Skills" label (2) is encoded as 2
[0083] Example Encoding: Let's say we have the following segments of text:
[0084] "Work Experience: Senior Software Engineer at ABC Company"
[0085] "Education: Bachelor's Degree in Computer Science"
[0086] "Skills: Proficient in Python, Java, and SQL"
[0087] After assigning labels and encoding them into integer values, the segments would look like this:
[0088] "Work Experience: Senior Software Engineer at ABC Company" (label: 0, encoded value: 0)
[0089] "Education: Bachelor's Degree in Computer Science" (label: 1, encoded value: 1)
[0090] "Skills: Proficient in Python, Java, and SQL" (label: 2, encoded value: 2)
[0091] Using integer encoding simplifies data representation and computation within the system. It reduces memory usage and processing time, as integer values are more compact and easier to manipulate than text labels. Additionally, integer encoding enables efficient algorithms for data analysis, such as machine learning models, which often require numerical input. By encoding labels into integer values, the system ensures streamlined processing and analysis of resume data, facilitating various downstream tasks such as classification, clustering, and predictive modeling.
[0092] In an embodiment, the segmentation module 216 may be configured to segment the extracted text into predefined categories. For example, when the extracted text is extracted from several resumes, and now need to categorize this text into different sections such as "Work Experience," "Education," "Skills," and "Certifications." The segmentation module analyzes the extracted text and identifies patterns or keywords that indicate different sections within the resumes. For instance, it may look for phrases like "Work Experience," "Education Background," or "Technical Skills" to determine the boundaries of each category. Based on the predefined categories set by the system, the segmentation module assigns each segment of text to the appropriate category. For example, if a segment contains information about the candidate's work history, it will be categorized under "Work Experience”.
[0093] Once the segmentation is complete, the segmentation module 216 organizes the segmented text into structured datasets. Each dataset corresponds to a predefined category, and it contains the text segments assigned to that category. For instance, all segments related to work experience will be grouped together in a dataset labeled "Work Experience”.
[0094] In an exemplary implementation, a resume may include following segments of text:
[0095] "Work Experience: Senior Software Engineer at ABC Company"
[0096] "Education: Bachelor's Degree in Computer Science"
[0097] "Skills: Proficient in Python, Java, and SQL"
[0098] After segmentation, the text will be organized into predefined categories:
[0099] Work Experience: "Senior Software Engineer at ABC Company"
[00100] Education: "Bachelor's Degree in Computer Science"
[00101] Skills: "Proficient in Python, Java, and SQL"
[00102] By segmenting the extracted text into predefined categories, the segmentation module ensures that the information from resumes is organized systematically, making it easier for further processing and analysis within the system.
[00103] In an embodiment, the annotation module 218 may be configured to utilize a Named Entity Recognition (NER) annotator tool i.e. Tecoholic tool to annotate stored data in the first dataset, which is typically in JavaScript Object Notation (JSON) format, as shown in FIG. 4. ".txt" files are received one at a time and then loaded into the Tecoholic tool. In this tool, tags are created for annotation purposes, likely to categorize or label specific sections of the text. Annotations are made across 36 different tags, presumably covering various aspects or categories of the text. After annotating the data with these tags, it is exported in JSON format, which is commonly used for structured data interchange. Finally, the annotated data is stored, likely for further processing, analysis, or integration into a larger system.
[00104] In an example, each tag represents a specific aspect, category, or attribute of the text being annotated. For example, in the context of resume parsing or text analysis, these tags could represent different types of information such as job titles, skills, education, experience levels, and so on. By applying these tags to the text, it becomes easier to categorize and extract relevant information from the documents. These tags help structure the data and enable more efficient processing and analysis downstream in the system.
[00105] This annotation process is required for training a sentence-level Natural Language Processing (NLP) model. NER is a technique used in NLP to identify and classify named entities (such as names of persons, organizations, locations, etc.) within a text corpus.
[00106] In an exemplary embodiment, a resume segment include: "Worked as a Software Engineer at XYZ Corp in San Francisco." Using NER, the annotator tool identifies and annotates entities such as "Software Engineer" (as a job title) and "XYZ Corp" (as the employer's name), along with "San Francisco" (as a location). These annotations provide valuable information for training the sentence-level NLP model, enabling it to recognize and understand different entities within resume texts.
[00107] Furthermore, the annotation module 218 also annotates each segment using the NER annotator tool and merges the annotated segments into the second dataset, also in JSON format, for training a word-level NLP model. This word-level NLP model utilizes machine learning techniques, including NER, to detect and label relevant segments within the resume data. By applying machine learning algorithms, the NER annotator tool can accurately identify and classify various elements within the resumes, such as job titles, skills, and qualifications.
[00108] In an exemplary embodiment, a resume segment include: "Proficient in Python, Java, and SQL." The NER annotator tool identifies and labels "Python," "Java," and "SQL" as programming languages. These labeled segments are then merged into the second dataset, which is used to train the word-level NLP model. This model leverages machine learning techniques to extract key phrases from resumes, facilitating keyword-based search and analysis of profiles of the candidates.
[00109] Furthermore, outputs of both the sentence-level and word-level NLP models are merged to generate comprehensive profiles of the candidates. This integration combines structured and unstructured information extracted from the resumes, providing a holistic view of the candidate's skills, experiences, and qualifications.
[00110] In an examplary embodiment, the sentence-level NLP model identifies job roles and experiences, while the word-level NLP model extracts specific skills and competencies. By merging these outputs, the system creates comprehensive profiles that include structured information (e.g., job titles, companies) and unstructured information (e.g., skills, responsibilities), enabling recruiters to gain deeper insights into candidates' profiles.
[00111] Further, trimming entity annotations is performed to maintain data accuracy and precision. This refining structured information obtained during the resume data annotation process. The primary objective of this embodiment is to ensure that the identified entities, such as skills, work experiences, and educational qualifications, are precisely delineated. It achieves this by meticulously removing any extraneous leading and trailing white spaces from the entity spans. This precision is essential as it prepares the data for the subsequent phase of training the Parser Model.
[00112] In an example, that includes the following skills section: Skills: Java, Python, Data Analysis, Artificial intelligence. In this example, the entity annotation for the "Skills" section would include the entity span "Java, Python, Data Analysis, Artificial intelligence." However, this span contains leading spaces before "Java" and trailing spaces after "Artificial intelligence." These spaces, although seemingly innocuous, can introduce inaccuracies when the data is processed further. This include a process that automatically removes any unnecessary leading or trailing spaces from the entity spans. In this case, the trimmed entity span would be "Java, Python, Data Analysis, Artificial intelligence," devoid of extraneous spaces. The significance of this embodiment becomes evident when we consider the downstream applications of the structured data. The accuracy and precision of entity spans are paramount, especially when the data is used to train artificial intelligence models, natural language processing algorithms, or other data-driven technologies. Here are several exemplary key aspects and embodiments of the entity annotation trimming process:
[00113] Whitespace Removal: The core function of this embodiment is to eliminate unnecessary whitespace characters from entity spans. These characters may result from variations in resume formatting or text extraction processes.
[00114] Preserving Entity Content: While trimming removes extraneous spaces, it ensures that the content of the entities remains intact. In the example mentioned earlier, "Java" and "Artificial intelligence" are preserved as individual skills.
[00115] Consistency Across Resumes: Entity annotation trimming maintains consistency across different resumes within the dataset. Regardless of how skills or other entities are formatted in various resumes, the trimming process ensures a standardized representation.
[00116] Data Integrity: Ensuring the accuracy and integrity of the data is crucial for downstream applications. Trimming entities helps prevent inaccuracies that could arise from inconsistent formatting.
[00117] Preparation for Parser Model: The trimmed and standardized data is well-prepared for the subsequent phase of training the Parser Model. A clean and precise dataset is essential for the model to learn and generalize effectively.
[00118] Automated Trimming: In practical implementations, the entity annotation trimming process is often automated through scripts or software components. These embodiments are designed to efficiently process large volumes of data.
[00119] Quality Control: Implementing quality control checks is another embodiment. This involves verifying that the trimming process has been applied consistently and accurately across all entities in the dataset.
[00120] Consider a scenario where a company is using the system to analyze a diverse pool of candidate resumes. These resumes may vary in formatting and structure, and the skills sections might have inconsistent whitespace usage. By applying entity annotation trimming, the HR team can ensure that all skills are consistently represented, regardless of how they appear in different resumes.
[00121] Moreover, the impact of entity annotation trimming extends to downstream processes, such as skills matching or skills-based candidate ranking. When the data is used to match candidate skills with job requirements or rank candidates based on their qualifications, precision is paramount. Inaccuracies introduced by extraneous spaces could lead to incorrect matching or ranking outcomes.
[00122] Beyond skills, entity annotation trimming applies to all annotated entities within the resume, including but not limited to work experiences, educational qualifications, certifications, and more. Each entity's precision is crucial in creating a reliable and effective dataset for various HR and recruitment tasks.
[00123] The entity annotation trimming process represents a critical embodiment in resume parser's workflow, focused on maintaining data accuracy and precision. It ensures that entity spans are free from extraneous leading and trailing spaces, preparing the data for the subsequent phase of training the Parser Model. This meticulous attention to detail enhances the quality and reliability of the structured data, enabling accurate and effective HR and recruitment processes while ensuring data integrity across diverse resumes.
[00124] In an embodiment, the training and testing module 220 may be configured to evaluate performance of the trained sentence-level NLP model and the word-level based NLP model on received unseen data by a training and testing unit 112. As shown in FIG. 5, an exemplary training process within the system is a comprehensive and iterative procedure aimed at enhancing the model's precision in recognizing and extracting critical segments from resumes. This multifaceted training process involves several key steps to ensure the model's proficiency. It all starts with the preparation of input data, which includes annotated training and testing data formatted in DocBin format. This data serves as the foundational dataset for training and evaluating the model's performance. Following data preparation, the process entails loading the pre-trained weights of the base model, representing the initial knowledge and language patterns acquired by the model from a vast corpus of text data.
[00125] Once the pre-trained model is in place, it is employed to generate embeddings for the input data. These embeddings are vector representations that encapsulate semantic and contextual information from the text, playing a pivotal role in subsequent Named Entity Recognition (NER) tasks. The embeddings are then passed to the Named Entity Recognition (NER) component, which focuses on identifying and categorizing specific segments or entities within the resume data, such as skills, work experience, education, and more. Leveraging contextual information from the embeddings, the NER component makes accurate predictions. The next step involves comparing the entity predictions generated by the NER component with the ground truth labels from the annotated training data.
[00126] This comparison is integral to calculating a loss function, which quantifies the dissimilarity between the predicted and actual entity labels, with the objective of minimizing this loss function during training. To optimize model parameters efficiently, the training process is typically executed in batches, with each batch involving processing a subset of the data, allowing for better parameter updates through techniques like backpropagation. Ensuring model progress and preventing overfitting are achieved through periodic evaluations, with the model undergoing evaluation using the test data after a set number of steps, typically every 200 steps as configured.
[00127] During evaluation, the entity predictions are compared to the ground truth labels in the test data to compute evaluation metrics, with one widely used metric being the F1 score, which strikes a balance between precision and recall, offering a comprehensive measure of the model's performance in accurately identifying entities of interest in resumes. The training process is inherently iterative, encompassing multiple epochs or training cycles, with the model fine-tuning its parameters in each cycle to align more effectively with the training data. This iterative approach allows the model to gradually enhance its competence in extracting information from resumes.
[00128] During this embodiment, the model's pre-trained parameters are adjusted to align with the intricacies of resume data, fine-tuning its understanding of named entities in resumes. The embodiment of annotated data utilization involves the integration of the meticulously annotated resumes gathered earlier in the workflow, with this data serving as the training dataset, enabling the model to learn and adapt based on real-world examples. Fine-tuning also includes hyperparameter tuning, with an embodiment involving optimizing hyperparameters such as learning rates, batch sizes, or model architectures to achieve the best performance for resume parsing. In scenarios where the resume parser is designed for specific industries or roles, an embodiment could involve domain-specific fine-tuning, such as emphasizing recognizing technical skills relevant to the IT field for IT job applications.
[00129] To boost accuracy further, an embodiment might involve the creation of ensemble models, with these models combining the predictions of multiple fine-tuned NER models, each trained on different subsets of annotated data, often resulting in enhanced performance. In dynamic environments where new resumes and entity variations are frequent, an embodiment could focus on incremental training, periodically retraining the model with newly annotated data to stay up-to-date and adapt to evolving resume structures.
[00130] Consider a hiring platform that utilizes the proposed resume parser. This platform caters to a diverse range of industries, from technology and finance to healthcare and engineering, with job applicants submitting resumes in various formats, each with unique structures and layouts. In this scenario, model training becomes pivotal, with the platform leveraging fine-tuning to adapt the NER model to the specific requirements of different industries, such as recognizing programming languages, software tools, and technical certifications for IT job positions, and medical qualifications, certifications, and clinical experience for healthcare resumes.
[00131] Furthermore, the platform employs transfer learning as a key embodiment, with the model initially learning from a vast corpus of general text, gaining a fundamental understanding of language, and then adapting to the nuances of resume parsing, such as differentiating between general mentions of skills and specific skills listed in a resume's "Skills" section. Hyperparameter tuning is yet another embodiment contributing to the model's effectiveness, with the platform experimenting with different hyperparameters, such as the learning rate or the number of layers in the model architecture, to fine-tune the model for optimal performance in identifying named entities within resumes. Model training elevates the NER model from a generic entity recognizer to a specialized resume analysis tool capable of pinpointing crucial segments within job applicants' resumes with remarkable precision. The embodiment of transfer learning, along with domain-specific fine-tuning and hyperparameter optimization, ensures that the model adapts to diverse industries and evolving resume structures, making it a powerful asset for efficient and accurate HR and recruitment processes.
[00132] The model training phase is a transformative embodiment within the resume parser's workflow, where the NER model evolves to become a specialized resume analysis engine. Through fine-tuning and transfer learning, the model gains a deep understanding of resume-specific named entities, enhancing its accuracy and effectiveness in HR and recruitment processes. The embodiment represents the convergence of state-of-the-art deep learning and NLP techniques with real-world resume parsing needs, ensuring that the model can seamlessly navigate the complexities of job application documents.
[00133] Further, the training and testing module 220 assesses the effectiveness and accuracy of both the trained sentence-level Natural Language Processing (NLP) model and the word-level based NLP model when presented with unseen data. This evaluation process is required to verify the models' performance and ensure their reliability in real-world applications.
[00134] In an exemplary embodiment, the system has been trained using a dataset of resumes to develop both the sentence-level and word-level NLP models. The training and testing module then uses a separate set of unseen resumes, which were not part of the training data, to evaluate how well these models perform when presented with new information. For the sentence-level NLP model, the evaluation involves analyzing its ability to accurately understand and interpret entire sentences within the resumes. This includes tasks such as identifying job roles, extracting work experiences, and categorizing skills.
[00135] Similarly, for the word-level based NLP model, the evaluation focuses on assessing its proficiency in recognizing and extracting specific keywords, phrases, or entities from the resumes. This could include identifying technical skills, educational qualifications, or certifications mentioned within the text.
[00136] Further, the training and testing module 220 compares the output generated by both models against the ground truth or expected results to determine their performance metrics. These metrics may include accuracy, precision, recall, and F1-score, among others. By evaluating the performance of both the sentence-level and word-level NLP models on unseen data, the training and testing module provides valuable insights into the models' strengths, weaknesses, and effectiveness. This allows for iterative improvements to the models and ensures that they are optimized for accurate resume parsing and information extraction in real-world scenarios.
[00137] The information extraction module 222 may be configured to integrate the outputs of both the sentence-level Natural Language Processing (NLP) model and the word-level NLP model to extract comprehensive information from the received resumes. By integrating the outputs of these two models, the system can leverage their respective strengths to enhance the extraction process.
[00138] In an exemplary embodiment, the sentence-level NLP model excels at understanding the context and structure of the resumes, while the word-level NLP model is more adept at identifying specific keywords and entities within the text. The information extraction module utilizes the combined output of both models to create a more holistic representation of the resume content.
[00139] For instance, the sentence-level NLP model may identify broader themes such as work experience, education history, and skills, while the word-level NLP model extracts specific details like programming languages, project names, or job titles mentioned within the resumes.
[00140] Once the outputs of both models are integrated, the information extraction module further processes the data using clustering techniques. These techniques group together resumes that exhibit similarities in the extracted information, allowing for efficient organization and categorization of the dataset.
[00141] For example, resumes with similar job roles, skill sets, or educational backgrounds may be clustered together based on the extracted information. This clustering process enables recruiters or HR professionals to quickly identify relevant candidates or patterns within the resume dataset, thereby streamlining the recruitment and selection process.
[00142] By integrating the outputs of the sentence-level and word-level NLP models and applying clustering techniques, the information extraction module facilitates the extraction of valuable insights from the resumes, leading to more informed decision-making in recruitment and human resources management.
[00143] In an exemplary implementation, in recruitment process of XYZ Corp, a resume parsing system is implemented to handle the influx of resumes from various sources. Resumes are received from job portals, email submissions, and other platforms, ensuring that all candidate submissions are captured and processed uniformly. Upon receipt, the resumes are automatically converted into PDF format to maintain consistency for further processing. The system utilizes the pymupdf library to extract text from each resume, enabling the subsequent analysis of the content. Font size analysis is performed to identify headings such as "Work Experience," "Education," and "Skills," allowing for the segmentation of the extracted text into predefined categories based on these headings.
[00144] Once segmented, labels indicating the category of each sentence are assigned and encoded into integer values, facilitating efficient processing and analysis of the data. Additionally, a Named Entity Recognition (NER) annotator tool is employed to annotate the data in JSON format, aiding in the training of a sentence-level Natural Language Processing (NLP) model. Word-based headings, such as specific skills or technologies mentioned in the resumes, are also extracted and annotated using the NER annotator tool. These annotated segments are merged into a dataset for training a word-level NLP model, utilizing machine learning techniques to detect and label relevant segments effectively.
[00145] In an exemplary embodiment, merging all annotated data files into a single JSON (JavaScript Object Notation) file. The resulting consolidated JSON file serves as a centralized repository containing structured information about various resume segments. This consolidation not only simplifies access to the data but also streamlines its manipulation, making it highly conducive for subsequent deep learning operations. Each annotated resume undergoes a meticulous process where specific segments or entities are identified and categorized. These segments encompass a wide range of information, including but not limited to skills, work experiences, education details, certifications, and more.
[00146] For instance, imagine a scenario where annotated multiple resumes are utilized. One of these resumes might have the following structured data:
{
"Skills": ["Java", "Python", "Data Analysis"],
"Work Experience": [
{
"Job Title": "Software Developer",
"Company": "TechCorp",
"Dates": "Jan 2018 - Present",
"Description": "Developed software applications using Java and Python."
},
{
"Job Title": "Data Analyst",
"Company": "DataTech",
"Dates": "Jun 2015 - Dec 2017",
"Description": "Performed data analysis and generated insights."
}
],
"Education": [
{
"Degree": "Bachelor of Science in Computer Science",
"Institution": "Tech University",
"Graduation Date": "May 2015"
}
]
}
[00147] Another annotated resume may have its structured data represented similarly. In this embodiment, the data merging process involves combining the structured data from all annotated resumes into a single JSON file. The resulting consolidated JSON file provides a comprehensive view of the dataset, with each resume's information categorized under specific segments.
[00148] This consolidated JSON file serves as a centralized repository that encapsulates the structured information obtained from all annotated resumes. It streamlines access to this information and simplifies its manipulation, which is essential for subsequent deep-learning operations. These embodiments can be designed to handle diverse data formats and ensure that the resulting JSON file maintains the structured hierarchy of resume segments.
[00149] For example, imagine a company's HR department using the resume parser system to analyze a vast pool of candidate resumes. After annotation, the HR team needs a unified view of candidate skills, experiences, and qualifications to make informed hiring decisions. The data merging embodiment simplifies this process by creating a single JSON file that aggregates all relevant information from the annotated resumes. Moreover, this embodiment is instrumental in maintaining data integrity and consistency. By consolidating structured data into a centralized repository, it becomes easier to verify the accuracy and completeness of the information. Any discrepancies or missing data can be addressed before moving on to the next phase of analysis.
[00150] Additionally, the structured nature of the consolidated JSON file lends itself well to deep learning operations. It provides a clean and organized dataset that can be readily fed into artificial intelligence models, neural networks, or other advanced algorithms. This dataset serves as the foundation for training models that can automate tasks such as skills matching, candidate ranking, and personalized recommendations.
[00151] The data merging ensures data consistency, integrity, and accessibility, ultimately enhancing the effectiveness and utility of the resume parser in various applications, including but not limited to talent acquisition and HR processes.
[00152] The performance of both the sentence-level and word-level NLP models is evaluated on unseen data to ensure the effectiveness of the parsing process. Metrics such as accuracy, precision, and recall are measured to validate the performance of the models. The information extraction module integrates outputs from both NLP models to extract comprehensive information from the resumes. This integrated data, including structured and unstructured information, is then utilized to generate profiles of the candidates. Clustering techniques are applied to group resumes based on similarities in the extracted information, aiding in candidate selection and analysis.
[00153] Ultimately, the insights provided by the parsing system streamline the recruitment process, improve candidate selection, and enhance workforce management practices at XYZ Corp.
[00154] FIG. 6 illustrates an exemplary view of a flow diagram of proposed method for parsing resumes, in accordance with some embodiments of the present disclosure.
[00155] In an embodiment, a method 600 for parsing resumes is disclosed. At step 602, receiving, by an input unit 104, a set of resumes from candidates from a set of sources such as job portals, career websites, or direct submissions from the candidates.
[00156] At step 604, the controller 202, extracts text from each resume using a pymupdf library. This step includes accessing content within each resume document and extracting the textual information contained within. To facilitate this process, the method includes an additional step wherein the controller 202 converts the received set of resumes into PDF format before extracting text from each resume. This conversion ensures uniformity in the format of the resumes, as different candidates may submit their resumes in various file formats such as DOC, DOCX, or PDF. By standardizing the format to PDF, the parsing system ensures compatibility and consistency in the extraction process. Once the resumes are converted to PDF, the controller 202 utilizes the pymupdf library to extract the text, effectively retrieving the textual content from each resume document.
[00157] At step 606, the controller 202, checks font size of each extracted text to identify headings, and labeling the extracted text having a larger font size as a heading. This includes analyzing the textual content extracted from the resumes to identify portions of text that serve as headings or subheadings. By examining the font size of the text, the system can distinguish between regular text and larger text typically used for headings. Text with a larger font size is indicative of headings that signify different sections of the resume, such as "Work Experience," "Education," "Skills," etc. Upon identifying text with a larger font size, the controller 202 labels this extracted text as a heading. This labeling categorizes the text based on its formatting characteristics, allowing for the segmentation of the resume content into distinct sections. By labeling text with a larger font size as headings, the system effectively identifies and delineates the structural elements of the resumes.
[00158] At step 608, the controller 202, segments the extracted text into predefined categories.
[00159] At step 610, the controller 202, extracts sentence-based headings from each resume, this includes identifying sentences within the resumes that serve as headings or subheadings for different sections of the document, such as "Work Experience," "Education," "Skills," etc. Sentence-based headings typically provide concise descriptions or summaries of the content that follows, guiding the reader through the structure of the resume. Once the sentence-based headings are identified, the controller 202 merges these extracted headings to create a first dataset. This dataset is structured in a pre-defined file format, such as CSV (Comma-Separated Values), which is commonly used for storing tabular data. In this dataset, each row corresponds to an individual sentence-based heading extracted from the resumes.
[00160] By creating a first dataset containing sentence-based headings, the system organizes and consolidates the structural elements of the resumes into a standardized format. This facilitates further processing and analysis of the resume content, allowing for easier navigation and interpretation of the information by subsequent stages of the parsing process. Additionally, structuring the headings in a dataset enables efficient storage and retrieval of the data, enhancing the effectiveness of the resume parsing system in extracting relevant information for recruitment and human resources management purposes.
[00161] At step 612, the controller 202, assigns labels to each sentence indicating the category and encoding the labels into integer values. Upon assigning labels to the sentences, the controller 202 further encodes these labels into integer values. Encoding the labels into integer values involves representing each category with a unique numerical identifier. For example, "Work Experience" may be assigned the integer value 0, "Education" may be assigned 1, and so on. This encoding process transforms the categorical labels into numerical representations, making them suitable for processing and analysis by machine learning algorithms and other computational methods.
[00162] By assigning labels and encoding them into integer values, the system standardizes the representation of resume categories, enabling efficient organization and analysis of the resume content. This standardized format facilitates subsequent stages of the parsing process, such as training machine learning models or conducting statistical analyses, by providing a structured and numerical representation of the resume data. Additionally, encoding the labels into integer values enhances the computational efficiency of the system, as numerical data can be processed more efficiently than categorical data, ultimately contributing to the effectiveness of the resume parsing system in extracting and categorizing relevant information for recruitment and human resources management purposes.
[00163] At step 614, the controller 202, utilizing a Named Entity Recognition (NER) annotator tool to annotate stored data in the first dataset, in JavaScript Object Notation (JSON) format for training a sentence-level Natural Language Processing (NLP) model. The NER annotator tool is used for annotating the word-based headings that utilize machine learning techniques for detecting and labeling the relevant segments.
[00164] At step 616, the controller 202, extracting, word-based headings from each resume. These headings appear in bold, capitalized, or otherwise emphasized text within the resume and represent key sections such as "Work Experience," "Education," "Skills," etc. The extraction process includes scanning the text of each resume to identify and extract these word-based headings. The controller 202 utilizes techniques such as pattern matching or heuristics to locate and isolate the headings within the document. Once identified, the word-based headings are extracted and stored for further processing.
[00165] In an exemplary embodiment, word-based headings assist in structuring resume content and facilitating its interpretation. By extracting these headings, the system establishes the organizational framework of the resumes, enabling subsequent stages of parsing to focus on specific sections of interest. Additionally, word-based headings serve as anchors for navigating the resume content and locating relevant information quickly and efficiently.
[00166] At step 618, the controller 202, annotates each segment using the Named Entity Recognition (NER) annotator tool. This annotation process involves identifying and labeling specific entities or segments within the text data extracted from resumes, similar to the process described earlier. These entities may include names, organizations, skills, dates, or any other relevant information present in the resume content.
[00167] After annotating each segment, the controller 202 merges the annotated segments into a second dataset in JSON format. This dataset is structured similarly to the first dataset but contains annotations at a more granular level, focusing on individual segments rather than entire sentences. The JSON format allows for the representation of hierarchical data structures, facilitating the storage and transmission of annotated segments and their associated attributes.
[00168] The annotated segments stored in the second dataset are used for training a word-level Natural Language Processing (NLP) model. This model is specifically designed to analyze and understand the meaning and context of individual words or phrases within the resume text. By training the word-level NLP model with annotated data, the system enhances its ability to extract key phrases and relevant information from resumes.
[00169] Furthermore, the output of the word-level NLP model is utilized to extract key phrases from resumes, facilitating keyword-based search and analysis of profiles of the candidates. This process involves identifying and extracting important terms or phrases from the resume content, which can be used for various purposes such as matching candidates with job openings, identifying relevant skills or qualifications, or conducting keyword-based searches across a database of resumes.
[00170] Further, the method 600 include step of merging the outputs of the sentence-level NLP model and the word-level NLP model for generating comprehensive profiles of the candidates, including both structured and unstructured information extracted from the received set of resumes. These comprehensive profiles offer a holistic view of the candidates, incorporating both broad categories of information and specific details relevant to their qualifications and experiences.
[00171] The generation of comprehensive profiles enables recruiters and human resources professionals to gain a more nuanced understanding of candidates' backgrounds and capabilities. It facilitates more informed decision-making in the recruitment process, allowing organizations to identify the most suitable candidates for specific roles based on a comprehensive evaluation of their profiles.
[00172] At step 320, the controller 202, evaluates the performance of the trained sentence-level NLP model and the word-level based NLP model on received unseen data using a training and testing unit. This evaluation process is essential for assessing the effectiveness and accuracy of the NLP models in processing and analyzing resume data.
[00173] The evaluation includes testing the trained NLP models on a separate set of unseen resumes or data samples that were not used during the training phase. This unseen data serves as a benchmark for evaluating how well the NLP models generalize to new and unseen resume content.
[00174] The controller 202 utilises a training and testing unit to conduct the evaluation process systematically. This unit is responsible for splitting the unseen data into two subsets: one for training the models and another for testing their performance. The training subset is used to fine-tune the parameters of the NLP models, while the testing subset is used to assess their performance on new data.
[00175] During the evaluation, the controller measures various performance metrics such as accuracy, precision, recall, and F1-score to evaluate how well the NLP models perform in extracting and categorizing information from the unseen resumes. These metrics provide insights into the models' ability to correctly identify and classify different resume components, such as work experience, education, skills, etc. By evaluating the performance of both the sentence-level and word-level NLP models, the system ensures that the parsing process is robust and reliable across different levels of granularity.
[00176] At step 622, the controller 202, integrates outputs of the sentence-level NLP model and the word-level NLP model to extract information from the received set of resumes. Further, the step of processing the integrated outputs of the sentence-level NLP model and the word-level NLP model using clustering techniques to group the set of resumes based on similarities in the extracted information. Clustering the resumes based on similarities in the extracted information allows for more efficient organization and analysis of the resume data. This enables recruiters and human resources professionals to identify groups of candidates with similar skills, experiences, or qualifications, facilitating targeted recruitment efforts and talent acquisition strategies.
[00177] Thus, the present disclosure provides system and method for resume parsing and information extraction offers a transformative approach to resume management, significantly enhancing the efficiency and effectiveness of recruitment processes. By ensuring data accuracy through employing advanced techniques such as the Named Entity Recognition (NER) annotator tool, the system empowers users to extract, categorize, and analyze essential resume segments with precision. Furthermore, the system revolutionizes data collection by simplifying the process and creating a diverse dataset covering various sectors and resume formats. Through the utilization of cutting-edge deep learning models and semantic analysis, the system standardizes and enriches resume data, making it more valuable for subsequent analysis. These advancements streamline the entire resume processing pipeline, saving time, improving accuracy, and maximizing the utility of resume data. As a result, organizations and individuals benefit from enhanced recruitment processes, improved job matching, and deeper insights into personal career trajectories, ultimately driving greater efficiency and productivity in human resources management.
[00178] The disclosed methods and systems, as illustrated in the ongoing description or any of its components, may be embodied in the form of a computer system. Typical examples of a computer system include a general-purpose computer, a programmed microprocessor, a micro-controller, a peripheral integrated circuit element, and other devices, or arrangements of devices that are capable of implementing the steps that constitute the method of the disclosure.
[00179] The computer system includes a computer, an input device, a display unit, and the internet. The computer further includes a microprocessor. The microprocessor is connected to a communication bus. The computer also includes a memory. The memory may be RAM or ROM. The computer system further comprises a storage device, which may be a HDD or a removable storage drive such as a floppy-disk drive, an optical-disk drive, and the like. The storage device may also be a means for loading computer programs or other instructions onto the computer system. The computer system also includes a communication unit. The communication unit allows the computer to connect to other databases and the internet through an input/output (I/O) interface, allowing the transfer as well as reception of data from other sources. The communication unit may include a modem, an Ethernet card, or other similar devices that enable the computer system to connect to databases and networks, such as, LAN, MAN, WAN, and the internet. The computer system facilitates input from a user through input devices accessible to the system through the I/O interface.
[00180] To process input data, the computer system executes a set of instructions stored in one or more storage elements. The storage elements may also hold data or other information, as desired. The storage element may be in the form of an information source or a physical memory element present in the processing machine.
[00181] The programmable or computer-readable instructions may include various commands that instruct the processing machine to perform specific tasks, such as steps that constitute the method of the disclosure. The systems and methods described can also be implemented using only software programming or only hardware, or using a varying combination of the two techniques. The disclosure is independent of the programming language and the operating system used in the computers. The instructions for the disclosure can be written in all programming languages, including, but not limited to, ‘C’, ‘C++’, ‘Visual C++’, ‘Visual Basic’, ‘Java’, ‘Python’. Further, software may be in the form of a collection of separate programs, a program module containing a larger program, or a portion of a program module, as discussed in the ongoing description. The software may also include modular programming in the form of object-oriented programming. The processing of input data by the processing machine may be in response to user commands, the results of previous processing, or from a request made by another processing machine. The disclosure can also be implemented in various operating systems and platforms, including, but not limited to, ‘Unix’, ‘DOS’, ‘Android’, ‘Symbian’, and ‘Linux’.
[00182] The programmable instructions can be stored and transmitted on a computer-readable medium. The disclosure can also be embodied in a computer program product comprising a computer-readable medium, or with any product capable of implementing the above methods and systems, or the numerous possible variations thereof.
[00183] Various embodiments of the system and method for resume parsing have been disclosed. However, it should be apparent to those skilled in the art that modifications in addition to those described are possible without departing from the inventive concepts herein. The embodiments, therefore, are not restrictive, except in the spirit of the disclosure. Moreover, in interpreting the disclosure, all terms should be understood in the broadest possible manner consistent with the context. In particular, the terms “comprises” and “comprising” should be interpreted as referring to elements, components, or steps, in a non-exclusive manner, indicating that the referenced elements, components, or steps may be present, or used, or combined with other elements, components, or steps that are not expressly referenced.
[00184] Those having ordinary skills in the art will appreciate that the systems, modules, and sub-modules have been illustrated and explained to serve as examples and should not be considered limiting in any manner. It will be further appreciated that the variants of the above disclosed system elements, modules, and other features and functions, or alternatives thereof, may be combined to create other different systems or applications.
[00185] Those skilled in the art will appreciate that any of the aforementioned steps and/or system modules may be suitably replaced, reordered, or removed, and additional steps and/or system modules may be inserted, depending on the needs of a particular application. In addition, the systems of the aforementioned embodiments may be implemented using a wide variety of suitable processes and system modules, and are not limited to any particular computer hardware, software, middleware, firmware, microcode, and the like.
[00186] The claims can encompass embodiments for hardware and software, or a combination thereof.
[00187] It will be appreciated that variants of the above disclosed, and other features and functions or alternatives thereof, may be combined into many other different systems or applications. Presently unforeseen or unanticipated alternatives, modifications, variations, or improvements therein may be subsequently made by those skilled in the art, which are also intended to be encompassed by the following claims.

ADVANTAGES OF THE PRESENT DISCLOSURE
[00188] The present disclosure provides a system that revolutionizes resume data management by guaranteeing data accuracy, thereby enhancing the reliability of extracted information.
[00189] The present disclosure provides a system that empowers users with the Named Entity Recognition (NER) annotator tool, allowing for precise data annotation and facilitating the identification and categorization of essential resume segments.
[00190] The present disclosure provides a system that enhances recruitment processes, job matching, and personal career insights by delivering a state-of-the-art resume parsing solution that saves time, improves accuracy, and maximizes the utility of resume data.
[00191] The present disclosure provides a system that simplifies the data collection process by utilizing advanced techniques, such as user submissions, to create a diverse and extensive dataset covering various sectors and resume formats.
[00192] The present disclosure provides a system that utilizes cutting-edge deep learning models for accurate information extraction and mapping that ensures the model's precision in recognizing key resume components.
[00193] The present disclosure provides a system that standardizes and enriches resume data through the extraction of text-based information and the application of semantic analysis, this makes it more valuable for subsequent analysis.
[00194] The present disclosure provides a system that offers comprehensive, accurate, and versatile solutions to streamline the entire resume processing pipeline, this improves efficiency and productivity in recruitment and human resources management processes.
, Claims:We Claim:
1. A resume parsing system (100) comprising:
an input unit (102) configured to receive a set of resumes from candidates from a set of sources;
a controller (202) in communication with the input unit (102), and the controller (202) comprising one or more processors, wherein the one or more processors are operatively coupled with a memory, the memory storing instructions executable by the one or more processors to:
extract text from each resume using a pymupdf library;
check font size of each extracted text to identify headings, and label the extracted text having a larger font size as a heading;
segment the extracted text into predefined categories;
extract sentence-based headings from each resume and merge the extracted sentence-based headings to create a first dataset in a pre-defined file format, wherein each row contains an individual sentence;
assign labels to each sentence indicating the category and encode the labels into integer values;
utilize a Named Entity Recognition (NER) annotator tool to annotate stored data in the first dataset, in JavaScript Object Notation (JSON) format for training a sentence-level Natural Language Processing (NLP) model;
extract word-based headings from each resume and merge the extracted word-based headings to create a second dataset in the pre-defined file format; and
annotate each segment using the NER annotator tool, and merge the annotated segment into the second dataset in the JSON format for training a word-level Natural Language Processing (NLP) model; and
a training and testing unit (112) configured to evaluate performance of the trained sentence-level NLP model and the word-level based NLP model on received unseen data;
wherein the controller is further configured to integrate outputs of the sentence-level NLP model and the word-level NLP model to extract information from the received set of resumes.

2. The system (100) as claimed in claim 1, wherein the received set of resumes is converted into pdf format prior to extracting text from each resume.

3. The system (100) as claimed in claim 1, wherein the NER annotator tool used for annotating the word-based headings utilizes machine learning techniques to detect and label the relevant segments.

4. The system (100) as claimed in claim 1, wherein the output of the word-level NLP model is utilized to extract key phrases from resumes, facilitating keyword-based search and analysis of profiles of the candidates.

5. The system (100) as claimed in claim 1, wherein the outputs of the sentence-level NLP model and the word-level NLP model are merged to generate comprehensive profiles of the candidates, including both structured and unstructured information extracted from the received set of resumes.

6. The system (100) as claimed in claim 1, wherein the integrated outputs of the sentence-level NLP model and the word-level NLP model are further processed using clustering techniques to group the set of resumes based on similarities in the extracted information.

7. A method (600) for parsing resumes, comprising the steps of:
receiving (602), a set of resumes from candidates from a set of sources by an input unit;
extracting (604), by a controller, text from each resume using a pymupdf library;
checking (606), by the controller, font size of each extracted text to identify headings, and labeling the extracted text having a larger font size as a heading;
segmenting (608), by the controller, the extracted text into predefined categories;
extracting (610), by the controller, sentence-based headings from each resume and merging the extracted sentence-based headings to create a first dataset in a pre-defined file format, wherein each row contains an individual sentence;
assigning (612), by the controller, labels to each sentence indicating the category and encoding the labels into integer values;
utilizing (614), by the controller, a Named Entity Recognition (NER) annotator tool to annotate stored data in the first dataset, in JavaScript Object Notation (JSON) format for training a sentence-level Natural Language Processing (NLP) model;
extracting (616), by the controller, word-based headings from each resume and storing in a second dataset;
annotating (618), by the controller, each segment using the NER annotator tool, and merging the annotated segment into the second dataset in JSON format for training a word-level Natural Language Processing (NLP) model;
evaluating (620) performance of the trained sentence-level NLP model and the word-level based NLP model on received unseen data using a training and testing unit; and
integrating (622) outputs of the sentence-level NLP model and the word-level NLP model to extract information from the received set of resumes.

8. The method as claimed in claim 8, further comprises the step of converting, by the controller, the received set of resumes into pdf format prior to extracting text from each resume.

9. The method as claimed in claim 8, wherein the NER annotator tool is used for annotating the word-based headings that utilize machine learning techniques for detecting and labeling the relevant segments.

10. The method as claimed in claim 8, further comprises the step of utilizing the output of the word-level NLP model to extract key phrases from resumes, facilitating keyword-based search and analysis of profiles of the candidates.

11. The method as claimed in claim 8, further comprises the step of merging the outputs of the sentence-level NLP model and the word-level NLP model for generating comprehensive profiles of the candidates, including both structured and unstructured information extracted from the received set of resumes.

12. The method as claimed in claim 8, further comprises the step of processing the integrated outputs of the sentence-level NLP model and the word-level NLP model using clustering techniques to group the set of resumes based on similarities in the extracted information.

Documents

Application Documents

#	Name	Date
1	202411035999-POWER OF AUTHORITY [07-05-2024(online)].pdf	2024-05-07
2	202411035999-MSME CERTIFICATE [07-05-2024(online)].pdf	2024-05-07
3	202411035999-FORM28 [07-05-2024(online)].pdf	2024-05-07
4	202411035999-FORM-9 [07-05-2024(online)].pdf	2024-05-07
5	202411035999-FORM FOR SMALL ENTITY(FORM-28) [07-05-2024(online)].pdf	2024-05-07
6	202411035999-FORM 3 [07-05-2024(online)].pdf	2024-05-07
7	202411035999-FORM 18A [07-05-2024(online)].pdf	2024-05-07
8	202411035999-FORM 1 [07-05-2024(online)].pdf	2024-05-07
9	202411035999-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [07-05-2024(online)].pdf	2024-05-07
10	202411035999-ENDORSEMENT BY INVENTORS [07-05-2024(online)].pdf	2024-05-07
11	202411035999-DRAWINGS [07-05-2024(online)].pdf	2024-05-07
12	202411035999-COMPLETE SPECIFICATION [07-05-2024(online)].pdf	2024-05-07
13	202411035999-IntimationUnderRule24C(4).pdf	2025-02-05