Abstract: Embodiments of present disclosure relates to method and system to extract text from engineering drawing for performing accurate OCR. Initially, for the extraction, image of engineering drawing is received with a plurality of components. Each of the plurality of components in the image is classified to be one of a textual component and a non-textual component. At least one word element for textual components from the plurality of components is identified based on segmentation of the plurality of components. The segmentation is performed by drawing a plurality of horizontal edge projections of a predefined length for each of the textual components. Further, the textual components is identified to be associated with the at least one word element when horizontal edge projection of each of the textual components overlaps with adjacent textual component. The at least one word element is provided as extracted text for performing OCR on the engineering drawing. Figure 4
Claims:We claim:
1. A method to extract text from an engineering drawing for performing Optical Character Recognition (OCR), wherein the method comprises:
receiving, by a text extraction system (101), an image (206) of an engineering drawing (102) comprising a plurality of components;
classifying, by the text extraction system (101), each of the plurality of components in the image (206) to be one of a textual component and a non-textual component;
identifying, by the text extraction system (101), at least one word element (209) for textual components from the plurality of components based on segmentation of the plurality of components, wherein the segmentation comprises:
drawing a plurality of horizontal edge projections of a predefined length (210) for each of the textual components; and
identifying the textual components to be associated with the at least one word element (209) when horizontal edge projection of each of the textual components overlaps with adjacent textual component; and
providing, by the text extraction system (101), the at least one word element (209) as extracted text for performing OCR on the engineering drawing (102).
2. The method as claimed in claim 1, wherein the classification of each of the plurality of components is performed using a deep learning classifier, wherein the deep learning classifier is trained using a plurality of predefined textual components and a plurality of predefined non-textual components.
3. The method as claimed in claim 1, wherein classifying of each of the plurality of components comprises:
converting the image (206) to a gray-scale image;
drawing a rectangular boundary for each of the plurality of components upon the conversion; and
determining probability (211) of each of the plurality of components to be the textual component, wherein the probability (211) is compared with a predefined threshold (212) to classify the plurality of components to be one of the textual component and the non-textual component.
4. The method as claimed in claim 3, wherein a component from the plurality of components associated with the probability (211) greater than the predefined threshold (212) is classified to be the non-textual component and a component from the plurality of components associated with the probability (211) one of lesser than and equal to the predefined threshold (212) is classified to be the textual component.
5. The method as claimed in claim 1, wherein the predefined length (210) is equal to an adaptive threshold associated with the image (206), wherein the adaptive threshold is average of distance between rectangular contours of every sequential components from the plurality of components.
6. The method as claimed in claim 1, wherein the each of the plurality of horizontal edge projections is drawn from right edge of a rectangular contour associated with corresponding textual component.
7. A text extraction system to extract text from an engineering drawing for performing Optical Character Recognition (OCR), comprises:
a processor (105); and
a memory (107) communicatively coupled to the processor (105), wherein the memory (107) stores processor-executable instructions, which, on execution, cause the processor (105) to:
receive an image (206) of an engineering drawing (102) comprising a plurality of components;
classify each of the plurality of components in the image (206) to be one of a textual component and a non-textual component;
identify at least one word element (209) for textual components from the plurality of components based on segmentation of the plurality of components, wherein the segmentation comprises:
draw a plurality of horizontal edge projections of a predefined length (210) for each of the textual components; and
identify the textual components to be associated with the at least one word element (209) when horizontal edge projection of each of the textual components overlaps with adjacent textual component; and
provide the at least one word element (209) as extracted text for performing OCR on the engineering drawing (102).
8. The text extraction system (101) as claimed in claim 7, wherein the classification of each of the plurality of components is performed using a deep learning classifier, wherein the deep learning classifier is trained using a plurality of predefined textual components and a plurality of predefined non-textual components.
9. The text extraction system (101) as claimed in claim 7, wherein classifying of each of the plurality of components comprises:
converting the image (206) to a gray-scale image;
drawing a rectangular boundary for each of the plurality of components upon the conversion; and
determining probability (211) of each of the plurality of components to be the textual component, wherein the probability (211) is compared with a predefined threshold (212) to classify the plurality of components to be one of the textual component and the non-textual component.
10. The text extraction system (101) as claimed in claim 9, wherein a component from the plurality of components associated with the probability (211) greater than the predefined threshold (212) is classified to be the non-textual component and a component from the plurality of components associated with the probability (211) one of lesser than and equal to the predefined threshold (212) is classified to be the textual component.
11. The text extraction system (101) as claimed in claim 7, wherein the predefined length (210) is equal to an adaptive threshold associated with the image (206), wherein the adaptive threshold is average of distance between rectangular contours of every sequential components from the plurality of components.
12. The text extraction system (101) as claimed in claim 7, wherein the each of the plurality of horizontal edge projections is drawn from right edge of a rectangular contour associated with corresponding textual component.
Dated this 31st day of September, 2018
R Ramya Rao
Of K&S Partners
Agent for the Applicant
IN/PA-1607 , Description:TECHNICAL FIELD
The present subject matter is related in general to Optical Character Recognition (OCR) systems, more particularly, but not exclusively to a method and system for extracting text from engineering drawings for performing a OCR.
| # | Name | Date |
|---|---|---|
| 1 | 201841032861-STATEMENT OF UNDERTAKING (FORM 3) [31-08-2018(online)].pdf | 2018-08-31 |
| 2 | 201841032861-REQUEST FOR EXAMINATION (FORM-18) [31-08-2018(online)].pdf | 2018-08-31 |
| 3 | 201841032861-POWER OF AUTHORITY [31-08-2018(online)].pdf | 2018-08-31 |
| 4 | 201841032861-FORM 18 [31-08-2018(online)].pdf | 2018-08-31 |
| 5 | 201841032861-FORM 1 [31-08-2018(online)].pdf | 2018-08-31 |
| 6 | 201841032861-DRAWINGS [31-08-2018(online)].pdf | 2018-08-31 |
| 7 | 201841032861-DECLARATION OF INVENTORSHIP (FORM 5) [31-08-2018(online)].pdf | 2018-08-31 |
| 8 | 201841032861-COMPLETE SPECIFICATION [31-08-2018(online)].pdf | 2018-08-31 |
| 9 | Abstract_201841032861.jpg | 2018-09-03 |
| 10 | 201841032861-Request Letter-Correspondence [05-09-2018(online)].pdf | 2018-09-05 |
| 11 | 201841032861-Power of Attorney [05-09-2018(online)].pdf | 2018-09-05 |
| 12 | 201841032861-Form 1 (Submitted on date of filing) [05-09-2018(online)].pdf | 2018-09-05 |
| 13 | 201841032861-Proof of Right (MANDATORY) [17-09-2018(online)].pdf | 2018-09-17 |
| 14 | Correspondence by Agent_Form30,Form1_24-09-2018.pdf | 2018-09-24 |
| 15 | 201841032861-PETITION UNDER RULE 137 [25-08-2021(online)].pdf | 2021-08-25 |
| 16 | 201841032861-FORM 3 [25-08-2021(online)].pdf | 2021-08-25 |
| 17 | 201841032861-FER_SER_REPLY [25-08-2021(online)].pdf | 2021-08-25 |
| 18 | 201841032861-FER.pdf | 2021-10-17 |
| 19 | 201841032861-US(14)-HearingNotice-(HearingDate-06-02-2024).pdf | 2024-01-09 |
| 20 | 201841032861-POA [15-01-2024(online)].pdf | 2024-01-15 |
| 21 | 201841032861-FORM 13 [15-01-2024(online)].pdf | 2024-01-15 |
| 22 | 201841032861-Correspondence to notify the Controller [15-01-2024(online)].pdf | 2024-01-15 |
| 23 | 201841032861-AMENDED DOCUMENTS [15-01-2024(online)].pdf | 2024-01-15 |
| 24 | 201841032861-FORM-26 [06-02-2024(online)].pdf | 2024-02-06 |
| 25 | 201841032861-Written submissions and relevant documents [21-02-2024(online)].pdf | 2024-02-21 |
| 26 | 201841032861-FORM 3 [21-02-2024(online)].pdf | 2024-02-21 |
| 27 | 201841032861-PatentCertificate27-02-2024.pdf | 2024-02-27 |
| 28 | 201841032861-IntimationOfGrant27-02-2024.pdf | 2024-02-27 |
| 29 | 201841032861-FORM 4 [10-09-2024(online)].pdf | 2024-09-10 |
| 1 | searchE_25-02-2021.pdf |