Abstract: A method and device for extracting images from PDF documents are disclosed. The method includes performing a text recognition process on a PDF document that includes one or more images. The text recognition process replaces the one or more images with a plurality of contiguous newlines. The method further includes storing a location of each of the one or more images within the PDF document based on occurrence of the plurality of contiguous newlines within the PDF document. The method includes converting each page of the PDF document to an image format in order to generate an image document corresponding to the PDF document. The method further includes extracting each of the one or more images from the image document based on the location stored for each of the one or more images within the PDF document. FIG. 3
Claims:WE CLAIM:
1. A method for extracting images from Portable Document Format (PDF) documents, the method comprising:
performing, by an image extraction device, a text recognition process on a PDF document comprising one or more images, wherein the text recognition process replaces the one or more images with a plurality of contiguous newlines;
storing, by the image extraction device, a location of each of the one or more images within the PDF document based on occurrence of the plurality of contiguous newlines within the PDF document;
converting, by the image extraction device, each page of the PDF document to an image format in order to generate an image document corresponding to the PDF document; and
extracting, by the image extraction device, each of the one or more images from the image document based on the location stored for each of the one or more images within the PDF document.
2. The method of claim 1, wherein at least one of the one or more images is a vector graphic image.
3. The method of claim 1, wherein storing a location of an image from the one or more images within the PDF document comprises associating a location metadata with the PDF document, wherein the location metadata comprises information related to the location of the image.
4. The method of claim 1, wherein a location of an image from the one or more images comprises a page number of a page including the image and coordinates of corners of the image within the page.
5. The method of claim 4, wherein extracting an image from the image document comprises incrementally scanning a page of the image document comprising the image based on the coordinates of the image.
6. The method of claim 5, wherein the scanning comprises tracing the contour of the image based on the coordinates of corners of the image within the page of the image document in at least one of a square and a rectangle pattern.
7. The method of claim 1 further comprising storing each of the one or more images in a predefined format in response to extracting each of the one or more images from the image document.
8. The method of claim 7, wherein an extracted image is tagged with an associated location metadata indicating location of the extracted image within the PDF document.
9. The method of claim 1, wherein the text recognition process is performed using an Open source Computer Vision (OpenCV) tool.
10. An image extraction device for extracting images from Portable Document Format (PDF) documents, the image extraction device comprising:
at least one processor;
a memory communicatively coupled to the processor, wherein the memory stores processor instructions, which, on execution, causes the processor to:
perform a text recognition process on a PDF document comprising one or more images, wherein the text recognition process replaces the one or more images with a plurality of contiguous newlines;
store a location of each of the one or more images within the PDF document based on occurrence of the plurality of contiguous newlines within the PDF document;
convert each page of the PDF document to an image format in order to generate an image document corresponding to the PDF document; and
extract each of the one or more images from the image document based on the location stored for each of the one or more images within the PDF document.
11. The image extraction device of claim 10, wherein at least one of the one or more images is a vector graphic image.
12. The image extraction device of claim 10, wherein to store a location of an image from the one or more images within the PDF document, the processor instructions further cause the processor to associate a location metadata with the PDF document, wherein the location metadata comprises information related to the location of the image.
13. The image extraction device of claim 10, wherein a location of an image from the one or more images comprises a page number of a page including the image and coordinates of corners of the image within the page.
14. The image extraction device of claim 13, wherein to extract an image from the image document, the processor instructions further cause the processor to incrementally scan a page of the image document comprising the image based on the coordinates of the image.
15. The image extraction device of claim 14, wherein to scan, the processor instructions further cause the processor to trace the contour of the image based on the coordinates of corners of the image within the page of the image document in at least one of a square and a rectangle pattern.
16. The image extraction device of claim 10, wherein the processor instructions further cause the processor to store each of the one or more images in a predefined format in response to extracting each of the one or more images from the image document.
17. The image extraction device of claim 16, wherein an extracted image is tagged with an associated location metadata indicating location of the extracted image within the PDF document.
18. The image extraction device of claim 10, wherein the text recognition process is performed using an Open source Computer Vision (OpenCV) tool.
Dated this 24th day of May, 2017
Swetha SN
Of K&S Partners
Agent for the Applicant
, Description:TECHNICAL FIELD
This disclosure relates generally to extracting images from documents and more particularly to method and device for extracting images from Portable Document Format (PDF) documents.
| # | Name | Date |
|---|---|---|
| 1 | Power of Attorney [24-05-2017(online)].pdf | 2017-05-24 |
| 2 | Form 5 [24-05-2017(online)].pdf | 2017-05-24 |
| 3 | Form 3 [24-05-2017(online)].pdf | 2017-05-24 |
| 4 | Form 18 [24-05-2017(online)].pdf_578.pdf | 2017-05-24 |
| 5 | Form 18 [24-05-2017(online)].pdf | 2017-05-24 |
| 6 | Form 1 [24-05-2017(online)].pdf | 2017-05-24 |
| 7 | Drawing [24-05-2017(online)].pdf | 2017-05-24 |
| 8 | Description(Complete) [24-05-2017(online)].pdf_577.pdf | 2017-05-24 |
| 9 | Description(Complete) [24-05-2017(online)].pdf | 2017-05-24 |
| 10 | REQUEST FOR CERTIFIED COPY [25-05-2017(online)].pdf | 2017-05-25 |
| 11 | abstract 201741018278 .jpg | 2017-05-25 |
| 12 | 201741018278-Proof of Right (MANDATORY) [13-07-2017(online)].pdf | 2017-07-13 |
| 13 | Correspondence by Agent_Form 1_18-07-2017.pdf | 2017-07-18 |
| 14 | 201741018278-REQUEST FOR CERTIFIED COPY [19-07-2017(online)].pdf | 2017-07-19 |
| 15 | 201741018278-FER.pdf | 2020-06-26 |
| 16 | 201741018278-PETITION UNDER RULE 137 [25-12-2020(online)].pdf | 2020-12-25 |
| 17 | 201741018278-OTHERS [25-12-2020(online)].pdf | 2020-12-25 |
| 18 | 201741018278-FORM 3 [25-12-2020(online)].pdf | 2020-12-25 |
| 19 | 201741018278-FER_SER_REPLY [25-12-2020(online)].pdf | 2020-12-25 |
| 20 | 201741018278-DRAWING [25-12-2020(online)].pdf | 2020-12-25 |
| 21 | 201741018278-COMPLETE SPECIFICATION [25-12-2020(online)].pdf | 2020-12-25 |
| 22 | 201741018278-CLAIMS [25-12-2020(online)].pdf | 2020-12-25 |
| 23 | 201741018278-US(14)-HearingNotice-(HearingDate-01-03-2023).pdf | 2023-02-10 |
| 24 | 201741018278-POA [17-02-2023(online)].pdf | 2023-02-17 |
| 25 | 201741018278-FORM 13 [17-02-2023(online)].pdf | 2023-02-17 |
| 26 | 201741018278-Correspondence to notify the Controller [17-02-2023(online)].pdf | 2023-02-17 |
| 27 | 201741018278-AMENDED DOCUMENTS [17-02-2023(online)].pdf | 2023-02-17 |
| 28 | 201741018278-Written submissions and relevant documents [16-03-2023(online)].pdf | 2023-03-16 |
| 29 | 201741018278-FORM-26 [16-03-2023(online)].pdf | 2023-03-16 |
| 30 | 201741018278-PatentCertificate28-03-2023.pdf | 2023-03-28 |
| 31 | 201741018278-IntimationOfGrant28-03-2023.pdf | 2023-03-28 |
| 1 | SearchStrategyMatrixE_26-06-2020.pdf |