Abstract: A method and device for classifying uniform resource locators based on content in corresponding websites is disclosed. The method includes extracting, by a network device, a plurality of website contents from a website associated with a URL based on Optical Character Recognition (OCR). The method further includes classifying, by the network device, each of the plurality of website contents into a plurality of webpage categories based on machine learning. The method includes simulating, by the network device, user actions for the plurality of website contents, based on a webpage category associated with each of the plurality of website contents. The method further includes determining, by the network device, an access classification for the URL based on results of simulating the user actions and machine learning. FIG. 2
Claims:WE CLAIM:
1. A method of classifying Uniform Resource Locators (URLs) based on content in corresponding websites, the method comprising:
extracting, by a network device, a plurality of website contents from a website associated with a URL based on Optical Character Recognition (OCR);
classifying, by the network device, each of the plurality of website contents into a plurality of webpage categories based on machine learning;
simulating, by the network device, user actions for the plurality of website contents, based on a webpage category associated with each of the plurality of website contents; and
determining, by the network device, an access classification for the URL based on results of simulating the user actions and machine learning.
2. The method of claim 1, further comprising identifying at least one URL attribute associated with the URL, wherein the at least one URL attribute comprises registration date of an associated domain, age of the associated domain, replica of the associated domain, or ownership of the associated domain.
3. The method of claim 2, wherein determining the access classification for the URL further comprises:
evaluating the URL based on the at least one URL attribute to determine the access classification; and
assigning risk and threat scores to the URL for each of the at least one URL attribute and each of the results of simulating the user actions.
4. The method of claim 1, wherein extracting the plurality of website contents comprises:
capturing at least one image associated with each webpage in the website; and
performing OCR on each of the at least one image associated with each webpage in the website.
5. The method of claim 4, wherein classifying each of the plurality of website contents into the plurality of webpage categories comprises:
identifying, based on machine learning, each of the plurality of website contents from the at least one image associated with each webpage; and
determining, based on machine learning, a webpage category from plurality of webpage categories for each of the plurality of website contents, in response to identifying.
6. The method of claim 1, wherein the plurality of webpage categories comprises at least one of a login page, download instruction page, financial information collecting page, and personal information gathering page.
7. The method of claim 1, wherein the plurality of website contents comprises at least one of images, textual description, inputs fields, buttons, audio, video, embedded documents, paragraphs, links, file paths, forms, Hyper Text Markup Language (HTML) geolocations, or HTML references.
8. The method of claim 1, further comprising performing at least one action corresponding to the URL based on the determined access classification.
9. The method of claim 8, wherein the at least one action comprises blocking the URL, granting access to the URL based on user role in an organization, reporting the URL as harmful, or sending a recommendation to downstream network devices to allow or block the URL.
10. The method of claim 1, wherein the access classification comprises at least one of phishing classification, adult classification, organization specific restriction classification, malware classification, ransomware classification, brokerage or trading classification, business or economy classification, file storage or sharing classification, financial services classification, malicious botnet Command and Control (CnC) classification, malicious malware sources classification, piracy/copyright violation classification, proxy avoidance classification, or software download classification.
11. A network device for classifying Uniform Resource Locators (URLs) based on content in corresponding websites, the device comprising:
a processor; and
a memory communicatively coupled to the processor, wherein the memory stores processor instructions, which, on execution, causes the processor to:
extract a plurality of website contents from a website associated with a URL based on Optical Character Recognition (OCR);
classify each of the plurality of website contents into a plurality of webpage categories based on machine learning;
simulate user actions for the plurality of website contents, based on a webpage category associated with each of the plurality of website contents; and
determine an access classification for the URL based on results of simulating the user actions and machine learning.
12. The network device of claim 11, wherein the processor instructions further cause the processor to identify at least one URL attribute associated with the URL, wherein the at least one URL attribute comprises registration date of an associated domain, age of the associated domain, replica of the associated domain, or ownership of the associated domain.
13. The network device of claim 12, wherein to determine the access classification for the URL, the processor instructions further cause the processor to:
evaluate the URL based on the at least one URL attribute to determine the access classification; and
assign risk and threat scores to the URL for each of the at least one URL attribute and each of the results of simulating the user actions.
14. The network device of claim 11, wherein to extract the plurality of website contents, the processor instructions further cause the processor to:
capture at least one image associated with each webpage in the website; and
perform OCR on each of the at least one image associated with each webpage in the website.
15. The network device of claim 14, wherein to classify each of the plurality of website contents into the plurality of webpage categories, the processor instructions further cause the processor to:
identify, based on machine learning, each of the plurality of website contents from the at least one image associated with each webpage; and
determine, based on machine learning, a webpage category from plurality of webpage categories for each of the plurality of website contents, in response to identifying.
16. The network device of claim 11, wherein the plurality of website contents comprises at least one of images, textual description, inputs fields, buttons, audio, video, embedded documents, paragraphs, links, file paths, forms, Hyper Text Markup Language (HTML) geolocations, or HTML references.
17. The network device of claim 11, wherein the processor instructions further cause the processor to perform at least one action corresponding to the URL based on the assigned access classification.
18. The network device of claim 17, wherein the at least one action comprises blocking the URL, granting access to the URL based on user role in an organization, reporting the URL as harmful, or sending a recommendation to downstream network devices to allow or block the URL.
19. The network device of claim 11, wherein the access classification comprises at least one of phishing classification, adult classification, organization specific restriction classification, malware classification, ransomware classification, brokerage or trading classification, business or economy classification, file storage or sharing classification, financial services classification, malicious botnet Command and Control (CnC) classification, malicious malware sources classification, piracy/copyright violation classification, proxy avoidance classification, or software download classification.
Dated this 26th day of April, 2018
Swetha SN
Of K&S Partners
Agent for the Applicant
IN/PA-2123
, Description:TECHNICAL FIELD
This disclosure relates generally to classifying Uniform Resource Locators (URLs) and more particularly to method and device for classifying uniform resource locators based on content in corresponding websites.
| # | Name | Date |
|---|---|---|
| 1 | 201841015809-STATEMENT OF UNDERTAKING (FORM 3) [26-04-2018(online)].pdf | 2018-04-26 |
| 2 | 201841015809-REQUEST FOR EXAMINATION (FORM-18) [26-04-2018(online)].pdf | 2018-04-26 |
| 3 | 201841015809-POWER OF AUTHORITY [26-04-2018(online)].pdf | 2018-04-26 |
| 4 | 201841015809-FORM 18 [26-04-2018(online)].pdf | 2018-04-26 |
| 5 | 201841015809-FORM 18 [26-04-2018(online)]-1.pdf | 2018-04-26 |
| 6 | 201841015809-FORM 1 [26-04-2018(online)].pdf | 2018-04-26 |
| 7 | 201841015809-DRAWINGS [26-04-2018(online)].pdf | 2018-04-26 |
| 8 | 201841015809-DECLARATION OF INVENTORSHIP (FORM 5) [26-04-2018(online)].pdf | 2018-04-26 |
| 9 | 201841015809-COMPLETE SPECIFICATION [26-04-2018(online)].pdf | 2018-04-26 |
| 10 | 201841015809-Request Letter-Correspondence [30-07-2018(online)].pdf | 2018-07-30 |
| 11 | 201841015809-Power of Attorney [30-07-2018(online)].pdf | 2018-07-30 |
| 12 | 201841015809-Form 1 (Submitted on date of filing) [30-07-2018(online)].pdf | 2018-07-30 |
| 13 | 201841015809-Proof of Right (MANDATORY) [21-09-2018(online)].pdf | 2018-09-21 |
| 14 | Correspondence by Agent_Form1_26-09-2018.pdf | 2018-09-26 |
| 15 | 201841015809-REQUEST FOR CERTIFIED COPY [01-10-2018(online)].pdf | 2018-10-01 |
| 16 | 201841015809-PETITION UNDER RULE 137 [29-04-2021(online)].pdf | 2021-04-29 |
| 17 | 201841015809-FORM 3 [29-04-2021(online)].pdf | 2021-04-29 |
| 18 | 201841015809-FER_SER_REPLY [29-04-2021(online)].pdf | 2021-04-29 |
| 19 | 201841015809-FER.pdf | 2021-10-17 |
| 20 | 201841015809-PatentCertificate29-12-2023.pdf | 2023-12-29 |
| 21 | 201841015809-IntimationOfGrant29-12-2023.pdf | 2023-12-29 |
| 22 | 201841015809-PROOF OF ALTERATION [18-03-2024(online)].pdf | 2024-03-18 |
| 1 | TPO201841015809E_26-10-2020.pdf |