Method And System For Context Free Grammar Based Domain Name

< Back

Method And System For Context Free Grammar Based Domain Name Generation For Defensive Registration

Abstract: Organizations operating within a given country register their domain names in different top-level domains or employ different words in second-level domains. Cyber attackers can exploit these inconsistencies to generate new confusingly similar looking domains as benign domains. Also, the attacker can register these domains, if available, and use them for malicious purposes. Embodiments herein provide a method and system for context-free grammar (CFG) based domain name generation for defensive registration. The system is configured to model inconsistencies present in benign domain names of different organizations’ websites using the CFG and to generate various candidate domain names. A decision-making flowchart is used to classify various candidate domain names into defensive, malicious, and suspicious category. Further, mapping various candidate domain names with benign domain names to rank them using a Probabilistic Context-Free Grammar (PCFG) model. Further, the system recommends ranked candidate domain names for defensive registration to curb abuse from domain squatters. [To be published with FIG. 2]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

19 April 2021

Publication Number

42/2022

Publication Type

INA

Invention Field

COMMUNICATION

Status

kcopatents@khaitanco.com

Parent Application

Patent Number

Legal Status

Grant Date

2025-02-13

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point Mumbai Maharashtra India 400021

Inventors

1. KUMAR, Neeraj

Tata Consultancy Services Limited Tata Research Development & Design Centre, 54-B, Hadapsar Industrial Estate, Hadapsar, Pune Maharashtra India 411013

2. GHEWARI, Sukhada

Tata Consultancy Services Limited Tata Research Development & Design Centre, 54-B, Hadapsar Industrial Estate, Hadapsar, Pune Maharashtra India 411013

3. TUPSAMUDRE, Harshal

Tata Consultancy Services Limited Tata Research Development & Design Centre, 54-B, Hadapsar Industrial Estate, Hadapsar, Pune Maharashtra India 411013

4. LODHA, Sachin Premsukh

Tata Consultancy Services Limited Tata Research Development & Design Centre, 54-B, Hadapsar Industrial Estate, Hadapsar, Pune Maharashtra India 411013

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention: METHOD AND SYSTEM FOR CONTEXT-FREE GRAMMAR BASED DOMAIN NAME GENERATION FOR DEFENSIVE REGISTRATION
Applicant
Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.

TECHNICAL FIELD [001] The disclosure herein generally relates to the field of domain name generation and more specifically, to a method and system for context-free grammar (CFG) based domain name generation for defensive registration.
BACKGROUND
[002] In today’s digital era, a large number of users rely on banking websites to perform financial transactions. The widespread adoption of online banking and the monetary value associated with each user’s account make banking domains a potential target for domain squatting. In addition, due to the ever-increasing adoption of online banking and the financial nature of transactions involved, banking sector is one of the top hotspots for cyber-attacks. According to various studies, banking services industry has the highest cost of cybercrime. The costliest types of attacks for banks are denial of services, phishing, social engineering, and malicious insiders. The Anti Phishing Work Group (APWG) report published in Q4 2020 identified a total of 637,302 phishing websites of which 22.50% were aimed towards financial institutions, making it one of the most targeted industry sectors.
[003] In order to perform such attacks, the attacker strategically registers domain names that are confusingly similar to those belonging to popular brands of any business sector. This practice is popularly known as domain squatting. The domain squatting includes activities such as impersonating the original websites to steal traffic, and distribution of advertisements and malware. Different business sectors use different words in their second-level domains (SLDs) that too in a different order and are registered in different top-level domains (TLDs). The attacker can exploit these inconsistencies to produce new domain names resembling the structure of benign domain names and register them for malicious purposes, which could adversely affect banks, both financially and reputation wise. This problem is not limited to banking organizations but can be found in different business sectors such as webmail, healthcare, e-commerce, and social networking etc. When someone is using the services provided by these business sectors online,

it is observed that organizations in different business sectors do not follow any common strategy and pattern while registering their domains and also, do not follow any common defensive domain registration strategy, which leaves a gap for attackers to exploit these inconsistencies by registering confusingly similar looking domains as benign domains and steal sensitive information which leads to monetary loss, reputation loss and data loss.
SUMMARY
[004] Embodiments of the disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method and system for context-free grammar (CFG) based domain name generation for defensive registration is provided.
[005] In one aspect, a processor-implemented method of context-free grammar (CFG) based domain name generation for defensive registration is provided. The method includes one or more steps such as receiving a plurality of benign uniform resource locators (URLs) of a business sector from a user, and parsing the received plurality of benign URLs to collect one or more URL components, wherein the one or more URL components include a hostname comprising a second-level domain (SLD) and a top-level domain (TLD) of the URL. Further, the method includes analyzing the collected one or more URL components to identify at least one inconsistency present in at least one of the pluralities of benign URLs and learning a context free grammar (CFG) of the identified at least one inconsistency to generate one or more candidate domains. Herein the at least one inconsistency includes difference in usage of words and order of words in SLDs, and TLDs in each of the plurality of received benign URLs. Furthermore, the method includes collecting one or more details of each of the generated one or more candidate domains by crawling WHOIS information, Domain Name System (DNS) information, and webpage information, mapping one or more details of generated candidate domains with the received plurality of benign URLs to rank one or more candidate domains using a Probabilistic Context-Free Grammar (PCFG) model and

recommending one or more candidate domains for defensive registration. Furthermore, the method includes categorizing the one or more generated candidate domains in defensive, suspicious, malicious, and unrelated category based on the collected one or more details of each of the one or more candidate domains and a predefined decision-making flowchart for each category.
[006] In another aspect, a system for context-free grammar (CFG) based domain name generation for defensive registration is provided. The system includes an input/output interface configured to receive a plurality of benign uniform resource locators (URLs) of a business sector from a user, one or more hardware processors and at least one memory storing a plurality of instructions, wherein the one or more hardware processors are configured to execute the plurality of instructions stored in the at least one memory.
[007] Further, the system is configured to parse the received each of the plurality of benign URLs to collect one or more URL components, wherein the one or more URL components include a hostname comprising of a second-level domain (SLD) and a top-level domain (TLD) of the URL. The collected one or more URL components to identify at least one inconsistency present in at least one of the pluralities of benign URLs. Herein, at least one inconsistency includes difference in usage of words, and order of words of the SLD and TLD of the plurality of benign URLs. Furthermore, the system is configured to learn a context free grammar of the identified at least one inconsistency to generate one or more candidate domains. The one or more details of each of the generated one or more candidate domains is collected by crawling WHOIS information, DNS information, and webpage information and map one or more details of generated candidate domains with the received plurality of benign URLs. Further, the system is configured to rank the mapped one or more candidate domains using a Probabilistic Context-Free Grammar (PCFG) model and recommend, via the input/output interface, one or more candidate domains for defensive registration.
[008] In yet another aspect, one or more non-transitory machine-readable information storage mediums are provided comprising one or more instructions, which when executed by one or more hardware processors causes a method of

context-free grammar (CFG) based domain name generation for defensive registration is provided. The method includes one or more steps such as receiving a plurality of benign uniform resource locators (URLs) of a business sector from a user, and parsing the received plurality of benign URLs to collect one or more URL components, wherein the one or more URL components include a hostname comprising a second-level domain (SLD) and a top-level domain (TLD) of the URL. Further, the method includes analyzing the collected one or more URL components to identify at least one inconsistency present in at least one of the pluralities of benign URLs and learning a context free grammar of the identified at least one inconsistency to generate one or more candidate domains. Herein, at least one inconsistency includes difference in usage of words and order of words in SLDs, and TLDs in each of the plurality of received benign URLs. Furthermore, the method includes collecting one or more details of each of the generated one or more candidate domains by crawling WHOIS information, DNS information, and webpage information, mapping one or more details of generated candidate domains with the received plurality of benign URLs to rank one or more candidate domains using a Probabilistic Context-Free Grammar (PCFG) model and recommending one or more candidate domains for defensive registration. Furthermore, the method includes categorizing the one or more generated candidate domains in defensive, suspicious, malicious, and unrelated category based on the collected one or more details of each of the one or more candidate domains and a predefined decision-making flowchart for each category.
[009] It is to be understood that the foregoing general descriptions and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS [010] The accompanying drawings, which are incorporated in and
constitute a part of this disclosure, illustrate exemplary embodiments and, together
with the description, serve to explain the disclosed principles:
[011] FIG. 1 illustrates an exemplary system for context-free grammar

(CFG) based domain name generation for defensive registration, according to an embodiment of the present disclosure.
[012] FIG. 2 is block diagram of the system for context-free grammar (CFG) based domain name generation for defensive registration, according to an embodiment of the present disclosure.
[013] FIG. 3 is flowchart to illustrate context-free grammar (CFG) based domain name generation for defensive registration, according to an embodiment of the present disclosure.
[014] FIG. 4 (a) & (b) are decision-making flowcharts to illustrate categorization of the one or more candidate domains into different categories, according to an embodiment of the present disclosure.
[015] FIG. 5 is flow diagram to illustrate ranking and recommendation of generated candidate domains for defensive registration, according to an embodiment of the present disclosure.
[016] FIG. 6 is a flow diagram to illustrate a method of context-free grammar (CFG) based domain name generation for defensive registration, in accordance with some embodiments of the present disclosure.
[017] It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems and devices embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, and the like represent various processes, which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
DETAILED DESCRIPTION OF EMBODIMENTS [018] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are

described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
[019] The embodiments herein provide a method and a system for context-free grammar (CFG) based domain name generation for defensive registration. It has been observed that internet domain names registered by organizations do not follow any common pattern and a uniform strategy, which could lead to combo-squatting, TLD squatting and other forms of domain squatting abuse. For instance, banks operating in India register their domain names in different top-level domains (TLDs) onlinesbi.com for State Bank of India and iobnet.co.in for Indian Overseas Bank or employ different words in second-level domains (SLDs) such as onlinesbi.com for State Bank of India and iobnet.co.in for India Overseas Bank. An attacker can exploit these inconsistencies and generate four new domains, two by exchanging the top-level domains to produce iobnet.com and onlinesbi.co.in, and two by exchanging the words to produce onlineiob.co.in and netsbi.com. Therefore, it is the responsibility of the banks to anticipate such domains in advance before attackers and make sure to defensively register such domains to prevent any fraudulent activities.
[020] Herein, the system is configured to model inconsistencies present in benign domains of any organization’s websites using the CFG and using it to generate one or more similar looking candidate domain names as benign domains. It is to be noted that the disclosure herein is generic and is not only restricted to model inconsistencies present in the benign domains of banking websites but it can be leveraged in any business sector such as in social networking websites, e-commerce websites, online/cloud service websites, gaming websites, webmail websites and government websites etc. It is to be noted that hereinafter the benign Uniform Resource Locators (URLs) and benign domains are being used interchangeable. A decision-making flowchart is used to classify the generated one or more candidate domain names into defensive, malicious, suspicious, and unrelated category. It is to be noted that three new forms of domain squatting namely combo-TLDsquatting, fullname squatting, and brandname squatting are also identified.

[021] Referring now to the drawings, and more particularly to FIG. 1 through 6, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[022] FIG. 1 illustrates a block diagram of a system (100) for context-free grammar (CFG) based domain name generation for defensive registration, in accordance with an example embodiment. Although the present disclosure is explained considering that the system (100) is implemented on a server, it may be understood that the system (100) may comprise one or more computing devices (102), such as a laptop computer, a desktop computer, a notebook, a workstation, a cloud-based computing environment and the like. It will be understood that the system (100) may be accessed through one or more input/output interfaces 104-1, 104-2... 104-N, collectively referred to as I/O interface (104). Examples of the I/O interface (104) may include, but are not limited to, a user interface, a portable computer, a personal digital assistant, a handheld device, a smartphone, a tablet computer, a workstation, and the like. The I/O interface (104) are communicatively coupled to the system (100) through a network (106).
[023] In an embodiment, the network (106) may be a wireless or a wired network, or a combination thereof. In an example, the network (106) can be implemented as a computer network, as one of the different types of networks, such as virtual private network (VPN), intranet, local area network (LAN), wide area network (WAN), the internet, and such. The network (106) may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), and Wireless Application Protocol (WAP), to communicate with each other. Further, the network (106) may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices. The network devices within the network (106) may interact with the system (100) through communication links.

[024] The system (100) supports various connectivity options such as BLUETOOTH®, USB, ZigBee, and other cellular services. The network environment enables connection of various components of the system (100) using any communication link including Internet, WAN, MAN, and so on. In an exemplary embodiment, the system (100) is implemented to operate as a stand-alone device. In another embodiment, the system (100) may be implemented to work as a loosely coupled device to a smart computing environment. Further, the system (100) comprises at least one memory with a plurality of instructions, one or more databases (112), and one or more hardware processors (108) which are communicatively coupled with the at least one memory to execute a plurality of modules (114) therein. Further, the plurality of modules (114) comprising a data processing module (116), CFG based domain generator (118), a data crawler (120), a domain categorization module (122), and a domain recommendation module (124). The components and functionalities of the system (100) are described further in detail.
[025] Referring FIG. 2, illustrates a block diagram (200) of the system (100) for context-free grammar (CFG) based domain name generation for defensive registration. The one or more I/O interfaces (104) are configured to receive one or more benign Uniform Resource Locators (URLs) of a predefined organization. The one or more I/O interfaces (104) are also configured to recommend at least one candidate domain for defensive registration to curb abuse from domain squatters at least one domain name (100) back to the user.
[026] It is to be noted that a URL is comprised of a scheme, a hostname,
and a path. The hostname identifies the machine connected to the web that contains
the requested resource. The hostname is further sub-divided into two parts
subdomain and domain. The domain is composed of top-level domain (TLD) and
second-level domain (SLD). The path identifies the location of the requested
resource on the host machine. The path is sub-divided into three components:
directory, file name and arguments. In one example, wherein a URL
https://netbanking.hdfcbank.com/netbanking/?_ga=2.263388364 comprises as
follows:

Hostname: netbanking.hdfcbank.com
a) Subdomain: netbanking
b) Domain: hdfcbank.com
c) SLD: hdfcbank
d) TLD: com
[027] It is to be noted that the input to the data processing module (116) of the system (100) is used to identify differences in second-level domain (SLD) and top-level domain (TLD) for each received benign URLs. Herein, the differences can be usage of different words in different order in SLD as compared to other benign domains, different TLDs as compared to other benign domains, different words in SLD, and TLDs as compared to other benign domains. Further, the difference in usage of fullname as compared to other benign domains and difference in usage of brandname as compared to other benign domains in each of second-level domain (SLD) and top-level domain (TLD) for each received benign URLs. The identified differences are the inconsistencies that are present in the benign domains and are used to generate one or more candidate domain names. Different inconsistencies and squatting types are described below in table 1.

Domain Squatting Types Candidate Domain Squatting Instances
1. Combosquatting bobebanking.com, ucoibanking.com, onlinebob.com, sbiibanking.com, ibankingsbi.com, bobonline.com, sbionline.com and ibankingbob.com
2. TLDsquatting hdfcbank.co.in, hdfcbank.net.in, obconline.com, obconline.net.in, indianbank.com and indianbank.co.in
3. ComboTLDsquatting mahanet.co.in and iobconnect.in
4. Fullname squatting indianoverseasbank.com, statebankofindia.com
5. Brandname squatting sbi.com and uco.com
Table 1
[028] Herein , in the table 1, the generated one or more candidate domains are the one or more instances of combosquatting, TLDsquatting,

comboTLDsquatting, fullname squatting and brandname squatting domain names, out of which comboTLDsquatting, brandname squatting and fullname squatting are newly identified. Furthermore, the one or more inconsistencies present in the benign domains are explained as follows:
1.) Different words in SLDs: Bank of Baroda (BOB) uses the word “ibanking” in its SLD, whereas United Commercial Bank (UCO) uses the word “ebanking”. The attacker can exchange these two words to produce two combosquatting domains, namely bobebanking.com and ucoibanking.com, and register them (if available) for malicious purposes.
2.) Different words in different order: State Bank of India (SBI) uses the
word “online” before the brand name sbi in its SLD whereas Bank of Baroda (BOB)
uses a different word “ibanking” after the brand name bob. In this case, the attacker
can exchange the words as well as their positions to obtain six combosquatting
domains, namely onlinebob.com, sbiibanking.com, ibankingsbi.com,
bobonline.com, sbionline.com and ibankingbob.com.
3.) Different TLDs: HDFC Bank, Oriental Bank of Commerce (OBC) and
Indian Bank are registered in different TLDs, .com, co.in, and net.in respectively.
The attacker can generate six TLDsquatting domains, hdfcbank.co.in,
hdfcbank.net.in, obconline.com, obconline.net.in, indianbank.com and
indianbank.co.in, and register them (if available).
4.) Different words and TLDs: Indian Overseas Bank (IOB) and Bank of Maharashtra use different TLDs (co.in, in) as well as different words (net, connect) in their SLDs. The attacker can generate six potential candidates by exchanging words and TLDs, iobnet.in, iobconnect.co.in, mahanet.in, mahaconnect.co.in, mahanet.co.in and iobconnect.in. The last two instances are combination of both combosquatting and TLDsquatting (as words and TLDs are both different). This new form of domain squatting is referred as comboTLDsquatting.
5.) Difference in usage of full name: Bank of India uses full organization name in its online domain bankofindia.com whereas Oriental Bank of Commerce (OBC), Bank of Baroda (BOB) and State Bank of India (SBI) uses acronym (obconline.co.in, bobibanking.com and onlinesbi.com). The attacker can register

the full name indianoverseasbank.com, bankofbarodaibanking.com and statebankofindia.com, if available, for malicious purposes. This new form of domain squatting is referred as fullname squatting.
6.) Difference in usage of brand name: Most banks use a word alongside their brand name in SLDs. For instance, onlinesbi.com and ucoebanking.com. The attacker can register just the brand names (sbi.com and uco.com) if available and use them for malicious purposes. This new form of domain squatting is referred as brandname squatting.
[029] The CFGs have been used extensively in the study of natural languages where they are used to generate strings with particular structures. The CFGs are useful in the automatic generation of domain names that resemble benign domain names. The CFG is a four tuple (V, Z, P, S) where:
• V is a set of non-terminal symbols, wherein the non-terminals are typically represented by capital letters.
• I is a set of terminals, wherein the terminals are represented by lower-case letters.
• P is a set of production rules and each rule is of the form a -> (2UV)*, wherea£V.
• SGV is the start symbol.
[030] It is to be noted that once the CFG is available, the CFG based domain generator (118) of the system (100) is configured to generate one or more new strings by using start symbol S and repeatedly replacing non-terminals according to the production rules of the grammar until no non-terminals are left. The language L of the CFG is the set of strings composed only of terminals I derivable from the start symbol S, i.e., L = {w 6 r | G generates w starting from S}.
[031] In one example, wherein consider two Indian banking organizations, State Bank of India (SBI) and Indian Overseas Bank (IOB), and their respective online banking domain names, onlinesbi.com and iobnet.co.in. The production rules obtained by processing these two domain names are shown below:
S^>C.T | B.T | F.T

wherein, the CFG models inconsistencies in TLDs and SLDs of two banks, SBI and IOB. It is to be noted that the SBI uses the word online in its SLD and is registered in the TLD .com, whereas IOB uses a different word net in its SLD and is registered in a different TLD .co.in. Further, the SBI’s SLD contains the word followed by the brand name i.e., onlinesbi whereas IOB’s SLD contains the brand name followed by the word i.e., iobnet. This is modeled by production rules of non¬terminal C→ BW | WB. The terminals in the CFG are words appearing in the right-hand side of the rules associated with non-terminals T, B, W, and F.
[032] Further, the one or more received benign URLs are processed to create CFG. In the above example the resulting grammar consists of six non¬terminal symbols, S represents start symbol, C represents second level domains (SLDs), T represents top level domains (TLDs), F represents organization’s full name, B represents brand name used in SLDs and W represents words that are used along with brand names. The production rules for non-terminals S and C are pre-determined and fixed. Specifically, the start symbol S can be substituted in three different ways, C.T, B.T and F.T, and the non-terminal C can be substituted in two different ways, BW and WB. The production rules of C consider only single word since most of the combosquatting domains are constructed by adding a single word to the original brand name. The production rules for the remaining non-terminals T, B, W, and F are learned from the benign domain name data. The rule C.T generates both combosquatting and comboTLDsquatting domains, whereas the rule B.T generates brandname squatting domains, while the rule F.T generates fullname squatting domains. It is to be noted that the TLDsquatting domains are generated by all the rules.
[033] Referring FIG. 3, a flowchart (300) illustrates the CFG based candidate domain generation and ranking based on the PCFG model. Firstly, the

CFG based domain generator (118) of the system (100) is configured to learn the CFG of the at least one inconsistency found in the benign domains to generate one or more candidate domains. Herein, the CFG based domain generator (118) of the system (100) takes as input a set of 3-tuple consisting of benign domain name represented by d, fullname of organization (brand) represented by f and acronym of the organization name if any is represented by a. The output is a CFG that models inconsistencies present in the benign domains. Initially, the set of terminals is empty. All grammar productions are stored in dictionary P with a non-terminal as key and a set of substitution rules as value. The production rules for the start symbol S and the variable C, are predefined. To learn production rules for the remaining four non-terminals (i.e., T, B, W and F), the system processes each tuple (d, f, a). First, it splits every domain (take domain “onlinesbi.com” as an instance for explanation) into two parts, SLD (onlinesbi) and TLD (com). Strings f (State Bank of India) and a (SBI) are used to further separate SLD (onlinesbi) into two parts, brand b (sbi) and word w (online). To split the SLD in brand (b) and word (w), commonality in brand name and domain is leveraged. From the brand name, the system derives its possible acronyms, along with the fullname of the brand. The system first checks whether the fullname of the brand is present in the domain. If it is not, it checks if any acronym of that brand is present in the domain. Lastly, if both of these are not present, substring matching between the domain and the brand name is performed in order to get longest common substring. In this way, brand name (b) and word (w) are found by removing brand from the domain. Here, in the above example, for both domains, sbi and iob are the brands in the form of acronyms. Further, the system adds TLD as a possible substitution for T, brand as a possible substitution for B, word as a possible substitution for W and fullname of brand as a possible substitution for F. Further, strings TLD, brand, word and fullname of brand are added to the set of terminals.
[034] Further, the resulting grammar could be ambiguous since the brand name b extracted from SLD and fullname f of the corresponding organization can coincide, i.e., b = f Hence, the same candidate domain can be generated using two different production rules, S → B.T and S → F.T. The ambiguity is resolved as

follows: before adding b to P[B], the CFG based domain generator (118) checks if b (brandname) equals to f (fullname). If it is, then it adds b in P[F] instead of adding it in P[B]. This way, P[B] contains all non-terminals except fullnames (e.g., acronyms, substrings) and P[F] contains only fullnames. Finally, the required grammar CFG is obtained.
[035] Further, after the learned grammar G is obtained, the CFG based domain generator (118) of the system (100) is configured to generate candidate domains of combosquatting, TLDsquatting, comboTLDsquatting, brandname squatting and fullname squatting. The Generate and Concatenate Candidate Domains of the CFG based domain generator (118) takes as its input a non-terminal α (starting with start symbol S) and grammar G and it returns all strings that can be derived starting from the non-terminal α using production rules of Grammar G. The Generate and Concatenate Candidate Domains of the CFG based domain generator (118) of the system (100) begins by checking whether the symbol α is indeed a non-terminal in G. If it is not, then it returns an empty set. If the symbol a is a non-terminal, then it iterates through all production rules associated with a. For each production rule a → R, where R is set of all terminals and non-terminals in the learned Grammar, the system produces the set of strings derivable from R and stores it in set A. Initially the set A is empty. The system then scans the string R from left-to-right and checks for a non-terminal symbol p. If it encounters a non¬terminal symbol p, then it finds all possible strings generated by the non-terminal P. Subsequently, all strings derived from the non-terminal p are concatenated with strings stored in a set A. The system uses concatenate utility which concatenates every string to every other string from any two sets S1 and S2. At the end, all possible strings generated from the rule R are available in set A which is subsequently combined with set L. In this way, all possible strings (domain names) derivable from the start symbol S are produced and stored in set L.
[036] Further, the grammar also generates benign domains which are removed from the final generated candidate domain set L. As the grammar is non-recursive, the recursion depth of the strings generated by different rules of the obtained grammar is finite. Specifically, strings generated using the rule S → C. T

have depth three and strings generated using the rules S → B.T and S → F. T have depth two. Let |α| represent the number of terminal strings derived from the non-terminal α ∈ V. As the resulting grammar could be ambiguous, the total number of domains n that can be generated beginning from the start symbol S is at most |S|.

In the above derivation, both the number of brand names and organization names are equal to the number of tuples in the input set I, i.e., | B | = | F | = | I |. However, the set of generated domains L also contains benign domains from I. Hence, such domains are removed from L.
[037] In one example, wherein with two benign domains of State Bank of India (SBI), and Indian Overseas Bank (IOB) the CFG models inconsistencies in TLDs and SLDs. It is to be noted that the SBI uses the word online in its SLD and is registered in the TLD .com, whereas IOB uses a different word net in its SLD and is registered in a different TLD .co.in. Here, the system learns the CFG for SBI and IOB and generates 22 similar looking domains which could be potentially registered by the attacker for malicious purposes. It is to be noted that each of the one or more candidate domains are generated based on combosquatting, comboTLDsquatting, brandname squatting, TLDsquatting and fullname squatting. The number of generated domains is 24 out of which 2 are benign. Few examples of newly generated 22 domains along with their squatting types are given below:
• Combosquatting: sbionline.com, sbinet.com, iobonline.co.in
• TLDsquatting: onlinesbi.co.in, iobnet.com
• ComboTLDsquatting: sbionline.co.in, sbinet.co.in, onlineiob.com

• Fullnamesquatting: statebankofindia.com, statebankofindia.co.in,
indianoverseasbank.com
• Brandname squatting: sbi.com, sbi.co.in, iob.com, iob.co.in
[038] In the preferred embodiment, the data crawler (120) of the system (100) is configured to collect one or more details of each of the generated one or more candidate domains by crawling WHOIS information, DNS information, and webpage information. The WHOIS library is used which supports extraction of WHOIS data for the given domains. The WHOIS data provides information about a domain creation date, updated date, expiration date, city, state, and country of registrant. The DNS information determines whether the domain resolves to an IP address. The webpage information contains the final URL, HTML body, HTTP status code and a screenshot of the page corresponding to the domain.
[039] Further, the data categorization module (122) of the system (100) is used to categorize the one or more candidate domains into four different categories defensive, suspicious, malicious and unrelated using the collected one or more details of each of the one or more candidate domains and a predefined decision-making flowchart (refer FIG. 4(a) & (b)) for each category.
[040] Defensive: Firstly, the decision-making flowchart (refer FIG. 4(a) of (400)) is used to compare the WHOIS record of the candidate domain and the benign domain. If it matches, then the system marks it as defensive. For the redacted WHOIS record, if the domain redirects to benign website, then the system considers it as defensive. For the redacted WHOIS record, if the domain redirects to benign website we consider it as defensive. Such domains are proactively registered by an authoritative domain owner to curb abuse from domain squatters. Defensive category is subcategorized into six types given below:
a.) Expired: If the expiry date of a domain is less than the current date then the domain is classified as expired. As expired domains are potentially dangerous, organizations should be vigilant of the expired domains and proactively register them before attackers do.
b) Not Live: The system verifies if the domain resolves to some IP address (via DNS lookup). If it does not, then the domain is defensively registered

but not being used.
c) Redirection to benign: The domain resolves to an IP address and redirects to the benign website.
d) Server throws error: The website page shows an error (e.g., HTTP status code 404).

e) Coinciding: The domain hosts the same web page content as that of the benign domain.
f) Content does not match: The web page has different content from that of the benign web page or it is blank.
[041] A decision-making flowchart (refer FIG. 4(b) of (400)) for suspicious, malicious, and unrelated category is considered when the organization name and address fields in the WHOIS record of the candidate domain does not match with that of the benign domain.
[042] Suspicious: If the candidate domain does not resolve to an IP address or if it resolves to an IP address, but the page hosted by the website is blank or displays an error message the given candidate domain is categorized suspicious domain name.
[043] Malicious: If a valid page of the candidate domain exists, then data categorization module (122) of the system (100) analyzes the page using an image hashing based technique to determine whether it displays any fraudulent content or not. Malicious category is further subdivided into six subcategories as follows:
a.) Expired: If the expiry date is less than the current date then the domain is classified as expired and it is available for registration
b.) Phishing: The page poses as a reputed brand that deceives users to enter their personal sensitive information like username, password, or PIN.
c) Social Engineering: The page displays surveys, scams and malicious downloads that trick users to give away their information.
d) Ad Parking: The page displays advertisement related links provided by commercial parking vendors such as parking crew, and park logic. Ad parking pages as malicious because they are involved in malware propagation, click fraud, mal-advertising practices, traffic spam, fake

antivirus warnings and traffic stealing.
e) Domain for sale: The generated domain is put up for sale on an auction
website. Such domains when visited redirect users to unwanted pages and
hence are considered as malicious.
[044] Unrelated: If the page of the candidate domain does not contain any malicious content, then the system calls the page unrelated (seemingly benign and unrelated websites).
[045] Referring FIG. 5, a flow diagram (500) illustrates ranking of the generated candidate domains using a Probabilistic Context-Free Grammar (PCFG) model. The domain recommendation module (124) of the system (100) is configured to rank and recommend candidate domains for defensive registration. The system (100) maps the collected one or more details of each of the one or more candidate domains with received one or more benign domains. Further, each of the plurality of mapped candidate domains are ranked using a Probabilistic Context-Free Grammar (PCFG) model. Therefore, the system (100) recommends, via the input/output interface, at least one candidate domain for defensive registration to curb abuse from domain squatters.
[046] It would be noted that the PCFG model is a Context-Free Grammar with probabilities assigned to the rules such that the sum of all probabilities for all rules expanding the same non-terminal is equal to 1. To convert CFG in PCFG model, calculation of probability of occurrence of each rule is required based on the real observations. These probabilities represent the underlying distribution of terminals and rules in benign domains. The sum of probabilities of all possible domains generated by PCFG model is also 1.
[047] In another embodiment, for the rule S -> F.T | BW.T | WB.T| B.T, the probability of occurrence of each individual rule S -> F.T | BW.T | WB.T| B.T, is predefined to be 0.25. For rules starting with non-terminals T, B, W and F and going to terminal symbols, the probability of each individual rule is calculated by taking the ratio of number of occurrences of the terminal symbol to total number of benign domains with non-blank terminal value. The probability rule to find the probability of occurrence is as follows:

P(variable) = |variable| / |all variables| (|| represents cardinality) The generated PCFG model with assigned probabilities for each rule is used to calculate the probability of newly generated candidate domain using by multiplication rule as follows:
P (A and B) = P (A) * P (B) wherein, the multiplication rules hold true for two variables A and B if they are independent of each other. Any generated domain is a chain of substitutions in different productions. Each substitution is independent of each other.
[048] It is to be noted that the PCFG models the probabilities associated with each production rule. The domain recommendation module (124) of the system (100) is configured to find out top candidate domains for each bank for defensive registration. In FIG. 5, a flow diagram (500) is given which uses PCFG model to rank the domains in decreasing order of probability of occurrence which is based on the benign domains. Herein, for each non-terminal V, the sum of probabilities of its all-possible production rules are 1. It is to be noted that probabilities are rounded to 3 decimal points. These probabilities represent the underlying distribution of terminals and rules in benign domains. The sum of probabilities of all possible domains generated by PCFG model is also 1.
[049] The productions namely, S -> F.T, S -> B.T, S -> BW.T and S
-> WB.T are predefined and not related to distribution of benign domains. Hence, the probability associated with each one of these four rules is equally likely, 0.25. For production rules with F and B on LHS, the probability of occurrence of all terminals on RHS is also equally likely. Assume 38 Indian benign banks in the input dataset and if each fullname appears once for every bank in the input dataset, then the probability of each full form is 1/38 ~ 0.026. If the B (brand name) is only present for 24 banks, then the probability associated with each brand name becomes 1/24 ~ 0.042. As the number of words and TLDs are limited, it gets more interesting to observe their distribution in benign domains.
[050] In another instance, if there are 22 valid words found in the Indian benign banks’ domains and the word online occurs 7 times out of 22, then its probability becomes 7/22 ~ 0.318. Similarly, consider TLD .co.in, if it occurs in 10

banks out of 38, then its probability becomes 10/38 ~ 0.26.
[051] In another example, wherein onlinekotak.co.in is given. Its derivation goes like:
1.) S -> WB. T [0.25]
2.) W -> online [0.318]
3.) B -> kotak [0.042]
4.) T -> co.in [0.26]
Wherein, rules 1 and 2 and 3 and 4 are used in order to derive onlinekotak.co.in.
Hence, multiplying respective probabilities of subcomponents to get probability
score for the whole domain, which is 0.00086. This score for the whole domain
is used to compare against other candidate domains.
[052] Furthermore, higher the probability, more is the trend followed, higher is the risk of it being registered by malicious actors for malicious purposes by tricking the user into believing it as benign and hence, it must be defensively registered by the bank. If a less popular word like ibanking instead of online would have been used, then the probability score would have reduced drastically (0.00022), making it less likely candidate to suggest for defensive registration.
[053] Referring FIG. 6, to illustrate a processor-implemented method (600) for context-free grammar (CFG) based domain name generation for defensive registration.
[054] Initially, at the step (602), receiving, via an input/output interface, a plurality of benign uniform resource locators (URLs) of a business sector from a user, wherein each of the plurality of benign domains include a benign domain.
[055] At the next step (604), parsing, via the one or more hardware processors, the received plurality of benign URLs to collect one or more URL components. The one or more URL components include a hostname comprising of a second-level domain (SLD) and a top-level domain (TLD) of the URL.
[056] At the next step (606), analyzing, via the one or more hardware processors, the collected one or more URL components to identify at least one inconsistency present in at least one of the pluralities of benign URLs. At least one inconsistency include difference in usage of words and order of words in SLDs, and

TLDs in each of the plurality of received benign URLs.
[057] At the next step (608), learning, via the one or more hardware processors, a context free grammar of the identified at least one inconsistency to generate one or more candidate domains.
[058] At the next step (610), collecting, via the one or more hardware processors, one or more details of each of the generated one or more candidate domains by crawling a WHOIS information, a DNS information, and a webpage information.
[059] At the next step (612), mapping, via the one or more hardware processors, the one or more details of generated candidate domains with the received plurality of benign URLs.
[060] At the next step (614), ranking, via the one or more hardware processors, the mapped one or more candidate domains using a Probabilistic Context-Free Grammar (PCFG) model.
[061] At the last step (616), recommending, via the input/output interface, one or more candidate domains for defensive registration.
[062] In yet another embodiment, the method (600) comprising categorizing, via the one or more hardware processors, the one or more generated candidate domains in defensive, suspicious, malicious, and unrelated category based on the collected one or more details of each of the one or more candidate domains and a predefined decision-making flowchart for each category. The predefined decision-making flowchart is for defensive, malicious, suspicious, and unrelated category.
[063] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[064] The embodiments of present disclosure herein address unresolved

problem of domain squatting abuse and defensive registration of domain names. Therefore, embodiments herein provide a system and method for context-free grammar (CFG) based domain name generation for defensive registration. Herein, the system is configured to model inconsistencies in domain names using the CFG and using it to generate one or more candidate domains. A decision-making flowchart is used to classify the generated one or more candidate domains into defensive, malicious, suspicious, and unrelated categories. It is to be noted that three new forms of domain squatting namely comboTLDsquatting, fullname squatting, and brandname squatting are also identified.
[065] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
[066] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store,

communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[067] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[068] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.

We Claim:
1. A processor-implemented method (600) comprising:
receiving (602), via an input/output interface, a plurality of benign
uniform resource locators (URLs) of a business sector from a user,
wherein each of the plurality of benign URLs include a benign
domain;
parsing (604), via the one or more hardware processors, the received
plurality of benign URLs to collect one or more URL components,
wherein the one or more URL components include a hostname
comprising a second-level domain (SLD) and a top-level domain
(TLD) of the URL;
analyzing (606), via the one or more hardware processors, the
collected one or more URL components to identify at least one
inconsistency present in at least one of the pluralities of benign
URLs, wherein the at least one inconsistency includes difference in
usage of words and order of words in SLDs, and TLDs in each of
the plurality of received benign URLs;
learning (608), via the one or more hardware processors, a Context
Free Grammar (CFG) of the identified at least one inconsistency to
generate one or more candidate domains;
collecting (610), via the one or more hardware processors, one or
more details of each of the generated one or more candidate
domains by crawling WHOIS information, Domain Name System
(DNS) information, and webpage information;
mapping (612), via the one or more hardware processors, the one or
more details of generated each of the one or more candidate
domains with the received plurality of benign URLs;
ranking (614), via the one or more hardware processors, the mapped
one or more candidate domains using a Probabilistic Context-Free
Grammar (PCFG) model; and

recommending (616), via the input/output interface, one or more candidate domains for defensive registration.
2. The processor-implemented method (600) of claim 1, wherein one or more generated candidate domains are instances of combosquatting, comboTLDsquatting, brandname squatting, TLDsquatting and fullname squatting.
3. The processor-implemented method (600) of claim 1, wherein the one or more generated candidate domains are similar looking to the plurality of benign domains.
4. The processor-implemented method (600) of claim 1, further comprising:
categorizing, via the one or more hardware processors, the one or more generated candidate domains in defensive, suspicious, malicious, and unrelated category based on the collected one or more details of each of the one or more candidate domains and a predefined decision-making flowchart for each category.
5. The processor-implemented method (600) of claim 4, further
comprising:
providing, via the one or more hardware processors, the predefined decision-making flowchart for defensive, malicious, suspicious, and unrelated category.
6. A system (100) comprising:
an input/output interface (104) to receive a plurality of benign uniform resource locators (URLs) of a business sector from a user, wherein each of the plurality of benign URLs include a benign domain;

one or more hardware processors (108);
a memory in communication with the one or more hardware processors (108), wherein the one or more hardware processors are configured to execute programmed instructions stored in the memory, to:
parse the received plurality of benign URLs to collect one or more URL components, wherein the one or more URL components include a hostname comprising a second-level domain (SLD) and a top-level domain (TLD) of the URL;
analyze the collected one or more URL components to identify at least one inconsistency present in at least one of the pluralities of benign URLs, wherein the at least one inconsistency includes difference in usage of words and order of words in SLDs, and TLDs in each of the plurality of received benign URLs;
learn a Context Free Grammar (CFG) of the identified at least one inconsistency to generate one or more candidate domains;
collect one or more details of each of the generated one or more candidate domains by crawling WHOIS information, Domain Name System (DNS) information, and webpage information; map the one or more details of generated each of the one or more candidate domains with the received plurality of benign URLs;
rank the mapped one or more candidate domains using a Probabilistic Context-Free Grammar (PCFG) model; and

recommend, via the input/output interface, one or more candidate domains for defensive registration.
7. A non-transitory computer readable medium storing one or more instructions which when executed by one or more processors on a system, cause the one or more processors to perform method comprising:
receiving (602), via an input/output interface, a plurality of benign
uniform resource locators (URLs) of a business sector from a user,
wherein each of the plurality of benign URLs include a benign
domain;
parsing (604), via the one or more hardware processors, the received
plurality of benign URLs to collect one or more URL components,
wherein the one or more URL components include a hostname
comprising a second-level domain (SLD) and a top-level domain
(TLD) of the URL;
analyzing (606), via the one or more hardware processors, the
collected one or more URL components to identify at least one
inconsistency present in at least one of the pluralities of benign
URLs, wherein the at least one inconsistency includes difference in
usage of words and order of words in SLDs, and TLDs in each of
the plurality of received benign URLs;
learning (608), via the one or more hardware processors, a Context
Free Grammar (CFG) of the identified at least one inconsistency to
generate one or more candidate domains;
collecting (610), via the one or more hardware processors, one or
more details of each of the generated one or more candidate
domains by crawling WHOIS information, Domain Name System
(DNS) information, and webpage information;

mapping (612), via the one or more hardware processors, the one or
more details of generated each of the one or more candidate
domains with the received plurality of benign URLs;
ranking (614), via the one or more hardware processors, the mapped
one or more candidate domains using a Probabilistic Context-Free
Grammar (PCFG) model; and
recommending (616), via the input/output interface, one or more
candidate domains for defensive registration.

Documents

Application Documents

#	Name	Date
1	202121018054-STATEMENT OF UNDERTAKING (FORM 3) [19-04-2021(online)].pdf	2021-04-19
2	202121018054-REQUEST FOR EXAMINATION (FORM-18) [19-04-2021(online)].pdf	2021-04-19
3	202121018054-PROOF OF RIGHT [19-04-2021(online)].pdf	2021-04-19
4	202121018054-FORM 18 [19-04-2021(online)].pdf	2021-04-19
5	202121018054-FORM 1 [19-04-2021(online)].pdf	2021-04-19
6	202121018054-FIGURE OF ABSTRACT [19-04-2021(online)].jpg	2021-04-19
7	202121018054-DRAWINGS [19-04-2021(online)].pdf	2021-04-19
8	202121018054-DECLARATION OF INVENTORSHIP (FORM 5) [19-04-2021(online)].pdf	2021-04-19
9	202121018054-COMPLETE SPECIFICATION [19-04-2021(online)].pdf	2021-04-19
10	Abstract1.jpg	2021-10-19
11	202121018054-FORM-26 [22-10-2021(online)].pdf	2021-10-22
12	202121018054-FER.pdf	2022-12-06
13	202121018054-FER_SER_REPLY [10-04-2023(online)].pdf	2023-04-10
14	202121018054-CLAIMS [10-04-2023(online)].pdf	2023-04-10
15	202121018054-PatentCertificate13-02-2025.pdf	2025-02-13
16	202121018054-IntimationOfGrant13-02-2025.pdf	2025-02-13

Search Strategy

1	SearchStrategyE_06-12-2022.pdf