Claims:
1. A processor implemented method (200), the method comprising:
receiving, by one or more hardware processors, an input text to be matched against a plurality of patterns compiled in an optimized Deterministic Finite Automaton (DFA), wherein the input text comprises a plurality of input symbols, (202), and wherein the optimized DFA is constructed by:
sorting the plurality of patterns based on a lexicographic order to obtain a plurality of sorted patterns, wherein each of the plurality of patterns comprises the plurality of symbols;
constructing a Lex Trie for the plurality of sorted patterns, wherein the Lex Trie is a tree data structure comprising a plurality of nodes and a plurality of directed edges, wherein the plurality of nodes comprises at least one parent node and a plurality of child nodes, and wherein the plurality of child nodes corresponding to a parent node is numbered consecutively;
constructing a DFA based on the Lex Trie, wherein each of the plurality of nodes associated with the Lex Trie are retained in the DFA by including an output value and a failure pointer, and wherein the failure pointer is utilized when there is an invalid transition;
constructing the optimized DFA based on the DFA, by:
computing an out degree corresponding to each of the plurality of nodes of the DFA;
segmenting each of the plurality of nodes into a plurality of end nodes, a plurality of nodes with one child and a plurality of nodes with more than one child based on the corresponding out degree;
assigning a transition value for each of the plurality of nodes with one child based on a node number associated with a corresponding child node;
allocating a bit map array for each of the plurality of nodes with more than one, wherein the bit map array comprises a plurality of bit positions corresponding to each of the plurality of symbols, and wherein a bit set to one corresponds to a valid transition and a bit set to zero corresponds to an invalid transition;
obtaining a smallest next value for each of the plurality of nodes with more than one child based on the node number associated with a corresponding smallest child node;
matching, by the one or more hardware processors, each of the plurality of input symbols with the plurality of symbols compiled in the optimized DFA by traversing the optimized DFA until reaching end of the input text, (204) by:
initializing a current node of the optimized DFA, wherein the current node is a start node initially;
obtaining the out degree of the current node, wherein a failure transition is followed when the out degree of the current node is zero;
computing a match value based on a comparison between a current input symbol from the plurality of input symbols and a plurality of transition symbols associated with the current node, wherein the current input symbol is a first input symbol from the plurality of input symbols;
obtaining the transition value associated with the current node when the out degree is one;
computing the transition value associated with the current node based on the corresponding smallest next value when the out degree is greater than one, by:
obtaining the bit position corresponding to the current input symbol from the bit map array associated with the current node;
computing an offset by counting a number of bits set to one till the bit position of the current input symbol in the corresponding bit map array of the current node;
computing the transition value by adding the offset with the smallest next value associated with the current node;
updating the current node by swapping the current node with the transition value;
updating the current input symbol by swapping the current input symbol with a next symbol of the current input symbol; and
displaying, by the one or more hardware processors, a plurality of matched patterns associated with the input text (206).
2. The method as claimed in claim 1, wherein each of the plurality of nodes is associated with the node number, and wherein the node with a least node number is the start node.
3. The method as claimed in claim 1, wherein each directed edge from the plurality of edges is associated with a transition between a pair of nodes from the plurality of nodes, and wherein the pair of nodes comprises a parent node and a child node.
4. The method as claimed in claim 1, wherein each of the plurality of nodes with one child is associated with the out degree one, wherein the plurality of end nodes is associated with the out degree zero, and wherein each of the plurality of nodes with more than one child is associated with the out degree greater than one.
5. The method as claimed in claim 1, wherein the out degree corresponding to a node is a number of outgoing edges from the node.
6. The method as claimed in claim 1, wherein if (u, v) is a directed edge from a node u to node v from the plurality of nodes, then node u is a parent node and node v is a child node.
7. The method as claimed in claim 1, wherein a failure transition is followed when the match value is zero based on the failure pointer associated with the current node and, and wherein a corresponding matched symbol is displayed when the match value is one.
8. A system (100) comprising:
at least one memory (104) storing programmed instructions;
one or more Input /Output (I/O) interfaces (112); and
one or more hardware processors (102) operatively coupled to the at least one memory (104), wherein the one or more hardware processors (102) are configured by the programmed instructions to:
receive an input text to be matched against a plurality of patterns compiled in an optimized Deterministic Finite Automaton (DFA), wherein the input text comprises a plurality of input symbols, and wherein the optimized DFA is constructed by:
sorting the plurality of patterns based on a lexicographic order to obtain a plurality of sorted patterns, wherein each of the plurality of patterns comprises the plurality of symbols;
constructing a Lex Trie for the plurality of sorted patterns, wherein the Lex Trie is a tree data structure comprising a plurality of nodes and a plurality of directed edges, wherein the plurality of nodes comprises at least one parent node and a plurality of child nodes, and wherein the plurality of child nodes corresponding to a parent node is numbered consecutively;
constructing a DFA based on the Lex Trie, wherein each of the plurality of nodes associated with the Lex Trie are retained in the DFA by including an output value and a failure pointer, and wherein the failure pointer is utilized when there is an invalid transition;
constructing the optimized DFA based on the DFA, by:
computing an out degree corresponding to each of the plurality of nodes of the DFA;
segmenting each of the plurality of nodes into a plurality of end nodes, a plurality of nodes with one child and a plurality of nodes with more than one child based on the corresponding out degree;
assigning a transition value for each of the plurality of nodes with one child based on a node number associated with a corresponding child node;
allocating a bit map array for each of the plurality of nodes with more than one, wherein the bit map array comprises a plurality of bit positions corresponding to each of the plurality of symbols, and wherein a bit set to one corresponds to a valid transition and a bit set to zero corresponds to an invalid transition;
obtaining a smallest next value for each of the plurality of nodes with more than one child based on the node number associated with a corresponding smallest child node; and
match each of the plurality of input symbols with the plurality of symbols compiled in the optimized DFA by traversing the optimized DFA until reaching end of the input text, by:
initializing a current node of the optimized DFA, wherein the current node is a start node initially;
obtaining the out degree of the current node, wherein a failure transition is followed when the out degree of the current node is zero;
computing a match value based on a comparison between a current input symbol from the plurality of input symbols and a plurality of transition symbols associated with the current node, wherein the current input symbol is a first input symbol from the plurality of input symbols;
obtaining the transition value associated with the current node when the out degree is one;
computing the transition value associated with the current node based on the corresponding smallest next value when the out degree is greater than one, by:
obtaining the bit position corresponding to the current input symbol from the bit map array associated with the current node;
computing an offset by counting a number of bits set to one till the bit position of the current input symbol in the corresponding bit map array of the current node;
computing the transition value by adding the offset with the smallest next value associated with the current node;
updating the current node by swapping the current node with the transition value;
updating the current input symbol by swapping the current input symbol with a next symbol of the current input symbol; and
display a plurality of matched patterns associated with the input text.
9. The system as claimed in claim 8, wherein each of the plurality of nodes is associated with the node number, and wherein the node with a least node number is the start node.
10. The system as claimed in claim 8, wherein each directed edge from the plurality of edges is associated with a transition between a pair of nodes from the plurality of nodes, and wherein the pair of nodes comprises a parent node and a child node.
11. The system as claimed in claim 8, wherein each of the plurality of nodes with one child is associated with the out degree one, wherein the plurality of end nodes is associated with the out degree zero, and wherein each of the plurality of nodes with more than one child is associated with the out degree greater than one.
12. The system as claimed in claim 8, wherein the out degree corresponding to a node is a number of outgoing edges from the node.
13. The system as claimed in claim 8, wherein if (u, v) is a directed edge from a node u to node v from the plurality of nodes, then node u is a parent node and node v is a child node.
14. The system as claimed in claim 8, wherein a failure transition is followed when the match value is zero based on the failure pointer associated with the current node and, and wherein a corresponding matched symbol is displayed when the match value is one.
, Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR LEX TRIE BASED DETERMINISTIC FINITE AUTOMATON
Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
The disclosure herein generally relates to the field of pattern matching and, more particular, to a method and system for Lex Trie based Deterministic Finite Automaton.
BACKGROUND
Pattern matching is a method of checking whether a specific sequence of symbols or a pattern exist in an input text. In pattern matching, every symbol or character of an input text is compared against the stored patterns. The patterns can be stored using many data structures including array, trees and the like. However, array data structure consumes more time for search operation which is a vital operation in pattern matching and hence tree data structure is preferred over array data structure. Deterministic Finite Automaton (DFA) is an application of tree data structure which is widely used for storing patterns. However, DFA consumes more memory.
Conventional methods mainly use Aho-Corasick (AC) algorithm based DFA for storing the patterns which consumes more memory. Some other convention methods use optimized AC algorithms including bit map AC, parallel AC algorithm, merge Finite Node Machine (FSM) and the like. However, the memory consumption is more even in optimized AC algorithms which increases the space complexity. Hence constructing a space efficient DFA and matching the patterns in an efficient manner is challenging.
SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method for Lex Trie based Deterministic Finite Automaton. The method includes receives, by one or more hardware processors, an input text to be matched against a plurality of patterns compiled in a Deterministic Finite Automaton (DFA), wherein the input text comprises a plurality of input symbols, wherein the DFA is constructed by: (i) sorting the plurality of patterns based on a lexicographic order to obtain a plurality of sorted patterns, wherein each of the plurality of patterns comprises the plurality of symbols (ii) constructing a Lex Trie for the plurality of sorted patterns, wherein the Lex Trie is a tree data structure comprising a plurality of nodes and a plurality of directed edges and, wherein a plurality of child nodes corresponding to a parent node is numbered consecutively (iii) constructing a DFA based on the Lex Trie, wherein each of the plurality of nodes of the DFA is associated with an output value and a failure pointer, wherein the failure pointer is utilized when there is an invalid transition (iv) constructing the optimized DFA based on the DFA, by: (a) computing an out degree corresponding to each of the plurality of nodes (b) segmenting each of the plurality of nodes into a plurality of end nodes, a plurality of nodes with one child and a plurality of nodes with more than one child based on the corresponding out degree (c) assigning a transition value for each of the plurality of nodes with one child based on a node number associated with a corresponding child node (d) allocating a bit map array for each of the plurality of nodes with more than one child , wherein the bit map array comprises a plurality of bit positions corresponding to each of the plurality of symbols, wherein a bit set to one corresponds to a valid transition and a bit set to zero corresponds to an invalid transition (e) obtaining a smallest next value for each of the plurality of nodes with more than one child based on the node number associated with a corresponding smallest child node. The method further includes matching, by one or more hardware processors, each of the plurality of input symbols by traversing the DFA until reaching end of the input text, by: (i) initializing a current node of the DFA, wherein the current node is a start node initially (ii) obtaining the out degree of the current node, wherein a failure transition is followed if the out degree is zero (iii) computing a match value based on a comparison between a current input symbol from the plurality of input symbols and a plurality of transition symbols associated with the current node, wherein the current input symbol is the first input symbol from the plurality of input symbols (iv) obtaining the transition value associated with the current node when the out degree is one (v) computing the transition value associated with the current node based on the corresponding smallest next value when the out degree is greater than one, by: (a) obtaining the bit position corresponding to the current input symbol from the bit map array associated with the current node (b) computing an offset by counting a number of bits set to one till the bit position of the current input symbol in the corresponding bit map array of the current node (c) computing the transition value by adding the offset with the smallest next value associated with the current node (vi) updating the current node by swapping the current node with the transition value and (vii) updating the current input symbol by swapping the current input symbol with a next symbol of the current input symbol. Finally, the method further includes displaying by one or more hardware processors, a plurality of matched patterns associated with the input text.
In another aspect, a system for Lex Trie based Deterministic Finite Automaton is provided. The system includes at least one memory storing programmed instructions, one or more Input /Output (I/O) interfaces, and one or more hardware processors operatively coupled to the at least one memory, wherein the one or more hardware processors are configured by the programmed instructions to receive an input text to be matched against a plurality of patterns compiled in a Deterministic Finite Automaton (DFA), wherein the input text comprises a plurality of input symbols, wherein the DFA is constructed by: (i) sorting the plurality of patterns based on a lexicographic order to obtain a plurality of sorted patterns, wherein each of the plurality of patterns comprises the plurality of symbols (ii) constructing a Lex Trie for the plurality of sorted patterns, wherein the Lex Trie is a tree data structure comprising a plurality of nodes and a plurality of directed edges and, wherein a plurality of child nodes corresponding to a parent node is numbered consecutively (iii) constructing a DFA based on the Lex Trie, wherein each of the plurality of nodes of the DFA is associated with an output value and a failure pointer, wherein the failure pointer is utilized when there is an invalid transition (iv) constructing the optimized DFA based on the DFA, by: (a) computing an out degree corresponding to each of the plurality of nodes (b) segmenting each of the plurality of nodes into a plurality of end nodes, a plurality of nodes with one child and a plurality of nodes with more than one child based on the corresponding out degree (c) assigning a transition value for each of the plurality of nodes with one child based on a node number associated with a corresponding child node (d) allocating a bit map array for each of the plurality of nodes with more than one child , wherein the bit map array comprises a plurality of bit positions corresponding to each of the plurality of symbols, wherein a bit set to one corresponds to a valid transition and a bit set to zero corresponds to an invalid transition (e) obtaining a smallest next value for each of the plurality of nodes with more than one child based on the node number associated with a corresponding smallest child node. Further, the one or more hardware processors are configured by the programmed instructions to match each of the plurality of input symbols by traversing the DFA until reaching end of the input text, by: (i) initializing a current node of the DFA, wherein the current node is a start node initially (ii) obtaining the out degree of the current node, wherein a failure transition is followed if the out degree is zero (iii) computing a match value based on a comparison between a current input symbol from the plurality of input symbols and a plurality of transition symbols associated with the current node, wherein the current input symbol is the first input symbol from the plurality of input symbols (iv) obtaining the transition value associated with the current node when the out degree is one (v) computing the transition value associated with the current node based on the corresponding smallest next value when the out degree is greater than one, by: (a) obtaining the bit position corresponding to the current input symbol from the bit map array associated with the current node (b) computing an offset by counting a number of bits set to one till the bit position of the current input symbol in the corresponding bit map array of the current node (c) computing the transition value by adding the offset with the smallest next value associated with the current node (vi) updating the current node by swapping the current node with the transition value and (vii) updating the current input symbol by swapping the current input symbol with a next symbol of the current input symbol. Finally, the one or more hardware processors are configured by the programmed instructions display a plurality of matched patterns associated with the input text.
In yet another aspect, a computer program product including a non-transitory computer-readable medium having embodied therein a computer program for method and system for Lex Trie based Deterministic Finite Automaton is provided. The computer readable program, when executed on a computing device, causes the computing device to receive an input text to be matched against a plurality of patterns compiled in a Deterministic Finite Automaton (DFA), wherein the input text comprises a plurality of input symbols, wherein the DFA is constructed by: (i) sorting the plurality of patterns based on a lexicographic order to obtain a plurality of sorted patterns, wherein each of the plurality of patterns comprises the plurality of symbols (ii) constructing a Lex Trie for the plurality of sorted patterns, wherein the Lex Trie is a tree data structure comprising a plurality of nodes and a plurality of directed edges and, wherein a plurality of child nodes corresponding to a parent node is numbered consecutively (iii) constructing a DFA based on the Lex Trie, wherein each of the plurality of nodes of the DFA is associated with an output value and a failure pointer, wherein the failure pointer is utilized when there is an invalid transition (iv) constructing the optimized DFA based on the DFA, by: (a) computing an out degree corresponding to each of the plurality of nodes (b) segmenting each of the plurality of nodes into a plurality of end nodes, a plurality of nodes with one child and a plurality of nodes with more than one child based on the corresponding out degree (c) assigning a transition value for each of the plurality of nodes with one child based on a node number associated with a corresponding child node (d) allocating a bit map array for each of the plurality of nodes with more than one child , wherein the bit map array comprises a plurality of bit positions corresponding to each of the plurality of symbols, wherein a bit set to one corresponds to a valid transition and a bit set to zero corresponds to an invalid transition (e) obtaining a smallest next value for each of the plurality of nodes with more than one child based on the node number associated with a corresponding smallest child node. Further, the computer readable program, when executed on a computing device, causes the computing device to match each of the plurality of input symbols by traversing the DFA until reaching end of the input text, by: (i) initializing a current node of the DFA, wherein the current node is a start node initially (ii) obtaining the out degree of the current node, wherein a failure transition is followed if the out degree is zero (iii) computing a match value based on a comparison between a current input symbol from the plurality of input symbols and a plurality of transition symbols associated with the current node, wherein the current input symbol is the first input symbol from the plurality of input symbols (iv) obtaining the transition value associated with the current node when the out degree is one (v) computing the transition value associated with the current node based on the corresponding smallest next value when the out degree is greater than one, by: (a) obtaining the bit position corresponding to the current input symbol from the bit map array associated with the current node (b) computing an offset by counting a number of bits set to one till the bit position of the current input symbol in the corresponding bit map array of the current node (c) computing the transition value by adding the offset with the smallest next value associated with the current node (vi) updating the current node by swapping the current node with the transition value and (vii) updating the current input symbol by swapping the current input symbol with a next symbol of the current input symbol. Finally, the computer readable program, when executed on a computing device, causes the computing device to display a plurality of matched patterns associated with the input text.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 is a functional block diagram of a system for Lex Trie based Deterministic Finite Automaton (DFA), according to some embodiments of the present disclosure.
FIGS. 2A and 2B are exemplary flow diagrams for a method for Lex Trie based DFA implemented by the system of FIG. 1, in accordance with some embodiments of the present disclosure.
FIGS. 3A and 3B are exemplary Lex Tries for the method for Lex Trie based DFA implemented by the system of FIG. 1, in accordance with some embodiments of the present disclosure.
FIG. 3C is an exemplary DFA with failure pointers for the method for Lex Trie based DFA implemented by the system of FIG. 1, in accordance with some embodiments of the present disclosure.
FIG. 4A is an exemplary flow diagram for constructing an optimized DFA, in accordance with some embodiments of the present disclosure.
FIG. 4B is a portion of the optimized DFA with bit map array constructed using the method for Lex Trie based DFA as depicted in FIGS. 2A and 2B, in accordance with some embodiments of the present disclosure.
FIGS. 5A and 5B are exemplary flow diagrams for pattern matching using the optimized DFA, in accordance with some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope and spirit being indicated by the following claims.
Embodiments herein provide a method and system for Lex Trie based Deterministic Finite Automaton (DFA). The system for Lex Trie based DFA constructs an optimized DFA with minimum memory requirement and performs pattern matching in an efficient manner using the optimized DFA. Initially, a plurality of patterns are sorted in a lexicographic order and compiled into a Lex Trie. A significant feature associated with the Lex Trie is that all child nodes of a parent node are numbered consecutively. The consecutive numbering is utilized to reduce the memory requirement associated with each node. Further, the Lex Trie is converted into a DFA by including failure pointers for each node and marking a plurality of end nodes. Further, the DFA is optimized based on an out degree associated with each node to obtain an optimized DFA. Here, optimization includes creating a bit map array only for nodes with the out degree greater than 1, which reduces memory requirement. Further, an input text is matched against the plurality of patterns stored in the optimized DFA by traversing the optimized DFA based on a transition value. The transition value can be a next child if the out degree of a node is one. The transition value is computed based on the values associated with the bit map array and a smallest next value associated with each node of the optimized DFA if the out degree of the node is greater than one.. Finally, a plurality of matched patterns associated with the input text are displayed.
Referring now to the drawings, and more particularly to FIGS. 1 through 5B, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1 is a functional block diagram of a system 100 for Lex Trie based DFA, according to some embodiments of the present disclosure. The system 100 includes or is otherwise in communication with hardware processors 102, at least one memory such as a memory 104, an I/O interface 112. The hardware processors 102, memory 104, and the Input /Output (I/O) interface 112 may be coupled by a system bus such as a system bus 108 or a similar mechanism. In an embodiment, the hardware processors 102 can be one or more hardware processors.
The I/O interface 112 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface 112 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a keyboard, a mouse, an external memory, a printer and the like. Further, the interface 112 may enable the system 100 to communicate with other devices, such as web servers and external databases.
The I/O interface 112 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, local area network (LAN), cable, etc., and wireless networks, such as Wireless LAN (WLAN), cellular, or satellite. For the purpose, the I/O interface 112 may include one or more ports for connecting a number of computing systems with one another or to another server computer. The I/O interface 112 may include one or more ports for connecting a number of devices to one another or to another server.
The one or more hardware processors 102 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, node machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 102 is configured to fetch and execute computer-readable instructions stored in the memory 104.
The memory 104 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, the memory 104 includes a plurality of modules 106 and a pattern analysis unit 114. The memory 104 also includes a data repository (or repository) 110 for storing data processed, received, and generated by the plurality of modules 106 and the pattern analysis unit 114.
The plurality of modules 106 include programs or coded instructions that supplement applications or functions performed by the system 100 for Lex Trie based DFA. The plurality of modules 106, amongst other things, can include routines, programs, objects, components, and data structures, which performs particular tasks or implement particular abstract data types. The plurality of modules 106 may also be used as, signal processor(s), node machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the plurality of modules 106 can be used by hardware, by computer-readable instructions executed by the one or more hardware processors 102, or by a combination thereof. The plurality of modules 106 can include various sub-modules (not shown). The plurality of modules 106 may include computer-readable instructions that supplement applications or functions performed by the system 100 for Lex Trie based DFA.
The data repository (or repository) 110 may include a plurality of abstracted piece of code for refinement and data that is processed, received, or generated as a result of the execution of the plurality of modules in the module(s) 106 and the modules associated with the pattern analysis unit 114. The data repository may also include the plurality of patterns used for constructing the Lex Trie associated with the method for Lex Trie based DFA.
Although the data repository 110 is shown internal to the system 100, it will be noted that, in alternate embodiments, the data repository 110 can also be implemented external to the system 100, where the data repository 110 may be stored within a database (not shown in FIG. 1) communicatively coupled to the system 100. The data contained within such external database may be periodically updated. For example, new data may be added into the database (not shown in FIG. 1) and/or existing data may be modified and/or non-useful data may be deleted from the database (not shown in FIG. 1). In one example, the data may be stored in an external system, such as a Lightweight Directory Access Protocol (LDAP) directory and a Relational Database Management System (RDBMS).
FIGS. 2A and 2B are exemplary flow diagrams for a processor implemented method for Lex Trie based DFA implemented by the system of FIG. 1, according to some embodiments of the present disclosure. In an embodiment, the system 100 comprises one or more data storage devices or the memory 104 operatively coupled to the one or more hardware processor(s) 102 and is configured to store instructions for execution of steps of the method 200 by the one or more hardware processors 102. The steps of the method 200 of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in FIG. 1 and the steps of flow diagram as depicted in FIG. 2A and FIG. 2B. The method 200 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The method 200 may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communication network. The order in which the method 200 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method 200, or an alternative method. Furthermore, the method 200 can be implemented in any suitable hardware, software, firmware, or combination thereof.
At step 202 of the method 200, the one or more hardware processors 102 receive an input text to be matched against the plurality of patterns compiled in an optimized DFA, wherein the input text comprises a plurality of input symbols.
Initially, the plurality of patterns are sorted based on a lexicographic order to obtain a plurality of sorted patterns. Each of the plurality of patterns includes the plurality of symbols. Further, the Lex Trie is constructed using the plurality of sorted patterns. The Lex Trie is a tree data structure including a plurality of nodes and a plurality of directed edges. The plurality of nodes includes at least one parent node and a plurality of child nodes. Each of the plurality of nodes is associated with a node number, wherein the node with a least node number is the start node. Each directed edge from the plurality of edges is associated with a transition between a pair of nodes from the plurality of nodes. The pair of nodes includes a parent node and a child node. For example, if (u, v) is a directed edge from a node u to node v from the plurality of nodes, then node u is a parent node and node v is a child node. The plurality of child nodes of a parent node are numbered consecutively.
Further, the DFA is constructed based on the Lex Trie. Here, each of the plurality of nodes of the Lex Trie are retained by including an output value and a failure pointer. The failure pointer is utilized when there is an invalid transition. Further, the optimized DFA is constructed based on the DFA.
The method of constructing the optimized DFA is explained as follows:
Initially, an out degree corresponding to each of the plurality of nodes of the DFA is computed and each of the plurality of nodes are segmented into a plurality of end nodes, a plurality of nodes with one child and a plurality of nodes with more than one child based on the corresponding out degree. Each of the plurality of nodes with one child is associated with the out degree one. The plurality of end nodes is associated with the out degree zero and each of the plurality of nodes with more than one child is associated with the out degree greater than one. Further, a transition value for each of the plurality of nodes with one child is assigned based on the node number associated with a corresponding child node. Further, a bit map array is allocated for each of the plurality of nodes with more than one child. The bit map array includes a plurality of bit positions (refer Table 2) corresponding to each of the plurality of symbols, wherein a bit set to one corresponds to a valid transition and a bit set to zero corresponds to an invalid transition. A smallest next value is obtained for each of the plurality of nodes with more than one child based on the node number associated with a corresponding smallest child node.
At step 204 of the method 200, the one or more hardware processors 102 match each of the plurality of input symbols with the plurality of symbols or patterns compiled in the optimized DFA by traversing the optimized DFA until reaching end of the input text, i.e., until processing all symbols of the input text. The method of matching the input pattern using the optimized DFA is explained as follows: Initially, the start node is assigned as a current node and updated during each iteration. The out degree of the current node is obtained. A failure transition is followed when the out degree of the current node is zero. Further, a match value is computed based on a comparison between a current input symbol from the plurality of input symbols and a plurality of transition symbols associated with the current node. A transition symbol is a symbol associated with a transition from a parent node to a child node. The current input symbol is the first input symbol (left most symbol) from the plurality of input symbols. . The transition value associated with the current node is also obtained when the out degree is one. The transition value associated with the current node is computed based on the corresponding smallest next value when the out degree is greater than one.
The method of computing the transition value when the out degree is greater than one is explained as follows. The bit position corresponding to the current input symbol from the bit map array (shown in Table 2) associated with the current node is obtained. An offset is computed by counting a number of bits set to one till the bit position of the current input symbol in the corresponding bit map array of the current node. The transition value is computed by adding the offset with the smallest next value associated with the current node. Further, the current node is updated by swapping the current node with the transition value and the current input symbol is updated by swapping the current input symbol with a next symbol of the current input symbol.
At step 206 of the method 200, the one or more hardware processors 102 display a plurality of matched patterns associated with the input text when the end of input text is reached. The above steps 202 till 206 can be better understood by way of following example. It is to be understood by a person having ordinary skill in the art and/or person skilled in the art that such example shall not be construed as limiting the scope of the present disclosure.
The pattern analysis unit 114, when executed by the one or more processors of the system 100, receives the input text to be matched against the plurality of patterns compiled in the optimized DFA, wherein the input text comprises the plurality of input symbols (also referred as ‘letters’ and may be interchangeably used herein). For example, an input text can be “landing in india” and the corresponding pattern to be matched is “india”. The words “pattern” and “keyword” are interchangeable used throughout the document.
The optimized DFA is an Aho-Corasick (AC) based pattern matching algorithm which examines every character in the text O(1)times and detects patterns (if any) simultaneously in O(l.?_(k=1)^r¦|P_k | ) time, where r is the number of patterns found in the input text of length l and |P_k |is the length of the kth matched pattern. The optimized DFA is a fast-multi-pattern matching algorithm that searches all keywords in the input text simultaneously. It works by constructing a pattern matching machine from a finite set of keywords Y = {y1,y2,…….,ym} to be matched, wherein each keyword or pattern y is composed of the plurality of symbols. The plurality of symbols includes alpha numeric symbols and other special symbols.
The construction of the optimized DFA is explained as follows. Initially, the plurality of patterns is sorted based on the lexicographic order to obtain the plurality of sorted patterns. Each of the plurality of patterns includes the plurality of symbols. Further, the Lex Trie is constructed using the plurality of sorted patterns Y' . The Lex Trie is the tree data structure including the plurality of nodes and the plurality of directed edges. The plurality of nodes comprises at least one parent node and a plurality of child nodes.. Each of the plurality of nodes is associated with the node number, wherein the node with the least node number is the start node. Each directed edge from the plurality of edges is associated with the transition between the pair of nodes from the plurality of nodes. The pair of nodes includes the parent node and the child node. For example, if (u,v) is a directed edge from a node u to node v from the plurality of nodes, then node u is a parent node and node v is a child node . The plurality of child nodes of a parent node is numbered consecutively.
FIGS. 3A and 3B are exemplary Lex Tries for the method for Lex Trie based DFA implemented by the system 100 of FIG. 1, in accordance with some embodiments of the present disclosure. Now referring to FIG. 3A, the plurality of lexicographically sorted patterns Y= {america, albania, india, indonesia} are compiled in the Lex Trie. Here, node 0 is the start node, and the node 0, node 1 and node 8 are having two children and are numbered consecutively. Now referring to FIG. 3B, the compiled patterns are Y = {anazom.com, anazom.fr, am.com, dbaz.com, qayqal.com, qaytm.com}.
In an embodiment, the construction of the Lex Trie is explained as follows: Initially, the Lex Trie includes a single node which is the start node 0. Further, the plurality of patterns are sorted in the lexicographic order. Subsequently, each keyword y is entered into the Lex Trie, by adding a directed path that begins at the start node (node 0). The node at which the path terminates is associated with keyword y as its output. New nodes and edges are added to the Lex Trie so that starting from the start node there is a unique path corresponding to each keyword in Y. This step results in two data-structures, goto function that maps a node s and an input symbol a ? S to either another node s’ or the message fail, and output function that maps a node s to a set of outputs {o_1,o_2……..o_t}.
For example, consider the set of sorted patterns Y = {albania, america india, indonesia}. Now referring to FIG 3A, initially, the first pattern “albania” is included in the Lex Trie. Initially, the transition starts from node 0. The first symbol ‘a’ is obtained and included in the Lex Trie by creating a new node 1 and an edge is formed from node 0 to node 1 for the symbol ‘a’. Further, the symbol ‘l’ is obtained and a new node 2 is created and an edge is formed from node 2 to node 3. Similarly the symbols ‘b’, ‘a’, ‘n’, ‘i’, and ‘a’ are inserted into the Lex Trie. Further, the pattern “america” is inserted into the Lex Trie as follows: The first symbol ‘a’ is obtained and the Lex Trie is traversed from node 0 to check whether the symbol ‘a’ exists. Here, a transition exists from node 0 to node 1 for the symbol ‘a’. Hence no new node is created and node 1 is reached. The next symbol ‘m’ is taken and checked whether any transition exists from node 1 with the symbol ‘m’. Here, no such transition exists. Hence a new node 4 is created. Further an edge is created for ‘m’ from the node 1 to node 4. Similarly, all the remaining symbols associated with the pattern “america’are compiled into the Lex Trie. Further, the patterns “india” and “indonesia” are also inserted into the Lex Trie in the similar manner. In Lex Trie, all the child nodes of the parent nodes are numbered consecutively.
In an embodiment, since the child (next) nodes of each parent node in the Lex Trie are numbered consecutively in the order in which the corresponding input symbols appear in the bitmap, then instead of storing all child (next) nodes, it is needed to store only the child (next) node with the smallest node number. As a result, the memory required for the bitmap node is greatly reduced. More formally, if there are k valid transitions from a node s on input symbolsa_1,a_2……a_k as indicated by set bitsb_a1,b_a2…..b_ak in the bitmap, wherea_i
= l then
10: insert y_ii.substr(0, l) into trie ?L ?_trie /*prefix yi[0, l-1]*/
11: end if
12: end for
13: end for
14: return Ltrie
15: end procedure
16:
17: procedure CreateOptimizedStates
18: Input: Unoptimized Aho-Corasick Machine Aunopt consisting of goto
(lex trie), fail and output functions
19: Output: Optimized Aho-Corasick Machine Aopt consisting of bitmap-lex
and lex states
20: S = number of states in Aunopt
21: states = new Array(S)
22: bitmap_table = {}
23: output_map = {}
24: for s ? A_unoptt do
25: states[s] = {}
26: states[s]:fail = Aunopt.fail[s]
27: if state s has more than one transitions then
28: states[s].symbol = #
29: states[s].smallest_next = smallestNextState(s)
30: bitmap_table[s] = createBitmap(Aunopt.goto(s))
31: else if state s has one transition then
32: states[s].symbol = a € S on which transition exists
33: states[s].smallest_next = smallestNextState(s)
34: else /* state is a leaf node */
35: states[s].symbol = $
36: states[s].smallest_next = null
37: end if
38: if state s is an output state then
39: output_map[s] = 0
40: for o ? A_unoptt.output(s) do
41: output_map[s] = max(o.length, output_map[s])
42: end for
43: end if
44: end for
45: end procedure
50: ?L ?_trie = CreateLexTrie(D )
51: A_unopt = Compute the failure pointer each state in Ltrie
52: CreateOptimizedStates(A_unopt)
53: end procedure
Further, the pattern analysis unit114, when executed by one or more processors of the system 100, matches each of the plurality of input symbols with the plurality of symbols compiled in the optimized DFA by traversing the optimized DFA until end of the input text is reached, i.e., until all symbols of the input text are processed. The method of matching the input pattern using the optimized DFA is explained using FIGS. 5A and 5B.
FIGS. 5A and 5B are exemplary flow diagrams for pattern matching using the optimized DFA, in accordance with some embodiments of the present disclosure. Now, referring to FIG 5A and 5B, initially, the current node is initialized at step 502. The current node is the start node initially and updated during each iteration. Further, the match value is computed based on the comparison between the current input symbol from the plurality of input symbols and the plurality of transition symbols associated with the current node at step 504. The current input symbol is the first input symbol from the plurality of input symbols. The match value is checked at step 506. A corresponding failure transition is followed when the match value is zero at step 508. The out degree of the current node is obtained when the match value is one at step 510. The out degree of the current node is checked at step 511 and if the out degree is 1, step 512 is taken. If the out degree is zero, step 514 is taken and if the out degree is greater than 1, step 516 is taken. The transition value associated with the current node is also obtained when the out degree is one at step 512. The matched pattern is displayed at step 514 if the corresponding out degree is zero. The transition value associated with the current node is computed based on the corresponding smallest next value when the out degree is greater than one from steps 516 to step 520.
The bit position corresponding to the current input symbol from the bit map array associated with the current node is obtained at step 516. The offset is computed at step 518 by counting the number of bits set to one till the bit position of the current input symbol in the corresponding bit map array of the current node.
The transition value is computed by adding the offset with the smallest next value associated with the current node at step 520. Further, the current node is updated by swapping the current node with the transition value at step 522 and the current input symbol is updated by swapping the current input symbol with the next symbol of the current input symbol at step 524.
For example, referring to the Table 2 and FIG. 4B, assuming the input text as “pa”. The first input symbol is ‘p’ and the first input symbol ‘p’ matches with a symbol of the node 0. The number of ones before the bit position of the symbol ‘p’ is 2. Hence the offset is 2. The smallest first value of the node 0 is node 1. Hence the offset is added with the smallest next of the node 0 to obtain the transition value. Here, the transition value is 2+1= 3. Hence the transition from node 0 is to node 3 by taking the symbol ‘p’. Now, the current node is updated to node 3 and the input symbol is ‘a’ (next symbol of the input text). At the current node 3, the out degree of node 3 is obtained and the out degree is 1. Hence the transition value from the current node 3 is the node number of the child node of node 3, which is node 7. Now the current node is updated to node 6 and the next input symbol is taken. Here, the next input symbol is null. Further, the out degree of node 6 is obtained, which is zero. The zero out degree indicates that it is an end node. Hence the output corresponding to the node 7 “pa” is displayed.
For example, if the input text is “ald”. Initially, the current node is node 0. The input symbol is ‘a’. The input symbol ‘a’ is matched with the transition symbols associated with the node 0 and there is a match. Hence the out degree of node 0 is obtained. The out degree of node 0 is 3. Hence the bit map position of the input symbol ‘a’ is obtained and the number of bits set to one before the bit position of ‘a’ is obtained. Here, the number of bits set to one before the bit position of ‘a’ is zero. Hence the offset is zero. The offset is added with the smallest next value of the node 0, which is node 1. Hence the offset is added with the smallest next value, 0+1=1. Hence node 1 is the transition value from node 0. Now the current node is updated to node 1 and the input symbol is ‘l’. The input symbol ‘l’ is matched with the symbols associated with the node 1 and there is a match. Now the out degree of node 1 is obtained, which is 1 and hence the transition value is node 4. Now the current node is updated to node 4 and the input symbol is updated to ‘d’. Now the input symbol ‘d’ is matched with the transition symbol of node 4 and there is no match. Now a failure transition from the node 4 is followed. The corresponding failure transition for node 4 is node 0 and the matching is continued for the next input text.
Further, the pattern analysis unit 114, when executed by one or more processors of the system 100 displays the plurality of matched patterns associated with the input text when the end of input text is reached, ie., all the input symbols associated with the input text are processed.
In an embodiment, an exemplary pseudo code for matching the input text against the plurality of patterns compiled in the optimized DFA is shown below and such pseudo code shall not be construed as limiting the scope of the present disclosure:
1: procedure Match
2: Input: input text x =a_1,a_2,,,,a_nwhere ai ? S and optimized Aho- Corasick pattern matching machine A_opt
3: Output: set of keywords found in t
4: s = 0
5: i = 0
6: matches = new Set()
7: p = -1= null
8: while i < n do
9: a = x[i]
10: if state[s].symbol ? S then
11: if state[s].symbol == a
12: s = state[s].smallest_next
13: i += 1
14: else
15: s = state[s].fail
16: else if state[s].symbol == ‘$’ then
17: s = state[s].fail
18: else if state[s].symbol == ‘#’ then
19: bitmap = bitmap_table[s]
20: if bit ba is set in bitmap then
21: s = state[s].smallest_next + offset(bitmap, a)
22: i += 1
23: else
24: s = state[s].fail
25: end if
end if
26: if p == 0 and s == 0
i += 1
end if
p = s
27: if s ? output_map then
28: matches.add(output_map[s])
34: end if
35: end while
36: return matches
37: end procedure
The present disclosure is more memory efficient than the bitmap implementation. Although maintaining the variable symbol for each state requires additional memory, a large amount of memory is saved by storing bit map only for the states with the out degree greater than 1. The formula for computing the total memory requirement for the present disclosure is represented in equation (2) to equation (6).
S.f.(c.|S| + a)+S.(1-f).a ………………..(2)
S.f.(c.|S| + a)+S.(1-f).a