Abstract: A method (600) and a system (100) for searching code using a plurality of search techniques is disclosed. A processor (104) receives a query from a user device. A set of keywords associated with the query and a query context of the query including a functional specification, one or more code languages, code configuration and log information is determined. A set of metrics and a set of structural metadata is determined associated with one or more codebases of each of the one or more code languages. A customized parser is determined associated with the one or more code languages. A set of code chunks is determined by parsing the one or more codebases. A search repository is created by indexing each of the set of code chunks. The search repository is searched using the plurality of search techniques to determine preliminary outputs. An output is determined from the preliminary outputs. (To be published with FIG. 1)
Description:DESCRIPTION
Technical Field
[0001] This disclosure relates generally to data processing systems and more particularly a method and system for searching code using a plurality of search techniques.
BACKGROUND
[0002] Code search is a complex process as every programming language has a unique syntax, keywords and structure. Therefore, two code snippets from different programming languages may perform the same function but may use different keywords and structure. Thus, relying solely on a keyword based search technique may not be effective as keywords may vary from language to language. Thus, narrow focus on a syntactic search can overlook other critical aspects like semantics or runtime behavior while searching for a code. For instance, a syntactic search might miss semantically equivalent code written in different languages and narrow the scope of search. Therefore, there is a requirement for an effective methodology that allows searching code across various code bases.
SUMMARY OF THE INVENTION
[0003] In an embodiment, a method of searching code using a plurality of search techniques is disclosed. The method may include receiving, by a processor, a query from a user device. In an embodiment, the query may include: a functional specification, one or more code languages, a set of code configuration and log information. The method may further include determining, by the processor, a set of keywords associated with the query and a query context of the query using a large language model (LLM) and historical data. The method may further include determining, by the processor, a set of metrics and a set of structural metadata associated with one or more codebases of each of the one or more code languages based on an analysis of the one or more codebases using the set of keywords and the query context. Further, the method may include determining, by the processor, a customized parser associated with the one or more code languages based on the set of metrics. The method may further include determining, by the processor, a set of code chunks by parsing the one or more codebases using the customized parser based on the set of structural metadata elements and the set of keywords. Further, the method may include creating, by the processor, a search repository by indexing each of the set of code chunks based on the set of keywords and the query context. In an embodiment, each of the plurality of search techniques may correspond to a predefined indexing technique from a plurality of predefined indexing techniques. It may be noted that the indexing may be performed using the corresponding predefined indexing techniques corresponding to each of the plurality of search techniques. The method may further include searching, by the processor, the search repository using each of the plurality of search techniques based on the corresponding indexing techniques, the set of keywords and the query context to determine a set of preliminary outputs. It may be noted that each of the set of preliminary outputs may be tagged using a tagging information and a search technique from the plurality of search techniques. The method may further include determining, by the processor, an output from the set of preliminary outputs based on the tagging information and a predefined weight associated with each of the plurality of search techniques.
[0004] In another embodiment, a system for searching code using a plurality of search techniques is disclosed. The system may include a processor, and a memory communicatively coupled to the processor, wherein the memory stores processor-executable instructions, which on execution, cause the processor to receive a query from a user device. In an embodiment, the query may include: a functional specification, one or more code languages, a set of code configuration and log information. Further, the processor may determine a set of keywords associated with the query and a query context of the query using a large language model (LLM) and historical data. The processor may further determine a set of metrics and a set of structural metadata associated with one or more codebases of each of the one or more code languages based on an analysis of the one or more codebases using the set of keywords and the query context. Further, the processor may determine a customized parser associated with the one or more code languages based on the set of metrics. The processor may further determine a set of code chunks by parsing the one or more codebases using the customized parser based on the set of structural metadata elements and the set of keywords. Further, the processor may create a search repository by indexing each of the set of code chunks based on the set of keywords and the query context. In an embodiment, each of the plurality of search techniques may correspond to a predefined indexing technique from a plurality of predefined indexing techniques. In an embodiment, the indexing may be performed using the corresponding predefined indexing techniques corresponding to each of the plurality of search techniques. Further, the processor may search the search repository using each of the plurality of search techniques based on the corresponding indexing techniques, the set of keywords and the query context to determine a set of preliminary outputs. It may be noted that each of the set of preliminary outputs may be tagged using a tagging information and a search technique from the plurality of search techniques. The processor may further determine an output from the set of preliminary outputs based on the tagging information and a predefined weight associated with each of the plurality of search techniques.
[0005] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles.
[0007] FIG. 1 is a block diagram of an exemplary system for searching code using a plurality of search techniques, in accordance with an embodiment of the present disclosure.
[0008] FIG. 2 is a functional block diagram of various modules within a computing device of FIG. 1, configured to an exemplary system for searching code using a plurality of search techniques, in accordance with an embodiment of the present disclosure.
[0009] FIG. 3 illustrates exemplary set of preliminary outputs, in accordance with an embodiment of the present disclosure.
[0010] FIG. 4 illustrates an exemplary set of referenced output, in accordance with an embodiment of the present disclosure.
[0011] FIG. 5 illustrates an exemplary output table determined based on the exemplary referenced output of FIG. 4, in accordance with an embodiment of the present disclosure.
[0012] FIG. 6 illustrates a flow diagram depicting a methodology of searching code using a plurality of search techniques, in accordance with an embodiment of the present disclosure.
[0013] FIG. 7 illustrates a flow diagram of a methodology of generating an output from the set of preliminary outputs determined in FIG. 6, in accordance with an embodiment of the present disclosure.
DETAILED DESCRIPTION OF THE DRAWINGS
[0014] Exemplary embodiments are described with reference to the accompanying drawings. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims. Additional illustrative embodiments are listed.
[0015] Further, the phrases “in some embodiments”, “in accordance with some embodiments”, “in the embodiments shown”, “in other embodiments”, and the like, mean a particular feature, structure, or characteristic following the phrase is included in at least one embodiment of the present disclosure and may be included in more than one embodiment. In addition, such phrases do not necessarily refer to the same embodiments or different embodiments. It is intended that the following detailed description be considered exemplary only, with the true scope and spirit being indicated by the following claims.
[0016] Since each programming language has a unique syntax, a purely syntactic search technique may focus narrowly on the textual aspects of code. This may risk overlooking important semantic relationships and runtime behaviors while searching code bases. Therefore, the current solution of utilizing multiple search techniques as discussed in detail in the current disclosure overcomes the problems discussed above.
[0017] Referring now to FIG. 1, a block diagram of an exemplary code searching system 100 for searching code using a plurality of search techniques, is illustrated, in accordance with an embodiment of the present disclosure. The code searching system 100 may include a computing device 102, an external device 112, and a data server 114 communicatively coupled to each other through a wired or wireless communication network 110. The computing device 102 may include a processor 104, a memory 106, and an input/output (I/O) device 108. In an embodiment, examples of processor(s) 104 may include, but are not limited to, microcontrollers, microprocessors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), system-on-chip (SoC) components, or any other suitable programmable logic devices. Examples of processor(s) 104 may include but are not limited to, an Intel® Itanium® or Itanium 2 processor(s), or AMD® Opteron® or Athlon MP® processor(s), Motorola® lines of processors, Nvidia®, FortiSOC™ system on a chip processors or other future processors. In an embodiment, the memory 106 may be a computer-readable medium (CRM) that may store non-transitory computer-readable instructions that, when executed by the processor 104, and cause the processor 104 to determine an output from the set of preliminary outputs, as discussed in more detail below. In an embodiment, the memory 106 may be a non-volatile memory or a volatile memory. Examples of non-volatile memory may include but are not limited to, a flash memory, a Read Only Memory (ROM), a Programmable ROM (PROM), Erasable PROM (EPROM), and Electrically EPROM (EEPROM) memory. Further, examples of volatile memory may include but are not limited to, Dynamic Random Access Memory (DRAM), and Static Random-Access memory (SRAM).
[0018] In an embodiment, the I/O device 108 may include a variety of interface(s), for example, interfaces for data input and output devices, and the like. The I/O device 108 may facilitate inputting of instructions by a user communicating with the computing device 102. In an embodiment, the I/O device 108 may be wirelessly connected to the computing device 102 through wireless network interfaces such as Bluetooth®, infrared, or any other wireless radio communication known in the art. In an embodiment, the I/O device 108 may be connected to a communication pathway for one or more components of the computing device 102 to facilitate the transmission of inputted instructions and output results of data generated by various components such as, but not limited to, processor(s) 104 and memory 106.
[0019] In an embodiment, the data server 114 may be enabled in a cloud or a physical database and may store data of one or more code bases. In an embodiment, the data server 114 may store data input by an external device 112 or output generated by the computing device 102. In an embodiment, the external device 112 may include any other data such as, but not limited to, historical data, parsers, necessary for the code searching system 100 to search code using a plurality of search techniques.
[0020] In an embodiment, the communication network 110 may be a wired or a wireless network or a combination thereof. The communication network 110 can be implemented as one of the different types of networks, such as but not limited to, ethernet IP network, intranet, local area network (LAN), wide area network (WAN), the internet, Wi-Fi, LTE network, CDMA network, 5G and the like. Further, the communication network 110 can either be a dedicated network or a shared network. The shared network represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), and the like, to communicate with one another. Further, the communication network 110 can include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like.
[0021] In an embodiment, the computing device 102 may receive a request or a query for searching code from an external device 112 (also referred to as user device 112) that may be operated by a user. In an embodiment, the computing device 102 and the external device 112 may be a computing system, including but not limited to, a smart phone, a laptop computer, a desktop computer, a notebook, a workstation, a server, a portable computer, a handheld, or a mobile device. In an embodiment, the computing device 102 may be, but not limited to, in-built into the external device 112 or may be a standalone computing device.
[0022] In an embodiment, the computing device 102 may perform various processing for searching code using the plurality of search techniques. By way of an example, the computing device 102 may receive a query from a user device 112. In an embodiment, the query may include, but not limited to, a functional specification, one or more code languages, a set of code configuration and log information. The computing device 102 may then determine a set of keywords associated with the query and a query context of the query using a large language model (LLM) and historical data. In an embodiment, the set of keywords and the query context are determined based on time information and user information using an AI model trained based on the historical data and domain data associated with a plurality of code languages. It is to be noted that the historical data may include, but not limited to a historical query, a historical set of keywords, a set of historical query context and historical user information. Further, the computing device 102 may determine a set of metrics and a set of structural metadata associated with one or more codebases of each of the one or more code languages based on an analysis of the one or more codebases using the set of keywords and the query context.
[0023] The computing device 102 may then determine a customized parser associated with the one or more code languages based on the set of metrics. The computing device 102 may further determine a set of code chunks by parsing the one or more codebases using the customized parser based on the set of structural metadata elements and the set of keywords. The computing device 102 may then create a search repository by indexing each of the set of code chunks based on the set of keywords and the query context. In an embodiment, each of the plurality of search techniques may correspond to a predefined indexing technique from a plurality of predefined indexing techniques. It is to be noted that the indexing may be performed using the corresponding predefined indexing techniques corresponding to each of the plurality of search techniques. In an embodiment, the search repository may be searched using each of the plurality of search techniques based on the corresponding indexing techniques, the set of keywords and the query context to determine a set of preliminary outputs. It is to be noted that each of the set of preliminary outputs is tagged using a tagging information and a search technique from the plurality of search techniques. Further, the computing device 102 may determine an output from the set of preliminary outputs based on the tagging information and a predefined weight associated with each of the plurality of search techniques.
[0024] Referring now to FIG. 2, a functional block diagram of the computing device 102 of the code searching system 100 of FIG. 1 is illustrated, in accordance with the embodiments of the present disclosure. FIG. 2 is explained in conjunction with FIG. 1. The memory 106 of the computing device 102 may include an input receiving module 202, a Large Language Model (LLM) module 204, a metrics and structural metadata determination module 206, a parser determination module 208, a code chunks determination module 210, a repository creation module 212, a repository search module 214, and an output generation module 218. The repository search module 214 further includes a tagging module 216.
[0025] The input receiving module 202 may receive a query as an input from the user device 112. The query may include, but not limited to, a functional specification, one or more code languages, a set of code configuration and log information. In one example, functional specification may include information such as, but not limited to, documentation that describes a structure, behavior, and/or an intended functionality of systems, subsystems, components, or features. Such documentation may include textual description, diagrams, architectural outlines, interface definitions, or design documents or the like. Further, the query may include one or more code languages such as, but not limited to, Java, JavaScript, .Net, C, C++ and so on. Further, a set of code configuration may include, but not limited to, build and deployment configuration (for example, files or scripts used to define compilation, packaging, dependency management, and deployment workflows). Examples of the set of code configuration may include pom.xml file in Java, package.json file in JavaScript, makefile in C, C++, C#, jenkinsfile scripts used for continuous integration (CI) and deployment implementation with jenkins, dockerfile for containerization and so on. Additionally, the query may include one or more log files, such as system logs, application logs, error logs, or defect logs. It is to be noted that these logs may contain structured or unstructured information pertaining to system behavior, exceptions, stack traces, or runtime errors and so on.
[0026] The LLM module 204 may determine a set of keywords associated with the query and a query context of the query using a large language model (LLM) and historical data. The LLM module 204 may also determine the set of keywords and the query context based on time information and user information using the LLM trained based on the historical data and domain data associated with a plurality of code languages. Further, the historical data may include a historical query, a historical set of keywords, a set of historical query context and historical user information.
[0027] The LLM module 204 may process the query to derive the set of keywords as search text by eliminating the probable human errors, and natural language constructs from the query. In an example, the set of keywords associated with the query may include, but not limited to, keywords related to a source language or a target language of a code to be searched. In other words, the set of keywords determined in response to the query may include, but not limited to, the programming languages (such as .java for Java, .c or .cpp for C/C++, and .cs for .Net), technologies or tools such as Apache Tomcat, Jasper, Spring Boot, and RxJava. It is to be noted that these set of keywords may help in identifying the technical context and dependencies related to the query. Further, the LLM module 204 may determine query context of the query using the LLM model based on the historical data. In an example, age of an entity may keep changing based on time, hence the context for source indexing time and query indexing time and the influence of it on data element need to considered as context for the right search result. Thus, query context may determine influence of be one or more or combinations of factors such as, but not limited to, domain specific, usage specific, and/or time specific on the priority, ranking and relevance of the code search. In another example, the query context may improve the search relevancy by combining the keywords based search with environmental based factors termed as context. The context may be derived using the historical data including, but not limited to, historical query, a historical set of keywords, a set of historical query context and historical user information. It may be noted that the query context may utilize historical data as feedback to influence a current code search.
[0028] Further, the metrics and structural metadata determination module 206 may determine a set of metrics and a set of structural metadata associated with one or more codebases of each of the one or more code languages based on an analysis of the one or more codebases using the set of keywords and the query context determined by the LLM module 204. In an embodiment, the set of metrics may include, but not limited to, line of code (LOC), complexity information, dependency information, framework information, and language information. In an example, the LOC may indicate total and effective lines of code in the one or more code bases. In another example, complexity information may be indicative of code complexity information such as, structural or cyclomatic complexity, Halstead complexity, and so on. Further, examples of dependency information may include, but not limited to, identifying libraries, modules, or third party packages, framework information (such as Spring Boot, ASP.NET) associated with the one or more code languages.
[0029] Further, the metrics and structural metadata determination module 206 may determine the set of structural metadata associated with the one or more codebases such as, but not limited to, application information, module information, class information, and function information. These set of structural metadata define the fundamental components needed to form the basic structure of the codebase. Further, the functional components identified in the codebase need to be grouped based on their logical or contextual relationships. It is to be noted that these groupings reflect the unique characteristics of different programming paradigms, such as functional programming, object-oriented programming, or object-based languages. In one example, class information may be determined in case of object- oriented programming, as functional components are grouped into classes within files, which may include private or nested classes. In another example, function information may be determined in case of functional programming, where functions are usually grouped by file structure. These files or classes form modules, which may be organised under namespaces and may be determined as module information.
[0030] As discussed earlier, some code languages support functional programming whereas others support object-oriented programming. Further, there are a few code languages which are object-based but not fully object oriented. However, each of the code languages may have particular structure and a portion of code in form of blocks and statements may implement a similar functionality yet may have different structure.
[0031] These portion of code need to be grouped in some kind of context that is related to each other. Usually in object-oriented paradigm that can be class but in functional programming that are mostly grouped into a file structure. In object-oriented paradigm, these files can have multiple classes embedded into it or even there could be some private classes only accessible within the current context only. Further these classes or files can be organized into modules so that they can be imported or exported to external modules. On top there could be logical namespaces which can be private to a module or span across different modules. Thus, this information may be determined as the set of structure metadata that may be considered while building the final structure of the code being searched.
[0032] Further, the parser determination module 208 may determine a customized parser associated with the one or more code languages based on the set of metrics. In one example, the customized parser for programming languages such as Java, .NET, or C may be generated using parser generator frameworks such as, but not limited to, ANTLR, JavaCC, Tree-Siiter and the like. It is to be noted that these frameworks may support multiple languages and allow for the creation of language specific parsers. In another example, the customized parser may be generated using a combination of grammar that describes the syntax and structure of each language. The grammar type may include, but not limited to, PEG (Parsing expression grammar), LL, LR, LALR, or CFG (Context-Free grammar). It is to be noted that the customized parser may ensure accurate parsing and extraction of language-specific code composition from the one or more codebases. Since each programming language has a distinct syntax and structural pattern, a custom parser may be implemented for each code language to accurately extract code artifacts from the one or more codebases based on the set of metrics associated with the one or more codebases of each of the one or more code languages. Thus, the customized parser may be generated using a parser generator framework that generates homogeneous output and supports many languages and adheres to a common API for collection of data across the one or more codebases.
[0033] The code chunks determination module 210 may use the customized parser determined by the parser determination module 208 to determine a set of code chunks by parsing the one or more codebases based on the set of structural metadata elements and the set of keywords. The customized parser may utilize the set of structural metadata elements and the set of keywords parse the one or more codebases in order to put code chunks in its defined buckets which will help filtering at time of actual search. The code chunks determination module 210 may determine the structured code chunks and not ad hoc to avoid complications in retrieval and co-referencing later.
[0034] It is to be noted that each chunking strategies are tailored to the use case, for example, modules for component level search and methods for functionality level search. Thus, the code chunks determination module 210 may maintain and preserve the hierarchy of code chunks in the set of code chunks to ensure that each code chunk may have relevant context to its scope and ambiguity may be reduced in terms of their hierarchy and nested code structures. In one example, the set of code chunks may include metadata and function definitions, such as names, parameters, and return types. In another example, the set of code chunks may also be represented as nodes in a graph or tree, capturing the relationships and hierarchy among functions. These nodes may further indicate function calls, dependencies, or control flow between different parts of the code.
[0035] Once the set of code chunks are determined using the code chunks determination module 210, it may send this information to repository creation module 212. The repository creation module 212 may index each of the set of code chunks based on the set of keywords and the query context using one or more predefined indexing techniques from a plurality of predefined indexing techniques. It may be noted that each of the plurality of indexing techniques corresponds to a search technique from the plurality of search techniques. In one example, the predefined indexing techniques may include, but not limited to, BTree search indexing technique, vector storage indexing technique, cache indexing technique, embedding indexing technique and so on. It may be noted that each of the indexing techniques may utilize indexes as specialized data structures for indexing the set of code chunks for building the search repository that allows fast and efficient traversal and retrieval of information. It may also be noted that each indexing technique requires the data to be processed and stored in a specific format. It is to be noted that different type of searches may require a different type of indexing technique, for example, keyword based search technique may utilize BTree based indexing technique or an invert index which may map each token with a file it appears in. In another example, semantic search technique may utilize vector storage based indexing technique in order to search based on a similarity of the tokens.
[0036] In an embodiment, ML models such as, but not limited to, natural language processing (NLP) models may be used by the repository creation module 212 to index each of the set of code chunks based on the set of keywords and the query context using at least one predefined indexing technique from a plurality of predefined indexing techniques. Thus, the search repository created by indexing the set of code chunks enables storage of data in a way that speeds up searching of the one or more code bases. Thus, trained ML models may be used to generate embeddings that may be stored in the search repository as a vector store. Also, the ML models may be pre-trained in various programming languages to understand the sematic of the one or more codebases. In one embodiment the repository creation module 212 may index each of the code chunks by tagging a corresponding code chunk from the set of code chunks using one or more keywords from the set of keywords.
[0037] The search module 214 may then search the search repository using each of the plurality of search techniques based on the corresponding indexing techniques, the set of keywords and the query context to determine a set of preliminary outputs. The plurality of search techniques may include, but not limited to, keyword-based search, semantic search, and so on. Further, the search module 214 may include a tagging module 216 that may tag each of the set of preliminary outputs using a tagging information and a search technique from the plurality of search techniques. Further, the tagging information may include an index corresponding to a corresponding predefined indexing technique and a search technique from the plurality of search techniques, a metric from the set of metrics and a structural metadata from the set of structural metadata. In one embodiment, the searching may be iteratively performed by using an output from one search technique as input for other techniques. Further, the set of metrics may include, but not limited to, a line of code (LOC), and so on.
[0038] In one example, a query may include class/function or a description about the functionality to be searched such as:
“TaxCalculator
performAnalysis()
fetch data from database and write to a file”.
[0039] Accordingly, the query context may be combined with the set of keywords to produce more refined and relevant set of preliminary outputs for example, domain like human resource or procurement, when combined with specific keywords like class name such as TaxCalculator, the computing device 102 may then filters the outputs more precisely and may return targeted code artifacts like Income Tax or Good and Service Tax (GST) instead of giving unrelated tax logic.
[0040] Referring now to FIG. 3, an exemplary set of preliminary outputs are illustrated, in accordance with an exemplary embodiment of the present disclosure. As can be seen, each of the plurality of search techniques may generate one or more preliminary outputs depicted in Table 300A and Table 300B. Each intermediate preliminary output 300A and 300B may be fetched using a corresponding indexing technique. It may be noted that each of the intermediate preliminary output 300A and 300B may have a common factor i.e. the tagging information that may be used to tag or label at the time of preparing the search repository for example, the set of structural information including, but not limited to, module information 302A, 302B, file information 304A, 304B, etc. Further, the tagging information of each of the intermediate preliminary output 300A and 300B may include a search technique from the plurality of search techniques 306A, 306B and line of code 308A, 308B.
[0041] Referring back to FIG. 2, the output determination module 218 may generate a set of referenced output from the set of preliminary outputs 300A and 300B based on the tagging information. Referring now to FIG. 4, an exemplary set of referenced output 400 is illustrated, in accordance with the exemplary embodiment of the present disclosure. Each of the set of preliminary outputs 300A and 300B may be referenced based on the tagging information to determine the set referenced output 400 including the corresponding tagging information such as, but not limited to, module information 302A, 302B, file information 304A, 304B search techniques 306A, 306B and line of code 308A, 308B.
[0042] Referring back to FIG. 2, the output determination module 218 may determine an output from the set of preliminary outputs based on the tagging information and a predefined weight associated with each of the plurality of search techniques. In an example, each of the plurality of search techniques may be assigned a predefined weight in a range of about 0 to 1. Thus, each of the set of preliminary outputs in the referenced output 400 may be ranked based on a number of preliminary outputs determined corresponding to each of the set of keywords using each of the plurality of search techniques. Further, the output determination module 218 may determine an output from the referenced output 400 based on the referencing, the ranking and the predefined weights.
[0043] Referring now to FIG. 5, an exemplary output table 500 determined based on the exemplary referenced output 400 of FIG. 4 is illustrated, in accordance with the exemplary embodiment of the present disclosure. The output table 500 depicts ranked list of the set of preliminary outputs based on the number of preliminary outputs determined corresponding to each of the set of keywords using each of the plurality of search techniques and the predefined weights assigned to each of the plurality of search techniques. Accordingly, the ranked output 500 may be determined by ranking each of the preliminary outputs using the predefined weights assigned to each of the plurality of search technique. For instance, if a file is matched with both semantic search technique and keyword search technique, the preliminary outputs determined using the semantic search technique may be ranked based on a number of occurrence of keyword.
[0044] Accordingly, in the output table 500, output “module 1, file 1, method 1 and line 10-20” is ranked highest as method 1 may have a high pre-defined weight and the result may have a high number of keyword match and semantic similarity. It is to be noted that ranking the outputs ensures that the most contextually accurate and relevant code snippets appear at the top of the search output.
[0045] It should be noted that all such aforementioned modules 202-218 may be represented as a single module or a combination of different modules. Further, as will be appreciated by those skilled in the art, each of the modules 202-218 may reside, in whole or in parts, on one device or multiple devices in communication with each other such as computer-readable medium (CRM). In some embodiments, each of the modules 202-218 may be implemented as dedicated hardware circuit comprising custom application-specific integrated circuit (ASIC) or gate arrays, off-the-shelf semiconductors such as logic chips, transistors, or other discrete components. Each of the modules 202-218 may also be implemented in a programmable hardware device such as a field programmable gate array (FGPA), programmable array logic, programmable logic device, and so forth. Alternatively, each of the modules 202-218 may be implemented in software or non-transitory computer-readable instructions for execution by various types of processors (e.g. processor 104). An identified module of executable code may, for instance, include one or more physical or logical blocks of computer instructions, which may, for instance, be organized as an object, procedure, function, or other construct. Nevertheless, the executables of an identified module or component need not be physically located together but may include disparate instructions stored in different locations which, when joined logically together, include the module and achieve the stated purpose of the module. Indeed, a module of executable code could be a single instruction, or many instructions, and may even be distributed over several different code segments, among different applications, and across several memory devices.
[0046] Referring to FIG. 6, a flow diagram 600 depicting a methodology of searching code using a plurality of search techniques is illustrated, in accordance with some embodiments of the present disclosure. In an embodiment, the flow diagram 600 may include a plurality of steps that may be performed by the processor 104 for searching code using a plurality of search techniques.
[0047] At step 602, a query may be received from a user device including, but not limited to, a functional specification, one or more code languages, a set of code configuration and log information. At step 604, a set of keywords associated with the query and a query context of the query input may be determined using a large language model (LLM) and historical data. The set of keywords and the query context may also be determined based on time information and user information using an AI model trained based on the historical data and domain data associated with a plurality of code languages. Further, the historical data may include a historical query, a historical set of keywords, a set of historical query context and historical user information. At step 606, a set of metrics and a set of structural metadata associated with one or more codebases of each of the one or more code languages may be determined based on an analysis of the one or more codebases using the set of keywords and the query context. The set of metrics may include, but not limited to, a line of code (LOC), complexity information, dependency information, framework information, and language information. Further, the set of structural metadata may include, but not limited to, application information, module information, class information, and function information. At step 608, a customized parser associated with the one or more code languages may be determined based on the set of metrics. The customized parsers may include, but not limited to, Java, .NET, C etc. At step 610, a set of code chunks may be determined by parsing the one or more codebases using the customized parser based on the set of structural metadata elements and the set of keywords. In one example, the set of code chunks may include metadata and function definitions, such as names, parameters, and return types. In another example, the set of code chunks may also be represented as nodes in a graph or tree, capturing the relationships and hierarchy among functions. At step 612, a search repository may be created by indexing each of the set of code chunks based on the set of keywords and the query context using at least one predefined indexing technique from a plurality of predefined indexing techniques. The plurality of predefined indexing techniques may correspond to a search technique from the plurality of search techniques. Further, the predefined indexing techniques may include, but not limited to, BTree search index, vector storage index, cache index and embedding index. In one embodiment, each of the set of code chunks is indexed based on the set of keywords by tagging a corresponding code chunk from the set of code chunks using one or more keywords from the set of keywords.
[0048] At step 614, the search repository may be searched using each of the plurality of search techniques based on the corresponding indexing techniques, the set of keywords and the query context to determine a set of preliminary outputs. Further, the plurality of search techniques may include, but not limited to, keyword search technique semantic search technique, and so on. It may be noted that each of the set of preliminary outputs may be tagged using a tagging information and a search technique from the plurality of search techniques. At step 616, an output from the set of preliminary outputs may be determined based on the tagging information and a predefined weight associated with each of the plurality of search techniques. It is to be noted that, the predefined weight may be user defined weight in a range of about 0 to 1.
[0049] Referring to FIG. 7, a flow diagram 700 of a methodology of generating an output from the set of preliminary outputs determined in step 614 of flow diagram 600 of FIG. 6, is illustrated, in accordance with some embodiments of the present disclosure. In an embodiment, the method 700 may include a plurality of steps that may be performed by the processor 104 for generating an output from the set of preliminary outputs.
[0050] At step 702, each of the set of preliminary outputs may be referenced based on the tagging information including such as, but not limited to, an index corresponding to a corresponding predefined indexing technique and a search technique from the plurality of search techniques, a metric from the set of metrics and a structural metadata from the set of structural metadata. At step 704, each of the set of preliminary outputs may be ranked based a number of preliminary outputs determined corresponding to each of the set of keywords using each of the plurality of search techniques. Accordingly, at step 708, the output from the set of preliminary outputs may be determined based on the referencing, the ranking and the predefined weight.
[0051] Thus, the disclosed method 600 and system 100 try to overcome the technical problem of searching one or more code data bases for search code using a plurality of search techniques. Due to unique syntax of different programming languages, single searching technique may prove inefficient for code exploration purposes across multiple code bases. The disclosed method 600 and system 100 may address these limitations by utilizing a multiple search techniques simultaneously or hierarchically to search code across multiple code bases based on a single user query.
[0052] In light of the above-mentioned advantages and the technical advancements provided by the disclosed method 600 and system 100, the claimed steps as discussed above are not routine, conventional, or well understood in the art, as the claimed steps enable the following solutions to the existing problems in conventional technologies. Further, the claimed steps bring an improvement in the functioning of the device itself as the claimed steps provide a technical solution to a technical problem.
[0053] The specification has described a method 600 and system 100 for searching code using a plurality of search techniques. The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.
[0054] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor 104 may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) 104 to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[0055] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims. , Claims:CLAIMS
I/We Claim:
1. A method (600) of searching code using a plurality of search techniques, the method (600) comprising:
receiving, by a processor (104), a query from a user device,
wherein the query comprises: a functional specification, one or more code languages, a set of code configuration and log information;
determining, by the processor (104), a set of keywords associated with the query and a
query context of the query using a large language model (LLM) and historical data;
determining, by the processor (104), a set of metrics and a set of structural metadata associated with one or more codebases of each of the one or more code languages based on an analysis of the one or more codebases using the set of keywords and the query context; determining, by the processor (104), a customized parser associated with the one or more code languages based on the set of metrics;
determining, by the processor (104), a set of code chunks by parsing the one or more codebases using the customized parser based on the set of structural metadata elements and the set of keywords;
creating, by the processor (104), a search repository by indexing each of the set of code chunks based on the set of keywords and the query context using at least one predefined indexing technique from a plurality of predefined indexing techniques,
wherein each of the plurality of predefined indexing techniques corresponds to a search technique from the plurality of search techniques;
searching, by the processor (104), the search repository using each of the plurality of search techniques based on the corresponding indexing techniques, the set of keywords and the query context to determine a set of preliminary outputs,
wherein each of the set of preliminary outputs is tagged using a tagging information and a search technique from the plurality of search techniques; and
determining, by the processor (104), an output from the set of preliminary outputs based on the tagging information and a predefined weight associated with each of the plurality of search techniques.
2. The method (600) as claimed in claim 1, wherein the set of keywords and the query context are determined based on time information and user information using an AI model trained based on the historical data and domain data associated with a plurality of code languages, and
wherein the historical data comprises historical query, a historical set of keywords, a set of historical query context and historical user information.
3. The method (600) as claimed in claim 1, wherein the set of metrics comprises: line of code (LOC), complexity information, dependency information, framework information, and language information.
4. The method (600) as claimed in claim 1, wherein the set of structural metadata comprises application information, module information, class information, and function information.
5. The method (600) as claimed in claim 1, wherein the tagging information comprises: an index corresponding to a corresponding predefined indexing technique and a search technique from the plurality of search techniques, a metric from the set of metrics and a structural metadata from the set of structural metadata.
6. The method (600) as claimed in claim 5, comprises:
referencing, by the processor (104), each of the set of preliminary outputs based on the tagging information; and
ranking, by the processor (104), each of the set of preliminary outputs based on a number of preliminary outputs determined corresponding to each of the set of keywords using each of the plurality of search techniques,
wherein the output from the set of preliminary outputs is determined based on the referencing, the ranking and the predefined weights.
7. The method (600) as claimed in claim 1, wherein each of the set of code chunks is indexed based on the set of keywords by:
tagging, by the processor (104), a corresponding code chunk from the set of code chunks using one or more keywords from the set of keywords.
8. A system (100) for searching code using a plurality of search techniques, the system (100) comprising:
a processor (104);
a memory (106) communicably coupled to the processor (104), wherein the memory (106) stores processor-executable instruction, which, on execution, cause the processor (104) to:
receive a query from a user device,
wherein the query comprises: a functional specification, one or more code languages, a set of code configuration and log information;
determine a set of keywords associated with the query and a query context of the query using a large language model (LLM) and historical data;
determine a set of metrics and a set of structural metadata associated with one or more codebases of each of the one or more code languages based on an analysis of the one or more codebases using the set of keywords and the query context;
determine a customized parser associated with the one or more code languages based on the set of metrics;
determine a set of code chunks by parsing the one or more codebases using the customized parser based on the set of structural metadata elements and the set of keywords;
determine a search repository by indexing each of the set of code chunks based on the set of keywords and the query context,
wherein each of the plurality of search techniques corresponds to a predefined indexing technique, and
wherein the indexing is performed using the corresponding predefined indexing techniques corresponding to each of the plurality of search techniques;
search the search repository using each of the plurality of search techniques based on the corresponding indexing techniques, the set of keywords and the query context to determine a set of preliminary outputs,
wherein each of the set of preliminary outputs is tagged using a tagging information and a search technique from the plurality of search techniques; and
determine an output from the set of preliminary outputs based on the tagging information and a predefined weight associated with each of the plurality of search techniques.
9. The system (100) as claimed in claim 8, wherein the set of keywords and the query context are determined based on time information and user information using an AI model trained based on the historical data and domain data associated with a plurality of code languages, and
wherein the historical data comprises historical query, a historical set of keywords, a set of historical query context and historical user information.
10. The system (100) as claimed in claim 8, wherein the set of metrices comprises: line of code (LOC), complexity information, dependency information, framework information, and language information.
11. The system (100) as claimed in claim 8, wherein the set of structural metadata comprises application information, module information, class information, and function information.
12. The system (100) as claimed in claim 8, wherein the tagging information comprises: an index corresponding to a corresponding predefined indexing technique and a search technique from the plurality of search techniques, a metric from the set of metrics and a structural metadata from the set of structural metadata.
13. The system (100) as claimed in claim 12, comprises:
reference each of the set of preliminary outputs based on the tagging information; and
rank each of the set of preliminary outputs based on a number of preliminary outputs determined corresponding to each of the set of keywords using each of the plurality of search techniques,
wherein the output from the set of preliminary outputs is determined based on the reference, the rank and the predefined weight.
14. The system (100) as claimed in claim 8, wherein each of the set of code chunks is indexed based on the set of keywords by:
tagging a corresponding code chunk from the set of code chunks using one or more keywords from the set of keywords.
| # | Name | Date |
|---|---|---|
| 1 | 202511060759-STATEMENT OF UNDERTAKING (FORM 3) [25-06-2025(online)].pdf | 2025-06-25 |
| 2 | 202511060759-REQUEST FOR EXAMINATION (FORM-18) [25-06-2025(online)].pdf | 2025-06-25 |
| 3 | 202511060759-REQUEST FOR EARLY PUBLICATION(FORM-9) [25-06-2025(online)].pdf | 2025-06-25 |
| 4 | 202511060759-PROOF OF RIGHT [25-06-2025(online)].pdf | 2025-06-25 |
| 5 | 202511060759-POWER OF AUTHORITY [25-06-2025(online)].pdf | 2025-06-25 |
| 6 | 202511060759-FORM-9 [25-06-2025(online)].pdf | 2025-06-25 |
| 7 | 202511060759-FORM 18 [25-06-2025(online)].pdf | 2025-06-25 |
| 8 | 202511060759-FORM 1 [25-06-2025(online)].pdf | 2025-06-25 |
| 9 | 202511060759-FIGURE OF ABSTRACT [25-06-2025(online)].pdf | 2025-06-25 |
| 10 | 202511060759-DRAWINGS [25-06-2025(online)].pdf | 2025-06-25 |
| 11 | 202511060759-DECLARATION OF INVENTORSHIP (FORM 5) [25-06-2025(online)].pdf | 2025-06-25 |
| 12 | 202511060759-COMPLETE SPECIFICATION [25-06-2025(online)].pdf | 2025-06-25 |