Abstract: TITLE: A method (200) and system (100) of text categorization to identify software build failure causes. Abstract The present disclosure proposes a method (200) and system (100) of text categorization to identify build failure causes in a software or a continuous pipeline. The system (100) comprises a processor (10) in communication with at least one computing unit (20) which is trained with Machine Learning model. The processor (10) is configured to extract a set of features from failure build logs and configuration files of the at least one computing unit (20). It then selects and ranks a set of sub features for each feature based on the semantic connection and a hit rate. A value of weight is then assigned to each feature based on the occurrence of it’s sub-features in the failure build log and summation of all sub-features corresponding to all features. The build failure cause is identified based on the assigned value of weight to the features. Figure 1.
Description:Complete Specification:
The following specification describes and ascertains the nature of this invention and the manner in which it is to be performed
Field of the invention
[0001] The present disclosure relates to the field of text mining in machine learning tasks for classification. More specifically, the invention relates to a method and system of text categorization to identify software build failure causes or a continuous pipeline.
Background of the invention
[0002] Software Projects that require continuous integration and deployment onto real world systems are widely used to continuously build, verify, and deploy changes introduced by software developers. In software engineering a “software build” is defined as a process of converting source code files into standalone software artifact(s) that can be run on an end user computer. However sometimes, owing to the errors in the software code a software build may fail. In software terms it means that due to the error in the project code, it is not compiling and running properly.
[0003] Text classification plays an important role to identify errors among software builds. There are several different supervised learning methods for text classification and popular among them is Naive Bayes. For the selected features assigning the weight is an important task as it directly impacts the strength of the model designed. Most techniques in the state of the art use a binary formula in assigning the values for the features semantic weight relationship algorithm for chaining between the features. There is a need for a new approach where semantic weight relationship scenarios are handled with better accuracy.
Brief description of the accompanying drawings
[0004] An embodiment of the invention is described with reference to the following accompanying drawings:
[0005] Figure 1 depicts a system (100) for text categorization to identify software build failure causes;
[0006] Figure 2 illustrates method steps (200) for text categorization to identify software build failure causes.
Detailed description of the drawings
[0007] Figure 1 depicts a system (100) for text categorization to identify software build failure causes. The system (100) comprises a processor (10) in communication with at least one computing unit (20).
[0008] The computing unit (20) is configured to process software builds. The computing unit (20) can be any computing device/module that is capable running and executing a software build. Some embodiments of the system may also include an interface (30) in communication with the one or more computing unit to monitor build failure causes. The processor (10) can either be a logic circuitry or a software programs that respond to and processes logical instructions to get a meaningful result. A hardware processor (10) may be implemented as any or a combination of: one or more microchips or integrated circuits interconnected using a parent board, hardwired logic, software stored by a memory device and executed by a microprocessor (10), firmware, an application specific integrated circuit (ASIC), and/or a field programmable gate array (FPGA). The processor (10) run a trained artificial intelligence module.
[0009] An AI module with reference to this disclosure can be explained as a component which runs a model. A model can be defined as reference or an inference set of data, which is use different forms of correlation matrices. Using these models and the data from these models, correlations can be established between different types of data to arrive at some logical understanding of the data. A person skilled in the art would be aware of the different types of AI models such as linear regression, naïve bayes classifier, support vector machine, neural networks and the like. It must be understood that this disclosure is not specific to the type of model being executed in the AI module and can be applied to any AI module irrespective of the AI model being executed.
[0010] A person skilled in the art will also appreciate that the AI module may be implemented as a set of software instructions, combination of software and hardware or any combination of the same. For example, a neural network mentioned herein after can be a software residing in the system (100) or the cloud or embodied within an electronic chip. Such neural network (101) chips are specialized silicon chips, which incorporate AI technology and are used for machine learning.
[0011] The processor (10) is configured to extract a set of features from failure build logs and configuration files of the at least one computing unit (20); select a set of sub features for each feature based on the semantic connection; rank the set of sub features based on a hit rate of each of the sub features to eliminate some sub features; assign a value of weight to each feature based on the occurrence of it’s sub-features in the failure build log and summation of all sub-features corresponding to all features; identify the build failure cause based on the assigned value of weight to the features. The hit rate is the frequency of usage of a sub-feature by a corresponding feature. The processor (10) is further configured to; pre-process the failure build logs and configuration files to remove stop words and HTML tags; extract the set of features from the pre-processed data using a backward elimination technique.
[0012] As used in this application, the terms "component," "system," "module," are intended to refer to a computer-related entity or an entity related to, or that is part of, an operational apparatus with one or more specific functionalities, wherein such entities can be either hardware, a combination of hardware and software, software, or software in execution. As further yet another example, interface(s) can include input/output (I/O) components as well as associated processor, application, or Application Programming Interface (API) components.
[0013] As used in this application a feature is, if a log contains a specific error, then attributes that contribute towards the same are called as Features. The sub-features here is the different scenarios pointing towards the same error classification or feature. In general, they are extracted for the feature based on the semantic weightage that represents their importance in relation towards the feature found in the build log.
[0014] It should be understood at the outset that, although exemplary embodiments are illustrated in the figures and described below, the present disclosure should in no way be limited to the exemplary implementations and techniques illustrated in the drawings and described below.
[0015] Figure 2 illustrates method steps (200) for text categorization to identify software build failure causes. The method steps are executed by the system (100) and its components that have been elucidated in accordance with figure 1.
[0016] Method step 201 comprises extracting a set of features by means of a processor (10) from failure build logs and configuration files from at least one computing unit (20). In an embodiment the AI module within the processor (10) is trained to exact the features. Extracting the set of features comprises pre-processing the failure build logs and configuration files to remove stop words and HTML tags. Then the set of features are extracted from the pre-processed data using a backward elimination technique. The backward elimination technique is based on information gain. Backward elimination is a feature selection technique wherein those features that do not have a significant effect on the outcome eliminated.
[0017] To put in context, in actual use case the method and the corresponding system (100) is deployed for one of the build types which focuses on Static Code Check. The static code check has to conform to certain AUTOSAR standards such as MISRA 2012, MISRA 2014. Now particularly for this use case, tool such as Code Guide (Tool A), ProcMan (Tool B), Artus (Tool C) that are frequently used are collected as attributes. Hence these tools are extracted a features as they are attributes that contribute to the failure. Now out of these features (tools) only those that have a significant effect on the outcome i.e. successful execution of static code check are retained.
[0018] Method step 202 comprises selecting a set of sub features for each feature based on a semantic connection with the feature by means of the processor (10). Taking cue from the previous example, Tool A is an attribute which has many sub-conditions like Platform W, X, Y, Z. Here Platform W, X , Y has more semantic weightage, hence only W,X,Y will are extracted as sub-features for the feature Tool A. A person skilled in the art will appreciate that the AI module is getting continuously updated or re-trained with the latest information and during that time if the sub-condition Z for the attribute Tool A is getting more weightage then reranking will be done where feature Tool A, the sub-conditions that will be chained up will be Z,W,X if Y’s semantic weight is lower than W and X.
[0019] Method step 203 comprises ranking the set of sub features by means of the processor (10) based on a hit rate of each of the sub features to eliminate some sub features. The hit rate is the frequency of usage of a sub-feature by a corresponding feature. Taking cue from the previous example, the no. of times Y is referred is the highest followed by X and then W. So, the sub-features are ranked Y,X,W. A threshold for the number of sub-features is defined say two hence W will be eliminated.
[0020] Method step 204 comprises assigning a value of weight to each feature based on the occurrence of its corresponding sub-features in the failure build log and summation of all sub-features corresponding to all features by means of the processor (10). For example, there are total 3 features extracted and there are a total 8 sub-features for all the features. Now platform Y corresponding to tool A occurs 5 times and all sub-features combined occur 9 times, so now the weightage assigned to tool A (feature ) based on platform Y (sub-feature) would be 5/9.
[0021] Method step 205 comprises identifying the build failure cause based on the assigned value of weight to the features. The feature having the maximum weightage is predicted as a probable reason for failure.
[0022] A person skilled in the art will appreciate that while these method steps describe only a series of steps to accomplish the objectives, these methodologies may be implemented with custom modification to the system (100) disclosed.
[0023] This idea to develop a method and system (100) of text categorization to identify software build failure causes identifies build failure causes with more accuracy. The key to the accuracy in the proposed idea lies in the manner in which weightage is assigned to each extracted feature. All legacy feature extraction techniques are not able to achieved dimensionality reduction properly however with our approach we have minimal features and maximum coverage to classify failure causes accurately.
[0024] It must be understood that the embodiments explained in the above detailed description are only illustrative and do not limit the scope of this invention. Any modification to method and system (100) of text categorization to identify software build failure causes are envisaged and form a part of this invention. The scope of this invention is limited only by the claims.
, Claims:We Claim:
1. A method (200) of text categorization to identify software build failure causes, the method comprising:
extracting (201) a set of features by means of a processor (10) from failure build logs and configuration files from at least one computing unit (20);
selecting (202) a set of sub features for each feature based on a semantic connection with the feature by means of the processor (10);
ranking (203) the set of sub features by means of the processor (10) based on a hit rate of each of the sub features to eliminate some sub features;
assigning (204) a value of weight to each feature based on the occurrence of its corresponding sub-features in the failure build log and summation of all sub-features corresponding to all features by means of the processor (10);
identifying (205) the build failure cause based on the assigned value of weight to the features.
2. The method (200) of text categorization to identify software build failure causes as claimed in claim 1, wherein extracting the set of features comprises:
Pre-processing the failure build logs and configuration files to remove stop words and HTML tags;
extracted the set of features from the pre-processed data using a backward elimination technique.
3. The method (200) of text categorization to identify software build failure causes as claimed in claim 1, wherein the hit rate is the frequency of usage of a sub-feature by a corresponding feature.
4. A system (100) for text categorization to identify software build failure causes, the system (100) comprising a processor (10) in communication with at least one computing unit (20), the computing unit (20) configured to process software builds, characterized in that system (100):
the processor (10) configured to:
extract a set of features from failure build logs and configuration files of the at least one computing unit (20);
select a set of sub features for each feature based on the semantic connection;
rank the set of sub features based on a hit rate of each of the sub features to eliminate some sub features;
assign a value of weight to each feature based on the occurrence of it’s sub-features in the failure build log and summation of all sub-features corresponding to all features;
identify the build failure cause based on the assigned value of weight to the features.
5. The system (100) for text categorization to identify software build failure causes as claimed in claim 4, wherein the processor (10) is further configured to:
Pre-process the failure build logs and configuration files to remove stop words and HTML tags;
extract the set of features from the pre-processed data using a backward elimination technique.
6. The system (100) for text categorization to identify software build failure causes as claimed in claim 4, wherein the hit rate is the frequency of usage of a sub-feature by a corresponding feature.
| # | Name | Date |
|---|---|---|
| 1 | 202341005922-POWER OF AUTHORITY [30-01-2023(online)].pdf | 2023-01-30 |
| 2 | 202341005922-FORM 1 [30-01-2023(online)].pdf | 2023-01-30 |
| 3 | 202341005922-DRAWINGS [30-01-2023(online)].pdf | 2023-01-30 |
| 4 | 202341005922-DECLARATION OF INVENTORSHIP (FORM 5) [30-01-2023(online)].pdf | 2023-01-30 |
| 5 | 202341005922-COMPLETE SPECIFICATION [30-01-2023(online)].pdf | 2023-01-30 |