Method And System For Scalable Acceleration Of Data Processing Pipeline
Abstract:
The present disclosure provides a scalable acceleration of data processing in Machine Learning pipeline which is unavailable in conventional methods. Initially, the system receives a dataset and a data processing code. A plurality of sample datasets are obtained based on the received dataset using a sampling technique. A plurality of performance parameters corresponding to each of the plurality of sample datasets are obtained based on the data processing code using a profiling technique. A plurality of scalable performance parameters corresponding to each of a plurality of larger datasets are predicted based on the plurality of performance parameters and the data processing code using a curve fitting technique. Simultaneously, a plurality of anti-patterns are located in the data processing code using a pattern matching technique. Finally, an accelerated code is recommended based on the plurality of anti-patterns and the predicted plurality of scalable performance parameters using an accelerated code recommendation technique.
[To be published with FIG. 4]
Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Inventors
1. MISHRA, Mayank
Tata Consultancy Services Limited, Olympus - A, Opp Rodas Enclave, Hiranandani Estate, Ghodbunder Road, Patlipada, Thane West, Maharashtra 400607, India
2. BHOWMICK, Archisman
Tata Consultancy Services Limited, Olympus - A, Opp Rodas Enclave, Hiranandani Estate, Ghodbunder Road, Patlipada, Thane West, Maharashtra 400607, India
3. SINGHAL, Rekha
Tata Consultancy Services Limited, Olympus - A, Opp Rodas Enclave, Hiranandani Estate, Ghodbunder Road, Patlipada, Thane West, Maharashtra 400607, India
Specification
Claims:WE CLAIM:
1. A processor implemented method (200), the method comprising:
receiving (202), by one or more hardware processors, a dataset and a data processing code, wherein the data processing code is implemented to transform the dataset from one format to another;
obtaining (204), by the one or more hardware processors, a plurality of sample datasets based on the received dataset and a predefined data sampling size using a sampling technique;
obtaining (206), by the one or more hardware processors, a plurality of performance parameters corresponding to each of the plurality of sample datasets based on the data processing code using a first profiling technique;
predicting (208), by the one or more hardware processors, a plurality of scalable performance parameters corresponding to each of a plurality of larger datasets based on the plurality of performance parameters and the data processing code using a curve fitting technique;
simultaneously locating (210), by the one or more hardware processors, a plurality of anti-patterns in the data processing code using a pattern matching technique, wherein an anti-pattern is a short term solution prone to adverse consequences during a long term usage; and
recommending (212), by the one or more hardware processors, a plurality of accelerated codes based on the plurality of anti-patterns and the predicted plurality of scalable performance parameters using an accelerated code recommendation technique, wherein the accelerated code recommendation technique identifies a plurality of super-linear bottlenecks from the the plurality of anti-patterns and recommends the corresponding accelerated code.
2. The method as claimed in claim 1, wherein the method of recommending the accelerated code based on the plurality of anti-patterns and the predicted plurality of scalable performance parameters using the accelerated code recommendation technique comprises:
receiving the plurality of scalable performance parameters corresponding to each of the plurality of larger datasets and the plurality of anti-patterns;
identifying a plurality of performance bottlenecks associated with the data processing code based on the plurality of scalable performance parameters and the located plurality of anti-patterns using a second profiling technique, wherein each of the plurality of performance bottlenecks are identified based on a change in computing time corresponding to an increase in data size;
selecting the plurality of super-linear bottlenecks from the plurality of performance bottlenecks by sorting the plurality of performance bottlenecks in descending order based on a corresponding computing time, wherein the plurality of performance bottlenecks with the computing time greater than a predefined threshold are selected and, wherein each of the plurality of super-linear bottlenecks causes scalability problem; and
recommending a plurality of accelerated codes corresponding to each of the plurality of super-linear bottlenecks based on an accelerated code lookup table, wherein the accelerated code lookup table comprises a plurality of bottlenecks and the plurality of accelerated codes corresponding to each of the plurality of bottlenecks.
3. The method as claimed in claim 1, wherein the plurality of performance parameters comprises a computing time, a memory transfer time, a number of processor cores, a cache memory hierarchy, a cache size and a memory bandwidth.
4. The method as claimed in claim 1, wherein the plurality of scalable performance parameters comprise a scalable computing time and a scalable memory transfer time.
5. The method as claimed in claim 1, wherein the pattern matching technique comprises a regular expression based pattern matching and a keyword based pattern matching.
6. A system (100) comprising:
at least one memory (104) storing programmed instructions; one or more Input /Output (I/O) interfaces (112); and one or more hardware processors (102) operatively coupled to the at least one memory (104), wherein the one or more hardware processors (102) are configured by the programmed instructions to:
receive a dataset and a data processing code, wherein the data processing code is implemented to transform the dataset from one format to another;
obtain a plurality of sample datasets based on the received dataset and a predefined data sampling size using a sampling technique;
obtain a plurality of performance parameters corresponding to each of the plurality of sample datasets based on the data processing code using a first profiling technique;
predict a plurality of scalable performance parameters corresponding to each of a plurality of larger datasets based on the plurality of performance parameters and the data processing code using a curve fitting technique;
simultaneously locate a plurality of anti-patterns in the data processing code using a pattern matching technique, wherein an anti-pattern is a short term solution prone to adverse consequences during a long term usage; and
recommend a plurality of accelerated codes based on the plurality of anti-patterns and the predicted plurality of scalable performance parameters using an accelerated code recommendation technique, wherein the accelerated code recommendation technique identifies a plurality of super-linear bottlenecks from the the plurality of anti-patterns and recommends the corresponding accelerated code.
7. The system of claim 6, wherein the method of recommending the accelerated code based on the plurality of anti-patterns and the predicted plurality of scalable performance parameters using the accelerated code recommendation technique comprises:
receiving the plurality of scalable performance parameters corresponding to each of the plurality of larger datasets and the plurality of anti-patterns;
identifying a plurality of performance bottlenecks associated with the data processing code based on the plurality of scalable performance parameters and the located plurality of anti-patterns using a second profiling technique, wherein each of the plurality of performance bottlenecks are identified based on a change in computing time corresponding to an increase in data size;
selecting the plurality of super-linear bottlenecks from the plurality of performance bottlenecks by sorting the plurality of performance bottlenecks in descending order based on a corresponding computing time, wherein the plurality of performance bottlenecks with the computing time greater than a predefined threshold are selected and, wherein each of the plurality of super-linear bottlenecks causes scalability problem; and
recommending a plurality of accelerated codes corresponding to each of the plurality of super-linear bottlenecks based on an accelerated code lookup table, wherein the accelerated code lookup table comprises a plurality of bottlenecks and the plurality of accelerated codes corresponding to each of the plurality of bottlenecks.
8. The system of claim 6, wherein the plurality of performance parameters comprises a computing time, a memory transfer time, a number of processor cores, a cache memory hierarchy, a cache size and a memory bandwidth.
9. The system of claim 6, wherein the plurality of scalable performance parameters comprise a scalable computing time and a scalable memory transfer time.
10. The system of claim 6, wherein the pattern matching technique comprises a regular expression based pattern matching and a keyword based pattern matching.
Dated this 14th day of December 2021
Tata Consultancy Services Limited
By their Agent
(Adheesh Nargolkar)
of Khaitan & Co
Reg No IN/PA-1086
, Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR SCALABLE ACCELERATION OF DATA PROCESSING PIPELINE
Applicant
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
The disclosure herein generally relates to the field of big data and, more particular, to a method and system for scalable acceleration of data processing pipeline.
BACKGROUND
Machine Learning (ML) is a data-driven approach and is widely used to automate applications. Data pre-processing is a key step in data-driven approaches especially in (ML) pipeline. The data pre-processing includes data cleaning, data transformation, data joins, data visualization for feature identification, and finally, building features for model training. Generally, the data pre-processing consumes more time in the development cycle of ML pipeline.
Conventionally, data pre-processing operations are implemented using a programming language and are tested for functional correctness and performance compliance using a dataset of smaller size. However, most of the performance bottlenecks are invisible when tested with smaller sized dataset. Therefore, the performance issues emerges when these pre-processing operations are executed on datasets of larger sizes (for example, rows in billion or trillion). In such cases, approach followed by programmers is to freeze the ML pipeline and fix the performance degradation by adding additional hardware or by changing some portion of programming code using static analysis. This may lead to an increase in data pre-processing time. Thus, existing method have limitations in detecting these bottlenecks early, on smaller data sets, which is a hurdle in creating preventive solutions to overcome challenges.
SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method for scalable acceleration of data processing pipeline is provided. The method includes receiving, by one or more hardware processors, a dataset and a data processing code, wherein the data processing code is implemented to transform the dataset from one format to another. Further, the method includes obtaining, by one or more hardware processors, a plurality of sample datasets based on the received dataset and a predefined data sampling size using a sampling technique. Furthermore, the method includes obtaining, by one or more hardware processors, a plurality of performance parameters corresponding to each of the plurality of sample datasets based on the data processing code using a first profiling technique. Furthermore, the method includes predicting, by one or more hardware processors, a plurality of scalable performance parameters corresponding to each of a plurality of larger datasets based on the plurality of performance parameters and the data processing code using a curve fitting technique. Furthermore, the method includes simultaneously locating, by one or more hardware processors, a plurality of anti-patterns in the data processing code using a pattern matching technique, wherein an anti-pattern is a short term solution prone to adverse consequences during a long term usage. Finally, the method includes recommending, by one or more hardware processors, a plurality of accelerated codes based on the plurality of anti-patterns and the predicted plurality of scalable performance parameters using an accelerated code recommendation technique, wherein the accelerated code recommendation technique identifies a plurality of super-linear bottlenecks from the the plurality of anti-patterns and recommends the corresponding accelerated code.
In another aspect, a system for scalable acceleration of data processing pipeline is provided. The system includes at least one memory storing programmed instructions, one or more Input /Output (I/O) interfaces, and one or more hardware processors operatively coupled to the at least one memory, wherein the one or more hardware processors are configured by the programmed instructions to receive a dataset and a data processing code, wherein the data processing code is implemented to transform the dataset from one format to another. Further, the one or more hardware processors are configured by the programmed instructions to obtain a plurality of sample datasets based on the received dataset and a predefined data sampling size using a sampling technique. Furthermore, the one or more hardware processors are configured by the programmed instructions to obtain a plurality of performance parameters corresponding to each of the plurality of sample datasets based on the data processing code using a first profiling technique. Furthermore, the one or more hardware processors are configured by the programmed instructions to predict a plurality of scalable performance parameters corresponding to each of a plurality of larger datasets based on the plurality of performance parameters and the data processing code using a curve fitting technique. Furthermore, the one or more hardware processors are configured by the programmed instructions to simultaneously locate a plurality of anti-patterns in the data processing code using a pattern matching technique, wherein an anti-pattern is a short term solution prone to adverse consequences during a long term usage. Finally, the one or more hardware processors are configured by the programmed instructions to recommend a plurality of accelerated codes based on the plurality of anti-patterns and the predicted plurality of scalable performance parameters using an accelerated code recommendation technique, wherein the accelerated code recommendation technique identifies a plurality of super-linear bottlenecks from the the plurality of anti-patterns and recommends the corresponding accelerated code.
In yet another aspect, a computer program product including a non-transitory computer-readable medium having embodied therein a computer program for scalable acceleration of data processing pipeline is provided. The computer readable program, when executed on a computing device, causes the computing device to receive a dataset and a data processing code, wherein the data processing code is implemented to transform the dataset from one format to another. Further, computer readable program, when executed on a computing device, causes the computing device to obtain a plurality of sample datasets based on the received dataset and a predefined data sampling size using a sampling technique. Furthermore, computer readable program, when executed on a computing device, causes the computing device to obtain a plurality of performance parameters corresponding to each of the plurality of sample datasets based on the data processing code using a first profiling technique. Furthermore, computer readable program, when executed on a computing device, causes the computing device to predict a plurality of scalable performance parameters corresponding to each of a plurality of larger datasets based on the plurality of performance parameters and the data processing code using a curve fitting technique. Furthermore, computer readable program, when executed on a computing device, causes the computing device to simultaneously locate a plurality of anti-patterns in the data processing code using a pattern matching technique, wherein an anti-pattern is a short term solution prone to adverse consequences during a long term usage. Finally, computer readable program, when executed on a computing device, causes the computing device to recommend a plurality of accelerated codes based on the plurality of anti-patterns and the predicted plurality of scalable performance parameters using an accelerated code recommendation technique, wherein the accelerated code recommendation technique identifies a plurality of super-linear bottlenecks from the the plurality of anti-patterns and recommends the corresponding accelerated code.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 is a functional block diagram of a system for scalable acceleration of data processing pipeline, in accordance with some embodiments of the present disclosure.
FIG. 2 is an exemplary flow diagram illustrating a method for scalable acceleration of data processing pipeline, implemented by the system of FIG. 1, in accordance with some embodiments of the present disclosure.
FIG. 3 illustrates a bottleneck inversion problem experienced by conventional methods, in accordance with some embodiments of the present disclosure.
FIG. 4 is a functional architecture for the processor implemented method for scalable acceleration of data processing pipeline implemented by the system of FIG. 1, in accordance with some embodiments of the present disclosure.
FIGS. 5A through 5E illustrate experimental results depicting performance of the processor implemented method for scalable acceleration of data processing pipeline implemented by the system of FIG. 1, in accordance with some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments.
Embodiments herein provide a method and system for scalable acceleration of data processing pipeline by automatically recommending accelerated codes to be used for larger data sizes. Initially, the system receives a dataset and a data processing code. The data processing code is implemented to transform the dataset from one format to another. Further, a plurality of sample datasets are obtained based on the received dataset and a predefined data sampling size using a sampling technique. After sampling, a plurality of performance parameters corresponding to each of the plurality of sample datasets are obtained based on the data processing code using a profiling technique. After profiling, a plurality of scalable performance parameters corresponding to each of a plurality of larger datasets are predicted based on the plurality of performance parameters and the data processing code using a curve fitting technique. Simultaneously, a plurality of anti-patterns are located in the data processing code using a pattern matching technique. An anti-pattern is a short term solution prone to adverse consequences during long term usage. Finally, a plurality of accelerated codes are recommended based on the plurality of anti-patterns and the predicted plurality of scalable performance parameters using an accelerated code recommendation technique. The accelerated code recommendation technique identifies a plurality of super-linear bottlenecks from the plurality of anti-patterns and recommends the corresponding accelerated code.
Referring now to the drawings, and more particularly to FIGS. 1 through 5E, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1 is a functional block diagram of a system 100 for scalable acceleration of data processing pipeline, according to some embodiments of the present disclosure. The system 100 includes or is otherwise in communication with hardware processors 102, at least one memory such as a memory 104, an I/O interface 112. The hardware processors 102, memory 104, and the Input /Output (I/O) interface 112 may be coupled by a system bus such as a system bus 108 or a similar mechanism. In an embodiment, the hardware processors 102 can be one or more hardware processors.
The I/O interface 112 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface 112 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a keyboard, a mouse, an external memory, a printer and the like. Further, the I/O interface 112 may enable the system 100 to communicate with other devices, such as web servers, and external databases.
The I/O interface 112 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, local area network (LAN), cable, etc., and wireless networks, such as Wireless LAN (WLAN), cellular, or satellite. For the purpose, the I/O interface 112 may include one or more ports for connecting several computing systems with one another or to another server computer. The I/O interface 112 may include one or more ports for connecting several devices to one another or to another server.
The one or more hardware processors 102 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, node machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 102 is configured to fetch and execute computer-readable instructions stored in the memory 104.
The memory 104 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, the memory 104 includes a plurality of modules 106. The memory 104 also includes a data repository (or repository) 110 for storing data processed, received, and generated by the plurality of modules 106.
The plurality of modules 106 include programs or coded instructions that supplement applications or functions performed by the system 100 for scalable acceleration of data processing pipeline. The plurality of modules 106, amongst other things, can include routines, programs, objects, components, and data structures, which performs particular tasks or implement particular abstract data types. The plurality of modules 106 may also be used as, signal processor(s), node machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the plurality of modules 106 can be used by hardware, by computer-readable instructions executed by the one or more hardware processors 102, or by a combination thereof. The plurality of modules 106 can include various sub-modules (not shown). The plurality of modules 106 may include computer-readable instructions that supplement applications or functions performed by the system 100 for scalable acceleration of data processing pipeline. In an embodiment, plurality of modules 106 includes a sampling module (shown in FIG. 4), a profiling module (shown in FIG. 4), a prediction module (shown in FIG. 4), an anti-pattern identification module (shown in FIG. 4) and an accelerated code recommendation module (shown in FIG. 4).
The data repository (or repository) 110 may include a plurality of abstracted piece of code for refinement and data that is processed, received, or generated as a result of the execution of the plurality of modules in the module(s) 106.
Although the data repository 110 is shown internal to the system 100, it will be noted that, in alternate embodiments, the data repository 110 can also be implemented external to the system 100, where the data repository 110 may be stored within a database (repository 110) communicatively coupled to the system 100. The data contained within such external database may be periodically updated. For example, new data may be added into the database (not shown in FIG. 1) and/or existing data may be modified and/or non-useful data may be deleted from the database. In one example, the data may be stored in an external system, such as a Lightweight Directory Access Protocol (LDAP) directory and a Relational Database Management System (RDBMS).
FIG. 2 is an exemplary flow diagrams illustrating a method 200 for scalable acceleration of data processing pipeline implemented by the system of FIG. 1 according to some embodiments of the present disclosure. In an embodiment, the system 100 includes one or more data storage devices or the memory 104 operatively coupled to the one or more hardware processor(s) 102 and is configured to store instructions for execution of steps of the method 200 by the one or more hardware processors 102. The steps of the method 200 of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in FIG. 1 and the steps of flow diagram as depicted in FIG. 2A and 2B. The method 200 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The method 200 may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communication network. The order in which the method 200 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method 200, or an alternative method. Furthermore, the method 200 can be implemented in any suitable hardware, software, firmware, or combination thereof.
At step 202 of the method 200, the one or more hardware processors 102 are configured by the programmed instructions to receive a dataset and a data processing code, wherein the data processing code is implemented to transform the dataset from one format to another. In an embodiment, the data processing code is a data pre-processing code used in automated ML pipelines.
At step 204 of the method 200, the sampling module executed by the one or more hardware processors 102 is configured by the programmed instructions of to obtain the plurality of sample datasets based on the received dataset and a predefined data sampling size (data size) using a sampling technique. In an embodiment, the predefined data sampling size is larger than the size of cache memory. In an embodiment, the predefined data sampling size refers to a data size reduction factor. For example, if k denotes the data size reduction factor, then k ?(1,1/2,1/4,1/8,1/16,…). In an embodiment, the sampling technique used in the present disclosure is a random sampling without replacement. For temporal data, the sample technique preserves the temporal order of individual data points.
At step 206 of the method 200, the profiling module executed by the one or more hardware processors 102 is configured by the programmed instructions to obtain the plurality of performance parameters corresponding to each of the plurality of sample datasets based on the data processing code using a first profiling technique. In an embodiment, the plurality of performance parameters includes a computing time, a memory transfer time, a number of processor cores, a cache memory hierarchy, a cache size and a memory bandwidth. The plurality of performance parameters are alternatively represented as “profile”.
In an embodiment, the first profiling technique includes Scalene, cProfiler, mProfiler, and the like. The output of the profiling technique is in a tuple form. For example, the tuple form is . The tuple mentioned above is generated for every line of data processing code on a particular input data size. The list of such tuples for the complete data processing code for a particular data size is termed as the “profile” of the code. The profile is denoted by P_k (processing profile), where, k denotes the data size reduction factor and k ?(1,1/2,1/4,1/8,1/16,…).
At step 208 of the method 200, the prediction module executed by the one or more hardware processors 102 is configured by the programmed instructions to predict the plurality of scalable performance parameters corresponding to each of a plurality of larger datasets based on the plurality of performance parameters and the data processing code using a curve fitting technique. In an embodiment, the plurality of scalable performance parameters includes a scalable computing time and a scalable memory transfer time. In an embodiment, the curve fitting technique used in the present disclosure includes scipy. optimize.curve fit and numpy.polyfit. The curve fitting techniques build models for predicting the processing time and assess the scalability bottlenecks when data size is increased. The present disclosure considers only positive polynomial, exponential, and logarithmic functions for curve fitting as the processing time of any data processing program increases monotonically with an increase in data size.
The process of curve fitting is explained as follows: Given k tuples
?(s?_(i,) t_i), where i?(1,2,..,k) and s_(i ), t_i denoting the data size and processing time for i^th tuple, curve fitting is performed using a set of mathematical functions F. The mathematical function f, where f?F results in the least fitting error, is selected as the representative curve function for the given tuples. The function f can then be utilized to predict the processing time for data sizes larger than the sizes for which the input tuples are provided. Table I shows the predicted processing time values derived using the mentioned approach.
In an embodiment, the scalable processing time (the scalable computing time) and memory usage (the scalable memory transfer time) is computed for every line of the code based on the profiles ?Pr?_k (predicted profile) for different values of k. The curve fitting techniques considers other parameters in the profiles P_k like the number of cores, the cache hierarchy, the cache size and the memory bandwidth. Further, the processing time as well as the memory utilization for larger data sizes, i.e., k ?(1,1/2,1/4,1/8,1/16,…) are predicted (called predicted profiles or the plurality of scalable performance parameters).
For example, Table I shows the processing time profiles P_k as well as their predicted profiles ?Pr?_k associated with an ML pipeline including several data processing operations like Op1,Op2, and Op3. The Table 1 shows that processing time taken by all three operations increases with an increase in data size. However, the rate of increase is different for each operation. Op1 consumes the highest processing time (20 sec) among the three (twice that of Op2 and thrice that of Op3) when data size is n/8, however, when data size is increased to n, Op2 starts taking four times more time than Op1. The ratio between the processing times taken by Op1 and Op3 is reduced to 1.3 when data size was n from the earlier ratio of 3 when data size was n/8. The n (data size) in this example is 500k, which is about the 5% sample of the whole dataset made available.
Table I
Bottleneck Observed Time (Sec) Predicted Time (Sec)
P_(n/8) P_(n/4) P_(n/2) P_n ?Pr?_2n ?Pr?_4n ?Pr?_16n
Op1 20 34 50 77 114 168 358
Op2 9.6 23 81 327 1350 5600 97000
Op3 6.7 11 24 60 158 425 3150
FIG. 3 illustrates the bottleneck inversion problem given in Table I for the method for scalable acceleration of data processing pipeline, implemented by the system of FIG. 1, in accordance with some embodiments of the present disclosure. Now referring to FIG. 3, both X axis represents the data size and the Y axis represents time in seconds. Both the X axis and Y axis are in log domains. FIG. 3 shows that the bottlenecks change with data size. The line graphs 302, 304 and 306 indicates the variation in time corresponding to the variation data size associated with Op1,Op2, and Op3 respectively. The points 308 and 310 are called the bottleneck inversion points. The piece of code which is the biggest bottleneck Op1 when data size is n/8 (denoted as 0.125 in the graph) is predicted to be the smallest bottleneck when data size reaches its actual value 16 X n (8 to 10 Million data points). This changing bottleneck problem as is called as Bottleneck Inversion Problem.
At step 210 of the method 200, the anti-pattern identification module executed by the one or more hardware processors 102 is configured by the programmed instructions to simultaneously locate the plurality of anti-patterns in the data processing code using a pattern matching technique. An anti-pattern is a short term solution, prone to adverse consequences during long term usage. In an embodiment, the pattern matching includes a regular expression based pattern matching and a keyword based pattern matching.
For example, the plurality of anti-patters includes, loops, dead code, lambda functions, file I/O operations, and data frame operations. The pattern matching technique uses regular expressions and keyword searches for finding bad code designs referred to as Anti-pattern finder, which is a modular script. The pattern matching technique can be extended for new anti-patterns. The pattern matching technique (anti-pattern finder) provides the locations of the found anti-patterns in the form of a list of tuples (location metadata). This list of tuples is called “Anti-pattern list (APL)” is of the form . For example, where ITER refers to antipattern type where “iterrows” operation is used with data frames. Example of ITER antipattern is present in Tables III. Another example is , where MEM refers to repeated data operation antipattern. Example of MEM antipattern is present in Table VI.
At step 212 of the method 200, the accelerated code recommendation module executed by the one or more hardware processors 102 is configured by the programmed instructions to recommend the plurality of accelerated codes based on the plurality of anti-patterns and the predicted plurality of scalable performance parameters using an accelerated code recommendation technique. The accelerated code recommendation technique identifies a plurality of super-linear bottlenecks from the plurality of anti-patterns and recommends the corresponding accelerated code.
In an embodiment, the method of recommending the accelerated code based on the plurality of anti-patterns and the predicted plurality of scalable performance parameters using the accelerated code recommendation technique receives the plurality of scalable performance parameters corresponding to each of the plurality of larger datasets and the plurality of anti-patterns. Further, a plurality of performance bottlenecks associated with the data processing code are identified based on the plurality of scalable performance parameters and the located plurality of anti-patterns using a second profiling technique. In an embodiment, the second profiling technique includes Scalene, cProfiler, mProfiler, and the like. Each of the plurality of performance bottlenecks are identified based on a change in computing time (or change in memory requirement) corresponding to an increase in data size. After identifying the plurality of performance bottlenecks, a plurality of super-linear bottlenecks are selected from the plurality of performance bottlenecks by sorting the plurality of performance bottlenecks in descending order based on a corresponding computing time. The plurality of performance bottlenecks with the computing time greater than a predefined threshold are selected as the plurality of super-linear bottlenecks. Each of the plurality of super-linear bottlenecks causes scalability problem. Finally, the plurality of accelerated codes corresponding to each of the plurality of super-linear bottlenecks are recommended based on an accelerated code lookup table. The accelerated code lookup table comprises the plurality of bottlenecks and the plurality of accelerated codes corresponding to each of the plurality of bottlenecks. For example, Tables III through Table VII illustrates some of the example super-linear bottlenecks and the corresponding accelerated codes.
In an embodiment, each line of code of the processing code is categorized as either “at-most-linear” or “super-linear” depending on how the processing time changes when data size is increased. The lines of code categorized as “super-linear” are the ones that cause the scalability problem. When the data size is small, such super-linear bottlenecks’ processing time or memory requirement might be relatively lesser than other (even linear) bottlenecks. However, for the larger data sizes, the super-linear bottlenecks can surpass other bottlenecks in time and resources required for processing. The operation Op2 in graph shown in Table I is one such super-linear bottleneck.
In an embodiment, the processing code associated with the each of the plurality of crucial performance bottlenecks is replaced with an efficient code (accelerated code). This code replacement can change the bottleneck’s nature from super-linear to at most linear, wherein such cases occur due to sub-optimal coding. However, in certain embodiments, it is impossible to change the bottleneck’s nature and such cases occur when the algorithm involved is super-liner in nature. In such cases, replacing individual operations such as ’loop iteration’ with unrolled loops, simple instructions to vectorized instructions, and converting lists to arrays can reduce the bottleneck’s absolute processing time and memory requirement. The nature of the bottleneck might remain the same (super-linear). However, the curve is pushed for larger data sizes which may be well beyond the size of total data used.
In an embodiment, replacing the code snippet associated with the each of the plurality of crucial performance bottlenecks with efficient and faster operations (accelerated code) accelerates the overall data processing operation. Although at-most-linear bottlenecks are not as severe as scalability bottlenecks, still reducing their processing time and memory requirement reduces the overall processing time and memory requirement of the data processing.
Table II provides a glimpse of accelerations achieved by the present disclosure for the anti-patterns over a data frame containing 500k data points. Table II represents some of the most common anti-patterns which causes bottlenecks in the data processing code of ML pipelines.
Table II
Antipattern Time Taken (Sec) Speedup
Before acceleration After acceleration
Loop iterrows (ITER) 740 0.0046 160,000x
Excessive Memory Copy (EMC) 60 0.69 87x
Lengthy Operations (LLF) 215 1.86 115x
Repeated Data Operations (MEM) 327 1.3 250x
Now referring to the Table II, some of the example anti-patterns considered in the present disclosure includes a Loop iterrows (ITER), an Excessive Memory Copy (EMC), a Lengthy Operations (LLF) and a Repeated Data Operations (MEM).
Loop iterrows (ITER -Loop Iteration over Rows): In order to process data points one by one using loops, the present disclosure utilizes Pandas (Python library) an iterable object over the data-frame using the iterrows function call. Table III shows a sample code snippet with loop using iterrows, performing the operation to multiply the entries of column price by the entries of column discount, and store the result in a new column price final. There is often an if-else condition also involved, as shown in the code snippet. Code in Table III takes around 740 seconds when executed over data containing around 500k rows and 27 columns. When this code is replaced with the corresponding vectorized operation as shown in the code snippet in Table III the time was reduced to just 4.6 milliseconds, a speedup of almost 160, 000x is achieved. It should be noted that the if-else condition is handled by “where” clause. Further, the code snippet was experimented without the “where” clause by putting the value for discount column as 0 when there is no discount, and it turned out to be even faster. The code with loop and iterrows, without if-else took 701 seconds while vectorized code just took 2 milliseconds. The vectorized code is almost 2 times faster when where is not used. Further, in another embodiment, the vectorized instructions were replaced with division operation in a similar situation, and the results were similar to the multiplication operation mentioned. The conversion from loop-iterrows to vectorized code works when the data to be operated upon is numeric (int, float, etc.) with arithmetic operation involved. When data is nonnumeric (string, textual), the vectorization of code does not give equivalent performance as with numeric data. In such cases apply (function name) is used. The experimentation proved that both the vectorized code and the code written using the apply function have a similar performance on non-numeric data. An option is to map the string data to numeric data whenever possible. For example, in the transactional data set, the nature of the interaction between a user and any product is captured as strings with values “view”, “order”, etc. If these strings are mapped to numeric values “view” = 1, “order” = 2, the processing involving comparison of interaction values can be significantly accelerated as vectorized instructions can be employed.
Table III
code snippet with loop using iterrows (Slow code) for index,rows in df.iterrows():
if df.loc[ind,'discount_flag']==1:
df.loc[index,'price_final']=df.loc[index,'price']*(1.0-df.loc[index,'discount']
else:
df.loc[ind,'price_final']= df.loc[index,'price']
Vectorized code snippet (accelerated code) df['price_final']=np.where(df['discount_flag']=1,
df.loc[index,'price']*(1.0-
df.loc[index,'discount']),df.loc[index,'price'] )
Excessive memory copying (EMC): A common usage in data frame operations is to process certain columns (of every row) of a data frame and store the processed result in a new column inside the function. This may lead to excessive memory transfer and hence execution time. For example, in Table IV, function is applied to every row of the data frame “df” and newly generated data is stored as a new column new column of the data frame “df”. Here, the whole data frame is transferred twice for sending to and receiving from the function with every function call. The slow code snippet shown in Table IV takes around 77.4 seconds on data frame of size 500k rows and 27 columns. The corresponding accelerated code given in Table IV takes only 3 seconds for the same amount of data i.e., a speedup of 25x. It can be seen that, though new data is still created inside the function, it is not stored in the new column of the data frame. Instead, this new data is returned and stored in the new data frame column outside the function.
Table IV
code snippet EMC (Slow code) def function(x,attr):
data=[]
for col in attr:
data.append({'name':col,'value':x[col]})
x['new_column'] = data
return x
df = df.apply(function,args=[attr,],axis=1)
Accelerated code def function(x,attr):
data=[]
for col in attr:
data.append({'name':col,'value':x[col]})
return data
df['new_column'] = df.apply(function,args=[attr,],axis=1)
Lengthy operation as a Lambda Function (LLF): The apply method coupled with the lambda operation provides an easy way to apply several operations over a data frame and is often misused. Table V shows an example of such misuse, which takes around 747 seconds for a data frame of size 2k rows and 27 columns. Here, the lambda function is employed to create a list of product attributes with the key as “productId”. The code creates a new dictionary for every lambda function call. The creation of a large number of dictionaries causes unnecessary overhead. It can be seen that “productId” is repeated in the “df” data frame, which causes the same dictionary to be created again and again as product attributes do not change in this use case. Note that “prod df” is a separate data frame which contains “productId” and “Product attr” as columns.
In an embodiment, instead of using a lengthy operation with the lambda based approach, a separate function “fn” is defined in the present disclosure which utilizes a single dictionary for the whole operation as shown in Table V. The dictionary “d” is created only once, and product attribute entries are added only once per “productId”. Thus, the function “fn” performs the same operation. However, it takes only 33 milliseconds after the dictionary is created in 20 milliseconds. The speedup is 14, 000x.
Table V
LLF (Slow code) df['products']=df['products'].apply(lambda x:
[prod_df.set_index('productId').Product_attr.to_dict()
[z] for z in x])
Accelerated code t1=tuple(prod_df['productId'])
t2=tuple(prod_df['Product_attr'])
d=dict(zip(t1,t2))
def fn(x):
l=[d[z] for z in x]
return l
df['products']=df['products'].apply(lambda x: fn(x))
Repeated Data operations (MEM): The code snippet shown in Table VI shows a typical operation involving data selection operation over data frames. The issue here is that function (line 1 in code snippet showing slow code) is called multiple times for same values of arguments (a, b, c, d). The data is selected from the data frame df using the values of passed arguments and returned as a list. This code takes around 327 seconds for a data frame of size 500k rows. The time can be reduced by eliminating the repeated computations for the same arguments and using a cache for later usage. This is similar to “memoization,” and certain libraries in python do provide facilities for this. The present disclosure used one such library known as “lru cache”, which gave a 2x speedup (165 seconds). Upon analysis, it is found that on average, the arguments a and b are repeated only 1.96 times, hence, only 2x speedup. In cases where the average repetition of arguments is k, there is an expected speedup of kx.
In an embodiment, higher speedups can be achieved if the memoization is performed beforehand, i.e., even before the actual function is called. Figure 6b shows the code snippet where a dictionary d is created beforehand with (a, b) as key and (c, d) as value. The dictionary creation requires a single iteration over the data frame, and it only takes 0.59 seconds. Once a dictionary is created, it can be used inside the function to serve the requests as shown in Table VI. After creating the dictionary, the function only requires only 0.71 seconds to execute, which is an acceleration of 250x. Acceleration factor includes the time taken to create the dictionary as well (327/ (0.59 + 0.71) = 250). There were two more similar cases in the ML pipeline implemented by the present disclosure, wherein the operations were repeated. The speedups achieved by creating a dictionary were 46x (reduced from 60 seconds to 1.28 seconds) and 90x (reduced from 215 seconds to 2.37 seconds).
Table VI
MEM (Slow code) def function(a,b,c,d):
new_df= df.loc[(df['A']==a) & (df['B']==b)]
item_list= new_df.loc[df['C']
Documents
Application Documents
#
Name
Date
1
202121058260-STATEMENT OF UNDERTAKING (FORM 3) [14-12-2021(online)].pdf
2021-12-14
2
202121058260-REQUEST FOR EXAMINATION (FORM-18) [14-12-2021(online)].pdf
2021-12-14
3
202121058260-FORM 18 [14-12-2021(online)].pdf
2021-12-14
4
202121058260-FORM 1 [14-12-2021(online)].pdf
2021-12-14
5
202121058260-FIGURE OF ABSTRACT [14-12-2021(online)].jpg
2021-12-14
6
202121058260-DRAWINGS [14-12-2021(online)].pdf
2021-12-14
7
202121058260-DECLARATION OF INVENTORSHIP (FORM 5) [14-12-2021(online)].pdf