Abstract: Embedded systems such as drones employing deep learning models are often resource constrained. Traditional approaches have utilized input buffering and network offloading partitioning techniques to overcome various limitations but have failed to handle scenarios where arrival data and input data volume increase over time, thus affecting storage and network usage. Embodiments of the present disclosure provide systems and methods that implement layer wise partitioning of neural networks and fusion thereof on a resource constrained system where such a model cannot be loaded at a time. Best set of layers are determined for processing at a time in-order to have the best inference latency. Lightweight optimisation model is derived for online fusion of the layers of already partitioned device based on its memory and dynamic system load. [To be published with FIG. 3]
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention:
PRE-PARTITIONING AND FUSING OF NEURAL NETWORK
LAYERS FOR EXECUTION IN, AND OPTIMIZING
RESOURCE CONSTRAINED SYSTEMS
Applicant
Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD [001] The disclosure herein generally relates to neural network partitioning and fusion techniques, and, more particularly, to pre-partitioning and fusing of neural network layers for execution in and optimizing resource constrained systems.
BACKGROUND [002] With increased focus on in situ analytics, artificial intelligence (AI) algorithms are getting deployed on embedded devices at the network edge. Growing popularity of deep learning (DL) and inference largely due to minimization of feature engineering, availability of pre-trained models and fine-tunable datasets especially in image and video analytics, have made these de facto standards. However, the embedded devices employing these models are often resource constrained and fail to handle scenarios where arrival data and input data volume increase over a given time period. This has a direct effect on storage and network usage of such devices, rendering traditional strategies of input buffering and network offloading partitioning and partial execution of DL inference phase to enable inelastic embedded systems to support varying sensing rates and large data volume.
SUMMARY [003] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one aspect, there is provided a processor implemented method for optimizing resource constrained systems. The method comprises obtaining, via one or more hardware processors, information comprising (i) one or more profiling parameters of a resource constrained system, wherein the one or more profiling parameters comprise at least one of size of a memory comprised in the resource constrained system, a processing time, and a processing speed, (ii) an input data pertaining to a specific requirement associated with an application, wherein the
input data comprises at least one of an input data rate, an input data size, and (iii) an intermediate storage output of each layer comprised in a neural network of the resource constrained system, an execution time of each layer comprised in the neural network; creating, via the one or more hardware processors, based on the obtained information, a set of partitioned blocks using a plurality of layers comprised in the neural network, wherein each of the set of partitioned blocks comprise one or more layers having an cumulative execution time (i) less than a pre-defined threshold, (ii) greater than the pre-defined threshold, and equal to the pre-defined threshold; determining, via the one or more hardware processors, a set of layers from a plurality of layers comprised in the neural network that satisfy (i) a current execution time specified in the specific requirement associated with the application and (ii) storage capacity of the memory comprised in the resource constrained system, based on the obtained information, and executing the determined set of layers to obtain a desired output pertaining to the specific requirement of the application; and upon executing the determined set of layers, mapping, via the one or more hardware processors, one or more corresponding partitioned blocks from the created set of partitioned blocks to the determined set of layers, wherein an analytics report is generated based on the mapping and the execution time of each layer comprised in the determined set of layers.
[004] In an embodiment, the method further comprises profiling the analytics report into the resource constrained system for subsequent layer partitioning and fusing for execution thereof in the neural network comprised into the resource constrained system.
[005] In an embodiment, the method further comprises: fine-tuning one or more hyperparameters of the neural network based on the analytics report; and predicting execution time of the one or more corresponding partitioned blocks for another resource constrained system having one or more components that are either similar or identical to one or more components comprised in the resource constrained system.
[006] In an embodiment, the step of determining, a set of layers from a plurality of layers comprised in the neural network comprises: dynamically performing, during an execution of a layer, a comparison of storage and memory of the resource constrained system with a pre-defined threshold; and backtracing within the determined set of layers, based on the comparison to determine a combined effect of storage and execution latency for inclusion and exclusion of the layer in execution.
[007] In an embodiment, the method further comprises predicting completion time of partial execution of the determined set of layers based on the analytics report.
[008] In another aspect, there is provided a system for optimizing resource constrained systems. The system comprises a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to: obtain information comprising (i) one or more profiling parameters of a resource constrained system, wherein the one or more profiling parameters comprise at least one of size of a memory comprised in the resource constrained system, a processing time, and a processing speed, (ii) an input data pertaining to a specific requirement associated with an application, wherein the input data comprises at least one of an input data rate, an input data size, and (iii) an intermediate storage output of each layer comprised in a neural network of the resource constrained system, an execution time of each layer comprised in the neural network; create, based on the obtained information, a set of partitioned blocks using a plurality of layers comprised in the neural network, wherein each of the set of partitioned blocks comprise one or more layers having an cumulative execution time (i) less than a pre-defined threshold, (ii) greater than the pre¬defined threshold, and equal to the pre-defined threshold; determine a set of layers from a plurality of layers comprised in the neural network that satisfy (i) a current execution time specified in the specific requirement associated with the application and (ii) storage capacity of the memory comprised the resource
constrained system, based on the obtained information, and execute the determined set of layers to obtain a desired output pertaining to the specific requirement of the application; and upon executing the determined set of layers, map one or more corresponding partitioned blocks from the created set of partitioned blocks to the determined set of layers, wherein an analytics report is generated based on the mapping and the execution time of each layer comprised in the determined set of layers.
[009] In an embodiment, the one or more hardware processors are further configured by the instructions to profile the analytics report into the resource constrained system for subsequent layer partitioning and fusing for execution thereof in the neural network comprised into the resource constrained system.
[010] In an embodiment, the one or more hardware processors are further configured by the instructions to: fine-tune one or more hyperparameters of the neural network based on the analytics report; and predict execution time of the one or more corresponding partitioned blocks for another resource constrained system having one or more components that are either similar or identical to one or more components comprised in the resource constrained system.
[011] In an embodiment, the one or more hardware processors determine, the set of layers from the plurality of layers comprised in the neural network by: dynamically performing, during an execution of a layer, a comparison of storage and memory of the resource constrained system with a pre-defined threshold; and backtracing within the determined set of layers, based on the comparison to determine a combined effect of storage and execution latency for inclusion and exclusion of the layer in execution.
[012] In an embodiment, the one or more hardware processors are further configured by the instructions to predict completion time of partial execution of the determined set of layers based on the analytics report.
[013] In yet another aspect, there are provided one or more non-transitory machine-readable information storage mediums comprising one or
more instructions which when executed by one or more hardware processors cause optimizing of resource constrained systems by obtaining, via the one or more hardware processors, information comprising (i) one or more profiling parameters of a resource constrained system, wherein the one or more profiling parameters comprise at least one of size of a memory comprised in the resource constrained system, a processing time, and a processing speed, (ii) an input data pertaining to a specific requirement associated with an application, wherein the input data comprises at least one of an input data rate, an input data size, and (iii) an intermediate storage output of each layer comprised in a neural network of the resource constrained system, an execution time of each layer comprised in the neural network; creating, via the one or more hardware processors, based on the obtained information, a set of partitioned blocks using a plurality of layers comprised in the neural network, wherein each of the set of partitioned blocks comprise one or more layers having an cumulative execution time (i) less than a pre-defined threshold, (ii) greater than the pre-defined threshold, and equal to the pre-defined threshold; determining, via the one or more hardware processors, a set of layers from a plurality of layers comprised in the neural network that satisfy (i) a current execution time specified in the specific requirement associated with the application and (ii) storage capacity of the memory comprised in the resource constrained system, based on the obtained information, and executing the determined set of layers to obtain a desired output pertaining to the specific requirement of the application; and upon executing the determined set of layers, mapping, via the one or more hardware processors, one or more corresponding partitioned blocks from the created set of partitioned blocks to the determined set of layers, wherein an analytics report is generated based on the mapping and the execution time of each layer comprised in the determined set of layers.
[014] In an embodiment, the instructions which when executed by the one or more hardware processors further cause profiling the analytics report into the resource constrained system for subsequent layer partitioning and fusing for
execution thereof in the neural network comprised into the resource constrained system.
[015] In an embodiment, the instructions which when executed by the one or more hardware processors further cause fine-tuning one or more hyperparameters of the neural network based on the analytics report; and predicting execution time of the one or more corresponding partitioned blocks for another resource constrained system having one or more components that are either similar or identical to one or more components comprised in the resource constrained system.
[016] In an embodiment, the step of determining, a set of layers from a plurality of layers comprised in the neural network comprises: dynamically performing, during an execution of a layer, a comparison of storage and memory of the resource constrained system with the pre-defined threshold; and backtracing within the determined set of layers, based on the comparison to determine a combined effect of storage and execution latency for inclusion and exclusion of the layer in execution.
[017] In an embodiment, the instructions which when executed by the one or more hardware processors further cause predicting completion time of partial execution of the determined set of layers based on the analytics report.
[018] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS [019] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[020] FIG. 1 depicts a normalized layer-wise execution latency versus size of the output blob for a typical convolution neural network (CNN).
[021] FIG. 2 depicts an exemplary block diagram of a system for pre-partitioning and fusing of neural network layers for execution in and optimizing
resource constrained systems, in accordance with an embodiment of the present disclosure.
[022] FIG. 3 depicts an exemplary flow chart for pre-partitioning and fusing of neural network layers for execution in and optimizing resource constrained systems of FIG. 2 in accordance with an embodiment of the present disclosure.
[023] FIG. 4A depicts a normalized layer-wise execution latency and size of an intermediate storage output after processing each layer, in accordance with an embodiment of the present disclosure.
[024] FIG. 4B depicts a layer-wise partitioning of deep inference, in accordance with an embodiment of the present disclosure.
[025] FIG. 5 depicts a partial inference and a full inference modelled as a queuing system, in accordance with an embodiment of the present disclosure.
[026] FIGS. 6A-6D, depict graphical representation illustrating storage load of full inference versus partial inference with respect to sensing period of a Residual Network (ResNet) - a convolution neural network, in accordance with an embodiment of the present disclosure.
[027] FIG. 7A depicts a graphical representation illustrating average delay per inference for the ResNet - convolutional neural network, in accordance with an embodiment of the present disclosure.
[028] FIG. 7B depicts a graphical representation illustrating storage load with respect to sensing rate for a Visual Geometry Group (VGG) – a convolutional neural network, in accordance with an embodiment of the present disclosure.
[029] FIG. 7C depicts a graphical representation illustrating average delay per inference for the VGG convolutional neural network, in accordance with an embodiment of the present disclosure.
[030] FIG. 8 depicts buffer copy overhead of partitioned inference, in accordance with an embodiment of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
[031] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims.
[032] Edge computing is emerging as a viable framework for analyzing and generating actionable knowledge from huge data generated in a connected digital world. For instance, in literature, Deep learning: The MIT Press, authored by Ian Goodfellow, Yoshua Bengio and Aaron Courville in 2016 hereinafter referred as Ian et al., it is determined that DL algorithms have grown in popularity due to minimization of complex feature engineering, resulting in faster time to market and availability of pre-trained models that can be fine-tuned on relatively smaller datasets to achieve good results, specifically in image and video analytics. Running DL inference, often called Deep Inference (DI), is becoming realizable on embedded systems primarily due to reduced size and resource requirements of several Edge-targeted DI models and custom hardware supporting DI.
[033] Embedded systems employing these DI models are often constrained in nature with capacity just enough to handle most of the inference workloads. Production ready DI models are benchmarked on such platforms and have a maximum inference rate support (e.g., several images classified per second). In worst case scenarios, there can be a temporary surge in frames-per-second requirement owing to temporal variations (e.g., glacier motion analysis – as discussed by Jane K Hart et al., in Surface melt driver summer diurnal and winter multi-day stick-slip motion and till sedimentology, Nature Communications), simultaneous analysis of multi-sensor data (e.g., at a sensor edge gateway as discussed in Adaptive fusion for Diurnal Moving Object
Detection by Sohail Nadimi in 2004), simultaneous usage in multiple application scenarios (e.g., drone doing crowd monitoring, surveillance but taking part in hazard detection or search and rescue in an emergency) as discussed in Persistent surveillance with small unmanned aerial vehicles (sUAV) by Abdulrahman Almarzooqi Abdullah Al Saadi Al Mansoori Slim Sayadi Issacniwas Swamidoss Ohood Al Nuaimi et al. in 2018), etc. The current state of the art techniques involves input buffering offloading the input over the network (e.g., refer Adaptive resource management and scheduling for cloud computing by Florin Pop et al.) and down sampling the input which direct affect storage, network usage and DI accuracy respectively. Changes in functional and non-functional properties of input data over time may result in both accuracy loss and performance degradation. While there have been efforts (e.g., refer Continual Lifelong Learning with Neural Networks by German I. Parisi et al.) to tackle the accuracy fall issue, the requirement of adapting to a different performance profile remain unaddressed.
[034] On the other hand, readymade pre-trained deep learning (DCNN) models can be used for image, speech, text classification / prediction. These models give better accuracy than state of the art shallow learning models. Models are often big and may not fit in the runtime memory of sensors and other constrained devices. There is a need of executing a DL models and perform in-situ analytics. The current technology has methods for optimized loading of the weight matrix of a Deep Learning pre-trained model to suit available memory and computing resources. However, there may not be straightforward ways to do inference using a model that is much larger than the runtime memory of an embedded device that may require to perform in-situ analytics. One technique to do such inference is to break a model such that parts of the model can be loaded in the runtime memory, the results of the partial operations can be stored in the persistent memory of the device and fed to, when the next part of the model is executed. Such breaking of model can be done using layer wise partitioning which is sequential in nature as the input flows through a DL inference model, getting transformed at each layer to output desired result. Depending on the
memory capacity of the embedded device performing the inference, a set of layers are executed at a time. Thus, not only layer-wise partition, fusion of a set of layers is also required to do the DL inference, so that the inference takes minimum possible time to execute, for a specific embedded device. At the same time the storage space in the device should not overflow when full inference cannot be completed. Such things can be done for a specific work by an engineer with niche skill, on a case to case basis, but no method or system are available to do automate a broad range of such tasks that require running artificial intelligence-based algorithms in embedded systems.
[035] Therefore, there is a need for an in-situ partition of the execution of a Deep Neural Network on an embedded platform that enables the system to handle peak incoming data rate, even when engineered for an average incoming data rate scenario, wherein the system can maximally utilize the limited amount of storage available, without losing data and provide handles to the user to select and maximize/minimize or balance a set of performance targets, viz. inference rate, power usage, storage usage, etc.
[036] In this regard, embodiments of the present disclosure provide systems and methods to address the above problem by using partial DI, i.e., doing only a part of the DI during a temporary surge in frames-per-second requirement and either store the intermediate output for processing in a leaner period or offload the same to a networked server. Such partial inference can be achieved by using layer-wise partitioning of a deep neural network graph. In DI, input data flows through a pre-determined computation graph, getting transformed into class/feature representation (usually compressed) through a set of weighted, non-linear transfer functions. The computation graphs are directed and acyclic, with a precedence constraint, which states a layer cannot be processed before it gathers output from all its preceding layers. Each of these layers has corresponding execution latency, and output data size associated with it, which can be profiled offline, on a given platform. FIG. 1 depicts a normalized layer-wise execution latency versus size of the output blob for a typical convolution neural network (CNN).
[037] Given an inference graph, sensing rate and an entry point in the graph, the goal of the present disclosure here is to maximize the number of layers that can be processed before a new input arrives, keeping the output size as less as possible. In other words, the inference system (also referred as system of the present disclosure) stores/offloads encoded features instead of the input and this encoded output comes out of a specific layer of the DI graph, depending on the inference rate requirement and storage/network bandwidth availability. In the present disclosure, the use of DI graph partitioning is investigated to achieve partial inference with an aim to address the following:
1. Is partitioning of DI graph beneficial for partial inference in terms of latency-output size tradeoff? If the same benefit available across different popular CNN based object detection models?
2. Through partial inference, the systems and methods of the present disclosure aim to complete inference without losing input data albeit finishing the task in a delayed fashion. How much delay does the partial inference tasks experience in case of real-world CNN based object detection systems? Is there an overall betterment in inference throughout due to partial inference?
3. Is partial inference supported by the popular DL frameworks and runtimes?
[038] After demonstrating the efficacy of partitioning and partial inference, the systems and methods of the present disclosure present design and evaluation of a low-overhead partial DI algorithm which enables devices (e.g., resource constrained system) to handle time varying data arrival rate along with large sized input (e.g., image, video, etc.). The technique as implemented by the embodiments of the present disclosure achieves the goal of forward propagation among maximum number of layers and at the same time minimize the output size.
[039] As a sideband, on device layer-wise partitioned DI models can be used to perform inference on devices with ensemble on-board hardware accelerators without any loss of accuracy.
[040] The idea of leveraging the power of DI on embedded devices was first systematically analysed by Lane et al. in 2015 which underlines three perspectives, namely – accuracy, resource requirement and scalability. All these are still major trade-off points while taking a DL model to an embedded device. Reduction of the size of standard DL models to bring down resource requirement, keeping the inference accuracy as close as possible to the original has been major focus area during the past few years. Network compression by pruning and sharing redundant connections, reduction of bits-per-weight through quantization, decomposition of convolution kernels to low rank tensors through factorization, modeling the esoteric knowledge of a larger model to a smaller one through distillation, etc., are some of the effective technique(s) as known in the art and as discussed by Yu Cheng et al., in ‘A survey of Model compression and acceleration for deep neural networks’ in 2017. Simultaneously, the structural transformations of the DL model, the state of the art for AI friendly embedded hardware has also evolved rapidly as discussed in literature (e.g., refer ‘Domain-specific accelerator design and profiling for deep learning applications’ by Andrew Bartolo et al., in 2019). More specifically, in this literature, the increased importance of architecture aware software that are capable of squeezing performance from ensemble of traditional processors and massively parallel custom chipsets is summarized. The bounds on energy requirement, fabrication area and latency of a custom MAC accelerator are established by developing an open source DL execution simulator on NVDLA. For hardware acceleration of CNN on FPFA and ASIC, the inherent parallelism of convolution algorithms, linear algebra operations and processor-memory dataflow, along with weight storage schemes are mapped to architecture of underlying specialized hardware. Frameworks such as fpgaConNet automate the mapping of CNN models onto FPGA, allowing the user to choose desired optimization goals including latency, throughput. Apart from exploiting the parallelism in convolution and fully connected CNN layers, the partitioning a neural network has also been discussed previously. Even after embedding a DL model on a device with a certain throughput-latency characteristic, there is no existing mechanism for the device
to adapt its inference to varying arrival rate of large input like images and video frames.
[041] Referring now to the drawings, and more particularly to FIGS. 1 through 8, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[042] FIG. 2 depicts an exemplary block diagram of a system 100 for pre-partitioning and fusing of neural network layers for execution in and optimizing resource constrained systems, in accordance with an embodiment of the present disclosure. The system 100 may also be referred as ‘resource constrained system’, ‘optimization system’, ‘embedded device’, ‘resource constrained device’ and may be interchangeably used hereinafter. In one embodiment, examples of resource constrained systems comprise, but are not limited to, Internet of Things (IoT) devices, drones, light weight robots, mobile robots, and the like. In an embodiment, the system 100 includes one or more hardware processors 104, communication interface device(s) or input/output (I/O) interface(s) 106 (also referred as interface(s)), and one or more data storage devices or memory 102 operatively coupled to the one or more hardware processors 104. The one or more processors 104 may be one or more software processing components and/or hardware processors. In an embodiment, the hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) is configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
[043] The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface,
and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
[044] The memory 102 may include any computer-readable medium
known in the art including, for example, volatile memory, such as static random
access memory (SRAM) and dynamic random access memory (DRAM), and/or
non-volatile memory, such as read only memory (ROM), erasable programmable
ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an
embodiment, a database 108 is comprised in the memory 102, wherein the
database 108 comprises information, for example, profiling parameters of
resource constrained devices, input data that is going to be processed, wherein the
input data comprises information such as a specific requirement associated with
an application (e.g., applications comprise but are not limited to image analysis,
speech analysis, video analysis, and the like), wherein the input data comprises at
least one of an input data rate, an input data size, and wherein the input data
pertains to a specific requirement (e.g., say image classification within a
stipulation time) of an application/domain, an intermediate storage output of each
layer comprised in a neural network of the resource constrained system, an
execution time of each layer comprised in the neural network, historical data
comprising information on layers partitioned and fused for similar, near similar
systems, and identical systems in the past), analytics report comprising mapping
of determined layers for execution with partitioned blocks in the neural network,
and the like. In an embodiment, the memory 102 may store (or stores) one of
more techniques (e.g., the neural network, partitioning technique, fusing
technique, and the like). The memory 102 further comprises (or may further
comprise) information pertaining to input(s)/output(s) of each step performed by
the systems and methods of the present disclosure. More specifically,
information pertaining to provision the neural network in the resource constrained system and testing new incoming data thereof and the like, may be
stored in the memory 102. In other words, input(s) fed at each step and output(s) generated at each step are comprised in the memory 102 and can be utilized in further processing and analysis.
[045] FIG. 3, with reference to FIGS. 1-2, depicts an exemplary flow chart for pre-partitioning and fusing of neural network layers for execution in and optimizing resource constrained systems 100 of FIG. 2 in accordance with an embodiment of the present disclosure. In an embodiment, the system(s) 100 comprises one or more data storage devices or the memory 102 operatively coupled to the one or more hardware processors 104 and is configured to store instructions for execution of steps of the method by the one or more processors 104. The steps of the method of the present disclosure will now be explained with reference to components of the system 100 of FIG. 2, and the flow diagram as depicted in FIG. 3. Object detection models are meta architectures that use CNNs, ResNet, as feature extractors to detect and localize objects inside bounding boxes. For example, Faster-CNN and SSD are two state of the art object detection architectures. While running an object detection model in a fire rescue scenario using a Raspberry Pi based drone with a USB connected inference accelerator, it was observed by the present disclosure through experiments that the inference algorithm was missing frames when it had to detect face of known people. The face detection algorithm required a higher frame rate to get clear facial view, more than the inference rate supported by drone platform. Using input buffering as a workaround, it was found by the present disclosure through experiments that the limited on-board storage is often not enough if the input size is large and the period of surge in incoming data rate is long. In this regard, at step 302 of the present disclosure, the one or more hardware processors 104 obtain, information comprising (i) one or more profiling parameters of a resource constrained system, wherein the one or more profiling parameters comprise at least one of size of a memory comprised in the resource constrained system, a processing time, and a processing speed, (ii) an input data pertaining to a specific requirement associated with an application, wherein the input data comprises at least one of an input data rate, an input data size, and (iii)
an intermediate storage output of each layer comprised in a neural network of the resource constrained system, an execution time of each layer comprised in the neural network. The memory as referred herein may be at least one of an on¬board memory of the resource constrained system or an external memory (e.g., secondary storage component) connected via interfaces to the resource constrained system 100. Further, intermediate storage output of each layer refers to layer output that is to be stored before giving it to the next layer. Execution time of each layer comprised in the neural network may comprise layer latency information or execution information of the layer, in one example embodiment.
[046] For instance, the profiling parameters comprise functional requirements (e.g., accuracy of classification, classifications per second) and non-functional requirements (e.g., hardware budget, storage, memory, processor, power, weight for resource constrained systems such as for drones) of the application. A suitable model is also obtained as part of the information for object detection and localization and person detection, counting and identification that meets the functional requirements. Resource constrained device may comprise an IoT device (smart camera / camera with Edge computer) and the model/ partitioned blocks is/are transformed for it. Further, the input data as mentioned above may refer to run sample workload of application and average data arrival rate for different tasks is determined or obtained from domain knowledge comprised in the memory 102 of the system 100. Intermediate storage output of each layer comprised in a neural network of the resource constrained system and an execution time of each layer comprised in the neural network is obtained by executing the model wherein the layer wise execution latency and intermediate data output for each layer are determined.
[047] FIG. 4A depicts a normalized layer-wise execution latency and size of an intermediate storage output after processing each layer, in accordance with an embodiment of the present disclosure. The aim of the present disclosure is to process maximum number of layers with minimal intermediate storage. Directed Acyclic Graph (DAG) scheduling on task precedence graphs as known in the art can be used to model such tree structure inference graphs with temporal
dependency, though such algorithms are best suited for inherently parallel tasks
on multiple processors. Due to the sequential nature of the DI graph, the current
assignment problem can be modelled based on as a chain-structured pipeline on a
single system at different temporal windows as known in the art. Given a
sequence of layer execution latencies , a sequence of layer output size , the
current sensing period and an entry point in the inference graph, the aim is to
find max Lc ≤ ∑i=∈Li subject to constraint Lc ≤ δ , where Lc is cumulative
execution latency of the selected layers. Using a dynamic programming strategy, a sequential selection of layers guarantees minimum overall execution latency. However, as shown in FIG. 4B, if there is a requirement to minimize intermediate storage as well, a modified strategy is required. More specifically, FIG. 4B depicts a layer-wise partitioning of deep inference, in accordance with an embodiment of the present disclosure.
[048] With 250 milliseconds and 100 milliseconds being per frame DI time and sensing period respectively, the partitioning in row 3 results in less storage usage than that of row 2. With hundreds of layers in CNNs and different optimization objectives, handling such problems in an embedded middleware is nontrivial.
[049] At step 304 of the present disclosure, the one or more hardware
processors 104 create, using the obtained information, a set of partitioned blocks
using a plurality of layers comprised in the neural network, wherein each of the
set of partitioned blocks comprise one or more layers having an cumulative
execution time (i) less than a pre-defined threshold, (ii) greater than the pre¬
defined threshold, and equal to the pre-defined threshold. For instance, the pre¬
defined threshold say is 5 minutes of execution time for a specific requirement
(e.g., say image classification) of an application (e.g., image
detection/classification model/technique). Therefore, the pre-partitioned blocks have one or more layers whose cumulative execution time is either less than 5 minutes, greater than 5 minutes or equal to 5 minutes. In other words, fusing of layers includes creating pre-partitioned models (herein referred as partitioned blocks) of different layer granularity from the full model (e.g., the neural network
comprised in the system 100) wherein layer by layer partitioned or more than one layer are fused together. The individual partitions of the pre-partitioned neural network comprise a set of layers that have cumulative execution latency of similar order (or near similar order), based on the data arrival interval.
[050] At step 306 of the present disclosure, the one or more hardware processors 104 determine, a set of layers from a plurality of layers comprised in the neural network that satisfy (i) a current execution time specified in the specific requirement associated with the application and (ii) storage capacity of the memory comprised in the resource constrained system, based on the obtained information, and further execute the determined set of layers to obtain a desired output pertaining to the specific requirement of the application. For instance, if the application is image processing, the specific requirement may be classification of images, the desired output generated by the determined set of layers may be image classification. During runtime, the system 100 tries to maximize the number of layers that can be executed for a partial inference within the gap between the arrival of current and the next sample. Partial inference betters the state of art approach(es) by eliminating the need of buffering or dropping frames altogether. By default it tries to fit as many layers as possible and if it is dynamically determined that the device (such as drone as mentioned above) has available storage and memory less than a threshold, the algorithm (or partitioning technique) tries to backtrack and checks the effect on storage and latency if the current layer is not included.
[051] Below illustrated in a pseudo code of partitioning technique as implemented by the system 100 of the present disclosure, by way of an example and shall not be construed as limiting the scope of the present disclosure:
Pseudo code for inference graph partitioning:
1. Function findPartitions int start
2. i start;
3. while cumuLlatency sensingPeriod and i nLayers do
4. cumuLatency cumuLatency +
getNextLayerLatency(i);
5. i i+1
6. end
7. if storageAlert is True then
8. i i+1
9. latencyLoss getNextLayerLatency(i);
10. storageGain 0;
11. optimalStorage getNextLayerStorage(i);
12. while (a* cumuLatency/100) latencyLoss do
13. storageGain storageGain +
getNextLayerStorage(i) – getNextLayerStorage(i-1);
14. if storageGain is *optimalStorage/100 then
15. return start, i-1;
16. else
17. i i-1;
18. latencyLoss latencyLoss + getNextLayerStorage(i);
19. end
20. end
21. end
22. else
23. return start, i-1;
24. end
25. end
[052] The above pseduo code of the partitioning technique when implemented by the present disclosure by default maximizes number of layers processed within a deadline (e.g., the time before a new input arrives) and minimizes the intermediate storage if there is a storage constraint which can be controlled using the storage alert flag. The multi-objective nature of the pseudo code of the partitioning technique as implemented by the present disclosure can result in two types of solutions where an additional layer may or may not result in
an increased size of the intermediate storage output. In case of the former, careful study of layer latency versus intermediate storage size as depicted in FIG. 4A reveals that the one general trend in most CNNs is that the intermediate storage output size generally reduces with the increase in layer index, i.e., there is less chance of pareto optimal solutions being generated with the default objective of processing as many layers as possible in lines 1-4. For the latter case, a pareto optimal or non-dominated solution is generated only when a selection of an intermediate convolution layer after a pooling layer occurs.
[053] Based on these findings, the systems and methods of the present disclosure provide hyperparameters and for setting the relative weightage of processing more layers and storage gain respectively. This is a standard practice in case of pareto optimality. In the present disclosure, these are the targets of those hyperparameters which is non-standard but are defined upfront. In lines 12-15 of the above pseudo code (also referred as partitioning algorithm and interchangeably used hereinafter), the partitioning technique backtracks and finds the partition that has least output size within a given latency bound based on these hyperparameters. In other words, determining the set of layers from the layers comprised in the neural network includes dynamically performing, during execution of one or more layer(s), a comparison of storage and memory acquired in the resource constrained system with the pre-defined threshold (e.g., say 5 minutes as described above). The storage and memory acquired in the resource constrained system during the execution enables the system 100 to back trace within the determined set of layers and determine a combined effect of storage and execution latency for inclusion and exclusion of the layer in execution. For instance, when a layer is executed, data stored information and memory utilized information is recorded which gets compared with corresponding pre-defined threshold (e.g., a pre-defined data storage for a layer during execution, and pre¬defined memory for the same layer during execution). If the storage and memory utilization is more than the pre-defined threshold, then the partitioning technique enables the system 100 to backtrack within the determined set of layers and determine a combined effect of storage and execution latency for inclusion and
exclusion of the layer in execution. In other words, an insight on whether a specific layer can be included or excluded for execution and partitioning is made. Alternatively, layers requiring (i) maximum input data which needs execution of more data and (ii) more time for processing may not be considered for partition and execution thereof, that can affect overall performance of the resource constrained systems.
[054] Due to the observation that a pooling layer with much less intermediate output size is available within 2-3 layers of a convolution layer, both goals can be achieved with a constant number of iterations of the nested while in line 12 of the pseudo code. This renders the above partitioning technique O(n) with very less overhead. Partial Inference (PI) scheme:
[055] For evaluating the performance of the PI scheme, neural network comprised in the system of the present disclosure is modeled as a queuing system as known in the art. Inference tasks arrive when a new sensing event happens and in the current setup, it is assumed by the present disclosure that sensing rate (arrive rate ) is dependent on time or a situation. For a given platform and a DI graph, it is assumed by the systems and methods of the present disclosure that (i) the inference rate (service rate ) to be a deterministic constant value, (ii) there is only one processing unit which abstracts the embedded device processor and accelerators that serve the inference task (single server model), and (iii) the system is having a finite buffer capacity modelling the SD card based storage in a typical embedded device.
[056] The above system can be modeled as {G/D/1/L} queue (as shown in FIG. 5), where the arrival distribution G is observed be time varying and L is maximum number of inference tasks that can be queued or be in processing state. More specifically, FIG. 5, with reference to FIGS. 1 through 4B, depict both partial inference and full inference modelled as a queuing system, in accordance with an embodiment of the present disclosure. As the arrival rate distribution here is non-standard (time varying and observed from past data), the current model cannot be derived analytically, and a customized discrete even simulation
is developed by the systems and methods of the present disclosure to investigate the queueing behavior. In the simulation setup, the term sensing period is defined as a deadline within which the system should take up a new inference task to maintain the steady state of the queue. The queue length starts increasing when this sensing period becomes shorter than the time required to complete a single inference task. The above problem is solved by using a partitioned inference.
[057] In full inference (FI), a complete cycle is covered if the current sensing period is longer than the time required for a single inference. In the opposite case, processing cycles are skipped, and the frame is queued for future leaner period. PI also works in similar fashion, except that a subset of layers is processed and only the intermediate features are queued in case sensing period is shorter than the per frame inference time. In the current implementation it is assumed that the sensing period can be statistically predicted from history. However, the system works in an event driven manner, i.e., the PI stops as new data arrives. The pseudo code of the partitioning technique of the present disclosure for finding partition executes once there is a change in the sensing period. The frequency of invoking the above partitioning technique pseudo code reduces with better prediction and as described above, such predictions are both important and possible in many embedded AI applications.
[058] Upon executing the determined set of layers and upon performing the above inference, at step 308 of the present disclosure, the one or more hardware processors 104 map one or more corresponding partitioned blocks from the created set of partitioned blocks to the determined set of layers. In other words, the determined set of layers are then mapped to respective partitioned block from the created set of partitioned blocks. Based on the selected number of layers in a partition, a pre-partitioned block (e.g., block of neural network comprised in the system 100) is mapped with granularity of partition less than or equal to the required selection.
[059] Further, the system 100 generates an analytics report based on the mapping, and the execution time of each layer comprised in the determined set of layers, in one example embodiment. The analytics report further enables the
system 100 to predict completion time of partial execution of the determined set of layers. In other words, historical data (also referred as information comprised in the analytics report) based prediction of the data arrival rate / size can be done to get better prediction of the deadline for partial execution.
[060] The analytics report is further profiled into the resource constrained system 100 wherein the analytics report acts as training data to train the resource constrained system. In other words, the analytics report profiled in the trained resource constrained system is used for subsequent layer partitioning and fusing for execution thereof in the neural network comprised into the resource constrained system for better utilization of the components comprised in the neural network/ resource constrained system. In other words, analytics report may be profiled for an incoming requirement from a domain/application (image analysis, speech analysis, video analysis and the like) wherein layers are fused based on the identified partitions, and the fused layers are further sequentially executed to obtain application’s/domain’s output. It is to be understood by a person skilled in the art or person having ordinary skill in the art that the above domain/applications shall not be construed as limiting the scope of the present disclosure.
[061] Furthermore, the hyperparameters and of the neural network
are fine-tuned based on the analytics report. When the hyperparameters and of the neural network are fine-tuned, enables prediction of execution time of the one or more corresponding partitioned blocks for another resource constrained system having one or more components that are either similar or identical to one or more components comprised in the resource constrained system. In other words, the approach can be used as a standalone system/module which can be either integrated within the resource constrained system or can be externally be connected to the resource constrained system via interfaces, wherein the fine-tuned neural network can actually predict execution time of partitioned blocks for another resource constrained system (e.g., say device B) which may have one or more components (say component x, component y, component z, and the like) that are either similar or identical to one or more components (e.g., component a,
component b, component c, and the like) comprised in the resource constrained system (say device A). In other words, the profiling of DL models (or neural network) on resource constrained systems (or embedded devices) can be logged and used as a training set so that in future the system 100 can predict the runtime of the neural network on a similar hardware (processor, memory, accelerator) without actually running it. In typical conventional Deep Neural Network scenarios, fine-tuned parameters are used to modify the objective of the DNN, i.e., accuracy of classification/forecasting. However, in the present disclosure, the fine-tuned parameters are used for one or more non-functional requirements e.g., inference time and storage/memory usage. In other words, fine-tuned parameters in the present disclosure are utilized by the system 100 for improving throughput of inference and storage usage. Partial Inference (PI) Simulation and Evaluation:
[062] Systems and methods of the present disclosure simulate and
evaluate the PI approach and the partitioning technique (refer pseudo code) by
taking real-world data (layer-wise execution latency and output size) of standard
CNNs used in object detection. A discrete event simulator was developed by the
embodiments of the present disclosure and comprised in the system 100 of FIG. 2
based on SimPy: Discrete event simulation for Python as known in the art,
wherein different arrival rates and inference times were simulated by the
simulator of the present disclosure. After experimenting with several CNNs,
namely ResNet, VGG, AlexNet, etc., the hyperparameters and at 20% each
in a Raspberry Pi with an 8-megapixel image sensor and 16 gigabyte storage.
[063] Full inference (FI) and Partial inference (PI) were compared in a simulated setup. The two metrices used for comparison were temporal variation of storage required and the average delay experienced by the inference tasks PI performs better over the standard FI scheme in almost all scenarios. Apart from the duration of experiments, the distribution of sensing periods (FIGS. 6A and 6D) plays an important role in the partitioning technique’s performance. It can be observed from the FIGS. 6A through 6D that with an equal distribution of sensing periods, the partitioning technique as implemented by the embodiments
of the present disclosure, its systems and methods has guaranteed less storage usage as depicted in FIG. 6A.
[064] With a more lenient sensing period distribution, as shown in FIG. 6D (only 30% sensing periods less than per frame inference latency), FI can handle the pending inference tasks eventually and sometime performs better than the PI of the present disclosure. More specifically, FIGS. 6A-6D, with reference to FIGS. 1 through 5, depict graphical representation illustrating storage load of full inference versus partial inference with respect to sensing period of Residual Network (ResNet) convolution neural network, in accordance with an embodiment of the present disclosure. Though the present disclosure depicts an example of ResNet/VGG/CNN, it is to be understood by a person having skill in the art or having ordinary skill in the art that use of ResNet (or such neural network) for implementation (or as implemented) in the present disclosure shall not be construed as limiting the scope of the present disclosure and any neural network can be exploited by systems and methods of the present disclosure for layer-wise partition and fusing thereof for execution in resource constrained systems and optimization of the systems thereof. The overhead associated with running the partition finding technique can be a reason for this. Similar efficacy of the partitioning technique of the present disclosure is observed in case of average delay in FIG. 7A over the FI, in which the average delays are not suitable for even a delayed inference in practical scenarios. The results for similar experiments performed on VGG is shown in FIG. 7B and 7C. More specifically, FIG. 7A, with reference to FIGS. 1 through 6D, depicts a graphical representation illustrating average delay per inference for a Residual Network (ResNet) - a convolutional neural network, in accordance with an embodiment of the present disclosure. FIG. 7B, with reference to FIGS. 1 through 7A, depicts a graphical representation illustrating storage load with respect to sensing rate for a Visual Geometry Group (VGG) - a convolutional neural network, in accordance with an embodiment of the present disclosure. FIG. 7C, with reference to FIGS. 1 through 7B, depicts a graphical representation illustrating average delay per inference for the VGG convolutional neural network, in accordance with an
embodiment of the present disclosure. It can be observed that in case of VGG the FI performs better than PI initially, but the storage and average delay grows and becomes worse eventually. Execution latency of initial layers of VGG is longed than ResNet. The large number of ResNet layers with shorter latency can be compactly packed in given sensing period, thus increasing resource utilization. This can be a reason why the overhead of finding partition does not become significant in case of ResNet.
[065] However, to get practical benefits using the PI of the present disclosure, embedded DL frameworks such as TensorFlow Lite, NVIDIA TensorRT, etc., as known in the art need to provide optimized mechanism for partial inference. As an example, depicted in FIG. 8, the current Caffe implementation (e.g., refer ‘Convolution architecture for fast feature embedding’ by Yangqing Jia et al. in 2014) uses the prototext based partitioning and transfers buffers between layers. More specifically, FIG. 8, with reference to FIGS. 1 through 7C, depicts buffer copy overhead of partitioned inference, in accordance with an embodiment of the present disclosure. The buffer copy overhead is mitigated by the implementation of the systems and methods (refer step 308 which is mapping of partitioned blocks to the determined set of layers and is one of many ways, in one example) of the present disclosure.
[066] Embodiments of the present disclosure and its systems and methods observed/investigated the use of partial deep learning inference to enhance performance and reduce resource usage of an AI enabled embedded device (e.g., AI enabled resource constrained system), in the face of time varying input data arrival rates. The systems and methods of the present disclosure implement layer-wise partitioned execution of Deep Inference (DI) to achieve partial inference. The partitioning technique as implemented by the systems and methods of the present disclosure can automatically schedule a subset of layers in the DI pipeline based on current inference task processing deadline for the system. It is expected that such a partitioned execution method of the present disclosure aids in scenarios where input data size is large, and the frames-per-second requirement is variable, especially true for image and video analytics at
network edge. The present disclosure has modelled the techniques implemented herein as a queuing system and evaluated using workloads of popular CNN based object detection models in a simulation framework. The figures depicting the graphs highlight that the efficacy of a partitioned layer-wise inference scheme can be leveraged in real embedded systems by the implementation of the systems and methods (refer step 308 which is mapping of partitioned blocks to the determined set of layers and is one of many ways, in one example) of the present disclosure.
[067] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[068] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments
may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[069] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[070] The illustrated steps are set out to explain the exemplary
embodiments shown, and it should be anticipated that ongoing technological
development will change the manner in which particular functions are performed.
These examples are presented herein for purposes of illustration, and not
limitation. Further, the boundaries of the functional building blocks have been
arbitrarily defined herein for the convenience of the description. Alternative
boundaries can be defined so long as the specified functions and relationships
thereof are appropriately performed. Alternatives (including equivalents,
extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[071] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure.
A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[072] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
[073]
We Claim:
1. A processor implemented method for optimizing resource constrained
systems, the method comprising:
obtaining (302), via one or more hardware processors, information comprising (i) one or more profiling parameters of a resource constrained system, wherein the one or more profiling parameters comprise at least one of size of a memory comprised in the resource constrained system, a processing time, and a processing speed, (ii) an input data pertaining to a specific requirement associated with an application, wherein the input data comprises at least one of an input data rate, an input data size, and (iii) an intermediate storage output of each layer comprised in a neural network of the resource constrained system, an execution time of each layer comprised in the neural network;
creating (304) via the one or more hardware processors, based on the obtained information, a set of partitioned blocks using a plurality of layers comprised in the neural network, wherein each of the set of partitioned blocks comprise one or more layers having an cumulative execution time (i) less than a pre-defined threshold, (ii) greater than the pre-defined threshold, and equal to the pre-defined threshold;
determining (306) via the one or more hardware processors, a set of layers from a plurality of layers comprised in the neural network that satisfy (i) a current execution time specified in the specific requirement associated with the application and (ii) storage capacity of the memory comprised in the resource constrained system, based on the obtained information, and executing the determined set of layers to obtain a desired output pertaining to the specific requirement of the application; and
upon executing the determined set of layers, mapping (308), via the one or more hardware processors, one or more corresponding partitioned blocks from the created set of the partitioned blocks to the determined set of layers, wherein an analytics report is generated based on the mapping and the execution time of each layer comprised in the determined set of layers.
2. The processor implemented method as claimed in claim 1, further comprising profiling the analytics report into the resource constrained system for subsequent layer partitioning and fusing for execution thereof in the neural network comprised into the resource constrained system.
3. The processor implemented method as claimed in claim 1, further comprising:
fine-tuning one or more hyperparameters of the neural network based on the analytics report; and
predicting execution time of the one or more corresponding partitioned blocks for another resource constrained system having one or more components that are either similar or identical to one or more components comprised in the resource constrained system.
4. The processor implemented method as claimed in claim 1, wherein the
step of determining, a set of layers from a plurality of layers comprised in the
neural network comprises:
dynamically performing, during an execution of a layer, a comparison of storage and memory of the resource constrained system with the pre-defined threshold; and
backtracing within the determined set of layers, based on the comparison to determine a combined effect of storage and execution latency for inclusion and exclusion of the layer in execution.
5. The processor implemented method as claimed in claim 1, further comprising predicting completion time of partial execution of the determined set of layers based on the analytics report.
6. A system (100), comprising:
a memory (102) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
obtain information comprising (i) one or more profiling parameters of a resource constrained system, wherein the one or more profiling parameters comprise at least one of size of a memory comprised in the resource constrained system, a processing time, and a processing speed, (ii) an input data pertaining to a specific requirement associated with an application, wherein the input data comprises at least one of an input data rate, an input data size, and (iii) an intermediate storage output of each layer comprised in a neural network of the resource constrained system, an execution time of each layer comprised in the neural network;
create, based on the obtained information, a set of partitioned blocks using a plurality of layers comprised in the neural network, wherein each of the set of partitioned blocks comprise one or more layers having an cumulative execution time (i) less than a pre-defined threshold, (ii) greater than the pre-defined threshold, and equal to the pre-defined threshold;
determine a set of layers from a plurality of layers comprised in the neural network that satisfy (i) a current execution time specified in the specific requirement associated with the application and (ii) storage capacity of the memory comprised in the resource constrained system, based on the obtained information, and execute the determined set of layers to obtain a desired output pertaining to the specific requirement of the application; and
map one or more corresponding partitioned blocks from the created set to the determined set of layers, wherein an analytics report is generated based on the mapping and the execution time of each layer comprised in the determined set of layers.
7. The system of claim 6, wherein the one or more hardware processors are
further configured by the instructions to profile the analytics report into the resource constrained system for subsequent layer partitioning and fusing for
execution thereof in the neural network comprised into the resource constrained system.
8. The system of claim 6, wherein the one or more hardware processors are
further configured by the instructions to:
fine-tune one or more hyperparameters of the neural network based on the analytics report; and
predict execution time of the one or more corresponding partitioned blocks for another resource constrained system having one or more components that are either similar or identical to one or more components comprised in the resource constrained system.
9. The system of claim 6, wherein the one or more hardware processors
determine, the set of layers from the plurality of layers comprised in the neural
network by:
dynamically performing, during an execution of a layer, a comparison of storage and memory of the resource constrained system with the pre-defined threshold; and
backtracing within the determined set of layers, based on the comparison to determine a combined effect of storage and execution latency for inclusion and exclusion of the layer in execution.
10. The system of claim 6, wherein the one or more hardware processors are
further configured by the instructions to predict completion time of partial
execution the determined set of layers based on the analytics report.
| # | Name | Date |
|---|---|---|
| 1 | 201921045603-IntimationOfGrant24-10-2024.pdf | 2024-10-24 |
| 1 | 201921045603-STATEMENT OF UNDERTAKING (FORM 3) [08-11-2019(online)].pdf | 2019-11-08 |
| 2 | 201921045603-PatentCertificate24-10-2024.pdf | 2024-10-24 |
| 2 | 201921045603-REQUEST FOR EXAMINATION (FORM-18) [08-11-2019(online)].pdf | 2019-11-08 |
| 3 | 201921045603-FORM 18 [08-11-2019(online)].pdf | 2019-11-08 |
| 3 | 201921045603-CLAIMS [13-12-2021(online)].pdf | 2021-12-13 |
| 4 | 201921045603-FORM 1 [08-11-2019(online)].pdf | 2019-11-08 |
| 4 | 201921045603-FER_SER_REPLY [13-12-2021(online)].pdf | 2021-12-13 |
| 5 | 201921045603-OTHERS [13-12-2021(online)].pdf | 2021-12-13 |
| 5 | 201921045603-FIGURE OF ABSTRACT [08-11-2019(online)].jpg | 2019-11-08 |
| 6 | 201921045603-FER.pdf | 2021-10-19 |
| 6 | 201921045603-DRAWINGS [08-11-2019(online)].pdf | 2019-11-08 |
| 7 | 201921045603-FORM-26 [24-03-2020(online)].pdf | 2020-03-24 |
| 7 | 201921045603-DECLARATION OF INVENTORSHIP (FORM 5) [08-11-2019(online)].pdf | 2019-11-08 |
| 8 | 201921045603-ORIGINAL UR 6(1A) FORM 1-311219.pdf | 2020-01-02 |
| 8 | 201921045603-COMPLETE SPECIFICATION [08-11-2019(online)].pdf | 2019-11-08 |
| 9 | 201921045603-Proof of Right (MANDATORY) [24-12-2019(online)].pdf | 2019-12-24 |
| 9 | Abstract1.jpg | 2019-11-11 |
| 10 | 201921045603-Proof of Right (MANDATORY) [24-12-2019(online)].pdf | 2019-12-24 |
| 10 | Abstract1.jpg | 2019-11-11 |
| 11 | 201921045603-COMPLETE SPECIFICATION [08-11-2019(online)].pdf | 2019-11-08 |
| 11 | 201921045603-ORIGINAL UR 6(1A) FORM 1-311219.pdf | 2020-01-02 |
| 12 | 201921045603-DECLARATION OF INVENTORSHIP (FORM 5) [08-11-2019(online)].pdf | 2019-11-08 |
| 12 | 201921045603-FORM-26 [24-03-2020(online)].pdf | 2020-03-24 |
| 13 | 201921045603-DRAWINGS [08-11-2019(online)].pdf | 2019-11-08 |
| 13 | 201921045603-FER.pdf | 2021-10-19 |
| 14 | 201921045603-FIGURE OF ABSTRACT [08-11-2019(online)].jpg | 2019-11-08 |
| 14 | 201921045603-OTHERS [13-12-2021(online)].pdf | 2021-12-13 |
| 15 | 201921045603-FER_SER_REPLY [13-12-2021(online)].pdf | 2021-12-13 |
| 15 | 201921045603-FORM 1 [08-11-2019(online)].pdf | 2019-11-08 |
| 16 | 201921045603-CLAIMS [13-12-2021(online)].pdf | 2021-12-13 |
| 16 | 201921045603-FORM 18 [08-11-2019(online)].pdf | 2019-11-08 |
| 17 | 201921045603-PatentCertificate24-10-2024.pdf | 2024-10-24 |
| 17 | 201921045603-REQUEST FOR EXAMINATION (FORM-18) [08-11-2019(online)].pdf | 2019-11-08 |
| 18 | 201921045603-STATEMENT OF UNDERTAKING (FORM 3) [08-11-2019(online)].pdf | 2019-11-08 |
| 18 | 201921045603-IntimationOfGrant24-10-2024.pdf | 2024-10-24 |
| 1 | search_strategyE_16-08-2021.pdf |