Optimized Communication In A Distributed Parallel Computing

< Back

Optimized Communication In A Distributed Parallel Computing Environment

Abstract: System and method(s) for increasing processing capabilities of a computing system in a distributed parallel computing environment by optimizing the communication are described, In said implementation, a number of nodes for parallel computing are determined. The node may represent a processor capable of computing in a distributed parallel computing environment. A matrix representing data may be divided into tiled matrices based on the number of nodes and a block set may be formed based on the principle of block design and the number of nodes. Each block of the block set may represent one node from amongst several nodes of the distributed parallel computing environment. Further, the tiled matrices may be distributed among the plurality of nodes based on the formed block set to optimize communication for distributed processing.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

25 July 2011

Publication Number

05/2013

Publication Type

INA

Invention Field

COMMUNICATION

Status

Parent Application

Patent Number

Legal Status

Grant Date

2021-02-19

Renewal Date

Applicants

TATA CONSULTANCY SERVICES LIMITED

NIRMAL BUILDING,9TH FLOOR, NARIMAN POINT,MUMBAI 400021, MAHARASHTRA,INDIA

Inventors

1. SAPRE , SHREENIWAS NARHAR

TATA CONSULTANCY SERVICES, ABHILASH SOFTWARE DEVELOPMENT CENTRE,PLOT NO.96, EPIP INDUSTRIAL AREA WHITEFIELD,BANGALORE-560066,INDIA

Specification

FORM 2
THE PATENTS ACT,1970
(39 of 1970)
&
THE PATENTS RULES, 2003
COMPLETE SPECIFICATION
(See section 10, rule 13)
1. Title of the invention: OPTIMIZED COMMUNICATION IN A DISTRIBUTED PARALLEL COMPUTING ENVIRONMENT
2. Applkant(s)
NAME NATIONALITY ADDRESS
TATA CONSULTANCY Indian Nirmal Building, 9th Floor, Nariman Point, SERVICES LIMITED Mumbai-400021. Maharashtra. India
3. Preamble to the description
COMPLETE SPECIFICATION
The following specification particularly describes the invention and the manner in which it is to be performed.

TECHNICAL FIELD
The present subject matter relates, in general, to distributed parallel computing and, in particular, to optimized communication in a distributed parallel computing environment.
BACKGROUND
Engineers, scientists, mathematicians, and educators across a diverse range of industries solve engineering and scientific problems using large complex models involving huge computations. Such complex models are generally deployed using computer processing techniques in computing environments. Typically, the mathematical models and quantitative analysis to be performed require massive amounts of calculations and thus often use high performance computing systems, such as super computers, to analyze and solve the problems of various scientific disciplines.
For such high performance computational requirements, although the use of a single high-performance computer is possible in principle, but such an approach may utilize tremendously large processing time and sophisticated hardware components. Conventionally, to achieve substantial computing results within a considerably less computing time, distributed parallel processing techniques are employed. For example, a complex computation may complete faster if the computation is divided into portions, and the portions are simultaneously executed on a number of computing devices or processing systems.
The use of distributed systems for parallel computing is beneficial for practical reasons. For example, it may be more cost-efficient to obtain the desired level of performance by using a cluster of several low-end computing systems, in comparison with a single high-end computing system. Therefore, for complex and time-consuming computational problems, multiple computing systems are linked together to form a distributed computing environment. The distributed computing environment divides large computations into smaller computational blocks for processing by separate computing systems, thus shortening the time required for obtaining a result of the computation. The multiple computing devices in the distributed computing environment either communicate over a network or utilize on-chip communications to share data produced in the computing device with another computing device to produce the final result.

SUMMARY
This summary is provided to introduce concepts related to optimized
communication in a distributed parallel computing environment, which is further described below in the detailed description. This summary is not intended to identify essential features of the claimed subject matter nor is it intended for use in determining or limiting the scope of the claimed subject matter.
In one embodiment of the present subject matter, method(s) and a system for
increasing processing capabilities of a computing system in a distributed parallel computing environment by optimizing the communication are described. In said implementation, a number of nodes for parallel computing are determined. The node may represent a processor capable of computing in a distributed parallel computing environment. A matrix representing data may be divided into tiled matrices based on the number of nodes, and a block set may be formed based on principles of block design and the number of nodes. Each block of the block set may represent one node from amongst several nodes of the distributed parallel computing environment. Further, the tiled matrices may be distributed among the plurality of nodes based on the formed block set to optimize communication for distributed processing.
BRIEF DESCRIPTION OF THE DRAWINGS
The detailed description is described with reference to the accompanying figures.
In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to reference like features and components.
Fig. I illustrates a distributed parallel computing environment, in accordance with
an embodiment of the present subject matter.
Fig. 2 illustrates components of the communication optimizing system, in
accordance with an implementation of the present subject matter.
Fig. 3 illustrates an exemplary method to divide a matrix and a vector, and
distribute the matrix and the vector to multiple computing devices in a distributed parallel computing environment, in accordance with an implementation of the present subject matter.

DETAILED DESCRIPTION
Systern(s) and method(s) for optimizing communication in a distributed parallel
computing are described. The systems and methods described herein can be implemented on a variety of devices, such as a server, a desktop computer, a notebook or a portable computer, a mainframe computer, a mobile computing device, and the like.
Typically, complex computational models require huge quantitative analysis and
processing, which is generally achieved by computing devices with huge computational power, such as high performance computers. Although, such high performance computers are sufficient to process the complex models, yet many a times the use of distributed parallel computing allows quicker and cost effective solutions.
Distributed parallel computing can be understood as parallel computing carried
out in a distributed computing environment having a plurality of computing devices. Distributed parallel computing techniques are conventionally used as a means to improve the processing capacity of the distributed parallel computing environment for producing results. Known techniques of distributed parallel computing work on the principle of breaking a large complex problem into smaller problems. In order to successfully produce the results of the complex computations in the distributed parallel computing environment, computational data of the complex problem is divided and distributed among different computing devices in the distributed parallel computing environment for computation. These computing devices, distributed over the network, process the received data and transfer the processed data among each other within the distributed parallel computing environment. Based on the communication of processed data among the computing devices, the final result of the complex computation is computed.
However, the distributed parallel processing techniques are complex and the
computation of the final result requires high bandwidth communication between the computing devices forming a part of the distributed computing environment. Further, the communication between the computing devices is dependent on the network through which the computing devices communicate. Since, the network is not always reliable, and there is some amount of latency involved in communicating results between different computing devices, and therefore, the time consumed in computing a complex problem is greatly dependent on the communication between the computing devices. Furthermore, with the high complexity of the computations to

be executed, the communication between the computing devices in the distributed parallel computing environment may sometimes lead to random communication across the network, therefore causing congestion.
According to an embodiment of the present subject matter, systems and methods
for optimizing communication in a distributed parallel computing environment are described. The computing devices forming a part of the distributed parallel computing environment are hereinafter referred to as nodes. According to an aspect of the present subject matter, an inter-node communication required for carrying out complex computations is optimized. The computing systems that can implement the described method(s) include, but are not limited to, desktop computers, hand-held devices, laptops or other portable computers, mobile phones, and the like. Further, the inter-node communication may happen through different networks, such as an on-chip network, telephone network, a computer network, a wireless sensor network, and a peer-to-peer network. Such networks may include, but are not limited to. Digital Subscriber Line (DSL) access network, fixed telephone line plain old telephone system (POTS) networks; cable broadband networks; Integrated Services Digital Network (ISDN); and Public Switched Telephone Networks (PSTN).
The present subject matter is further described with reference to a complex
matrix-vector multiplication problem, which is only as an example for the ease of explanation and it will be understood that the concepts explained in context thereof can be extended to other complex computations as well. Additionally, for exemplary purposes, the complex problems involving matrix-vector multiplication are described with respect to computational fluid dynamics. However, it would be understood that the described systems and methods may be implemented in different fields of high p Tformance computing (HPC), such as flow analysis, weather forecasting, prototyping, digital communication, bio-technology, image processing, and the like.
The optimization of communication between different nodes in a distributed
parallel computing environment is associated with effective distribution of data amongst the nodes, thereby, minimizing the requirement of inter-node communication. According to an implementation, matrix data associated with the matrix-vector multiplication is optimally distributed between the nodes. In said implementation, to produce results based on the optimally

distributed matrix data, each node communicates with a part of total nodes. Further, instead of entire vector data being stored at each node for computation, the vector data is also optimally distributed amongst the nodes. Furthermore, a schedule for vector communication between the nodes is created, which reduces the effective inter-node communications, thus in effect reducing the complexity of inter-node communication.
In said implementation, the matrix-vector multiplication may be computed for the
computation of results of linear equations in a computational fluid dynamics analysis. It would be appreciated by those skilled in the art that such complex matrix-vector multiplication which requires iterative approach for the computation might require huge computational power and that the use of distributed parallel computing for such purposes may reduce the time of computation. However, the technique of distributed computing may provide results in reduced time, the time can still be considerably reduced. Therefore, the use of described methods and systems in problems related to fluid dynamics may optimize the communication between computing nodes, thereby reducing the overall computation time and latency.
In operation, the matrix data may be distributed among the nodes based on the
principle of block design. The principle of block design might be fundamentally understood to be similar to the principle of Singer difference sets and projective geometry concepts. In one implementation, the distribution of matrix data is based on an incidence relation between the blocks and objects in a particular block design. The matrix data is divided into different tiles based on a number of nodes, which are identified based on the block design imposed by a projective space. Each tile may represent a sub-matrix obtained by a process of division of the matrix, and the number of nodes might represent the number of computing devices.
In one implementation, the projective space utilized is PG (2, q), where q is a
power of prime, i.e., q= pr (for a prime p) with r a positive integer. The number of the nodes is identified based on the number q. In said implementation, the number of nodes (n) is represented by the following exemplary relation:

The matrix data is divided into n2 tiles, which represent sub-matrices, by
segregating the matrix data into an (n x n) grid, where n represents the identified number of nodes. For example, in a matrix with 99 rows and 99 columns, if the identified number of nodes

is 3, the matrix is divided into 9 (n x n) 33 x 33 tiles, where each new row contains 3 such tiles and the complete matrix includes 3 such new rows. Hence, in one implementation, each tile obtained by the division of the matrix data is represented as a square sub-matrix of the original matrix. It would be understood by those skilled in the art that in situations when the number of rows or columns is not a perfect multiple of the number of nodes (n), the last tiles of each row and each column might not form a square matrix. For example, if a matrix comprising of 100 rows and 100 columns is divided based on 3 nodes, the rows would be divided into 3 tiles and columns will also be divided into 3 tiles. In such a scenario, the last tile of first two new rows would contain 34 columns and 33 rows. Similarly, the last tile of the first two new columns would contain a tile with 33 columns and 34 rows. However, the last tile of the last new row and last new column would have 34 columns and 34 rows, as would be understood by a person skilled in the art.
The tiles thus obtained are distributed among different blocks to form a block set,
where each block represents an identified node. It will be understood that since the matrix is divided into n tiles, the n tiles are to be distributed to the n identified nodes, and therefore, each node is provided with exactly n tiles for computational purposes to balance the load across all nodes. As described earlier, for the purpose of evaluation of the final result of the matrix-vector multiplication, inter-node communication is required, therefore, in one implementation, the distribution of different tiles to different nodes is done to optimize the required communication between the identified nodes.
To distribute the tiles among different nodes, block design based on Singer
difference set methods is utilized. A block design may contain blocks and objects where each block may contain a few of the available objects. In one implementation, the number of blocks and the number of objects are equal to the number of identified nodes. If the block design has n objects, n would represent the number of identified nodes, and each block of the block design would contain q+1 objects. As explained earlier, the relation between the number of identified nodes n and the number q is n = q2+q+l, and the nodes are identified based on the number q.
In such a defined block design, the number of objects is equal to n, each object is
present in exactly q+1 blocks, and each pair of objects is present in exactly one block. For example, for a number q=3, the identified number of nodes and objects would be equal to 13, and

therefore, there would also be 13 blocks in the block design forming a block set. Each block would contain q+1. i.e., 4 objects. Once such block design is achieved for the identified number of nodes n and the tiles are distributed among the nodes based on the block design.
In one implementation, each diagonal tile (i', i) is allocated to the computing node
i. and each tile (i, j) is distributed based on the block design. As mentioned earlier, in the said block design, n objects are distributed into n blocks. Since, a unique block contains a particular pair of objects, the pair in each block is signified as the tile index to be distributed to the computing node. For example, when 'object /' and 'object 2' are present in block 3, tile (1, 2) and tile (2, 7) are distributed to the node 3. Similarly, based on each existing pair of objects in different blocks, tiles are assigned to the nodes in the distributed parallel computing environment.
Further, as mentioned earlier, the vector to be multiplied with the matrix as part of
the matrix-vector multiplication is also distributed among the various nodes in the distributed parallel computing environment. In one implementation, the vector is divided into n different sections, where n is the number of identified nodes. Each node / is allotted the i'h section of the vector and therefore, each node is allotted some tiles of the matrix data and a portion of the vector as part of the matrix-vector multiplication.
It should be noted that the description merely illustrates the principles of the
present subject matter. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described herein, embody the principles of the present subject matter and are included within its spirit and scope. Furthermore, all examples recited herein are principally intended expressly to be only for pedagogical purposes to aid the reader in understanding the principles of the present subject matter and the concepts contributed by the inventor(s) to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the present subject matter, as well as specific examples thereof, are intended to encompass equivalents thereof.
The manner in which the systems and methods for optimizing communication in a
distributed parallel computing environment is implemented shall be explained in details with respect to the Figures 1-3. While aspects of described systems and methods for optimizing

communication in a distributed parallel computing environment can be implemented in any number of different computing devices, environments, and/or configurations, the embodiments are described in the context of the following exemplary system(s).
Fig. 1 illustrates a distributed parallel computing environment 100 implementing
a communication optimizing system 102, according to an embodiment of the present subject matter. The distributed parallel computing environment 100 includes the communication
optimizing system 102 and multiple computing devices 104-1, 104-2, 104-3, 104-N coupled
to the communication optimizing system 102 via a network 106. It will be understood that a computing device might have huge computational power and may be a super computer 104-1. or might have limited computational power and may be a smart phone 104-N. For the purpose of
explanation and clarity, the computing devices 104-1, 104-2, 104-3, 104-N, are collectively
referred to as computing device(s) 104. Further, the computing device(s) 104 may be interchangeably referred to as the node(s) 104.
The network 106 may be a wireless network, wired network, or a combination
thereof. The network 106 can be implemented as one of the different types of networks, such as intranet, telecom network, electrical network, local area network (LAN), wide area network (WAN), Virtual Private Network (VPN), internetwork, Global Area Network (GAN), the Internet, and such. The network 106 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example. Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), etc.. to communicate with each other.
The communication optimizing system 102 and the computing devices 104 can be
implemented as any of a variety of conventional computing devices including, for example, servers, a desktop PC, a notebook or a portable computer, a workstation, a mainframe computer, a mobile computing device, an entertainment device, and an Internet appliance. Although the computing devices 104 are shown to be connected through a physical network 106, it would be appreciated by those skilled in the art that the communication optimizing system 102 and the computing devices 104 may be distributed locally or across one or more geographic locations and can be physically or logically connected to each other.

The communication optimizing system 102 includes, amongst other things, a
block design module 110. The block design module 110 can also be provided in an external storage media, which may interface with the communication optimizing system 102. In one implementation, the block design module 110 determines the distribution of matrix tiles among different nodes 104. According to an implementation of the present subject matter, matrix tiles are formed by dividing a matrix into multiple sub-matrices, and the matrix may contain data spread across multiple rows and columns.
Each matrix may be associated with a field of operation and would contain data
associated with that particular field. For example, a matrix may be defined for the field of fluid dynamics and may include data related to the parameters of fluid dynamics, such as pressure at different cross-sections at different speeds, or pressure at different cross-sections at varied altitudes. As described earlier, for efficient computing in a distributed parallel computing environment, the computing problem is divided and distributed to different computing devices. Therefore, the matrix data comprising the data related to fluid dynamics is divided into multiple sub-matrices and may be distributed among several nodes 104. For the sake of clarity, the sub-matrices formed by division of a matrix would be referred to as tiled matrices hereinafter.
In one implementation, upon identifying the tiled matrices, the block design
module 110 may form a block set with different blocks based on Singer difference sets and projective geometry concepts. In said implementation, the blocks thus formed may include the tiled matrices to be distributed to the nodes 104. Further, each block within the block set may indicate the node 104 to which the tiled matrices present in the block can be distributed. For example, the block design module 110 may form a block set with five blocks where each block includes five tiled matrices. It would be understood that to provide five tiled matrices to each block, a matrix is divided into twenty five sub-matrices. In said example, the blocks may be numbered as block 1, block 2, block 3, block 4, and block 5. The block 4 may indicate the five tiled matrices present in the block to be provided to the node four, the block 2 may indicate the five tiled matrices present in the block to be provided to the node 2, and so on.
Fig. 2 illustrates exemplary components of the communication optimizing system
102, according to an embodiment of the present subject matter. The communication optimizing

system 102 includes interface(s) 202, one or more processor(s) 204, and a memory, such as a memory 206, coupled to the processor(s) 204.
The interfaces 202 may include a variety of software and hardware interfaces, for
example, interfaces for peripheral device(s), such as a keyboard, a mouse, an external memory, and a printer. Further, the interfaces 202 may enable the communication optimizing system 102 to communicate with other computing devices, such as the computing device 104 (not shown). The interfaces 202 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example local area network (LAN), cable, etc., and wireless networks such as Wireless LAN (WLAN), cellular, or satellite. For the purpose, the interfaces 202 may include one or more ports for connecting a number of computing devices 104 to each other or to another communication optimizing systems. In one implementation, the communication optimizing system 102 communicates with the nodes 104 (not shown) via the interfaces 202.
The processor 204 can be a single processing unit or a number of units, all of
which could include multiple computing units. The processor 204 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor 204 is configured to fetch and execute computer-readable instructions and data stored in the memory 206.
The functions of the various elements shown in the figures, including any
functional blocks labeled as "processor(s)", may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor; the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term "processor' should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read only memory (ROM) for storing software, random access memory (RAM), and non volatile storage. Other hardware, conventional and/or custom, may also be included.

The memory 206 may include any computer-readable medium known in the art
including, for example, volatile memory such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-voiatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. The memory 206 includes module(s) 208 and data 210. The modules 208, amongst other things, include routines, programs, objects, components, data structures, etc., which perform particular tasks or implement particular abstract data types. The data 210 serves, amongst other things, as a repository for storing data processed, received and generated by one or more of the modules 208. The modules 208 further include, for example, a matrix tiling module 212, a block design module 110, and other module(s) 214. The other modules 214 may include programs that supplement applications on the communication optimizing system 102, for example, programs in the operating system. The data 210 includes data generated as a result of the execution of one or more modules 208.
In one implementation, the communication optimizing system 102 optimizes the
communication between different nodes 104 (not shown) for a matrix-vector multiplication computation in a distributed parallel environment, for computation of data related to fluid dynamic. It would be understood by those skilled in the art that the matrix would include multiple rows and multiple columns, whereas, the vector would include only one column for multiple rows.
The matrix tiling module 212 is configured to identify the number of nodes
required for the matrix-vector multiplication based on the projective geometry concepts. In one implementation, the projective space utilized is PG (2, q) to optimize the communication between the computing nodes. In said implementation, q is a power of some prime, and may be represented as, q= pr (for a prime p and integer r). The matrix tiling module 212 identifies the optimum number of nodes based on a relation:
(1)
In the above mentioned relation (1), n represents the number of nodes which are
computed based on the number q. The matrix tiling module 212 utilizes the identified number of nodes (n) and divides the matrix into sub-matrices based on the number n. The matrix tiling module 212 may form an n x n grid on the matrix to form n2 smaller sub-matrices referred to as

tiled matrices. For example, for a matrix with 169 rows and 169 columns, if identified number of nodes is 13, the matrix tiling module 212 would divide the 169 rows into 13 sections, and similarly, the 169 columns into 13 sections. This division would result in formation of 169 tiled matrices with the matrix being divided into 13 new rows and 13 new columns, Hence, in one implementation, each tile obtained by the division of the matrix would contain 13 rows and 13 columns and thereby forming square sub-matrices of the original matrix. As already explained, it would be understood by those skilled in the art that in situations when the number of rows or columns is not a perfect multiple of the number of nodes (n), the last tiles of each row and each column might not form a square matrix.
The tiled-matrices are to be distributed among different nodes for the purpose of
distributed parallel computing. As described earlier, for the purpose of evaluation of the final result of the matrix-vector multiplication, inter-node communication is required, therefore, the distribution of different tiled matrices to different nodes is done to optimize the required communication between the identified nodes. In one implementation, the block design module 110 distributes the tiled matrices to different nodes based on the principle(s) of block design utilizing the Singer difference set method. The block design module 110 distributes the tiled matrices among different blocks, where each block represents an identified node.
To distribute the tiled matrices, a block set with multiple blocks is first
determined. The block set may contain blocks and objects where each block may contain a few of the available objects. According to the principle(s) of block design, the block design module 110 forms block set containing blocks equal to the number of identified nodes for the computation. For example, if the identified number of nodes are 13, based on the number q (q=3), the block design module 110 would form 13 blocks. For the sake of clarity, it may be considered that the blocks thus formed are numbered as block 1, block2, ..., up to block 13. Further, in one implementation, to provide objects to the blocks, the block design module 110 distributes objects to the formed blocks of the block set, where the block design includes n objects, equal to the number of blocks and identified nodes. Further, it may be considered that the n objects are numbered as OB1, OB2,..., up to OBn.
According to one implementation of the present subject matter, the block design
module 110 distributes each object to exactly q+1 blocks and assigns only q+1 objects to each

block. Therefore, since the n objects are distributed among the n blocks and each block has more than one object, it would be understood that one object is distributed to more than one block. Further, it would also be understood that the distribution of objects among the blocks of the block set does not incJude the distribution of matrix and tiled matrices, and the block set with different blocks are formed to complete the block design. For example, if the matrix tiling module 212 identifies the number of nodes to be equal to 13, based on q=3, the block design module 110 would form 13 blocks, each representing one node. Such a block design will have 13 objects with each block being assigned exactly 4 (q+1) objects, and each object would be present in exactly 4 (q+1) blocks, The distribution of such 13 objects into 13 different blocks has been explained in greater detail with respect to Table 1.

Blocks OBJECTS
Block 1 OB7 OB8 OB10 OB3
Block 2 OB8 OB9 OB11 OB4
Block 3 OB9 0810 OB12 OB5
Block 4 OB1C OB11 OB13 OB6
Block 5 OB11 0612 OBI OB7
Block 6 OB12 OB13 OB2 OB8
Block 7 OB13 OBI OB3 OB9
Block 8 OBI OB2 0B4 OB10
Block 9 OB2 0B3 OB5 OB11
Block 10 OB3 0B4 OB6 OB12
Block 11 OB4 OB5 OB7 OB13
Block 12 OB5 OB6 0B8 OBI
Block 13 OB6 OB7 OB9 OB2
Table 1
As shown in Table 1, block 1 to block 13 are distributed with 13 objects, where
each block has exactly 4 objects. Since one object is present exactly in 4 blocks, each object in the table. Table 1, is shown in exactly 4 blocks. For example, object 7 (OB7) is present in blocks block 1, block 5, block II and block 13. Further, the block design module 110 distributes the objects in such a manner that each pair is present exactly in one block, and therefore, any pair of objects is shown in exactly one block in Tablel. For example, the pair of objects, object 7 (OB7) and object 9 (OB9) is present only in the block 13. Similarly, the pair of objects, object 2 (OB2) and object 4 (OB4) is present only in the block 8. Further, the objects present in a block, form a pair which is not present in any other block, i.e., in block 1, object 7 (OB7), object 8 (OB8),

object 10 (OB 10), and object 3 (OB3) form pairs of objects, such as 'OB7 and OB8', 'OB8 and OB10; 'OB10 and OB3' 'OB3 and OB7', 'OB8 and OB7' 'OB10 and OB8', 'OB3 and OB10', 'OB7 and OB3', 'OB7 and OB10', 'OB8 and OB3', 'OB10 and OB7' and 'OB3 and OB8'. Similarly, the objects of each block may form 12 individual pairs which are not present in any other block of the block design. Although the distribution has been explained with respect to 13 blocks, it would be understood by those skilled in the art that a similar approach is utilized for distribution of n objects among n blocks, where n = q2 + q + 1.
As described earlier, each block is indicative of one node from among the
identified number of nodes. Therefore, in one example, block 1 may represent node 1, block 2 may represent block 2, and so on. In one implementation, the matrix tiling module 212 distributes the tiled matrices among identified nodes based on the block design formed by the block design module 110. To distribute the tiled matrices among different nodes, the matrix tiling module 212 may distribute each diagonal tile (i, i) to the blocki or node i, and each tile(i,j) is allocated based on the block design. As described earlier, each block includes one pair of objects which is unique to the entire block design and is not present in any other block. Therefore, for each pair of objects present in a block, the tiled matrix (i, j) formed by the pair is distributed to the block.
For example, if the identified number of blocks is 13, the block design module
110 would form a block design similar to that shown in Table 1. In such a scenario, the matrix tiling module 212 would allocate tiled matrices to all the blocks. The block 5 would include the tiles based on the unique pair of objects it contains. Therefore, block 2 would include tiled matrices (11,12), (11,1), (11,7), (12,11), (12,1), (12,7), (1,11), (1,12), (1,7), (7,11), (7,12), and (7,11). As described earlier that the matrix is divided into n2 tiles, the matrix tiling module 212 distributes the n2 tiles to the n identified nodes, and therefore, each node is provided with exactly n tiles for computational purposes to balance the load across all nodes.
In above described example where the identified number of blocks is 13, the
number of tiles (i.j) distributed to each block based on block design are 12. Further, each tile (i, i) is also distributed to the block i. Therefore, in a block design with 13 blocks, 13 tiles are distributed to each block which includes distribution of twelve tiles based on block design and distribution of one tile based on block number.

For the purpose of matrix-vector multiplication, the block design module 110
divides a vector into multiple blocks to be distributed among the various nodes in the distributed parallel computing environment. In one implementation, the vector is divided into different blocks based on the number of identified nodes. For example, if the matrix tiling module 212 identifies n number of nodes, the block design module 110 may distribute the vector into n
blocks and each vector block may be represented as V1, V2, V3, Vn. For example, as already
explained, in a situation where the identified number of nodes are 13 and the matrix to be divided into tiles has 169 columns, the matrix tiling module 212 would form 169 tiles where each tile would include 13 columns. Further, the vector to be multiplied with the matrix would have 169 rows, as would be understood by a person skilled in the art. In such a scenario, the block design module 110 may distribute the vector into 13 blocks where each block would include 13 rows and the blocks may be represented as V1, V2, V3, ...., V13.
The block design module 110, apart from forming a block design for the matrix
and dividing the vector into blocks, distributes the vector blocks to different nodes. In one implementation, the vector block i, is distributed to a node i. For example, if the block design module 110 divides the vector into 10 blocks, the number of identified nodes would also be 10. In such a scenario, the vector block V3 may be distributed to block 3, vector block V7 may be distributed to block 7, and so on. It would be understood by those skilled in the art that in a matrix-vector multiplication, when a matrix M with multiple rows and multiple columns is multiplied with a vector V with a single column and multiple rows, the result produced, that is, vector R, would contain a single column and multiple rows. Further, it would also be understood by those skilled in the art that for the purpose of matrix-vector multiplication in a distributed parallel environment, to produce one row Ri of the complete result, the entire row i of the matrix is to be multiplied with the vector V. For this purpose, the tile (i, i) of row i is multiplied with the vector block V; to produce a partial result for Ri. Further, each tile (i,j) is multiplied with vector block V/ to produce remaining partial result for Ri.
As explained before, according to an implementation of the present subject
matter, the tiled matrices of the matrix are distributed to different nodes based on the block design formed by the block design module 110. In such a distribution, each block contains objects where each block represents a node. The number of blocks is identified based on the number q, and the number of objects in each block is also identified based on the number q.

Since according to one implementation, each block includes exactly q+1 objects and combination of a pair of objects present in the block identifies the distribution of tiles(i,j), the all possible columns (j ,s) that a block may include, would be limited to the number of objects and objects itself present in the block. Therefore, the block would include tiled matrices (i,j), where j may take only q + 1 values. For example, if the identified number q is 3, the identified number of blocks is 13 and each block would include exactly 4(3+1) objects. According to Table 1, distribution of thirteen objects to thirteen blocks is shown according to the block design implemented. The block 8 of such a block design includes object 1, object 2, object 4 and object
10. In such a scenario, the tiled matrices included in block 8 would include tile (8,8) and tiles
formed by pairs of objects 1, 2, 4, and 10, i.e., tile (1,2), tile (2,1), tile (2,4), tile (4,10), ..., and
so on. Therefore, it is evident that the block 8, apart from tile with column 8, only includes tiled
matrices with columns 1, 2, 4, and 10. It would be understood by those skilled in the art that in a
representation of a tiled matrix as tile (i,j), i represents the row and j represents the column.
Since each block i includes exactly q+1 columns and one vector block i, for the
purpose of computation of matrix-vector multiplication result, each node may require q+1 vector blocks from other nodes to compute the partial result for tiled matrices present with columns other than i, Further, each node i may provide the vector block Vi to q+1 other nodes for the computation of partial results, as would be understood by those skilled in the art. Therefore, it requires only 2(q+l) vector block communications to complete the computation at one node.
For example, a block with tile (2,2) and other tiles including columns 4, 8, 9, and
11, would include the vector block 2. For the computation of partial result of R, the matrix tiling
module 212 would multiply the tile (2,2) with the vector block V2. However, to multiply other
tiles with respective vectors, the node 2 would require inputs of vector blocks from node 4, 8, 9
and 11. Similarly, nodes 8. 9, 6 and 13 may require V2 for the computation of partial results.
Hence, the node may communicate 8 (2{q+1}) times for the complete computations at the node.
As would be understood by those skilled in the art that the communication
between nodes is dependent on the number q, therefore, the complexity of communication is reduced to 2(q+l). Further, since q is directly proportional to Vn, the complexity of communications can be effectively considered to be as low as Vn. Therefore, such a reduction of

the number of communications provides efficient and time effective results of matrix-vector computation with high computation to communication ratio.
Fig. 3 illustrates an exemplary method 300 for optimizing communication in a
distributed parallel computing environment, in accordance with an implementation of the present subject matter. According to an aspect, the concepts of optimizing communication in the distributed parallel computing environment are described with reference to matrix-vector multiplication.
The exemplary method may be described in the general context of computer
executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures. procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The method may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, computer executable instructions may be located in both local and remote computer storage media, including memory storage devices.
The order in which the methods are described is not intended to be construed as a
limitation, and any number of the described method blocks can be combined in any order to implement the method, or an alternative method. Additionally, individual blocks may be deleted from the methods without departing from the spirit and scope of the subject matter described herein. Furthermore, the method can be implemented in any suitable hardware, software, firmware, or combination thereof. The method is explained with reference to a communication koptimizing system, however, it will be understood that the method 300 can be implemented for a plurality of communication optimizing systems.
At block 302, a number of nodes of the distributed parallel computing
environment to be utilized for matrix-vector multiplication is identified. In a distributed parallel computing environment, there can be multiple nodes available for computing. The number of nodes are identified based on a relation:

where n represents the number of nodes and q is a power of a prime number. In one implementation, the number q is determined in such a manner that the number n is not greater than the number of rows or columns in the matrix. In one implementation, the matrix tiling module 212 identifies the number of nodes based on the number q.
At block 304. the matrix is divided into sub-matrices to form tiled matrices based
on the number of identified nodes. Further, the vector is also divided into vector blocks based on the number of identified nodes In one implementation, the matrix tiling module 212 divides the matrix into grid of nxn, where n is the number of identified nodes. The division of the matrix into sub-matrices may form n tiled matrices. Similarly, the vector to be multiplied with the matrix is divided into vector blocks. In one implementation, the matrix tiling module 212 forms vector blocks based on the number of identified nodes. The matrix tiling module 212 may divide the vector into equal vector blocks similar to the manner in which division of the matrix into tiled matrices is achieved.
At block 306, block set based on the number of identified nodes and principles of
block design is formed. In one implementation, the number of blocks included in the block set is equal to the number of identified nodes, and each block represents a node. The blocks further include objects where the number of objects is equal to the number of identified nodes. The objects are distributed to the blocks of the block set based on block design principle. In said implementation, each object is allocated to exactly q+1 blocks, and only q+1 objects are allocated to each block. Further, the each pair of objects is distributed to exactly one block. Based on the distribution criteria, the blocks are formed by apportioning the total number of objects into a plurality of blocks, each block containing q+1 objects. In one implementation, the block design module 110 forms the blocks based on the number of identified nodes and the principles of Singer difference set.
At block 308, the tiled matrices are distributed to different nodes based on the
formed blocks of the block set. As described in Fig. 2, each block formed by the block design module 110 contains objects according to the principle of block design. The tiled matrices are distributed to the nodes based on the pair of objects present in the block representing the node. The details of tiled matrix distribution have been explained with reference of Fig.2 and therefore,

the details of the same are omitted here for the sake of brevity. In one implementation, the matrix tiling module 212 distributes the tiled matrices to each node based on the blocks formed,
At block 310, the vector blocks formed by dividing the vector are distributed to
different nodes based on the number of identified nodes. In one implementation, the matrix tiling module 212 distributes one vector block to each node based on the number of the vector block and the node. For example, the vector block VI is distributed to node I and vector block V6 is distributed to node 6.
Although embodiments for methods and systems for communication optimizing
system have been described in a language specific to structural features and/or methods, it is to be understood that the invention is not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as exemplary embodiments for optimizing communication.

1/ We Claim:
1. A method for increasing processing capabilities of a computing system in a distributed parallel computing environment, the method comprising:
determining a number of nodes for parallel computing, wherein a node represents a processor of a distributed parallel computing environment, and wherein the number of nodes is at least seven;
obtaining data in the form of a matrix and a vector for parallel computing;
dividing the matrix representing data into tiled matrices based on the number of nodes;
forming a block set based on the principle of block design, 2-dimensional projective space, and the number of nodes, wherein each block of the block set represents one node from amongst a plurality of nodes of the distributed parallel computing environment; and
distributing the tiled matrices among the plurality of nodes based on the formed block set.
2. The method as claimed in claim 1, wherein the method further comprises dividing a vector into a plurality of vector blocks based on the number of nodes to distribute the plurality of vector blocks among the plurality of nodes.
3. The method as claimed in claim 1, wherein the number of nodes is determined based on the principle of Singer Difference Set.
4. The method as claimed in claim 1, wherein the determining of the number of nodes is based on a quadratic function of a power of a prime number.
5. The method as claimed in claim 4, wherein a communication between the plurality of nodes is based on the power of the prime number.

6. A communication optimizing system (102) for increasing processing capabilities of a
computing system in a distributed parallel computing environment, the communication
optimizing system (102) comprising:
a processor (204);
a memory (206) coupled to the processor (204), the memory (206) comprising:
a matrix tiling module (212) configured to divide a matrix representing data into tiled matrices based on a number of nodes, wherein the number of nodes is at least seven;
block design module (110) configured to form a block set based on the principle of block design utilizing Singer difference set, and distribute the tiled matrices among the number of nodes based on the block set.
7. The communication optimizing system (102) as claimed in claim 6. wherein the block design module (212) is further configured to divide a vector into a plurality of vector blocks based on the number of nodes.
8. The communication optimizing system (102) as claimed in claim 6, wherein the matrix tiling module (212) determines the number of nodes based on the principle of Singer difference set.
9. The communication optimizing system (102) as claimed in claim 7, wherein the block design module (212) is further configured to distribute the plurality of vector blocks among the number of nodes.
10. The communication optimizing system (102) as claimed in claim 6, wherein a block in the block set represents one node from amongst a plurality of nodes of the distributed parallel computing environment.
11. A computer-readable medium having computer-executable instructions that when executed perform acts comprising:
determining a number of nodes for parallel computing, wherein a node represents a processor of a distributed parallel computing environment, and wherein the number of nodes is at least two;
dividing a matrix into tiled matrices based on the number of nodes;

forming a block set based on the principle of block design, 2-dimensional projective space, and the number of nodes, wherein each block of the block set represents one node from amongst a plurality of nodes of the distributed parallel computing environment; and distributing the tiled matrices among the plurality of nodes based on the formed block set.

Documents

Orders

Section	Controller	Decision Date

Application Documents

#	Name	Date
1	2104-MUM-2011-OTHERS [25-05-2018(online)].pdf	2018-05-25
1	2104-MUM-2011-RELEVANT DOCUMENTS [26-09-2023(online)].pdf	2023-09-26
2	2104-MUM-2011-FER_SER_REPLY [25-05-2018(online)].pdf	2018-05-25
2	2104-MUM-2011-RELEVANT DOCUMENTS [27-09-2022(online)].pdf	2022-09-27
3	2104-MUM-2011-US(14)-HearingNotice-(HearingDate-20-10-2020).pdf	2021-10-03
3	2104-MUM-2011-CORRESPONDENCE [25-05-2018(online)].pdf	2018-05-25
4	2104-MUM-2011-IntimationOfGrant19-02-2021.pdf	2021-02-19
4	2104-MUM-2011-COMPLETE SPECIFICATION [25-05-2018(online)].pdf	2018-05-25
5	2104-MUM-2011-PatentCertificate19-02-2021.pdf	2021-02-19
5	2104-MUM-2011-CLAIMS [25-05-2018(online)].pdf	2018-05-25
6	2104-MUM-2011-Written submissions and relevant documents [22-10-2020(online)].pdf	2020-10-22
6	2104-MUM-2011-ABSTRACT [25-05-2018(online)].pdf	2018-05-25
7	ABSTRACT1.jpg	2018-08-10
7	2104-MUM-2011-Correspondence to notify the Controller [24-09-2020(online)].pdf	2020-09-24
8	2104-MUM-2011-POWER OF ATTORNEY(27-9-2011).pdf	2018-08-10
8	2104-mum-2011-abstract.pdf	2018-08-10
9	2104-MUM-2011-AFFIDAVIT(28-3-2012).pdf	2018-08-10
9	2104-MUM-2011-PETITION UNDER RULE-137(28-3-2012).pdf	2018-08-10
10	2104-mum-2011-claims.pdf	2018-08-10
10	2104-MUM-2011-FORM 5(28-3-2012).pdf	2018-08-10
11	2104-MUM-2011-CORRESPONDENCE (28-3-2012).pdf	2018-08-10
11	2104-mum-2011-form 3.pdf	2018-08-10
12	2104-MUM-2011-CORRESPONDENCE(27-9-2011).pdf	2018-08-10
12	2104-mum-2011-form 2.pdf	2018-08-10
13	2104-MUM-2011-CORRESPONDENCE(28-3-2012).pdf	2018-08-10
13	2104-mum-2011-form 2(title page).pdf	2018-08-10
14	2104-mum-2011-correspondence.pdf	2018-08-10
14	2104-mum-2011-form 18.pdf	2018-08-10
15	2104-mum-2011-description(complete).pdf	2018-08-10
15	2104-MUM-2011-FORM 13(28-3-2012).pdf	2018-08-10
16	2104-mum-2011-drawing.pdf	2018-08-10
16	2104-mum-2011-form 1.pdf	2018-08-10
17	2104-MUM-2011-FORM 1(28-3-2012).pdf	2018-08-10
17	2104-MUM-2011-FER.pdf	2018-08-10
18	2104-MUM-2011-FORM 1 (28-3-2012).pdf	2018-08-10
19	2104-MUM-2011-FER.pdf	2018-08-10
19	2104-MUM-2011-FORM 1(28-3-2012).pdf	2018-08-10
20	2104-mum-2011-drawing.pdf	2018-08-10
20	2104-mum-2011-form 1.pdf	2018-08-10
21	2104-mum-2011-description(complete).pdf	2018-08-10
21	2104-MUM-2011-FORM 13(28-3-2012).pdf	2018-08-10
22	2104-mum-2011-correspondence.pdf	2018-08-10
22	2104-mum-2011-form 18.pdf	2018-08-10
23	2104-MUM-2011-CORRESPONDENCE(28-3-2012).pdf	2018-08-10
23	2104-mum-2011-form 2(title page).pdf	2018-08-10
24	2104-mum-2011-form 2.pdf	2018-08-10
24	2104-MUM-2011-CORRESPONDENCE(27-9-2011).pdf	2018-08-10
25	2104-MUM-2011-CORRESPONDENCE (28-3-2012).pdf	2018-08-10
25	2104-mum-2011-form 3.pdf	2018-08-10
26	2104-mum-2011-claims.pdf	2018-08-10
26	2104-MUM-2011-FORM 5(28-3-2012).pdf	2018-08-10
27	2104-MUM-2011-AFFIDAVIT(28-3-2012).pdf	2018-08-10
27	2104-MUM-2011-PETITION UNDER RULE-137(28-3-2012).pdf	2018-08-10
28	2104-mum-2011-abstract.pdf	2018-08-10
28	2104-MUM-2011-POWER OF ATTORNEY(27-9-2011).pdf	2018-08-10
29	2104-MUM-2011-Correspondence to notify the Controller [24-09-2020(online)].pdf	2020-09-24
29	ABSTRACT1.jpg	2018-08-10
30	2104-MUM-2011-ABSTRACT [25-05-2018(online)].pdf	2018-05-25
30	2104-MUM-2011-Written submissions and relevant documents [22-10-2020(online)].pdf	2020-10-22
31	2104-MUM-2011-PatentCertificate19-02-2021.pdf	2021-02-19
31	2104-MUM-2011-CLAIMS [25-05-2018(online)].pdf	2018-05-25
32	2104-MUM-2011-IntimationOfGrant19-02-2021.pdf	2021-02-19
32	2104-MUM-2011-COMPLETE SPECIFICATION [25-05-2018(online)].pdf	2018-05-25
33	2104-MUM-2011-US(14)-HearingNotice-(HearingDate-20-10-2020).pdf	2021-10-03
33	2104-MUM-2011-CORRESPONDENCE [25-05-2018(online)].pdf	2018-05-25
34	2104-MUM-2011-RELEVANT DOCUMENTS [27-09-2022(online)].pdf	2022-09-27
34	2104-MUM-2011-FER_SER_REPLY [25-05-2018(online)].pdf	2018-05-25
35	2104-MUM-2011-RELEVANT DOCUMENTS [26-09-2023(online)].pdf	2023-09-26
35	2104-MUM-2011-OTHERS [25-05-2018(online)].pdf	2018-05-25

Search Strategy

1	2104-MUM-2011_29-09-2017.pdf