A System For Decoding Of Low Density Parity Check (Ldpc) Encoded Data

< Back

A System For Decoding Of Low Density Parity Check (Ldpc) Encoded Data

Abstract: A parallel system for decoding of low density parity check (LDPC) encoded data having length L in a bit stream, ‘L’ being given by the function q*(p2s+ps+1), where p is a prime number, s is any positive integer and q is a factor of (ps+1), said system comprising, means to select a block of the bit stream f said length L, said block containing a plurality of its; a parity check matrix having a plurality of elements; a plurality of parity check nodes for implementing each row of said parity check matrix and to update information of each bit; plurality of bit nodes adapted to receive said update information from said parity check nodes and having means to calculate the bit probability and further adapted to transmit th calculated bit probability data to said parity check nodes; a logic control means adapted to iterate information from said parity check nodes to bit nodes until decoding is complete.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

31 January 2007

Publication Number

39/2008

Publication Type

INA

Invention Field

ELECTRONICS

Status

Parent Application

Patent Number

Legal Status

Grant Date

2016-11-25

Renewal Date

Applicants

TATA CONSULTANCY SERVICES LIMITED

BOMBAY HOUSE, 24, HOMI MODY STREET, MUMBAI-400 001,

Inventors

1. SHARMA HRISHIKESH

Embedded System Innovation Lab Tata Consultancy Services Ltd. Banglore.

Specification

FORM-2
THE PATENT ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
PROVISIONAL SPECIFICATION
(See section 10 and Rule 13)
COMPUTER SYSTEMS
TATA CONSULTANCY SERVICES MLIMITED
An Indian Company
of Bombay House, 24, Homi Mody Street, Mumbai-400 001,
Maharashtra, India

THE FOLLOWING SPECIFICATION DESCRIBES THE INVENTION

Field of invention
This invention relates to computer systems.
Particularly, this invention relates to a computer system for communication.
Typically, this invention envisages a hardware system for decoding of regular, split-extended low density parity-check (LDPC)-coded bit stream applicable to both wireless and wireline communication systems.
Background of invention Introduction
It is known that digital data transmission suffers from channel influence such as:
1) Severe (multipath) fading in terrestrial mobile radio communications
2) Very low signal-to-noise ratio for satellite communications due to high path loss and limited transmit power in the downlink
3) Compressed data (e.g. audio and video signals) is very sensitive to
transmission errors
Hence a transmitted message especially over a long distance may not be received as it is sent. In order to make out meaning of what was sent, we need to, therefore, detect, and if possible, correct errors. Wireless communication systems are especially prone to errors, and hence robust error-correction system at the receiving end almost becomes a necessity. In recent times, such wireless communication systems are widely deployed to provide different services. Popular technologies such as both ad-
2

hoc networking(Wireless LAN) and infrastructure-based(3G) all require error-correction.
An approach to deal with error correction first encodes each unit of information(such as byte) into a unit(called word) derived from p-ary alphabet(binary alphabet with 0 and 1 is most used). Then the error detection or correction work at the receiver end becomes mapping the current unit of information received to the "closest" p-ary code, thus knowing the actual unit of information sent. This process is known as the encoding/decoding process in digital communication.
It was the invention of turbo codes and turbo product codes that moved the limit of achievable coding gain much closer to the Shannon limit for low and high code rates, respectively. However, the iterative nature of the decoding algorithms for turbo codes and turbo product codes present significant implementation challenges. Search for better iterative decoding algorithms led to the (re-)discovery of low-density parity-check (LDPC) codes. These codes are simpler to decode when compared to other codes. For example, unlike turbo and turbo product codes, it is possible to decode LDPC codes using a block-parallel algorithm rather than a block-serial algorithm. Further, using the message-passing algorithm, LDPC decoders require an order of magnitude less arithmetic computations than equivalent Turbo decoders. Finally, it was also established that LDPC codes are asymptotically superior to turbo codes with respect to coding gain.
Simplicity of implementation of decoder coupled with better performance has made these codes candidates for usage in the recent generation of digital
3

transmission systems. For example, in digital satellite broadcasting standard (DVB-S2), a long LDPC code has been chosen for error-correction in downlink data. A similar case is being made in wireless LAN technology, as has been mentioned earlier. Because of this thrust, optimal systems related to LDPC decoding have a very high potential of being deployed.
Prior Art:
In the prior art, most of the work has been done on designing semi-parallel architecture for such decoders. This includes decoders for both regular and irregular codes, though designs for regular codes are more to be found. Most of the work suggests a strong correlation between choice of the codes and the decoder design. Hence the importance of specialized designs is quite evident. Shu Lin and Marc Fossorior introduced not only the projective geometry codes, but also the row-/column-splitting technique to primarily boost the code rate. They also did decoder simulations for using generic architectures to study the error-performance. Later, Engling Yeo, Venkat Ananthram et al came up with a 'staggered schedule' for decoding, which they applied to decoding split codes as well.
The approach as envisaged in accordance with this invention is different from prior art in the sense that it optimally uses the classical flooding schedule for decoding split codes. Also, in some sense it is a better architecture because of the use of memories with scheduling such that there are no memory access conflicts. Thus the resulting design is still fast enough compared to other semi-parallel designs. This architectural invention extends
4

a prior patent in same area by the author for non-split projective geometry LDPC codes.
Objects of the Invention:
One object of the invention is to provide a system which can easily replace any error-detection/correction component of decoder for communication systems where regular LDPC codes have been used.
Another object of this invention is to provide a decoding system which has a synchronous design.
Yet another object of this invention is to provide a system for decoding with error-correction facility.
An additional object of this invention is to provide a system for communication with decreased complexity of design and increases speed/throughput.
Summary of the Invention;
In general, the system in accordance with this invention can be used in any decoder for communication systems where regular, split-extended LDPC codes have been used. Column-splitting of structured code is a technique that improves error-performance, code-rate and convergence performance for LPDC decoding. The decoder system exploits the projective geometry regularity in certain subset of Tanner graphs. However, following the
5

industry practice of joint code-decoder design, the system has been designed specifically with LDPC codes arising out of projective planes in mind. The decoder is fully generic in the sense that it can be used to decode codes of any length, provided the length is of the type q*(p2s + ps + 1), where p is a prime number, s any positive integer and q a factor of (ps +1).
This invention envisages using parallel hardware architecture for a split-extended LDPC decoding system at the receiving end of a communication system. Particularly, this invention envisages a method and apparatus for decoding split-extended LDPC-encoded bitstream and includes a system for improving the decoding time-performance and memory access efficiency for such bitstream.
The apparatus envisaged in accordance with this invention works on a highly generic architectural template so far used in other systems for other computation problems. It is suitable for designing many parallel systems. The underlying mathematical structure of regular LDPC codes is isomorphic with the structure of the architectural template, and hence the envisaged novel decoder system substantially improves the performance.
The belief propagation algorithm for decoding LDPC-coded bitstream passes messages between so-called bit(variable) and check nodes. These messages are stored before the other nodes access and process it, in order to have synchronous execution. However, it is the lack of any structural regularity in these essentially random codes that creates a major challenge for building a practical LDPC decoder. Hence the envisaged system focuses on decoding regular codes only. In regular codes, each node in the Tanner
6

graph passes the same number of messages in a phase of decoding. Further, by choosing projective geometry regular codes to lend additional regularity, it becomes possible to design a schedule in such a way that there are no memory or processor conflicts, as our system shows. In fact, various other "nice properties" that one generally seeks during parallel system design: load-balancing, efficient data-routing are also achieved. This regularity can also be used to reduce the wiring and layout complexity in the final stages of hardware design.
Brief Description of the Drawings;
Figure 1 relates to a Tanner Graph corresponding to p = 3, s = 1
Figure 2 relates to a Split Tanner Graph corresponding top = 3,s=l,q = 2
Figure 3 relates to the Basic Architecture of Split-LDPC Decoding System
Figure 4 relates to the Quasi-cyclicity in Parity Matrix split by factor of 2
Figure 5 relates to an Example grouping/numbering for generated Perfect Pattern
Figure 6 relates to an Interconnect Network for p = 5
Figure 7 relates to a Bit Node Architecture
7

Figure 8 relates to a Check Node Architecture for Sign Processing

Figure 9 relates to a Check Node Architecture for Magnitude Processing
Detailed Description of the Drawings:
Figure 1 illustrates the Tanner graph representation of the code for understanding the belief propagation decoding algorithm. Each bit of the codeword is connected (hence related) to the constraints (in form of parity-checks) that the bits must satisfy in order to form a valid codeword. During the iterative decoding algorithm process, messages are exchanged between the bits and the constraints (parity checks) on the edges of the graph. The steps in this algorithm can be briefly put down as follows:
1) All bit-nodes and their outgoing bit-messages are initialized to the
intrinsic information represented as a log-likelihood ratio of the received
symbol:

2) The messages are sent to the parity-check nodes along the edges of the Tanner graph.
3) At the parity-check nodes, computation is performed separately on the sign and the magnitude part of each message. The XOR on the sign bits of the incoming bit messages at the parity-check nodes form the parity check result for the row. The sign of each outgoing check message for each edge of the graph is formed by further XORing the sign of the incoming bit message of that edge, and the row parity check result.
Further, an intermediate row parity reliability:
8

and subsequent outgoing check message reliabilities are computed in the logarithmic domain as:

ln(li)=

where ln(li) is the log of the parity reliability for row and lij is the reliability of the message sent to parity-check node i by bit node j. The outgoing check message reliabilities(magnitudes) are calculated using ln(li) as:

where lik is the reliability of the check message from parity-check node i to variable node k.
(4) These messages are now sent to the bit nodes along the edges of the Tanner graph.
(5) At the bit nodes, an estimate of the decoded bit is updated by summing up the intrinsic information, and the extrinsic information received along each edge (the check message log-likelihoods). The decoded bit is taken to be the sign of the summation, for this round of iteration. Outgoing bit messages for the next decoding iteration are then formed using the same summation. However, each edge is updated without using the incoming message on the corresponding graph edge. This is easily implemented by subtracting the contribution of each edge from the group sum.
9

(6) Repeat steps 2)—5) until a termination condition is met. From the above description of the algorithm, it is clear that over various iterations, updation of extrinsic information is where most of the time is spent. The system in accordance with this invention focuses on optimizing that.
The projective geometry LDPC codes can be extended by splitting each column of its parity-check matrix into multiple columns. The new parity check matrix having more number of bits is due to changed properties, and is considered as a separate code.
Let go, gi, ..., gn-1denote the columns of the parity-check matrix H made from a projective plane, where n = (p2s + ps + 1). Each column of H is then split into the same number of columns. The set of "l"s of the original (regular) column is distributed among the new columns as follows. Let q be a positive integer such that 2 < q < g, and q is also a factor of g, where g is the original column weight. Dividing g by q, we have g = q x gext- Split each column gi of H into columns gi,o, gi,1, ..., gi,(q-1), such that the all q columns have column weight yext- The distribution of "ones" of into gi;0, gi,1, ..., gi,(q-i), is carried out in a rotating manner. In the first rotation, the first "1" of gi is put in gi;o at the same vertical location, the second "1" of gi is put in gi,1 at the same vertical location, and so on. In the second rotation, the (q + l)th "1" of gi is put in gi,0 at the same vertical location, the (q + 2)th "1" of gi is put in gi,1 at the same vertical location, and so on. This rotating distribution of the "l"s of continues until all the "l"s of have been distributed into the new columns.
The above column splitting results in a new parity-check matrix HeXt with q*n columns, which have the following important structural properties:
10

(1) Each row still has weight p = (p +1)
(2) Each column has weight gext = g / q = (ps + 1) / q
By using different factors of g, different extended finite-geometry LDPC codes can be obtained.
If one arranges the rows of a regular LDPC parity-check matrix derived from projective planes into a single square circulant sub-matrix and then split each column into a fixed number of new columns, the resultant extended code is in quasi-cyclic form. In this patent work, quasi-cyclic column-split codes are required in the design of decoder system.
Figure 2 illustrates a corresponding split Tanner graph. The design uses distributed memory configuration by pushing all required inputs of a node into its local memory efficiently. Hence, communication issues such as waiting and data corruption are avoided. Direct wiring has been used because of locality of communication. As mentioned before,
designing efficient access to memory is crux of this system design.
In parallel computing, there is a seminal work done by Karmarkar relating to
design of systems, which arise out of projective geometry considerations. In that work, Karmarkar defined two units of system behaviour; perfect access pattern and perfect access sequences. The main focus was to define a unit amount of processing requiring stimulation of the whole parallel system, such that it automatically results in a coherent conflict-free operation and uniform balanced use of all the resources such as processors, memories and wires connecting them. Such units were called perfect access patterns. If such units are combined, then the resulting computation is still balanced
11

with respect to resource usage. Such aggregation was called a perfect access sequence. It has been proved by authors that the behaviour of column-split regular LDPC decoder while doing extrinsic information updating is made up of such perfect access patterns and sequences. To accommodate separate nature of bit-node and check-node processing within a single iteration, the authors have extended Karmarkar's design template. This was made possible because the bit nodes and check nodes work in non-temporal-overlap fashion. We design in such a way that the memory blocks that these nodes access for operand input and output are non-common, thus temporally partitioned the computation in two sub-computations fully. Although we have used 2-dimensional geometry for the design, due to splitting, the bipartite Tanner graph is non-symmetric with respect to a flip along a horizontal line running through mid of the graph. However, the number of nodes in one row of Tanner graph is multiple of the number of nodes in the other row. With respect to memory configuration, the approach is to use the small local memory available on a processing node itself. Thus the problem of scheduling for perfect access pattern/sequence in LDPC decoding gets decomposed into two related-by-a-factor scheduling problems.
The apparatus in accordance with this invention is shown in Figure 3 of the accompanying drawings.
Since the perfect sequence behaviour is still guaranteed at the end of each phase as well as scans (introduced later), the focus of this design is to ensure a perfect pattern behaviour within the perfect sequence. The quasi-cyclicity suggests that one might not get a (total) perfect pattern at the atomic level of behaviour of the system.
12

However, the authors found that the semi-structure manifests not in the form of semi-perfection of access pattern, but in the form of perfection due to staggered rotation. Figure 4 illustrates this fact.
In fact, there are q equivalence classes of chains of movement of a matrix element, as it shifts itself in the next row so as to give rise to a quasi-cyclic code's matrix. When 'q' representatives of the 'q' equivalence classes of chains of movement are put/added together, which are moving in a staggered fashion then we cover all the numbers between 0 and (p2s + ps + 1). This fact is leveraged in designing a perfect pattern.
For the scheduling algorithm for this setup, recall that in the bit-node processing, the major processing(same on all nodes) on q * (p2s + ps 4- 1) nodes involves adding the extrinsic information collected over all the
connections of the particular bit-node to check nodes. Suppose that r is a
factor of the out-degree of bit-nodes. In the worst case, turns

out to be a prime number. In that case, we either take r as 1, or as
itself, depending on other implementation concerns. By grouping the number
of edges going out of such a split node by r, one can make them rotate
symmetrically.
Since the other row of nodes has not been split, the number of memory banks where the updated extrinsic information will be stored, remain (p2s + ps + 1), and their arity (ps + 1) > r. If one 'aggregates' number of times these r edges will try to read/write from each memory block, one can see that it is uniform and its value is 'r'(or 'q*r'). Thus one obtains a perfect pattern
13

behaviour. By grouping such perfect patterns, one gets a perfect
sequence. As for check-node processing, the major processing involves
adding functionally transformed input messages collected over all the
connections of the particular check-node to bit nodes. Hence the number of
processing elements remain (p2s + ps + 1). However, the number of memory
blocks of the opposite type increase by a factor to q*(p2s + ps + 1), and their
parity decreases correspondingly to . In order to have uniform memory

block access, we need to read/write 's*q' inputs for XORing or addition at a
time in each cycle, where 's' is any integer such that s*q divides (ps + 1).
Figure 5 illustrates grouping for perfect pattern generation. This figure also
illustrates the fact that since multiplicative group is cyclic, a "horizontal
right shift" by one unit length, of each group of line and two points yields
the next group of line and two points.
Before designing the decoder system, one also needs to fix the length of the
code, since our design is quite generic. A regular LDPC code's Tanner graph can be carved out from any projective plane. A lot many
parameters(numerical properties) of codes can then be derived for each instance of code. In iterative decoding, main parameters that are used for shortlisting one for architectural design are the block size q *(p2s + ps + 1), and the number of iterations required. It is up to the user of this patent work to fix the codes before using the envisaged system.
Further, before the schedule is generated, one needs to design the interconnect network for a particular value of prime number p. The question that naturally arises here is: what are (indices of) the points that are linked to
line number/index, for example, 2? Initially one needs to go through an
14

elaborate process of vector space construction for a non-split matrix, which is an indirect method to answering this question. Given a non-split matrix it
is easy to write a program to split it. Figure 6 shows a pictorial representation of a non-split interconnect for p = 5.
In the decoding process based on belief propagation, we observe that for bit node computation, addition and subtraction are the two required operators. Similarly, for check node computation, addition, subtraction and XOR are the operators. Of course the check node computation is preceded/succeeded by transforming the inputs in the log domain.
In the envisaged architecture, each bit-node operator has to take inputs

before it can produce an output. The inputs to these operators are grouped 'r'
at a time. Hence a 'r'-ary operation of one type is feasible every cycle on all
bit-node processing units for decoding purposes. Similarly, for check-node
processing, a 's*q'-ary operation is feasible every cycle on all check-nodes
processing units. We can now clearly marry it off to the balanced schedule
of an iteration found in the perfect sequence.
Further, in context of LDPC decoder dynamics, one needs to decide the
order in which processing nodes are activated(fired). One also needs to
decide in which order inputs are taken in, and outputs are computed? This
order is referred to as the schedule of the decoder. The proposed decoding
system uses the flooding type of schedule. In this scheme, the nodes of the
same type are all updated(e.g. the bit nodes) and then the nodes of the other
type are also all updated to finish the iteration. Each of these two periods is
called a phase. In the envisaged system, one needs to further make sure that
every cycle the scheduled processing units are performing the same
operation.
15

The implementation of each type of node involves more than one hardware
unit(e.g., gates). The behaviour of units is mainly arithmetic and logical
operations, though the check nodes require complex transformation units for
implementing f(x) and f-1(x) defined previously. We assume here that all
these hardware units perform their operation within one machine cycle,
which is a practical assumption. If a particular operation runs over multiple
cycles, then that operation must be pipelined.
Figure 7 depicts an adder as part of the system in accordance with this
invention. For the bit-node architecture, we use one (r+1)-input adder with a
feedback element on bit-node side as the first hardware unit. It has a
memory(register) in the feedback path to store the intermediate value of
addition. Apart from that, a register is provided to store the accumulated sum
of extrinsic information.
We also use 'r' 2-input subtractors on the bit-node side. One of the inputs of
these subtractors is connected to a register that stores the sum of all its
inputs. Because of these changes, the input gathering in accumulation scan
becomes r-at-a-time, while output calculation in output scan also becomes r-
at-a-time.
Depending on the quantization levels we choose, the adder and subtractors
have to be multi-bit precision.
Figure 8 and Figure 9 illustrate the check node architecture for the system in
accordance with this invention.
For the check-node processing, the sign and magnitude processing happens
separately. The corresponding coefficients have to be derived from the same
message value, by taking leading bit(s) as the sign, and remaining as
magnitude. Also, a log-domain transformation is required before magnitude
processing, and reverse transformation after processing. Thus, for sign
16

processing, we use one (q*s + l)-input XOR gate with a feedback element on bit-node side as the first hardware unit. It has a memory(register) in the feedback path to store the intermediate value of XORing. XORing a bit twice with another bit nullifies the effect of the duplicated bit. We use this property to remove the sign bit of a particular edge associated to a node, from the group of sign bits of all edges related to that node. The 2-input XOR gates are used for that purpose.
Both sign and magnitude processing parts have a memory(register) each in the feedback path to store the intermediate value of the main operator. Apart from that, a register is provided to store the accumulated sign bit(or the addition of log-likelihood ratios). The 's*q' 2-input subtracters for the magnitude processing have one each of their inputs connected to a register that stores the sum of all its inputs.
Depending on the quantization levels we choose, the adder, XOR gates, f(x) and f-1(x), and subtracters have to be multi-bit precision. The implementation of functions f(x) and f-1(x) is one of the trickier aspects of check node implementation. Both these functions are non-linear functions. When the quantization levels are smaller, then these mathematical functions can be approximated using first few terms of their infinite expansion. In that case, an ASIC unit can be put in using few logic gates. Otherwise, a SRAM-based LUT block can also be used for implementation. To mitigate the effect of extra bits generated during addition, a saturation unit can be added wherever required. Within a node, one can use higher-precision logic gates and internal busses to minimize the round-off error. So we may use higher precision(bit-width) adder and subtracters in bit node architecture. In that case, the saturation blocks are placed at the outputs of
17

the each subtractor. For magnitude processing on the check node side, we place the saturation blocks just before the blocks implementing f-1(x).
The layout of memories, where the reliability messages sent out by various processing units get stored, is as follows. In the design, the row-regularity is
(ps + 1), while the column-regularity is But even though the number of bit-nodes (only) has increased, the number of check-memory block remain (p2s + ps + 1) and size of each memory block (ps + 1) as well. Let lij denote the reliability message that is sent to check node i from bit node j. Similarly let lij* denote the reliability message that is sent to bit node i from
check node j. Suppose that the ‘1’s in the j-th column(corresponding to bit-node j) are placed at set of row numbers {aj0, aj1,..., aj(ps _ 1) /2}, where aji is a number between 0 and (p2s + ps). Then the message lij is stored in memory block whose value is aji. A similar formulation works for assignment of messages {lij* }, even though the number of memory blocks in which they are to be stored increases. Here, we use a column in the split code matrix Hext for that bit node to find out the indices of assignment. The size of each data object within a memory block depends on the quantization level of messages. As discussed before, we use vector space construction to identify which memory blocks(point in projective plane) are linked to which processing unit(lines in projective plane) of both types. In split-code case, we schedule r-ary operations on the bit-processing units in each cycle. The q*(p2s + ps + 1) bit nodes need r accesses each from (p2s + ps + 1) memory blocks, hence the number of ports per memory block is q*r. Hence the memory blocks are all q*r-port memories. The value of 'r' should be selected carefully to keep value of q*r low. Luckily, it has been found analytically that most of the split-codes have 2 or 3 as a factor(candidate
18

value of r). For check-processing units, as pointed out earlier, s*q inputs are required at a time. Hence s-port memories are required for reading and writing.
Each memory stores (ps + 1) messages in some order. Our choice of order is one based on connectivity of (reverse) memory to bit-processing units. We denote the columns of matrix HeXt as lines, and rows as points. Suppose point aj is represented as a intersection of lines {lj0, lji, ..., ljps}, where lji is a number between 0 and (qp2s + qps +1). Without loss of generality, we assume that these numbers are linearly ordered, i.e. lj0 < lj1 < ... < ljPs. If they are not, we can always reorder them. Then a message corresponding to processing unit number lj0 is stored in 0 location of the block, lji in 1st location, and so on. Such an ordering eases out address generation, if it were not to be a look-up table implementation. Similar formulation happens for storage of messages output by check nodes in the other set of memory blocks.
It is known that the access to memory blocks is structured. One needs to further answer the question: if the ljoth processing unit working on some binary operation accesses the jth memory block, then in which cycle does it access it, and at which location(local address)? It follows for check node computation in accumulation scan that since we read/write 's*q' messages at a time, the mth message associated with the processing unit lji is accessed

from jth memory block's ith location in cycle number

. Because of this

ring structure, the complexity of hardware is significantly reduced. By doing a simple modulo addition and then doing a search in LUT, one can find which out of 0 to (k-1) points actually moved into jth memory location. That becomes the value of m.
19

However, for bit node computation in accumulation scan, we need to re-answer the same question differently. A similar question will arise on the output scan boundaries of both phases, since the number of memories in which the output is stored is not equal to number of processing units. Suppose the line/processing unit lj0 is represented as an ordered collection of points {ajoo, aj0i, ..., aj0(Ps - 1)/2}. The order is a total order, i.e. ajoo < aj01 < ...
< ajo(Ps - 1) /2. Hence j is equal to aj0k for some k. Also let (lji - lj0) = qi(remember that lj0 < lji < ... < ljps). Then, by reordering the point set of lp, we can represent it as { ajoo+qi, ajo1+qi, ..., ajopS+qi}. The addition here is a modulo-ps addition. Because of the modulo addition, the total order ajoo < ajoi
< ... < ajopS gets shifted in a circular way on the modulo 'ring'. If we
represent it by {aji0, aji1,..., ajips}, it looks like aj0o < aj0i < ... < aj0(X-i) > aj0x < ajO(x+1) < ... < ajopS. One of these numbers, say ajim is equal to j, since lji is connected to j, by definition. Since we read/write 'r' messages at a time, the mth message associated with the processing unit lji is accessed from jth

memory block's ith location in cycle number . .Again, to practically start
address generation, one should choose lj0 such that it is 0(i.e., the 0th
processing unit). A similar reverse formulation works for output scan of
check-node processing.
There are two ways by which processing units can access operands stored in
memories in a particular cycle. In the first way, the processing units
themselves calculate/generate and place the address/location using an extra
bus, for a memory access. One can alternatively build and embed an address
generator within the memory block, which places 'r'(or 's*q') data objects
on the 'q*r'(or 's') ports(or alternatively, allows 'r'/'s*q' data objects to be
stored at 'q*r'/'s' locations), given the cycle number. Hence for each
20

memory block, one will require one address generation block. Yet again,
there are two ways of implementing this address generation block(wherever
it is). Given the Tanner graph, one can pre-calculate and implement it using
a LUT. Alternatively, one can use custom logic to design a counter-type
logic based on 'structure' in access, which can act as the address generation
block. The choice of implementation method is left open to the user of this
patent work.
The switch design for the architecture is little involved. Inwards, each bit-

processing unit has ports, while the check-processing unit has (ps + 1)
ports. Similarly, the corresponding memory blocks have ports(ones
close to bit-processing units) at the interface level to the check-nodes, while
the others have (ps + 1) ports at their interface level. The line decoding to
switch 'q*r'(or 's') ports to these numbers is different for both the scans. In
the accumulation scan, each bit-processing unit reads 'r' inputs at a time and

hence only 'q*r' ports of corresponding memory blocks, out of need
to be decoded. In the output scan of bit-processing unit, again, 'q*r' ports of
the opposite memory blocks, out of (ps +1), need to be decoded. A similar
reverse argument holds for check-processing units. The remaining ports in
each cycle can be turned off by the decoder. On similar lines, the switch
element in each processing unit can 'select', turn 'on', and maps its ports.
With this much architectural information, the schedule of the envisaged
system is now presented. Within each nodes' computation, one can find two
perfect patterns(sub-computations) which do not temporally overlap. These
patterns are named as 'accumulation scan' and 'output scan' respectively.
21

In the accumulation scan, one accumulates the sum of all input messages, in a suitably transformed domain. The architecture of bit nodes and check nodes shows a corresponding adder element at the beginning(left) for this scan. For the sign processing on check node, the accumulation is achieved by using a XOR gate. Further, each memory block is read either 's' or 'q*r' times within the same cycle from various processing units. Hence both of its ports remain busy. However, within a 'phase', the accumulation scan needs only one type of memory, and hence the set of other type of memories are idle during this scan.
In the output scan, we read messages corresponding to edges, again either's' or 'q*r' at a time, subtract/remove them from the accumulation, inverse transform them wherever needed, and output/store them in memories corresponding to the other type of nodes. The architecture of bit nodes and check nodes shows corresponding subtracter elements at the end(right) for this scan. For the sign processing on check node, the effect of subtraction is achieved by using a XOR gate. While the reading of these coefficients happens from one type of memory, the storage happens in other type of memory. For example, the bit nodes read from the bit node memory blocks during output scan, and write into the check node memory blocks. It is proven that this way, both the memories and their ports each are busy every cycle within the output scan. This happens irrespective of the 'phase' to which the output scan belongs.
Having provided the hardware structure and the schedule to stimulate various elements, one can now understand the decoding process in detail. Each decoding cycle takes many decoding iterations. Here we discuss details
of a single iteration. First, the memory register in the feedback path of the adder(Mintr) is initialized with the intrinsic information of the current frame.
22

All bit nodes then first accumulate the sum of their input messages, by
adding them 'r' messages at a time. The messages are read in from the
appropriate bit-memory blocks using suitable addressing. The intermediate
value of addition is accumulated in the memory Mintr. Since there are (ps + 1)

input messages per node, the accumulation scan needs cycles to finish.
At the end of accumulation scan, the other memory register Macc is selected,
and the sum dumped into it.
Next, each of the bit node processing unit starts reading 'r' messages, again,
at a time, and subtracting it from the sum using 'r' subtracters in parallel.
The output of these subtracters is stored into appropriate check-memory
blocks. If a saturation unit is inserted, then the saturation is done just before
storing the bit node outputs.
Once this phase is over, the check nodes take over the processing for the
next phase of the same data frame. Both the intermediate memory registers
Sintr and Mintr are initialized with 0. The messages are read in from the
appropriate check-memory blocks using suitable addressing. The sign and
magnitude are segregated, and handed over to different units for processing.
The sign unit accumulates the XOR of signs of all input messages in it's
corresponding accumulation scan, 's*q' at a time. The magnitude unit first

transforms all coefficients into domain, before it similarly starts

accumulation 's*q' at a time. Both of them use the corresponding intermediate accumulation memories, Sintr and Mintr, to store the intermediate results. Like bit-node processing, the accumulation scans of sign and
magnitude processing take cycles, working in parallel, to finish. At
23

the end of accumulation scan, the other memory registers, Sacc and Macc respectively, are selected, and the sum dumped into it. Next, both the sign and magnitude units of each node start reading 's*q' messages, again, at a time. The sign bits and magnitude are yet again segregated and routed. The magnitude part is (forward) transformed before being routed. The sign unit does a XOR of the 's*q' signs using 's*q' XOR gates in parallel, to nullify the effect of the signs in the accumulation. The magnitude unit subtracts the 's*q' magnitudes from the accumulation using 's*q' subtractors in parallel. The magnitude parts are then inverse transformed and combined with the appropriate sign output for a particular edge. They are then stored back into appropriate bit-memory blocks. If a saturation unit is inserted, then the saturation is done just before storing the check node outputs.
Since the sign processing in the check node is lightweight when compared to the magnitude processing, it is possible that the accumulation and output scans of both units are not synchronized. The real need for synchronizing them arises when the sign and magnitude part of a coefficient are being combined, but one part has not arrived. In such case, one needs to implement a rendezvous and insert appropriate buffers, most probably for storing the sign unit outputs.
Thus the first phase requires cycles to finish, while the second

phase requires cycles. These numbers, however, are a coarse

approximation.
24

The iterations(sum of two phases) stop when the parity checks are satisfied. There are two ways of checking them as per the literature. Alternatively, one stops after a fixed number of iterations.
The memories encountered in the overall system are certain registers within each node, and the bit-memory and check-memory blocks. All these memory units are not active in all machine cycles. Hence wherever possible, chip select signals should be used. The design of chip select signal turns out to be quite straightforward.
The control logic design for the envisaged system will be such that whenever a hardware unit is scheduled, the corresponding data operands are ready at it's inputs. Hence the fetch-decode-execute paradigm gets reduced to just 'execute'. One can say that the internal control logic of the processing units(nodes) is deterministic as well as simple. Two parts of control logic,
namely the address generation unit and the switch, have been covered already. Another feature of control logic is appropriate control signal generation(enabling) and distribution(sending) at right cycle to right unit(s). The system clock may be switched off for some of the elements that are idle during certain periods. Implementing this power-saving mode requires signals. Given that the whole schedule is determinable at design time itself(static scheduling), one can even design for various types of chip select signals that are required for memory synchronization. There will also be other signals such as frame start signal, which signals the start of a new iteration. Finally, a group of signals will need to be designed to handle exceptions, overflows etc. Whether the required control logic is microprogrammed using a control store, or it is implemented as hardwired control circuitry, is a prerogative of the implementer of this patent work.
25

Since the system design is based on proven properties of projective geometries, simulations are not exactly required. However, detailed complexity and cost analysis was done, and the results are discussed now. The parameters based on which comparison is done are:
1. Processing Unit parameters. Code length n, gate count and their fan-in were chosen for analysis.
2. Interconnect parameters. The number of edges/connections in Tanner graph was chosen for analysis.
3. Memory parameters. Number of memory blocks and their width, number of memory registers per processing unit are primary parameters. Apart from this, number of memory accesses per cycle per block, and aggregated over an iteration, number of read and write accesses per iteration, and bandwidth utilization are further parameters. The corresponding utilization factor implies what percentage of this access capacity is being used in a particular cycle.
4. Timing parameters. The number of cycles taken by few aggregate units of time, namely the fold, phase, and iteration are primary parameters here. Another important parameter is the system throughput.
The envisaged decoder system requires J = (p2s + ps + 1) processing units of check-node, and q*(p s + ps + 1) of bit-node type. Hence the overall number of processing units is (q + l)· (p2s + ps + 1).
The bit-processing nodes have (r + 1) gates: one adder and 'r' subtractors. The checks processing nodes contain 2(q*s + 1) gates: (q*s + 1) XOR gates, one adder and q*s subtractors. We are assuming here that the functions f(x) and f -1(x) are implemented using look-up table. Hence their numbers do not add to the gate count, but do add to area-related considerations.
26

Each bit node in our architecture is connected to both bit- and check-memory blocks. Hence the number of wires is 2·(p2s + ps + l) · (ps + 1). Turning to memory layout, there are (p2s + ps + 1) bit-memory and q*(p2s +
ps + 1) check-memory blocks. The width of bit-memory blocks is data objects and it is s-port memory. The width of check-memory blocks is (ps + 1) and it is 'q*r'-port memory. There are also two parallel registers in each bit-processing unit, and four registers in each check-processing unit. They count upto (2q + 4)·(p2s + ps + 1) memory registers. The memory bandwidth utilization is dependent on the type of scan that is going on. In accumulation scan, one set of memory blocks(the opposite ones) are idle, and so are the connections to those blocks. In output scan, the utilization is full, for both type of memory blocks are busy. In every iteration there are four scans, two of each type. Each memory type remains idle in one out of these four scans. For bit-processing unit, either 'r' or '2*r' out of 2· (ps +1) connections get excited in a cycle(see the assumption next). Since this figure is same over
each cycle, the communication bandwidth utilization factor is either

or Similarly, in the next phase, the communication bandwidth
utilization factor for check-processing units is either
Assuming that the both types of processing units take one cycle to execute each type of scan, the number of cycles that execution of a phase(consisting of two scans) of either type takes is then (ps + l)/2 + (ps + l)/2 = (ps + 1). The iteration time is thus 2·(ps + 1). If N is the number of iterations to
27

decode each frame of data, then the throughput is where fs
is the system frequency.
The number of cycles for execution of 'bit'- phase is (ps + 1) / (q«r) + (ps +
1) / (q·r) = 2· (ps + 1) / (q·r). The number of cycles for execution of
'check'- phase is (ps + 1) /s + (ps + 1) / s = 2· (ps + 1) / s. The iteration time

is thus . If N is the number of iterations to decode each

frame of data, then the throughput is where fs is the system
frequency and T the iteration time calculated above. One can see that the throughput has increased approximately by a factor of r. Some of the lower bounds on some of the parameters related to area and speed are introduced as follows. One of the parameter is the gate count, which can crudely represent one part of area. The other part is the area taken by interconnect wires. Generally the complexity is expressed in terms of code length J(or n) in literature. By analytical means, we have found that for large values of n, 0(n°√n) is a looser lower bound for area. Similarly, the
lower bound of decoding time is , where e is some positive constant. Finally, the lower bound on throughput is
While considerable emphasis has been placed herein on the various components of the preferred embodiment and the interrelationships between the component parts of the preferred embodiment, it will be appreciated that many alterations can be made and that many modifications can be made in the preferred embodiment without departing from the principles of this invention. These and other changes in the preferred embodiment as well as
28

other embodiments of the invention will be apparent to those skilled in the art from the disclosure herein, whereby it is to be distinctly understood that the foregoing descriptive matter is to be interpreted merely as illustrative of the invention and not as a limitation.
29
Dated this 31st day of January. 2007.

Documents

Orders

Section	Controller	Decision Date

Application Documents

#	Name	Date
1	176-MUM-2007-RELEVANT DOCUMENTS [28-09-2023(online)].pdf	2023-09-28
1	Other Patent Document [15-07-2016(online)].pdf	2016-07-15
2	176-MUM-2007-POWER OF ATTORNEY--(21-07-2016).pdf	2016-07-21
2	176-MUM-2007-RELEVANT DOCUMENTS [26-09-2022(online)].pdf	2022-09-26
3	176-MUM-2007-RELEVANT DOCUMENTS [29-09-2021(online)].pdf	2021-09-29
3	176-MUM-2007-POWER OF ATTORNEY-(21-07-2016).pdf	2016-07-21
4	176-MUM-2007-RELEVANT DOCUMENTS [29-03-2020(online)].pdf	2020-03-29
4	176-MUM-2007-CORRESPONDENCE--(21-07-2016).pdf	2016-07-21
5	176-MUM-2007-RELEVANT DOCUMENTS [23-03-2019(online)].pdf	2019-03-23
5	176-MUM-2007-CORRESPONDENCE-(21-07-2016).pdf	2016-07-21
6	176-MUM-2007-FORM 2(GRANTED)-(25-11-2016).pdf	2016-11-25
6	176-mum-2007-abstract(31-1-2008).pdf	2018-08-09
7	176-MUM-2007-CORRESPONDENCE(IPO)-(DECISION)-(25-11-2016).pdf	2016-11-25
7	176-MUM-2007-Abstract-070115.pdf	2018-08-09
8	176-MUM-2007-CORRESPONDENCE(IPO)-(25-11-2016).pdf	2016-11-25
8	176-mum-2007-claims(31-1-2008).pdf	2018-08-09
9	176-MUM-2007-Claims-070115.pdf	2018-08-09
9	Form 27 [20-03-2017(online)].pdf	2017-03-20
10	176-MUM-2007-CORRESPONDENCE(16-4-2009).pdf	2018-08-09
10	176-MUM-2007-RELEVANT DOCUMENTS [28-03-2018(online)].pdf	2018-03-28
11	176-mum-2007-correspondence(31-1-2008).pdf	2018-08-09
11	176-MUM-2007_EXAMREPORT.pdf	2018-08-09
12	176-MUM-2007-CORRESPONDENCE(IPO)-(FER)-(27-1-2014).pdf	2018-08-09
12	176-MUM-2007-Power of Attorney-070115.pdf	2018-08-09
13	176-MUM-2007-CORRESPONDENCE(IPO)-(HEARING NOTICE)-(11-6-2016).pdf	2018-08-09
13	176-MUM-2007-OTHERS-070115.pdf	2018-08-09
14	176-mum-2007-correspondence-received.pdf	2018-08-09
14	176-mum-2007-form-3.pdf	2018-08-09
15	176-mum-2007-description(complete)-(31-1-2008).pdf	2018-08-09
15	176-mum-2007-form-26.pdf	2018-08-09
16	176-MUM-2007-DESCRIPTION(PROVISIONAL).pdf	2018-08-09
16	176-mum-2007-form-2.pdf	2018-08-09
17	176-mum-2007-form-1.pdf	2018-08-09
17	176-mum-2007-drawing(31-1-2008).pdf	2018-08-09
18	176-mum-2007-drawings.pdf	2018-08-09
18	176-MUM-2007-Form 5-070115.pdf	2018-08-09
19	176-MUM-2007-Examination Report Reply Recieved-070115.pdf	2018-08-09
19	176-mum-2007-form 5(31-1-2008).pdf	2018-08-09
20	176-mum-2007-form 1(21-3-2007).pdf	2018-08-09
20	176-MUM-2007-Form 3-070115.pdf	2018-08-09
21	176-MUM-2007-Form 1-070115.pdf	2018-08-09
21	176-MUM-2007-Form 2(Title Page)-070115.pdf	2018-08-09
22	176-mum-2007-form 13(16-4-2009).pdf	2018-08-09
22	176-MUM-2007-FORM 2(TITLE PAGE)-(PROVISIONAL).pdf	2018-08-09
23	176-MUM-2007-FORM 18(16-4-2009).pdf	2018-08-09
23	176-mum-2007-form 2(title page)-(complete)-(31-1-2008).pdf	2018-08-09
24	176-MUM-2007-FORM 2(PROVISIONAL).pdf	2018-08-09
24	176-mum-2007-form 2(complete)-(31-1-2008).pdf	2018-08-09
25	176-mum-2007-form 2(complete)-(31-1-2008).pdf	2018-08-09
25	176-MUM-2007-FORM 2(PROVISIONAL).pdf	2018-08-09
26	176-MUM-2007-FORM 18(16-4-2009).pdf	2018-08-09
26	176-mum-2007-form 2(title page)-(complete)-(31-1-2008).pdf	2018-08-09
27	176-mum-2007-form 13(16-4-2009).pdf	2018-08-09
27	176-MUM-2007-FORM 2(TITLE PAGE)-(PROVISIONAL).pdf	2018-08-09
28	176-MUM-2007-Form 1-070115.pdf	2018-08-09
28	176-MUM-2007-Form 2(Title Page)-070115.pdf	2018-08-09
29	176-mum-2007-form 1(21-3-2007).pdf	2018-08-09
29	176-MUM-2007-Form 3-070115.pdf	2018-08-09
30	176-MUM-2007-Examination Report Reply Recieved-070115.pdf	2018-08-09
30	176-mum-2007-form 5(31-1-2008).pdf	2018-08-09
31	176-mum-2007-drawings.pdf	2018-08-09
31	176-MUM-2007-Form 5-070115.pdf	2018-08-09
32	176-mum-2007-drawing(31-1-2008).pdf	2018-08-09
32	176-mum-2007-form-1.pdf	2018-08-09
33	176-MUM-2007-DESCRIPTION(PROVISIONAL).pdf	2018-08-09
33	176-mum-2007-form-2.pdf	2018-08-09
34	176-mum-2007-description(complete)-(31-1-2008).pdf	2018-08-09
34	176-mum-2007-form-26.pdf	2018-08-09
35	176-mum-2007-correspondence-received.pdf	2018-08-09
35	176-mum-2007-form-3.pdf	2018-08-09
36	176-MUM-2007-OTHERS-070115.pdf	2018-08-09
36	176-MUM-2007-CORRESPONDENCE(IPO)-(HEARING NOTICE)-(11-6-2016).pdf	2018-08-09
37	176-MUM-2007-CORRESPONDENCE(IPO)-(FER)-(27-1-2014).pdf	2018-08-09
37	176-MUM-2007-Power of Attorney-070115.pdf	2018-08-09
38	176-mum-2007-correspondence(31-1-2008).pdf	2018-08-09
38	176-MUM-2007_EXAMREPORT.pdf	2018-08-09
39	176-MUM-2007-CORRESPONDENCE(16-4-2009).pdf	2018-08-09
39	176-MUM-2007-RELEVANT DOCUMENTS [28-03-2018(online)].pdf	2018-03-28
40	176-MUM-2007-Claims-070115.pdf	2018-08-09
40	Form 27 [20-03-2017(online)].pdf	2017-03-20
41	176-mum-2007-claims(31-1-2008).pdf	2018-08-09
41	176-MUM-2007-CORRESPONDENCE(IPO)-(25-11-2016).pdf	2016-11-25
42	176-MUM-2007-CORRESPONDENCE(IPO)-(DECISION)-(25-11-2016).pdf	2016-11-25
42	176-MUM-2007-Abstract-070115.pdf	2018-08-09
43	176-MUM-2007-FORM 2(GRANTED)-(25-11-2016).pdf	2016-11-25
43	176-mum-2007-abstract(31-1-2008).pdf	2018-08-09
44	176-MUM-2007-RELEVANT DOCUMENTS [23-03-2019(online)].pdf	2019-03-23
44	176-MUM-2007-CORRESPONDENCE-(21-07-2016).pdf	2016-07-21
45	176-MUM-2007-RELEVANT DOCUMENTS [29-03-2020(online)].pdf	2020-03-29
45	176-MUM-2007-CORRESPONDENCE--(21-07-2016).pdf	2016-07-21
46	176-MUM-2007-RELEVANT DOCUMENTS [29-09-2021(online)].pdf	2021-09-29
46	176-MUM-2007-POWER OF ATTORNEY-(21-07-2016).pdf	2016-07-21
47	176-MUM-2007-POWER OF ATTORNEY--(21-07-2016).pdf	2016-07-21
47	176-MUM-2007-RELEVANT DOCUMENTS [26-09-2022(online)].pdf	2022-09-26
48	176-MUM-2007-RELEVANT DOCUMENTS [28-09-2023(online)].pdf	2023-09-28
48	Other Patent Document [15-07-2016(online)].pdf	2016-07-15