Abstract: A semi parallel system for decoding of low density parity check (LDPC) encoded data having length L in a bit stream, 'L' being given by the function (p2s+ps+1), where p is a prime number and s is any positive integer, said system comprising, means to select a block of the bit stream of said length L, said block containing a plurality of bits; a parity check matrix having a plurality of elements; multiplication means to multiply each of said bit with said elements, to give a resultant bit-vector having plurality of bits; a plurality of parity check nodes for implementing each row of said product matrix and to update information of each bit; plurality of bit nodes adapted to receive said updated information from said parity check nodes and having means to calculate the bit probability and further adapted to transmit the calculated bit probability data to said parity check nodes; a logic control means adapted to iterate information from said parity check nodes to bit nodes until decoding is complete.
FORM-2
THE PATENT ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
PROVISIONAL SPECIFICATION
(See section 10 and Rule 13)
COMPUTER SYSTEMS
TATA CONSULTANCY SERVICES MLIMITED
An Indian Company
of Bombay House, 24, Homi Mody Street, Mumbai-400 001,
Maharashtra, India
THE FOLLOWING SPECIFICATION DESCRIBES THE INVENTION
Field of invention
This invention relates to computer systems.
Particularly, this invention relates to a computer system for communication.
Typically, this invention envisages a hardware system for decoding of low density parity-check (LDPC)-coded bit stream applicable to both wireless and wireline communication systems.
Background of invention Introduction
It is known that digital data transmission suffers from channel influence such as:
1) Severe (multipath) fading in terrestrial mobile radio communications.
2) Very low signal-to-noise ratio for satellite communications due to high path loss and limited transmit power in the downlink.
3) Compressed data (e.g. audio and video signals) is very sensitive to transmission errors.
Hence a transmitted message especially over a long distance may not be received as it is sent. In order to make out the meaning of what was sent, we need to, therefore, detect, and if possible, correct errors.
Wireless communication systems are especially prone to errors, and hence a robust error-correction system at the receiving end almost becomes a
2
necessity. In recent times, such wireless communication systems are widely deployed to provide different services. Popular technologies such as both ad-hoc networking (Wireless LAN) and infrastructure-based (3G) all require error-correction.
An approach to deal with error correction first encodes each unit of information (such as byte) into a unit (called word) derived from p-ary alphabet (binary alphabet with 0 and 1 is most used). Then the error detection or correction work at the receiver end begins mapping the current unit of information received to the "closest" p-ary code, thus knowing the actual unit of information sent. This process is known as the encoding/decoding process in digital communication.
It was the invention of turbo codes and turbo product codes that moved the limit of achievable coding gain much closer to the Shannon limit for low and high code rates, respectively. However, the iterative nature of the decoding algorithms for turbo codes and turbo product codes present significant implementation challenges. Search for better iterative decoding algorithms led to the (re-)discovery of low-density parity-check (LDPC) codes. These codes are simpler to decode when compared to other codes. For example, unlike turbo and turbo product codes, it is possible to decode LDPC codes using a block-parallel algorithm rather than a block-serial algorithm. Further, using the message-passing algorithm, LDPC decoders require an order of magnitude less arithmetic computations than equivalent Turbo decoders. Finally, it was also established that LDPC codes are asymptotically superior to turbo codes with respect to coding gain.
3
Simplicity of implementation of decoder coupled with better performance has made these codes candidates for usage in the recent generation of digital transmission systems. For example, in digital satellite broadcasting standard (DVB-S2), a long LDPC code has been chosen for error-correction in downlink data. A similar case is being made in wireless LAN technology, as has been mentioned earlier. Because of this thrust, optimal systems related to LDPC decoding have a very high potential of being deployed.
Prior Art;
- In the prior art, most of the work has popularly been done on designing semi-parallel architecture for such decoders. This includes decoders for both regular and irregular codes, though designs for regular codes are more to be found. Most of the work suggests a strong correlation between choice of the codes and the decoder design. Hence the importance of specialized designs is quite evident. Shu Lin and Marc Fossorior introduced projective geometry codes. Andrew Blanksby and Chris Howland pioneered a 1024-bit, rate Vz fully parallel generic decoder implementation. Parallel decoding of LDPC codes suffers from relatively large chip area. Apart from that, it also suffers from scalability issues. Scalability captures the ease of using the same architecture for extensions of the application that may require higher throughputs, increased code block sizes, higher edge degrees of nodes etc. A
folded architecture addresses these drawbacks of a parallel architecture at the cost of reduction in speed/throughput, and increase in complexity of the
control logic. Other noticeable works in folded architecture design are that of Mansour and Shanbhag, and Boutillon et al.
4
In parallel computing, there is a seminal work done by Karmarkar relating to design of systems, which arise out of projective geometry considerations. Here, Karmarkar has defined two units of system behaviour; perfect access pattern and perfect access sequences. The main focus was to define a unit amount of processing requiring stimulation of the whole parallel system, such that it automatically results in a coherent conflict-free operation and uniform balanced use of all the resources such as processors, memories and wires connecting them. Such units were called perfect access patterns. If such units are combined, then the resulting computation is still balanced with respect to resource usage. Such aggregation was called a perfect access
sequence. It has been proved by authors that the behaviour of regular LDPC decoder while doing extrinsic information updating is made up of such perfect access patterns and sequences. To accommodate the separate natures of bit-node and check-node processing within a single iteration, the authors have extended Karmarkar's design template. This was made possible because the bit nodes and check nodes work in non-temporal-overlap fashion. The system in accordance with this invention is designed in such a way that the memory blocks that these nodes access for operand input and output are non-common, thus temporally partitioning the computation in two sub-computations fully.
Objects of the Invention;
One object of the invention is to provide a system which can easily replace any error-correcting component of decoder for communication systems where regular LDPC codes have been used.
5
Another object of this invention is to provide a decoding system which has a synchronous design.
Yet another object of this invention is to provide a system for decoding with error-correction facility.
An additional object of this invention is to provide a system for communication with decreased complexity of design and increased area efficiency.
Summary of the Invention:
In general, the system in accordance with this invention can be used in any decoder for communication systems where regular LDPC codes have been used. The decoder system has a folded architecture. Semi-parallel, or folded architectures arise in hardware design due to trade-off between area and time. The decoder system exploits the projective geometry regularity in certain subset of Tanner graphs. However, following the industry practice of joint code-decoder design, the system has been designed specifically with LDPC codes arising out of projective planes in mind. The decoder is fully generic in the sense that it can be used to decode codes of any length, provided the length is of the type (p2s + ps + 1), where p is a prime number and s any positive integer.
This invention envisages using semi-parallel hardware architecture for a LDPC decoding system at the receiving end of a communication system. Particularly, this invention envisages a method and apparatus for decoding
6
LDPC-encoded bitstream and includes a system for improving the decoding time-performance and memory access efficiency for such bitstream.
The apparatus envisaged in accordance with this invention works on a highly generic architectural template so far used in other systems for other computation problems. It is suitable for designing many parallel systems. The underlying mathematical structure of regular LDPC codes is isomorphic with the structure of the architectural template, and hence the envisaged novel decoder system substantially improves the performance.
The approach as envisaged in the system in accordance with the invention for decoding is different from prior art in the sense that we optimally use the classical flooding schedule for decoding codes in a folded way. Also, in some sense it is a better architecture because we use memories with scheduling such that there are no memory access conflicts. Thus the resulting design is still fast enough compared to other semi-parallel designs. This architectural invention extends a prior patent in same area by the author for fully-parallel projective geometry LDPC codes.
The belief propagation algorithm for decoding LDPC-coded bitstream passes messages between so-called bit (variable) and check nodes. These messages are stored before the other nodes access and process it, in order to have synchronous execution. However, it is the lack of any structural regularity in these essentially random codes that creates a major challenge for building a practical LDPC decoder. Hence the envisaged system focuses on decoding regular codes only. In regular codes, each node in the Tanner graph passes the same number of messages in a phase of decoding. Further,
7
by choosing projective geometry regular codes to lend additional regularity, it becomes possible to design a schedule in such a way that there are no memory or processor conflicts, as our system shows. In fact, various other "nice properties" that one generally seeks during parallel system design; load-balancing, efficient data-routing are also achieved. This regularity can also be used to reduce the wiring and layout complexity in the final stages of hardware design.
Brief Description of the Drawings:
Figure 1 relates to a Tanner Graph corresponding to p = 3, s = 1
Figure 2 relates to the Basic Architecture of Folded-LDPC Decoding System
Figure 3 relates to an Example numbering for generated Perfect Pattern
Figure 4 relates to an Interconnect Network for p = 5
Figure 5 relates to an Example grouping of two perfect access patterns
Figure 6 relates to a Bit Node Architecture
Figure 7 relates to a Check Node Architecture for Sign Processing
Figure 8 relates to a Check Node Architecture for Magnitude Processing
8
Figure 9 relates to a Memory Layout for First Perfect Pattern Generation Scheme
Detailed Description of the Drawings:
Figure 1 illustrates the Tanner graph representation of the code for
understanding the belief propagation decoding algorithm. Each bit of the
codeword is connected (hence related) to the constraints (in form of parity-
checks) that the bits must satisfy in order to form a valid codeword. During
the iterative decoding algorithm process, messages are exchanged between
the bits and the constraints (parity checks) on the edges of the graph. The
steps in this algorithm can be briefly put down as follows:
1) All bit-nodes and their outgoing bit-messages are initialized to the
intrinsic information represented as a log-likelihood ratio of the received symbol:
2) The messages are sent to the parity-check nodes along the edges of the Tanner graph.
3) At the parity-check nodes, computation is performed separately on the sign and the magnitude part of each message. The XOR on the sign bits of the incoming bit messages at the parity-check nodes form the parity check result for the row. The sign of each outgoing check message for each edge of the graph is formed by further XORing the sign of the incoming bit message of that edge, and the row parity check result.
Further, an intermediate row parity reliability:
9
and subsequent outgoing check message reliabilities are computed in the logarithmic domain as:
where In(λi) is the log of the parity reliability for row and λij is the reliability of the message sent to parity-check node i by bit node j. The outgoing check message reliabilities(magnitudes) are calculated using
In(λi) as:
where (λi) is the reliability of the check message from parity-check node i to
variable node k.
4) These messages are now sent to the bit nodes along the edges of the
Tanner graph.
5) At the bit nodes, estimates of the decoded bit is updated by summing
up the intrinsic information, and the extrinsic information received along each edge(the check message log-likelihoods). The decoded bit is taken to be the sign of the summation, for this round of iteration. Outgoing bit messages for the next decoding iteration are then formed using the same summation. However, each edge is updated without using the incoming message on the corresponding graph edge. This is easily implemented by subtracting the contribution of each edge from the group sum.
10
6. Repeat steps 2)-5) until a termination condition is met. From the above description of algorithm, it is clear that over various iterations, updation of extrinsic information is where most of the time is spent. Hence, the system envisaged in accordance with this invention focuses on optimizing that.
Let the length of projective geometry codes be J such that J = (p2s + ps + 1). Let q be a factor of J. Let y = (ps + 1) be the column weight. The design uses distributed memory configuration by pushing all required inputs of a node into its local memory efficiently. Hence, communication issues such as waiting and data corruption are avoided. Direct wiring has been used because of locality of communication. As will be clear later, it is possible to fold in such a way that certain (overlaid) nodes always access same set of p out of J/q memories. Hence, the connections remain static. Since both the processing units and the memory blocks have γ i/o ports on them, the ports are the ones that are dedicatedly cross-connected by wiring. As mentioned before, designing efficient access to memory is the crux of the system design
in accordance with this invention.
The apparatus in accordance with this invention is shown in Figure 2 of the accompanying drawings.
A semi-parallel architecture represents a part of the Tanner graph. It consists of a smaller number of processing units implementing either the check or bit node functionality. It also has a set of memory blocks to store the messages, and wires to realize the graph connectivity. A Tanner graph is foldable (i.e. a semi-parallel design is feasible) if and only if the corresponding parity check matrix H be structured in a way described below.
11
The folding can occur at two levels, and hence there are many ways of folding. At the lower level, one can try to fold the computation of one node, and then define various types of folding that become visible in the Tanner graph. At the higher level, one can consider folding the computation graph in the normal graphical sense. For example, 1st unit performs 1st bit node computation in a cycle, then 5th bit node computation in next cycle, and so on. The nature of the envisaged system is such that it already deals with first level folding in the form of perfect access pattern. Recall that a perfect access pattern stimulates only a fraction of edges per node in a cycle. At the higher level, based on the (different or same) factors chosen for folding both rows of nodes, the folding can be vastly different. This patent work chooses uniform folding as the option. Further, in the envisaged architecture, all of the one type of nodes (e.g. bit nodes) finish their computation fully before the other type of nodes (e.g. check nodes) take over. Then, the phase boundary is the normal boundary when all the computations of one type finish. Hence the other set of nodes fold much in the same way as the first one.
Let one fold both sets of nodes by a factor of q in our projective-plane based Tanner graph. Hence there are J/q processing units of either type. Let one also take J/q memory blocks as well. Then the size of each memory block required is (J*γ) / (J/q) = q*γ. By elaborate proof construction for generalization, it was found that any factor of J could lead to folding such that the corresponding schedule is of the perfect pattern/sequence nature.
Figure 3 shows an illustration of such numbering and grouping for perfect pattern generation. Suppose l00 is the first processing unit in the 0th fold, and let it be connected to {aooo, aoo1, …, aoopS} memory blocks. It is assumed that
12
the Tanner graph has been sorted and renumbered so that cyclicity is in explicit form. Let one also take the first two connections, aooo and aooi- Either they are equal or they are not equal. Now, as we make a horizontal shift J/q times, in both lines(processing units) and points(memory blocks),
1. 100 covers up all the J/q processing units present in the fold
2. By virtue of modulo-(J/q) addition by 1 J/q times, the first point aooo covers all the values between 0 and J/q. Similarly, second point a010 covers all the values between 0 and J/q as well.
We take modulo-(J/q) addition because we have that many memory blocks only. One can immediately see that all perfect sequence axioms are satisfied.
There are two ways by which the 2-input computations done by a fold can be combined. One may combine γ/2 2-input computations done by each
processing unit in one fold only, repeat such computations and their combinations for all remaining (q - 1) folds, and combine such patterns. Alternatively, one may first combine 2-input computations done by each processing unit in all the q folds, and then one combines all such perfect patterns. The choice of this is left to the implementer. This choice is an important design decision that has to be made in the beginning itself. For the scheduling algorithm for this setup, recall that in the bit-node processing, the major processing (same on all nodes) involves adding extrinsic information collected over all the connections of the particular bit-node to check nodes. By taking, for example, two inputs at a time for addition, we can schedule a binary operation on each bit-node processing unit, in (every) one cycle. Similarly, in the check-node processing, the major processing involves adding functionally transformed input messages collected over all the connections of the particular check-node to bit nodes.
13
Hence one can again find perfect access pattern in check-node processing as well. An illustration of such numbering and grouping for perfect pattern generation is depicted in Figure 3. One can observe that since the multiplicative group is cyclic, a "horizontal right shift" by one unit length, of each group of line and two points yields the next group of line and two points.
Before designing the decoder system, there is a need to fix the length of the code, since the design in accordance with this invention is quite generic. A regular LDPC code's Tanner graph can be carved out from any projective plane. A lot many parameters (numerical properties) of codes can then be derived for each instance of code. In iterative decoding, main parameters that are used for shortlisting one for architectural design are the block size (p2s + ps + 1), and the number of iterations required. For folded architecture, one needs to decide on the bigger and smaller (embedded) projective plane simultaneously. This is because different values of J will have different
factors, q, that one will need to work with. Analytically, it was found that there are only very few codes for which factorization does not exist. For most of the remaining, 3 or 7 are the dominant small factors. It is up to the user of this patent work to fix the codes before using the envisaged system. Further, before the schedule is generated, one needs to design the interconnect network for a particular value of prime number p. The question that naturally arises here is; what are (indices of) the points that are linked to line number/index, for example, 2. Initially one needs to go through an elaborate process of vector space construction for a non-folded matrix, which is an indirect method to answering this question. Given a non-folded matrix, it is easy to write a program to fold it using modulo-J/q addition. A
14
pictorial representation of a non-folded interconnect for p = 5 is given in Figure 4.
In the decoding process based on belief propagation, we observe that for bit node computation, addition and subtraction are the two required operators. Similarly, for check node computation, addition, subtraction and XOR are the operators. Of course the check node computation is preceded/succeeded by transforming the inputs in the log domain.
In the envisaged architecture, each operator has to take (ps + 1) inputs before it can produce an output. The inputs to these operators are grouped two at a time. Hence a binary operation of one type is feasible every cycle on all processing units of one type for decoding purposes. We can now clearly marry it off to the balanced schedule of an iteration found in the perfect sequence. The most important advantage of using binary operation is that we will be able to use off-the-shelf dual-port memories to fetch two operands at a time. By keeping the number of ports limited to two, one can keep the cost, area and power consumption in control. The final advantage is of being able to use 3-input hardware units such as adders, as discussed later. A hardware unit with higher degree of fan-in is significantly more complex, requires custom logic at times and blows up the cost of producing a decoder.
Except for prime number p = 2, for all other prime numbers, perfect sequences can be formed in following (one of the two) way. We use here the "horizontal shift" property discussed earlier. Since a line is an unordered set of points, we reorder each such set in such a way that we get multiple units of two points showing "horizontal shift" property. By folding the number of nodes and number of edges, we move towards a three-level symmetry. While exciting 2 inputs at a time, each group of J/q processing units shows
15
memory access balance within a single cycle. Across q such cycles, all the q groups show balance. These balanced patterns from these q cycles combine to form a perfect pattern as found in fully-parallel computation, when combined temporally. Finally, all such (combined) perfect patterns should form a balanced perfect sequence. The latter combination is illustrated in Figure 5. The execution of perfect sequence, thus, takes multiple cycles. For prime number 2, (ps + 1) is divisible by 3 for all values of s. Hence one can design scheduling algorithm for this number as well.
Further, in context of LDPC decoder dynamics, one needs to decide the order in which processing nodes are activated (fired). One also needs to decide in which order inputs are taken in, and outputs are computed? This order is referred to as the schedule of the decoder. The proposed decoding system uses the flooding type of schedule. In this scheme, the nodes of the same type are all updated (e.g. the bit nodes) and then the nodes of the other type are also all updated to finish the iteration. Each of these two periods is called a phase. In the envisaged system, one needs to further make sure that every cycle the scheduled processing units are performing the same operation.
The implementation of each type of node involves more than one hardware unit(e.g., gates). The behaviour of units is mainly arithmetic and logical operations, though the check nodes require complex transformation units for implementing f(x) and f-1(x) defined previously. We assume here that all these hardware units perform their operation within one machine cycle, which is a practical assumption. If a particular operation runs over multiple cycles, then that operation must be pipelined.
16
Figure 6 depicts an adder as part of the system in accordance with this invention. For the bit-node architecture, we use one 3-input adder with a feedback element on bit-node side as the first hardware unit. It has at least one memory(register) in the feedback path to store the intermediate value of addition. Apart from that, at least one register is provided to store the accumulated sum of extrinsic information.
We also use '2' 2-input subtractors on the bit-node side. One of the inputs of these subtractors is connected to a register that stores the sum of all its inputs. Because of these changes, the input gathering in accumulation scan becomes 2-at-a-time, while output calculation in output scan also becomes 2-at-a-time.
However, the two different ways of combining balanced patterns for each fold(discussed earlier) has an impact on the count of Mintr and Macc registers. Suppose one chooses the second way, where one combines patterns of multiple folds first. In this case, each processing unit would have done only partial computation(e.g. accumulation). Further, nodes of non-folded Tanner graph are overlaid on these processing units q times. Hence, we need to store q copies of the partial computation(say accumulation). This means q Mintr
registers per bit processing unit are required. Similarly, q Mace registers per bit processing unit are required, to store the q summations that it has performed. Apart from this, two switches need to be incorporated to handle the choice of 1-out-of-q registers depending on the value of machine cycle. In the first option, partial computation does not happen. Hence multiple Mintr registers per bit processing unit are not required. However, the overlay is still there, hence q Mace registers per bit processing unit are required, and a corresponding switch element.
17
Depending on the quantization levels we choose, the adder and subtracters have to be multi-bit precision.
Figure 7 and Figure 8 illustrate the check node architecture for the system in accordance with this invention. For the check-node processing, the sign and magnitude processing happens separately. The corresponding coefficients
have to be derived from the same message value, by taking leading bit(s) as the sign, and remaining as magnitude. Also, a log-domain transformation is required before magnitude processing, and reverse transformation after processing. Thus, for sign processing, we use one 3-input XOR gate with a feedback element on bit-node side as the first hardware unit. It has a memory(register) in the feedback path to store the intermediate value of XORing.
XORing a bit twice with another bit nullifies the effect of the duplicated bit. We use this property to remove the sign bit of a particular edge associated to a node, from the group of sign bits of all edges related to that node. The 2-input XOR gates are used for that purpose.
Both sign and magnitude processing parts have at least one memory (register) each in the feedback path to store the intermediate value of the main operator. Apart from that, at least one register is provided to store the accumulated sign bit(or the addition of log-likelihood ratios). The '2' 2-input subtracters for the magnitude processing have one each of their inputs connected to a register that stores the sum of all its inputs.
Depending on the quantization levels as chosen, the adder, XOR gates, f(x) and f-1(x), and subtracters have to be multi-bit precision. The implementation of functions f(x) and f-1(x) is one of the trickier aspects of check node implementation. Both these functions are non-linear
18
functions. When the quantization levels are smaller, then these mathematical functions can be approximated using first few terms of their infinite expansion. In that case, an ASIC unit can be put in using few logic gates. Otherwise, a SRAM-based LUT block can also be used for implementation. To mitigate the effect of extra bits generated during addition, a saturation unit can be added wherever required. Within a node, one can use higher-precision logic gates and internal busses to minimize the round-off error. So we may use higher precision(bit-width) adder and subtractors in bit node architecture. In that case, the saturation blocks are placed at the outputs of the each subtractor. For magnitude processing on the check node side, we place the saturation blocks just before the blocks implementing f _1(x). To figure out where a message will get stored, let us take the bit node j in the unfolded Tanner graph, whose output messages are {λij}. Suppose line lj is represented as a collection of points {ajo, aji, ..., ajPs}, where 0 < aji < (p2s + ps + 1). Then the message λij is stored in memory block (aji modulo-(J/#)). Since A λij denotes the reliability message that is sent to check node / from bit node j, these messages are stored in check-memory blocks. A special case may arise when messages corresponding to two consecutive edges get stored in the same memory block. This happens for λij and λ-(i+1)j, when (aji modulo-(J/q)) = (aj(i+1) modulo-(J/q)), even when aji ≠ aj(i+i). The whole architecture still works because the number of memory accesses remain balanced in every fold.
Since our interconnect graph is symmetric with respect to a flip over a line passing through mid of the graph, exactly the same scheme can be used to place the messages sent by check nodes to the bit nodes.
19
The layout of memories, where the reliability messages sent out by various processing units get stored, is as follows. Individually, the memory blocks are all dual-port memories. The J/q processing units per fold generate γ*J/q outputs, each folded pattern needs to deal with γ*J/q messages. Also, one needs to store all messages generated by q folds, because we are dealing with flooding schedule. Hence, overall, γ* J messages need to be stored in J/q memory blocks. Hence the size/width of each block is γ*q, i.e. q(ps +1). Earlier, it was pointed out that there are two different ways by which one can combine the 2-input computations done by a fold. The memory layout depends on which option one chooses.
Suppose one chooses the latter approach. Here, first 2-input computations done by each processing unit in all the q folds are combined, and then all such perfect patterns are combined. In such a case, a colour-coded version of memory layout looks like in Figure 9.
In this approach, the first level substructure arises by making 'γ/2' bins within each block, one bin for each full (non-folded, rolled out) perfect pattern. The bins are arranged in linear order with respect to full perfect patterns. Within a perfect pattern, each processing unit accesses two memory locations. It may access them either in same memory block, or in different memory blocks. In the former case, i out of (J/q) processing units of k fold(0 < i < J/q, 0 < i < k) may store both it's messages in p* memory block. If the index of perfect pattern is /, then these two messages are in lth bin of pth memory block in two consecutive locations. The offset of these locations from start of the bin is (surprisingly) 2q and (2q+l). In the latter case, i* out of (J/q) processing units of kth fold may store exactly one message in pth memory block. If the index of perfect pattern is /, then (one of
20
the two) message is in lth bin of pth memory block. If i'* out of processing
units of k* fold also accesses p* memory block, and if i < i', then the offset
of location for message corresponding to i* processing unit from start of the
bin is 2q, while that of i,th is (2q+l).
One gets a dual layout when the second approach is considered.
The size of each data object within a memory block depends on the
quantization level of messages. As discussed before, we use vector space
construction to identify which memory blocks(point in projective plane) are
linked to which processing unit(lines in projective plane) of both types.
It is known that the access to memory blocks is structured. One needs to
further answer the question in the context of a single fold; if the lj0th
processing unit working on some binary operation accesses the jth memory
block, then in which cycle does it access it, and at which location(local
address)? It follows that since we read/write two messages at a time, the m
message associated with the processing unit ljj is accessed from j* memory
block's i* location in cycle
number Because of this ring
structure, the complexity of hardware is significantly reduced. By doing a simple modulo addition and then doing a search in LUT, one can find which out of 0 to (k-1) points actually moved into jth memory location. That becomes the value of m.
There are two ways by which processing units can access operands stored in memories in a particular cycle. In the first way, the processing units themselves calculate/generate and place the address/location using an extra bus, for a memory access. One can alternatively build and embed an address generator within the memory block, which places two data objects on the two ports(or alternatively, allows two data objects to be stored at two
21
locations), given the cycle number. Hence for each memory block, .one will require one address generation block. Yet again, there are two ways of implementing this address generation block(wherever it is). Given the Tanner graph, one can pre-calculate and implement it using a LUT. Alternatively, one can use custom logic to design a counter-type logic based on 'structure' in access, which can act as the address generation block. The choice of implementation method is left open to the user of this patent work. The address generation requirements are apparent from the memory layout and flow of time as depicted in Figure 9. Since one can combine balanced patterns for a fold in two different ways to form a perfect sequence, the requirements also correspondingly differ. One needs to still answer
1. The ith out of (J/q) processing units of kth fold accesses one or both it's messages in pth memory block for the lth perfect pattern. How to get the value of p?
2. How to know whether one or both the messages are going to be stored in the same memory block?
To answer both the cases, let the non-folded Tanner graph be re-ordered to exhibit the rotational symmetry. In that case, lj0 = k*q + i. Let the corresponding point set be {ajoo, aj0i, ••-, ajops: ajoo < ajoi < ••• < ajoPs}- The /* perfect pattern implies that the two edges connecting to points aj0(2i) and aj0(2l+l) will be excited. Let p = [a^o modulo-(JAjr)] and p' = [aj0(2i+i) modulo-(J/q)]. Then the memory blocks accessed by Ith out of (J/q) processing units of k* fold for the /* perfect pattern are found in appropriate bins of pth and p,th memory blocks.
It is possible that p = p', due to the modulo operation. In that case, both the messages corresponding to lth perfect pattern access are found in the same memory block.
22
To practically start address generation, one should choose lj0 such that it is 0(i.e., the 0th processing unit).
Both processing units and memories have (ps + 1) ports. In each cycle, they operate using only two ports, and thus cover the whole set of (ps +1) ports. Hence the switch is distributed across the architecture; each processing unit and each memory block has its own switch. Because of folding, an additional level of sub-structure can be found. If one takes the second way of combining patterns, then for a processing unit, the function of switch is to 'select' and turn 'on' two out of (ps + 1) ports within a block of q cycles (or γ/2 cycles) of each non-folded pattern. Hence it needs a custom 2-to-(ps + 1) line decoding logic. The remaining ports in each cycle can be turned off by the decoder. On similar lines for the first option, the switch element in each memory block 'selects', turns 'on', and maps the two out of (ps + 1) ports to it's two internal ports (we are using dual-port memories). We need similar line decoding logic for this switch element.
With this much architectural information, the schedule of the envisaged system is now presented for the first option of folding. Within each nodes' computation, one can find two perfect patterns(sub-computations) which do not temporally overlap. These patterns are named as 'accumulation scan' and 'output scan' respectively.
In the accumulation scan, one accumulates the sum of all input messages, in a suitably transformed domain. The architecture of bit nodes and check nodes shows a corresponding adder element at the beginning(left) for this
scan. For the sign processing on check node, the accumulation is achieved by using a XOR gate. Further, each memory block of the fold is read two
times within the same cycle from various processing units. Hence both of its
ports remain busy. However, within a 'phase', the accumulation scan needs
23
only one type of memory, and hence the set of other type of memories are idle during this scan.
In the output scan, one reads messages corresponding to edges, again two at a time, subtract/remove them from the accumulation, inverse transform them wherever needed, and output/store them in memories corresponding to the other type of nodes. The architecture of bit nodes and check nodes shows corresponding subtractor elements at the end(right) for this scan. For the sign processing on check node, the effect of subtraction is achieved by using a XOR gate. While the reading of these coefficients happens from one type of memory, the storage happens in other type of memory. For example, the
bit nodes read from the bit node memory blocks during output scan, and write into the check node memory blocks. It is proven that this way, both the memories and their ports each are busy every cycle within the output scan. This happens irrespective of the 'phase' to which the output scan belongs. Having provided the hardware structure and (one of the two possible) schedule(s) to stimulate various elements, one can now understand the decoding process in detail.
Each decoding cycle takes many decoding iterations. Here we discuss details of a single iteration. First, the memory register in the feedback path of the adder(Mintr) is initialized with the intrinsic information of the current frame. In the first folding option, all bit nodes of a particular fold then first accumulate the sum of their input messages, by adding them '2' messages at a time. The messages are read in from the appropriate bit-memory blocks using suitable addressing. The intermediate value of addition is accumulated in the memory Mintr Since there are (ps + 1) input messages per node, the
ps +1
accumulation scan needs i-—- cycles to finish. At the end of accumulation
2
24
scan, the other memory register Macc meant for that particular fold is selected, and the sum dumped into it.
By re-initializing Mintr for the next fold, and doing accumulation in similar way, the computation proceeds till all bit nodes are covered. Next, each of the bit node processing unit within a particular fold starts reading '2' messages, again, at a time, and subtracting it from the sum using '2' subtractors in parallel. The output of these subtractors is stored into appropriate check-memory blocks. If a saturation unit is inserted, then the saturation is done just before storing the bit node outputs.
Once this phase is over, the check nodes take over the processing for the next phase of the same data frame. Both the intermediate memory registers Sintr and Mintr are initialized with 0. The messages are read in from the appropriate check-memory blocks using suitable addressing. The sign and magnitude are segregated, and handed over to different units for processing. The sign unit accumulates the XOR of signs of all input messages in it's corresponding accumulation scan, '2' at a time. The magnitude unit first
transforms all coefficients into i„ [tanh(x)]domain, before it similarly starts
2
accumulation '2' at a time. Both of them use the correspondingly selected intermediate accumulation memories, Sintr and Mintr, to store the intermediate results. Like bit-node processing, the accumulation scans of sign and
magnitude processing take ps +1- cycles, working in parallel, to finish. At
2
the end of accumulation scan, the other memory registers meant for this fold
processing, Sacc and Macc respectively, are selected, and the sum dumped into it.
25
By re-initializing Sintr and Mintr for the next fold, and doing accumulation in similar way, the computation proceeds till all check nodes are covered. Next, both the sign and magnitude units of each node start reading '2' messages, again, at a time. The sign bits and magnitude are yet again segregated and routed. The magnitude part is (forward) transformed before being routed. The sign unit does a XOR of the '2' signs using '2' XOR gates in parallel, to nullify the effect of the signs in the accumulation. The magnitude unit subtracts the '2' magnitudes from the accumulation using '2' subtracters in parallel. The magnitude parts are then inverse transformed and combined with the appropriate sign output for a particular edge. They are then stored back into appropriate bit-memory blocks. If a saturation unit is inserted, then the saturation is done just before storing the check node outputs.
Since the sign processing in the check node is lightweight when compared to the magnitude processing, it is possible that the accumulation and output scans of both units are not synchronized. The real need for synchronizing them arises when the sign and magnitude part of a coefficient are being combined, but one part has not arrived. In such case, one needs to implement a rendezvous and insert appropriate buffers, most probably for storing the sign unit outputs.
Thus each of the two phases require 2*q* ps +1 cycles to finish. This
2
number, however, is a coarse approximation.
The iterations(sum of two phases) stop when the parity checks are satisfied. There are two ways of checking them as per the literature. Alternatively, one stops after a fixed number of iterations.
26
The memories one encounters are certain registers within each node based on the choice of combination scheme, and the bit-memory and check-memory blocks. All these memory units are not active in all machine cycles. Hence wherever possible, chip select signals should be used. The design of chip select signal turns out to be quite straightforward.
As an example, in the second scheme, there are q Mintr and Macc memory registers in the bit node for the first phase. Both sets are labelled using one to one association to the folding index(0 to q). Between 0th to (γ/2 - 1)th
perfect pattern of accumulation scan, each Mintr is selected once every q cycles, in linear order based on their index, during every fold computation worth one cycle. This is because exactly one node per fold gets overlaid on the processing unit. For the last perfect pattern of accumulation scan, each Mace is selected once every q cycles, in linear order based on their index, during every fold computation worth one cycle, to store the final sum. Similar selection of Macc repeats during output scan. The bit memory blocks are being read from 0th to [q(ps + 1) - l]th cycle continuously. The check-memory blocks are being written into from [q(ps + l)/2]th cycle to [q(ps + 1) -if cycle.
Similar selection happens for sign- and magnitude-processing for the second phase in check processing units as well.
The control logic design for the envisaged system will be such that whenever a hardware unit is scheduled, the corresponding data operands are ready at it's inputs. Hence the fetch-decode-execute paradigm gets reduced to just 'execute'. One can say that the internal control logic of the processing units(nodes) is deterministic as well as simple. Two parts of control logic, namely the address generation unit and the switch, have been covered already. Another feature of control logic is appropriate control signal
27
generation(enabling) and distribution(sending) at right cycle to right unit(s). The system clock may be switched off for some of the elements that are idle during certain periods. Implementing this power-saving mode requires signals. Given that the whole schedule is determinable at design time itself(static scheduling), one can even design for various types of chip select signals that are required for memory synchronization. There will also be other signals such as frame start signal, which signals the start of a new iteration. Finally, a group of signals will need to be designed to handle exceptions, overflows etc. Whether the required control logic is microprogrammed using a control store, or it is implemented as hardwired control circuitry, is a prerogative of the implementer of this patent work.
This decoder architecture can be scaled to handle higher block sizes. Since our Tanner graph is based on projective geometry, changing the value of J implies corresponding change in value of ps. This means that the set of factors(^) also change. Hence only the individual units such as bit-processing unit etc. can be carried over to new architecture. The address generation units also carry over, when configured using a new value of ps. The memory block size increases, though the internal structure remains same. The switches need to be redesigned, though.
Since the system design is based on proven properties of projective geometries, simulations are not exactly required. However, detailed complexity and cost analysis was done, and the results are discussed now.
The parameters based on which comparison is done are:
1. Processing Unit parameters. Code length n, gate count and their fan-in were chosen for analysis.
2. Interconnect parameters. The number of edges/connections in Tanner graph was chosen for analysis.
28
3. Memory parameters. Number of memory blocks and their width, number of memory registers per processing unit are primary parameters. Apart from this, number of memory accesses per cycle per block, and aggregated over an iteration, number of read and write accesses per iteration, and bandwidth utilization are further parameters. The corresponding utilization factor implies what percentage of this access capacity is being used in a particular cycle.
4. Timing parameters. The number of cycles taken by few aggregate units of time, namely the fold, phase, and iteration are primary parameters here. Another important parameter is the system throughput. The envisaged decoder system requires J/q = (p2s + ps + l)/q processing units of check-nodes as well as bit-nodes. Hence the overall number of processing units is 2*(p2s + ps + l)/q.
The bit-processing nodes have 3 gates: one adder and '2' subtractors. The checks processing nodes contain 6 gates: 3 XOR gates, one adder and 2 subtractors. We are assuming here that the functions f(x) and f I(x) are implemented using look-up table. Hence their numbers do not add to the gate count, but do add to area-related considerations.
Each bit node in our architecture is connected to both bit- and check-memory blocks(similarly check nodes). Hence the number of wires is 4*(p2s + ps+l)/q*(ps+l).
Turning to memory layout, there are (p2s + ps + l)/q bit- as well as check-memory blocks. The width of both memory blocks is q*(ps +1) data objects and they are dual-port memories. For the first option of scheduling, there are also (q + 1) parallel registers in each bit- as well as check-processing unit. They count upto 2*(q + l)*(p s + ps + l)/q memory registers. The memory bandwidth utilization is dependent on the type of scan that is going on. In
29
accumulation scan, one set of memory blocks(the opposite ones) are idle, and so are the connections to those blocks. In output scan, the utilization is full, for both type of memory blocks are busy. In every iteration there are four scans, two of each type. Each memory type remains idle in one out of these four scans.
For each processing unit, still either two or four out of 2*(ps + 1) connections get excited in a cycle. Hence, communication bandwidth
utilization factor remains either 2 or 4
(pS+l) (ps+l)
The number of cycles for execution of 'bit'- phase is q*[(ps + l)/2 + (ps + l)/2] = q*(ps + 1). The iteration time is thus 2*q*(ps + 1). If N is the number of iterations to decode each frame of data, then the throughput is
1+(p+lC2)s
_____________°fx where fs is the system frequency. One can see that the
N o 2 o q o (p s +1)
throughput has decreased approximately by a factor of q, while area is saved by approximately the same factor.
Some of the lower bounds on some of the parameters related to area and speed are introduced as follows. One of the parameter is the gate count, which can crudely represent one part of area. The other part is the area taken by interconnect wires. Generally the complexity is expressed in terms of code length J(or n) in literature. By analytical means, we have found that for large values of n, 0(no- is a looser lower bound for area. Similarly, the lower bound of decoding time is 0(V«). Finally, the lower bound on throughput is also O
While considerable emphasis has been placed herein on the various components of the preferred embodiment and the interrelationships between
30
the component parts of the preferred embodiment, it will be appreciated that many alterations can be made and that many modifications can be made in the preferred embodiment without departing from the principles of this invention. These and other changes in the preferred embodiment as well as other embodiments of the invention will be apparent to those skilled in the art from the disclosure herein, whereby it is to be distinctly understood that the foregoing descriptive matter is to be interpreted merely as illustrative of the invention and not as a limitation.
Dated this 31st day of January, 2007.
31
| Section | Controller | Decision Date |
|---|---|---|
| # | Name | Date |
|---|---|---|
| 1 | 177-MUM-2007-RELEVANT DOCUMENTS [28-09-2023(online)].pdf | 2023-09-28 |
| 1 | Other Patent Document [15-07-2016(online)].pdf | 2016-07-15 |
| 2 | 177-MUM-2007-POWER OF ATTORNEY-(21-07-2016).pdf | 2016-07-21 |
| 2 | 177-MUM-2007-RELEVANT DOCUMENTS [26-09-2022(online)].pdf | 2022-09-26 |
| 3 | 177-MUM-2007-RELEVANT DOCUMENTS [29-09-2021(online)].pdf | 2021-09-29 |
| 3 | 177-MUM-2007-CORRESPONDENCE-(21-07-2016).pdf | 2016-07-21 |
| 4 | 177-MUM-2007-RELEVANT DOCUMENTS [29-03-2020(online)].pdf | 2020-03-29 |
| 4 | 177-MUM-2007-CORRESPONDENCE(IPO)-(DECISION)-(22-11-2016).pdf | 2016-11-22 |
| 5 | Form 27 [31-03-2017(online)].pdf | 2017-03-31 |
| 5 | 177-MUM-2007-RELEVANT DOCUMENTS [23-03-2019(online)].pdf | 2019-03-23 |
| 6 | Other Patent Document [06-05-2017(online)].pdf | 2017-05-06 |
| 6 | 177-mum-2007-abstract(31-1-2008).pdf | 2018-08-09 |
| 7 | 177-MUM-2007-RELEVANT DOCUMENTS [28-03-2018(online)].pdf | 2018-03-28 |
| 7 | 177-MUM-2007-Abstract-210115.pdf | 2018-08-09 |
| 8 | 177-MUMNP-2007-CORRESPONDENCE(IPO)-(HEARING NOTICE)-(11-6-2016).pdf | 2018-08-09 |
| 8 | 177-mum-2007-claims(31-1-2008).pdf | 2018-08-09 |
| 9 | 177-MUM-2007-Claims-210115.pdf | 2018-08-09 |
| 9 | 177-MUM-2007_EXAMREPORT.pdf | 2018-08-09 |
| 10 | 177-mum-2007-correspondance-received.pdf | 2018-08-09 |
| 10 | 177-MUM-2007-Power of Attorney-210115.pdf | 2018-08-09 |
| 11 | 177-MUM-2007-CORRESPONDENCE(16-4-2009).pdf | 2018-08-09 |
| 11 | 177-MUM-2007-OTHERS-210115.pdf | 2018-08-09 |
| 12 | 177-mum-2007-correspondence(31-1-2008).pdf | 2018-08-09 |
| 12 | 177-MUM-2007-Original Under Rule 6 (1 A)OTHERS-150217.pdf | 2018-08-09 |
| 13 | 177-MUM-2007-CORRESPONDENCE(IPO)-(28-1-2017).pdf | 2018-08-09 |
| 13 | 177-MUM-2007-Original Under Rule 6 (1 A)Correspondence-150217.pdf | 2018-08-09 |
| 14 | 177-MUM-2007-CORRESPONDENCE(IPO)-(FER)-(30-1-2014).pdf | 2018-08-09 |
| 14 | 177-mum-2007-form-3.pdf | 2018-08-09 |
| 15 | 177-mum-2007-description (provisional).pdf | 2018-08-09 |
| 15 | 177-mum-2007-form-26.pdf | 2018-08-09 |
| 16 | 177-mum-2007-description(complete)-(31-1-2008).pdf | 2018-08-09 |
| 16 | 177-mum-2007-form-2.pdf | 2018-08-09 |
| 17 | 177-mum-2007-drawing(31-1-2008).pdf | 2018-08-09 |
| 18 | 177-mum-2007-form-1.pdf | 2018-08-09 |
| 18 | 177-mum-2007-drawings.pdf | 2018-08-09 |
| 19 | 177-MUM-2007-Examination Report Reply Recieved-210115.pdf | 2018-08-09 |
| 19 | 177-MUM-2007-Form 5-210115.pdf | 2018-08-09 |
| 20 | 177-mum-2007-form 1(21-3-2007).pdf | 2018-08-09 |
| 20 | 177-mum-2007-form 5(31-1-2008).pdf | 2018-08-09 |
| 21 | 177-mum-2007-form 13(16-4-2009).pdf | 2018-08-09 |
| 21 | 177-MUM-2007-Form 3-210115.pdf | 2018-08-09 |
| 22 | 177-MUM-2007-FORM 18(16-4-2009).pdf | 2018-08-09 |
| 22 | 177-MUM-2007-Form 2(Title Page)-210115.pdf | 2018-08-09 |
| 23 | 177-mum-2007-form 2(complete)-(31-1-2008).pdf | 2018-08-09 |
| 23 | 177-mum-2007-form 2(title page)-(provisional)-(31-1-2007).pdf | 2018-08-09 |
| 24 | 177-mum-2007-form 2(title page)-(complete)-(31-1-2008).pdf | 2018-08-09 |
| 25 | 177-mum-2007-form 2(title page)-(provisional)-(31-1-2007).pdf | 2018-08-09 |
| 25 | 177-mum-2007-form 2(complete)-(31-1-2008).pdf | 2018-08-09 |
| 26 | 177-MUM-2007-FORM 18(16-4-2009).pdf | 2018-08-09 |
| 26 | 177-MUM-2007-Form 2(Title Page)-210115.pdf | 2018-08-09 |
| 27 | 177-mum-2007-form 13(16-4-2009).pdf | 2018-08-09 |
| 27 | 177-MUM-2007-Form 3-210115.pdf | 2018-08-09 |
| 28 | 177-mum-2007-form 1(21-3-2007).pdf | 2018-08-09 |
| 28 | 177-mum-2007-form 5(31-1-2008).pdf | 2018-08-09 |
| 29 | 177-MUM-2007-Examination Report Reply Recieved-210115.pdf | 2018-08-09 |
| 29 | 177-MUM-2007-Form 5-210115.pdf | 2018-08-09 |
| 30 | 177-mum-2007-drawings.pdf | 2018-08-09 |
| 30 | 177-mum-2007-form-1.pdf | 2018-08-09 |
| 31 | 177-mum-2007-drawing(31-1-2008).pdf | 2018-08-09 |
| 32 | 177-mum-2007-description(complete)-(31-1-2008).pdf | 2018-08-09 |
| 32 | 177-mum-2007-form-2.pdf | 2018-08-09 |
| 33 | 177-mum-2007-description (provisional).pdf | 2018-08-09 |
| 33 | 177-mum-2007-form-26.pdf | 2018-08-09 |
| 34 | 177-MUM-2007-CORRESPONDENCE(IPO)-(FER)-(30-1-2014).pdf | 2018-08-09 |
| 34 | 177-mum-2007-form-3.pdf | 2018-08-09 |
| 35 | 177-MUM-2007-CORRESPONDENCE(IPO)-(28-1-2017).pdf | 2018-08-09 |
| 35 | 177-MUM-2007-Original Under Rule 6 (1 A)Correspondence-150217.pdf | 2018-08-09 |
| 36 | 177-MUM-2007-Original Under Rule 6 (1 A)OTHERS-150217.pdf | 2018-08-09 |
| 36 | 177-mum-2007-correspondence(31-1-2008).pdf | 2018-08-09 |
| 37 | 177-MUM-2007-OTHERS-210115.pdf | 2018-08-09 |
| 37 | 177-MUM-2007-CORRESPONDENCE(16-4-2009).pdf | 2018-08-09 |
| 38 | 177-mum-2007-correspondance-received.pdf | 2018-08-09 |
| 38 | 177-MUM-2007-Power of Attorney-210115.pdf | 2018-08-09 |
| 39 | 177-MUM-2007-Claims-210115.pdf | 2018-08-09 |
| 39 | 177-MUM-2007_EXAMREPORT.pdf | 2018-08-09 |
| 40 | 177-mum-2007-claims(31-1-2008).pdf | 2018-08-09 |
| 40 | 177-MUMNP-2007-CORRESPONDENCE(IPO)-(HEARING NOTICE)-(11-6-2016).pdf | 2018-08-09 |
| 41 | 177-MUM-2007-Abstract-210115.pdf | 2018-08-09 |
| 41 | 177-MUM-2007-RELEVANT DOCUMENTS [28-03-2018(online)].pdf | 2018-03-28 |
| 42 | Other Patent Document [06-05-2017(online)].pdf | 2017-05-06 |
| 42 | 177-mum-2007-abstract(31-1-2008).pdf | 2018-08-09 |
| 43 | Form 27 [31-03-2017(online)].pdf | 2017-03-31 |
| 43 | 177-MUM-2007-RELEVANT DOCUMENTS [23-03-2019(online)].pdf | 2019-03-23 |
| 44 | 177-MUM-2007-RELEVANT DOCUMENTS [29-03-2020(online)].pdf | 2020-03-29 |
| 44 | 177-MUM-2007-CORRESPONDENCE(IPO)-(DECISION)-(22-11-2016).pdf | 2016-11-22 |
| 45 | 177-MUM-2007-RELEVANT DOCUMENTS [29-09-2021(online)].pdf | 2021-09-29 |
| 45 | 177-MUM-2007-CORRESPONDENCE-(21-07-2016).pdf | 2016-07-21 |
| 46 | 177-MUM-2007-RELEVANT DOCUMENTS [26-09-2022(online)].pdf | 2022-09-26 |
| 46 | 177-MUM-2007-POWER OF ATTORNEY-(21-07-2016).pdf | 2016-07-21 |
| 47 | 177-MUM-2007-RELEVANT DOCUMENTS [28-09-2023(online)].pdf | 2023-09-28 |
| 47 | Other Patent Document [15-07-2016(online)].pdf | 2016-07-15 |