Abstract: The present invention refers to a method for MapReducing the processing of an Electronic Data Interchange (EDI) document (1, the method comprising the following steps: a. mapping the EDI document (1) into a plurality of intermediate docu- ments (10, 11); b. processing the intermediate documents (10, 11) to produce a plurality of intermediate results (20-23); c. reducing the plurality of intermediate results (20-23) to produce a plurality of reduced intermediate results (30, 31); and d. reducing the reduced intermediate results (30, 31) to produce a final result (2) representing the processed EDI document (1).
Method and server cluster for MapReducing the processing of large documents
1. Technical Field
The present invention relates to a method, a server cluster and a computer pro-
gram for MapReducing the processing of large documents, for example Electronic
Data Interchange (EDI) documents.
2. The Prior Art
Modern software applications in an enterprise environment are typically struc-
tured into sub-programs each performing certain subtasks of the software applica-
tion. Typically, huge amounts of data have to be processed by such applications,
for example in the field of communication between applications of different en-
terprises, wherein large documents have to be sent and processed.
Such applications are often executed on integration servers, an example of which
is the webMethods Integration Server of applicant. The Integration Server sup-
ports a graphical programming model FLOW, which is used for defining the
processing logic of an Integration Server. FLOW allows for the graphical defini-
tion of a plurality of FLOW services as "black box" services as well as pipelines
between the FLOW services, which serve to pass data from outputs of one FLOW
service to inputs of another FLOW service. Since FLOW is a graphical programming language, it alleviates the developer from writing complex and error-prone
conventional code. FLOW services may be used for processing any kind of information and for performing various kinds of computations.
A common approach known from the prior art for processing large documents by
an Integration Server is to process the contents of the document in a sequential
manner. However, since the size of the documents may be in the range of Giga-
bytes, such a sequential processing is very time-consuming and processing-
intensive and may require special high-end hardware, whose maintenance is cost-
ly and complex.
Another approach known from the prior art is to employ a broker, which distributes the large documents to instances of an Integration Server in order to achieve
some parallel processing. However, this approach requires additional and often
complex messaging middleware for the communication between the broker and
the Integration Server instances, which typically imposes high network bandwidth
requirements and results in a high consumption of resources. Furthermore, this
approach typically involves processing multiple large documents by the broker
and the Integration Server instances, wherein each large document is still processed by a single Integration Servers in a sequential manner.
Furthermore, in the field of processing large input sets of data, a programming
model and associated framework called MapReduce is known from the document
"MapReduce: Simplified Data Processing on Large Clusters" by J. Dean et al. of
Google, Inc. (OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, December, 2004). A user-defined map function
takes an input pair and produces a set of intermediate key/value pairs. The Ma-
pReduce library groups together all intermediate values associated with the same
intermediate key and passes them to a user-defined reduce function. The reduce
function accepts an intermediate key and a set of values. It merges together the
values to form a possibly smaller set of values. Typically zero or one output value
is produced per reduce invocation. The intermediate values are supplied to the
user's reduce function via an iterator. This allows for handling lists of values that
are too large to fit in memory. Programs written in this programming model may
be automatically executed in parallel on different machines by the framework.
However, employing the MapReduce programming model onto an existing application requires an in-depth adaptation of the programming logic of the application
to conform to the MapReduce programming model. Furthermore, MapReduce is
intended for the field of search engines, where specialized tasks such as counting
words in huge collections of documents, building graph structures of web links
and the like are common.
One concrete example of the processing of large documents is Electronic Data
Interchange (EDI). EDI relates to the transmission of structured messages between applications by electronic means. EDI is typically used to transmit large
documents such as invoices or purchase orders between applications of different
enterprises. A number of standardized formats for the structured messages are
known in the art, e.g. ANSI X12, UCS, VICS, UN/EDIFACT, ODETTE and
EANCOM. Processing such large EDI documents typically involves the above-
mentioned disadvantages.
The technical problem underlying the present invention is therefore in one aspect
to provide a method and a system for processing large documents. In particular
EDI documents, with less processing time and computing effort and thereby at
least partly overcoming the above explained disadvantages of the prior art. Another but related technical problem underlying the present invention is to provide
a method and a system for processing the input of a FLOW service with less processing time and computing effort, which is furthermore flexibly adaptable to existing programming logic with minimal adaptation efforts.
3. Summary of the Invention
According to one aspect, the invention relates to a method for MapReducing the
processing of an Electronic Data Interchange (EDI) document. In the embodiment
of claim 1, the method comprises the steps of:
a. mapping the EDI document into a plurality of intermediate documents;
b. processing the intermediate documents to produce a plurality of intermediate
results;
c. reducing the plurality of intermediate results to produce a plurality of reduced
intermediate results; and
d. reducing the reduced intermediate results to produce a final result representing
the processed EDI document.
The first aspect of the present invention is based on the realisation that the concept
of MapReducing cannot only be used in the context of Search Engines but also
advantageously for the processing of EDI documents in an enterprise environment. Accordingly, a large EDI document is at first mapped, i.e. split, into multiple intermediate documents. The mapping, i.e. splitting is preferably performed
such that each resulting intermediate document has an approximately equally
sized payload, i.e. so that it consumes a comparable amount of processing time
and / or processing resources, when being processed in the further steps of the
method.
The intermediate documents are then processed to produce a plurality of intermediate results, which is preferably performed in parallel to improve the processing
performance in terms of overall processing time. Furthermore, since the EDI
document is mapped to a plurality of, typically smaller, intermediate documents,
the intermediate documents may be processed by commodity hardware, i.e. there
is no need to employ specialized high-end hardware.
After the processing of the intermediate documents, the resulting intermediate
results are reduced to produce a plurality of reduced intermediate results. Reducing means collating the related intermediate results into one reduced intermediate
result. Related in this context means that two or more intermediate results stem
from the same original EDI document.
Finally, the reduced intermediate results are reduced in a further step to produce a
final result. This method step typically involves adequately combining the reduced intermediate results, in order to obtain the final result of the processing of
the EDI document.
The described two-step reducing is especially advantageous, if the reducing steps
are performed on different physical machines in order to achieve a parallelization.
In this case, since the intermediate results are already reduced before being sent to
another machine, which performs the second reducing step, valuable network
bandwidth can be saved, since less results have to be transmitted between the machines. Furthermore, this aspect is especially advantageous if the reducing steps
are commutative (i.e. A operation B is equivalent to B operation A) and / or associative (i.e. A operation (B operation C) is equivalent to (A operation B) operation
C). Consequently, the reducing steps may be performed in parallel in any order.
Another advantage associated with a two-step reducing is that the load, i.e. proc-
essing time for performing reduce step, may be shared between commodity ma-
chines, rather than one machine doing the reduce step over a large set. The second
reduce step may thus be performed over a smaller set of intermediate results.
In another aspect of the present invention, the EDI document (1) may be mapped
such that each of the intermediate documents (10, 11) comprises at least one of a
plurality of interchange envelopes, at least one of a plurality of functional group
envelopes and / or at least one of a plurality of transaction set envelopes of the
EDI document (1). Accordingly, the mapping, i.e. splitting, may be performed at
one of the boundaries defined by the structure of the EDI document, i.e. on trans-
actional set envelope level, functional group envelope level and / or interchange
envelope level. It typically depends upon the end user to define at what boundary
the EDI document is to be split based on the structure of the EDI document. For
example, if the functional groups and / or interchange envelopes contain the opti-
mum number of transactions to define a reasonably sized payload.
Steps a. and d., i.e. the mapping and the final reducing, may be performed by a
master server of a server cluster and steps b. and c, i.e. processing and reducing
the intermediate documents or intermediate results, respectively, may be per-
formed by a plurality of slave servers of the server cluster. Each slave server may
process one or more intermediate documents and reduce one or more intermediate
results. Performing the processing by a plurality of slave servers is especially ad-
vantageous, since the processing of the intermediate documents can be highly
parallelized. The master and slave servers are preferably distinct physical ma-
chines communicating over a network connection.
For example if the processing task is to add a list of numbers {1, 2, 3, 4, 5, 6, 7, 8,
9, 10 ,11, 12}, the master server may delegate the intermediate documents {1,2}
to a slave node 1, {3, 4} to a slave node 2, {5, 6} to a slave node 3, {7, 8} to the
slave node 1, {9, 10} to the slave node 2 and {11, 12} to the slave node 3. At the
slave node 1, the intermediate results would then be sum of the intermediate
documents: 3 corresponding to (1, 2} and 15 corresponding to {7, 8}. The reduce
step on the slave node 1 would then add 3 and 15, resulting to 18. Accordingly,
the reduce step on the slave node 2 would add 7 and 19 to yield 26 and the slave
node 3 would add 1 i and 23 into 34. Consequently, only three reduced intermedi-
ate results, 18, 26, 34, would have to be transferred back to the master server, in-
stead of transferring 3, 15, 7, 19, 11, 23. The final reduce step performed on the
master server would then yield 78 (18+26+34), which is the desired result.
In another aspect, the method may further comprise the step of sending the niter-
mediate documents to the slave servers by an asynchronous invocation from the
master server. Accordingly, the master server takes the large EDI document and
delegates the processing of the intermediate documents to the slave servers. The
EDI document itself preferably stays with the master server. Asynchronous invo-
cation means that once the master server invokes, i.e. triggers, the processing of a
slave server, which is preferably performed by a thread pool of the master server,
the master server threads do not wait for the slave servers to finish their process-
ing (which would be a synchronous invocation), but that the master server may
immediately proceed with its own processing, i.e. subsequently invoking further
slave servers. This concept even more increases the processing speed of the pre-
sent invention, since there are no master server resources which are blocked (i.e.
waiting for the slave servers), thus resulting in a faster delegation of tasks to the
slave servers.
Alternatively, the EDI document may be stored in a distributed file system acces-
sible to the slave servers and the method may comprise the further step of send-
ing, by the master server, a reference to the intermediate documents to the slave
servers by an asynchronous invocation. If the slave servers are connected to the
distributed file system over direct connections, this aspect may speed up the proc-
essing considerably, since it is not necessary to send the EDI document or the in-
termediate documents over a slow network connection. In this case, only a refer-
ence to the EDI document and / or the portions of the EDI document which is
supposed to be processed by the slave (i.e. the intermediate documents) have to be
passed to the slave nodes. Co-location (i.e. providing a reference to the portions
which actually reside on the slave nodes itself) is especially advantageous since it
saves a lot of bandwidth consumption, since no EDI data transfer happens be-
tween the machines.
When processing the intermediate documents, the slave servers preferably store
the intermediate results locally, either in memory or in a persistent file system,
which are then collected by the master server.
Furthermore, each of the intermediate results may comprise an identifier relating
the respective intermediate result to the EDI document. Each of these intermediate
invocation results may be tracked to the original invocation by the use of an iden-
tifier. The identifier may e.g. be a counter which is increased with every original
invocation with a large EDI document. The identifier may be used to allow for
asynchronous behaviour when the master server calls the slave servers. This as-
pect may free the delegating threads at the master server (which in synchronous
mode would have waited for the slave servers to perform their processing), thus
leading to a better resource utilization at the master server and indirectly leading
to more parallelization. When the master server delegates the intermediate results
to the slave servers in an asynchronous manner, the slave servers thus have a
means to track their obtained intermediate results back to the original invocation
from the master server. For example, if there is a large EDI document to be proc-
essed, an identifier "12345" may be created for the invocation. The method may
pass this identifier to the slave servers, while delegating the intermediate docu-
ments to the slave servers. This helps in relating all the intermediate results to the
original EDI document in the subsequent reduce steps, as at the slave servers the
intermediate results may be maintained with this identifier.
Additionally or alternatively, a processing logic adapted for performing the proc-
essing of the slave servers in step b. may be distributed to the slave servers during
runtime. Accordingly, the slave servers do not have to have copies of the executa-
bles, i.e. the processing logic, which execute the EDI document. The executables
may for example be comprised in a library of the master server and spread to the
slave servers at runtime. The spreading is preferably performed depending on the
executables needed for the current EDI document. Any peer to peer framework or
proprietary mechanism may be used to share the executables.
Furthermore, the present invention relates to a server cluster comprising a master
server and a plurality of slave servers adapted for performing any of the methods
presented above.
In yet another aspect of the present invention, a method for MapReducing the
processing of at least one input of a FLOW service is provided. In the embodiment of claim 9, the method comprises the steps of:
a. apping the at least one input of the FLOW service into a plurality of
intermediate inputs by a mapper service;
b. Executing a plurality of instances of the FLOW service, the instances of
the FLOW service processing the intermediate inputs to produce a plu-
rality of intermediate results;
c. Reducing the intermediate results into a plurality of reduced intermediate
results by a plurality of first reducer services; and
d. reducing the reduced intermediate results to produce a final output of
the FLOW service from the reduced intermediate results by a second
reducer service.
Accordingly, a FLOW service, either a newly created or an existing FLOW ser-
vice, does not process its inputs in a sequential manner, but the processing of the
FLOW service is effectively "parallelized" by the above method. Therefore, the
inputs of the FLOW service are not directly fed into the FLOW service as com-
monly performed, but are first split by a mapper service into a plurality of inter-
mediate inputs. In an embodiment the FLOW service itself is "cloned", i.e. the
intermediate inputs are processed by a plurality of instances of the FLOW service,
preferably in parallel. The resulting intermediate results are then reduced by a
plurality of first reducer services in order to obtain one reduced intermediate result
for each instance of the FLOW service. Finally, the reduced intermediate results
are reduced by a second reducer service in order to provide the final output of the
FLOW service. Preferably, the second reducer service is based on the same im-
plementation than the first plurality of reducer services, i.e. all reducing steps are
performed by instances of the same reducer service. In the following, the terms
"reducer service" and "instance of the reducer service" are used synonymously for
the sake of clarity. It is to be noted that the overall input and output of the FLOW
service stays the same, only the processing is parallelized.
In one aspect, the mapper service and the second reducer service are executed on a
master server of a server cluster and wherein the plurality of instances of the
FLOW service and the plurality of first reducer services are executed on a plural-
ity of slave servers of the server cluster.
In another aspect, an input signature of the mapper service conforms to an input
signature of the FLOW service. Additionally or alternatively, an output signature
of the reducer service conforms to an output signature of the FLOW service. An
input signature (or an output signature) preferably defines the number and type of
arguments provided as input (or as output) of a service, hence defining the inter-
face of the service.
Due to the fact that the input signature of the mapper service preferably conforms
to the input signature of the FLOW service to be parallelized, any existing FLOW
service may be connected to a mapper service with the same input signature. Fur-
thermore, any existing FLOW service may be connected to a reducer service with
a conforming output signature, which means that any existing FLOW service may
be embedded in the present method without the need to adapt its input or output
signature or internal processing logic. This is especially advantageous, since it
highly increases the flexibility and applicability of the present invention. An ex-
ample of input and output signature is presented in the detailed description below.
In yet another aspect, at least one input of the FLOW service may comprise an
Electronic Data Interchange (EDI) document. Hence, the FLOW service is in this
aspect preferably adapted for processing the EDI document. When the FLOW
service is parallelized, an especially efficient processing of the EDI document
may be achieved similar to the aspects presented further above. However, it
should be appreciated that FLOW services are not at all restricted to processing
EDI documents. On the contrary, FLOW services are suitable for processing any
kind of documents, such as XML documents for example. Furthermore, not only
documents may be processed by FLOW services, but any kind of computation
logic may be implemented.
The present invention also relates to a server cluster comprising a master server
and a plurality of slave servers adapted for performing any of the above presented
methods.
Lastly, a computer program is provided comprising instructions adapted for im-
plementing any of the above described methods.
4. Short Description of the Drawings
In the following detailed description, presently preferred embodiments of the in-
vention are further described with reference to the following figures:
Fig. 1: A schematic overview of an embodiment of the present i nvention;
Fig. 2: A schematic overview of a master server and a plurality of slave
servers according to an embodiment of the present invention;
Fig. 3: An overview of the structure of an EDI document;
Fig. 4: An exemplary FLOW service and its related input and outputs;
Fig. 5: Another exemplary FLOW service for processing an EDI document;
Fig. 6: An overview of a graphical user interface for specifying the proper-
ties of a MapReduced FLOW service; and
Fig. 7: A class diagram of an exemplary implementation of a method of the
present invention.
5. Detailed Description
In the following, a presently preferred embodiment of the invention is described
with respect to the processing of a large EDI document by a server cluster accord-
ing to the present invention. A server cluster, also referred to as a grid as sche-
matically shown in Fig. 2, is a distributed computing platform which allows for
parallel processing. It is typically composed of a cluster of networked, loosely
coupled computers acting in concert to perform very large computing or data in-
tensive tasks. It should be appreciated that processing an EDI document is only
one of a wide variety of scenarios for the present invention and that any other
types of documents may be processed. Furthermore, not only document process-
ing may be advantageously achieved by the present invention, but any kind of
complex computations, as will be demonstrated in further exemplary embodi-
ments below.
The general structure of an EDI document is schematically depicted in Fig. 3,
which shows the structure as defined for example by the ANSI ASC X12 stan-
dard. Accordingly, an EDI document comprises any number of transactions,
which are grouped by various envelopes. On the innermost level, a transaction set
is identified by the ST/SE segments shown in Fig. 3. The ST segment preferably
comprises a transaction set ID, a control number and an optional implementation
convention reference. The SE segment preferably comprises the number of in-
cluded segments in the transaction set and the same control number as the ST
segment. The second level of enveloping is the functional group envelope. Its
purpose is to group similar types of transaction sets within a transmission. ANSI
ASC X12 defines a number of business processes for grouping similar transaction
sets, like Planning Schedule (830), Purchase Order (850), Purchase Order Ac-
knowledgment (855), Purchase Order Change (865), Order Status Inquiry (869) or
Order Status Report (870).
The outermost level is the interchange envelope that is defined by ISA and IEA
segments (see Fig. 3). An Interchange envelope preferably encloses the data from
one sender to one receiver. The ISA segment is preferably a fixed length segment.
Some items contained in the ISA/IEA segments are structured mailbox addresses
of the sender and receiver, interchange control numbers, counts of the functional
groups within the interchange envelope, time/date stamps and the version of the
interchange envelope.
Typical ways to process such an EDI document might be to map the data of the
EDI document to another format (e.g., the format that a back-end system requires)
or to map data from the EDI document to the inputs of a FLOW service, as further
outlined below.
Traditional EDI processing typically processes one transaction at a time. If the
EDI document size is in the order of hundreds of megabytes or gigabytes, this
processing is very time consuming. To somewhat alleviate this disadvantage, typ-
ically a cluster of high-end servers are deployed to process each of a plurality of
EDI documents in parallel. The employment of high-end servers, however, has
severe disadvantages, e.g. an increased complexity if hardware/software fails dur-
ing the processing and an increased cost of ownership for maintaining the high-
end servers.
The present invention defines a method and server cluster for parallelizing the
processing of on EDI document-level. As can be seen in Fig. 1, a master server
Ml at first receives a large EDI document 1. The EDI document is at first
mapped, i.e. split, on the interchange envelope boundaries into a plurality of in-
termediate documents 10, 11. However, it should be appreciated that an EDI doc-
ument may in further embodiments of the present application as well be split for
example at the functional group envelope level or even at the transaction set enve-
lope level, depending on the type of EDI document.
Even more fine-grained approaches are suitable with the present invention, for
example splitting the EDI document at the single transaction level, if the transac-
tions in the EDI document are independent entities. As a result the document
could be mapped (chunked) into very small portions leading to a high level of
parallelization.
After splitting the EDI document, the master server Ml delegates the intermediate
documents 10, 11 to a plurality of slave servers S1, S2 for processing. The slave
servers S1, S2 process the intermediate documents 10, 11 and produce intermedi-
ate results 20-23. It should be noted that each processing of one intermediate
document may result in multiple intermediate results, as further explained below.
The intermediate results 20-23 are then reduced by each of the slave servers S1,
S2 in order to preferably obtain one reduced intermediate result 30, 31 per slave
server S1, S2.
When the master server Ml has finished delegating the intermediate documents to
the slave servers S1, S2, it preferably issues reduce calls on each of the slave
servers S1, S2. The delegation is preferably invoked in an asynchronous manner,
so that the master server Ml, i.e. its threads, may proceed with its processing and
does not have to wait for each slave server S1, S2 to finish execution, as already
explained above.
The reduce calls issued by the master server Ml trigger the slave servers S1, S2 to
send their respective reduced intermediate results 30, 31 back to the master server
Ml. The master server Ml then issues another reduce call for collating the col-
lected reduced intermediate results 30, 31 into one final output 2. The output 2
then represents the processed EDI document 1.
It is to be noted that, since the slave servers S1, S2 each process only a portion of
the overall EDI document 1, there is no need for specialized high-end hardware.
Commodity machines may be used as slave servers, which greatly reduces the
cost of the overall architecture.
The processing of the master and slave servers is preferably performed by a num-
ber of services. Particularly preferred is an embodiment where the servers are
webMethods Integration Serves. The webMethods Integration Server is at the core
of the webMethods portfolio of products of Applicant. It is a Java based, multi-
platform enterprise Integration engine supporting the execution of services to per-
form integration logic such as data mapping and communication with other sys-
tems. The Integration Server provides a graphical programming model FLOW
that is used for performing common integration tasks such as mapping, invoking
other services, looping and branching. Some of the Integration Server features
include writing graphical FLOW and Java services, defining and modifying doc-
uments and mapping logic, testing, debugging and executing services, creating
and configuring web services and editing adapter services and notifications.
Fig. 4 depicts an exemplary simple FLOW service "sampleFlowService" which
takes two integers "input1" and "input2" and provides two outputs "multiplyInts-
Result" (the result of a multiplication of the two input integers) and "addmtsRe-
sult" (the result of an addition of the two input integers). When executing the ex-
emplary FLOW service on the Integration Server, the user may be provided with a
dialog to enter values for the inputs and another dialog may be presented which
comprises the computation results. Fig. 4 shows a graphical user interface pref-
erably used by the developer for specifying the mapping between the inputs, the
FLOW service and the outputs.
Another example of FLOW service processing is to count the occurrences of
words in a file. A common approach without parallelization would be to read the
file line by line, to add the word as a key in a HashMap and the count as a value in
the HashMap. First the HashMap is queried for the key and if the query returns
"null", the count is put as 1. Otherwise the original count is retrieved and it will be
incremented and put back in to the HashMap. When the mapper Ml 0 and reducer
services R10, R11, R20 are written, the mapper service may produce smaller files
as the output and the reducer services only combine the output HashMap with a
final HashMap. Accordingly, the input/output signature of the original FLOW
service which does the word count remains the same and only the logic of the
mapper and the reducer operation have to be written. This is an especially advan-
tageous aspect of the present invention, as further explained below.
Yet another example of a FLOW service is the processing of an EDI document.
Fig. 5 shows an exemplary FLOW service "mainService", which takes an EDI file
name as input. It converts the EDI file format to an internal webMethods format
by calling the service "ediToValues" also depicted in Fig. 5. As an output, it re-
turns if the input EDI file is valid as a whole after the conversion. It may further
indicate the consumed processing time for execution of the service (not shown in
Fig. 5). The input/output signature of the FLOW service "ediToValues" is struc-
tured as follows: It accepts either an input "ediFilcName" (the file name of the
EDI document) or an input "edidata" (the actual EDI data itself represented as
string) which are mutually exclusive. If a "printTime" input is set, the time taken
to execute the service will be printed out. A single output "isValid" will be output
indicating if the EDI document is valid or not after the conversion.
Since processing the above described FLOW service "ediToValues" sequentially
consumes a great amount of processing time and resources, it is demonstrated in
the following how the method of the present invention is applied onto this existing
FLOW service in order to efficiently and flexibly "parallelize" it.
Referring to Fig. 6, in the properties panel of the FLOW service "ediToValues",
the developer may provide the following properties:
• Mapper service: a valid Integration Server service for mapping the input
data
• Reducer service: a valid Integration Server service for reducing the output
data
• Grid enabled: set to "true"
• Throttle: the maximum number of parallel executions including the master
and the slave servers
• Policy: this property specifies whether to hold the intermediate results of
the slave servers in memory (if they are of negligible size), or persist them
in a file
The present invention then uses the above-specified mapper service M10 (see.
Fig. 1) to perform the mapping of the EDI document 1 into the plurality of inter-
mediate documents 10, 11. The input and output signature of the mapper service
M10 preferably follows certain rules:
• The input of the mapper service is preferably the same as the FLOW ser-
vice being "parallelized" ("ediToValues" in the example). In this case, the
mapper service accepts an input "ediFileName" which matches with the
input of the service "ediToValues".
• The output of the mapper service is preferably wrapped in an Integration
Server document with name "servicelnputData". The content of "ser-
vicelnputData" is preferably the input of the "parallelized" FLOW service.
In the example, the output "edidata" of the mapper service matches with
the input of the service "ediToValues".
• Furthermore, the output of the mapper service preferably provides a boo-
lean "isLastSplit". The mapper service sets this value to "true" when it
processes the last mapping step. The mapper service may then be repeat-
edly called until this value is set to "true".
The input and output signature of the reducer service R10, R11, R20 preferably
also follows certain rales:
• The input of the reducer service is wrapped in an Integration Server docu-
ment list called "reducelnputData". The document list is preferably an ar-
ray of documents. The content of each entry in the document list may be
the output of the FLOW service to be "parallelized". In the example, an
input "isValid" of the reducer service matches with the output of the ser-
vice "ediToValues".
• The input of the reducer service may further provide a boolean "isLastRe-
duceStep". This value is set to true if the reducer processes the last reduce
call. This can he used to perform cleanup activities in the reducer service.
• The output of the reduce service should be the output of the service that is
to be "parallelized". In the example, the output "isValid" matches with the
output of the service "ediToValues".
As can be seen, the input and output signatures of the mapper and reducer services
defined above conform to the input and output signature of the FLOW service.
This has the advantage that any existing FLOW service may be easily "parallel-
ized", since neither the signature nor the internal processing of the FLOW service
itself have to be adapted. Instead, the mapper and reducer services are simply
"plugged in" before and after the FLOW service, respectively.
The above presented approach may be especially advantageously applied, if the
reduce operations are associative and commutative. For example, when calculat-
ing the amount of prime numbers in a range of 1 to 100, two input splits may be
employed; the first input split being 1 to 50 and the second input split being 51 to
100. The intermediate outputs in this example would be "x" and "y" representing
the number of prime numbers in both splits, respectively. The reduce operation
would do the addition, which is associative and commutative.
The above-presented signature conformance is one of the advantages of the pre-
sent invention over the conventional MapReduce algorithm known from the prior
art. While the conventional MapReduce map step written by the user takes an in-
put pair and produces a set of intermediate key/value pairs, the mapper service on
the Integration Server according to the present invention follows a standard signa-
ture and only "chunks", i.e. splits, the input data. Furthermore the conventional
MapReduce map step is typically run on slaves that take an input pair and produce
a set of intermediate key/value pairs, wherein the mapper service on the Integra-
tion Server preferably executes on the master server Ml, which then delegates the
chunked input data to the slave servers S1, S2 for executing the actual services.
This is especially flexible and results in an easy development and maintenance of
flow services for a number of reasons: in the conventional MapReduce algorithm,
there is no service which is "parallelized", but it is rather a programming construct
which is defined through mappers and reducers, which perform the desired opera-
tion. Unlike in the Integration Server, there is no service which corresponds to the
desired operation. This makes the claimed method understandable and especially
user-friendly.
Concerning the conventional MapReduce reduce step written by the user, it talces
an intermediate key and a set of values for the key and merges the values to form
a possibly smaller set of values. On the contrary, when the reducer service is exe-
cuted on the Integration Server according to the present invention, the master
server Ml preferably issues a reduce call to all the slave servers S1, S2 to collate
the related intermediate results. When the master server Ml gets back the results
from the slave servers S1, S2 after the reduce operation in each of the slave serv-
ers S1, S2, it internally combines the results into one final result on the master
server M1. This essentially makes the reduce operation a two-step process per-
formed first on the slave servers S1, S2 and then on the master server Ml, which
saves network bandwidth and thus leads to a further decreased processing time
and better utilization of resources, as already explained above.
Further features of the server cluster of the present invention are possible. The
master Integration Server may for example maintain a configuration file which
comprises a list of available slave servers. It may comprise the required informa-
tion needed by the master to delegate the processing to slave servers. This simple
facility can be easily extended to achieve a dynamic identification of the slave
nodes. For example, when a slave server starts up, it may broadcast its identifica-
tion to all the machines in the server cluster and the master server can identify the
slave server as a potential slave.
In the following, an exemplary Java implementation of an embodiment of the pre-
sent invention is presented, the main components of which are depicted in Fig. 7.
However, it should be appreciated that the present invention is neither restricted to
the programming language Java nor to the concrete implementation shown in the
following.
The class JobClient shown in Fig. 7 serves for defining a "job", which represents
one execution of a processing of data according to the present invention. An ex-
emplary implementation of JobClient is shown in the following code listing:
As can be seen, when a new instance of JobClient is created by invoking its con-
structor (cf. p. 21, line 1), it takes as input the parameters mapper (the mapper
implementation to be used for the current job), reducer (the reducer implementa-
tion to be used), throttle (the number of desired parallel service executions) and
isPersistMapIntResult (whether the intermediate results should be stored in the
memory of the slaves or in a persistent file system). When invoking the submit-
Job()-method (cf. p. 21, line 30), this method takes a pipeline parameter of type
IData, which preferably comprises the input data to be processed, e.g. the data of
the EDI document. submitJob() then creates a new JobInProgress instance and
invokes its executeAndTrackJob()-method.
An exemplary implementation of JobInProgress is shown in the following code
listing:
As can be seen, JobInProgress's run()-method in this example comprises the
main code for processing the input file, i.e. the steps of splitting (cf, p. 24, line
29), executing the "parallelized" flow services (cf. p. 25, line 8) and reducing (cf.
p. 26, line 24).
An exemplary implementation of MapTask, which performs the mapping, is
shown in the following code listing:
As can be seen, when a MapTask is executed, i.e. when its run()-method is in-
voked, MapTask invokes the remoteInvoke()-method (p. 30, line 21) of the RPC
class, which takes three input parameters: hostSelector.getHostInfoEntry(), ser-
viceName and taskInput. taskInput is an attribute inherited from the superclass
AbstractTask and preferably comprises the input to be processed, e.g. the data of
the EDI document.
An exemplary implementation of RPC and its remoteInvoke()-method is shown in
the following code listing:
Both MapTask and ReduceTask have the abstract class AbstractTask as super-
class, i.e. they inherit its attributes and set- and -get-methods, which are shown in
the following exemplary code listing of AbstractTask:
As can be seen, AbstractTask itself implements the interface Task, an exemplary
implementation of which is shown in the following code listing:
A number of further infrastructure classes and interfaces are required in the exem-
plary implementation of the present invention, which are shown in the following
code listings:
We Claim
1. A method for MapReducing the processing of an Electronic Data Inter-
change (EDI) document (1), the method comprising the following steps:
a. mapping the EDI document (1) into a plurality of intermediate docu-
ments (10, 11);
b. processing the intermediate documents (10, 11) to produce a plurality of
intermediate results (20-23);
c. reducing the plurality of intermediate results (20-23) to produce a plu-
rality of reduced intermediate results (30, 31); and
d. reducing the reduced intermediate results (30, 31) to produce a final re-
sult (2) representing the processed EDI document (1).
2. The method of claim 1, wherein the EDI document (1) is mapped such that
each of the intermediate documents (10, 11) comprises at least one of a plu-
rality of interchange envelopes, at least one of a plurality of functional
group envelopes and / or at least one of a plurality of transaction set enve-
lopes of the EDI document (1).
3. The method of claim 1 or 2, wherein steps a. and d. are performed by a
master server (Ml) of a server cluster and wherein steps b. and c. are per-
formed by a plurality of slave servers (S1, S2) of the server cluster, each
slave server (S1, S2) processing one or more intermediate documents (10,
11) and reducing one or more intermediate results (20-23).
4. The method of claim 3, further comprising the step of sending the interme-
diate documents (10, 11) to the slave servers (S1, S2) from the master
server (Ml) by an asynchronous invocation.
5. The method of claim 3, wherein the EDI document (1) is stored in a distrib-
uted file system accessible to the slave servers (S1, S2) and wherein the
method further comprises the step of sending a reference to the intermediate
documents (10, 11) to the slave servers (S1, S2) from the master server
(Ml) by an asynchronous invocation.
6. The method of any of the preceding claims, wherein each of the intermedi-
ate results (20-23) comprises an identifier relating the respective intermedi-
ate result (20-23) to the EDI document (1).
7. The method of any of the preceding claims, wherein a processing logic for
performing the processing of the slave servers (S1, S2) in step b. is distrib-
uted to the slave servers (S1, S2) during runtime.
8. A server cluster comprising a master server (Ml) and a plurality of slave
servers (S1, S2) adapted for performing a method of any of the claims 1 - 7.
9. A method for MapReducing the processing of at least one input (1) of a
FLOW service, the method comprising the steps of:
a. mapping the at least one input (1) of the FLOW service into a plurality
of intermediate inputs (10, 11) by a mapper service (M10);
b. executing a plurality of instances (F10, F10') of the FLOW service, the
instances (F10, F10') of the FLOW service processing the intermediate
inputs (10, 11) to produce a plurality of intermediate results (20-23);
c. reducing the intermediate results (20-23) into a plurality of reduced in-
termediate results (30, 31) by a plurality of first reducer services (R10,
R11);and
d. reducing the reduced intermediate results (30, 31) to produce a final
output (2) of the FLOW service from the reduced intermediate results
(30, 31) by a second reducer service (R20).
10. The method of claim 9, wherein the mapper service (M10) and the second
reducer service (R20) are executed on a master server (M1) of a server clus-
ter and wherein the plurality of instances (F10, F10') of the FLOW service
and the plurality of first reducer services (R10, R11) are executed on a plu-
rality of slave servers (S1, S2) of the server cluster.
11. The method of claim 9 or 10, wherein an input signature of the mapper ser-
vice (M10) conforms to an input signature of the FLOW service
12. The method of claim 9-11, wherein an output signature of the reducer ser-
vices (R10, R11, R20) conforms to an output signature of the FLOW ser-
vice.
13. The method of claim 9 - 12, wherein at least one input of the FLOW ser-
vice comprises an Electronic Data Interchange (EDI) document (1).
14. A server cluster comprising a master server (Ml) and a plurality of slave
servers (S1, S2) adapted for performing a method of any of the claims 9-13.
15. A computer program comprising instructions adapted for implementing a
method of any of the proceeding claims 1-7 or 9-13.
The present invention refers to a method for MapReducing the processing of an
Electronic Data Interchange (EDI) document (1, the method comprising the following steps:
a. mapping the EDI document (1) into a plurality of intermediate docu-
ments (10, 11);
b. processing the intermediate documents (10, 11) to produce a plurality of
intermediate results (20-23);
c. reducing the plurality of intermediate results (20-23) to produce a plurality of reduced intermediate results (30, 31); and
d. reducing the reduced intermediate results (30, 31) to produce a final result (2) representing the processed EDI document (1).
| # | Name | Date |
|---|---|---|
| 1 | 1869-KOL-2008-ABSTRACT 1.1.pdf | 2011-10-07 |
| 1 | abstract-1869-kol-2008.jpg | 2011-10-07 |
| 2 | 1869-kol-2008-abstract.pdf | 2011-10-07 |
| 2 | 1869-kol-2008-specification.pdf | 2011-10-07 |
| 3 | 1869-KOL-2008-REPLY TO EXAMINATION REPORT.pdf | 2011-10-07 |
| 3 | 1869-KOL-2008-ASSIGNMENT.pdf | 2011-10-07 |
| 4 | 1869-KOL-2008-GPA.pdf | 2011-10-07 |
| 4 | 1869-KOL-2008-CLAIMS 1.1.pdf | 2011-10-07 |
| 5 | 1869-kol-2008-form 3.pdf | 2011-10-07 |
| 5 | 1869-kol-2008-claims.pdf | 2011-10-07 |
| 6 | 1869-KOL-2008-FORM 3.1.pdf | 2011-10-07 |
| 6 | 1869-KOL-2008-CORRESPONDENCE-1.1.pdf | 2011-10-07 |
| 7 | 1869-KOL-2008-FORM 3-1.2.pdf | 2011-10-07 |
| 8 | 1869-kol-2008-form 2.pdf | 2011-10-07 |
| 8 | 1869-kol-2008-description (complete).pdf | 2011-10-07 |
| 9 | 1869-KOL-2008-FORM 2.1.pdf | 2011-10-07 |
| 10 | 1869-kol-2008-form 1.pdf | 2011-10-07 |
| 11 | 1869-KOL-2008-FORM 1.1.pdf | 2011-10-07 |
| 12 | 1869-kol-2008-drawings.pdf | 2011-10-07 |
| 13 | 1869-KOL-2008-DRAWINGS 1.1.pdf | 2011-10-07 |
| 14 | 1869-KOL-2008-DESCRIPTION COMPLETE 1.1.pdf | 2011-10-07 |
| 15 | 1869-kol-2008-description (complete).pdf | 2011-10-07 |
| 16 | 1869-kol-2008-correspondence.pdf | 2011-10-07 |
| 17 | 1869-KOL-2008-CORRESPONDENCE-1.1.pdf | 2011-10-07 |
| 18 | 1869-kol-2008-claims.pdf | 2011-10-07 |
| 19 | 1869-KOL-2008-CLAIMS 1.1.pdf | 2011-10-07 |
| 20 | 1869-KOL-2008-ASSIGNMENT.pdf | 2011-10-07 |
| 21 | 1869-kol-2008-abstract.pdf | 2011-10-07 |
| 22 | 1869-KOL-2008-ABSTRACT 1.1.pdf | 2011-10-07 |