Scalable Memory Management Using Hadoop Map Reduce Environment

Abstract: Abstract: The new paradigm on Cloud Platform is the computing model of large data intensive Applications. The cloud is maintained using distributed computing frameworks capable of handling and processing large amount of data. Of all the cloud frameworks available, Hadoop MapReduce (HMR) is the most widely adopted owing to its ease of deployment, scalability and open-source nature. However, the Hadoop MapReduce platform suffers from number of drawbacks. The preconfigured memory allocator for Hadoop jobs leads to the issues of buffer concurrency amongst jobs and heavy disk read seeks. The memory allocator issues result in increasing the makespan time and induces high Input/output (I/O) overheads. The jobs scheduled on Hadoop cloud environments do not consider parameters such as memory requirement and multi core environment for linear scalability effecting performance. In Hadoop the reduce tasks are started post completion of all map tasks. Hadoop assumes homogenous map execution times considering homogenous distributed data, which is not true. Assumed homogenous map execution times and serial execution strategy put forth utilized map workers (and their memory resources) who have completed their tasks and are waiting for the other map workers to complete. In cloud environments where organizations/users are charged according to the(storage, compute and communication) resources utilized these issues burden the costs apart from effecting the performance. Hadoop platforms do not support flexible pricing. Scalability is an issue owing to the cluster based nature of Hadoop platforms. Processing of streaming data is also an issue with Hadoop. To overcome these drawbacks the memory optimized Hadoop MapReduce Framework (MOHMR) is presented. The MOHMR adopts a parallel execution strategy under multi-core environment. In conventional HMR the memory is freed when all map task are completed. However, in MOHMR the memory is freed when two or more map worker completes the task. The adoption of such execution strategies enables in reduction of unutilized map worker resources. To evaluate the performance of MOHMR various Hadoop based MapReduce framework is considered. The performance of MOHMR is evaluated interm of execution time and memory utilization. The overall results outcome shows the MOHMR attain significant performance improvement over existing Hadoop based HMR framework considering diverse applications such as bioinformatics and stream and non-stream applications. Further, this work presented a makespan model to describe operations of the parallel HMR under HDInsight cloud computing environment. PHMR makespan model enhance resource utilization by adopting parallel execution of map and reduce operation considering multi-core environments available with virtual computing workers. Experiment are conducted on Microsoft Azure HDInsight cloud platform considering stream, non-stream and bioinformatics application to evaluate performance of PHMR framework over existing computing model. The outcome shows significant performance improvements in terms of execution time and computation cost. Overall good correlation is seen among practical execution and theoretical execution, outcome shows proposed memory and I/O optimized PHMR framework is robust, scalable, cost efficient and support dynamic analysis on HDInsight cloud computing environment.

Patent Information

Application #

Filing Date

29 March 2022

Publication Number

49/2022

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

Rajeev Ranjan

Associate Professor, School of Computer Science and Applications, 3rd Floor, Swami Vivekananda Block, REVA University , Rukmini Knowledge Park, Kattigenahalli, Yelahanka, Bengaluru-

Archana Bhaskar

Assistant Professor , Acharya Institute Of Graduate Studies, 4th Floor Acharya Institutes , Soladevanahall

REVA University

Rukmini Knowledge Park, Kattigenahalli, Yelahanka, Bengaluru

Inventors

1. Rajeev Ranjan

Associate Professor, School of Computer Science and Applications, 3rd Floor, Swami Vivekananda Block, REVA University , Rukmini Knowledge Park, Kattigenahalli, Yelahanka, Bengaluru-

2. Archana Bhaskar

Assistant Professor , Acharya Institute Of Graduate Studies, 4th Floor Acharya Institutes , Soladevanahall

Specification

Claims:Claims:
We claim:
1. A Novel memory Optimized Scheduler for Hadoop MapReduce Framework (MOHMR) which consists of three architectural layers.
2. The first layer of the said framework describes the system model. Then, sequential Hadoop MapReduce Makespan model is described. Last, the parallel Hadoop MapReduce Makespan model is presented.
3. The Proposed System is executed in terms of Makespan time and Memory Utilization.
4. The proposed I/O Optimization minimizes the number of Reads and Writes.
5. The Proposed Parallel Hadoop MapReduce (PHMR) is evaluated in terms of Makespan time and Memory Utilization.
6. The proposed system is of less complexities and low maintenance cost. , Description:Complete Specification
BACKGROUND OF THE INVENTION
Map Reduce is the preferred computing framework used in large data analysis and processing applications. Hadoop is a widely used Map Reduce framework across different community due to its open-source nature. However, the critical challenges of cloud service provider is to meet user task Service level agreement (SLA) requirement (task deadline).
There are some inventions proposed in the existing art which may resolve the above said issue.
For example, a Chinese patent application CN201410673548 describes a MapReduce computation process optimization method. The process optimization method disperses the output time, decreases the transient network traffic flow, reduces the occupancy rate of a local disk and improves the MapReduce computation process.
An US patent discusses US9323580B2 a method for resource optimization of map/reduce computing in a computing cluster. The method can include receiving a computational problem for processing in a map/reduce module, subdividing the computational problem into a set of sub-problems and mapping a selection of the sub-problems in the set to respective nodes in a computing cluster. Clustered systems provide a natural infrastructure for use in modern MapReduce computing.
Patent number US20170315848 optimizes mapReduce job using a resource supply-demand based approach.
DESCRIPTION OF THE PROPOSED INVENTION:
MOHMR present a refinement in MR by adopting a thread-based execution for managing and utilizing memory resource efficiently. The architecture of MOHMR is shown in Figure 1. The MOHMR runs inside a virtual machine (VM) which is one of the main architectural design differences among MOHMR and HMR and rest of the upper layer architecture of HMR are retained.
Advantages:-
The advantage of using memory optimized Hadoop MapReduce framework is described below. 1. The memory optimized Hadoop MapReduce framework allow dynamic memory usage among Map and Reduce phase execution.
2. Thread based task execution is considered aiding realizing of global memory management.
3. Achieves good result for execution of small-real genomic sequence alignment and support execution of both stream and non-stream applications.
4. The MOHMR framework model adopts cloud computing platform; thus MOHMR is scalable in nature.
5. The proposed MOHMR framework model reduces execution and utilize memory resource more efficiently for executing steam and non-steam applications when compared with state-of-art Hadoop based parallel computing model.
.
Fig 1: The Architecture of Proposed Memory Optimized Hadoop MapReduce Computation Model
Parallel Hadoop Makespan Model: -
This work presents a Parallel Hadoop MapReduce (PHMR) Makespan model for utilizing unutilized slot more efficiently. First, a system model is described. Then, sequential Hadoop MapReduce Makespan model is described. Last, the proposed parallel Hadoop MapReduce Makespan model is presented.

Fig 2: The Work Flow of the Proposed Memory Optimization Hadoop Map Reduce
MOHMR present a refinement in MR by adopting a thread-based execution for managing and utilizing memory resource efficiently.
The Figure 2 presents an I/O optimization for HMR framework for minimizing disk seek and attain better parallel I/O performance.
The proposed I/O scheduler is composed of two components namely, ReadWorker (RW), and CleanWorker (CW) which is accountable for read and write operations respectively. In proposed model, the I/O operation is composed of two types such as active and passive I/O. An active I/O, reads the input from Hadoop distribute file system, and write the final output to Hadoop distribute file system, and also write Map jobs intermediate output for attaining fault tolerance. Active I/O has higher selectivity than passive I/O.
CleanWorker cleans the data in different pools. Requests with high selectivity will be satisfied first. However, for request with same selectivity, CleanWorker will poll their pools and write one block per instance in round robin manner, which in this work we represent it as interleaved-I/O.

Advantages:-
The advantage of using parallel Hadoop MapReduce framework is described below.
1. The parallel Hadoop MapReduce framework allow parallel execution using Multi-core environment within virtual computing machine.
2. Presenting a mathematical proof of parallel Hadoop MapReduce framework and attain good (accurate) correlation between theoretical and experimental Makespan .
3. The parallel Hadoop MapReduce framework can perform analysis on various stream and non-stream data-intensive and scientific algorithm and applications.
4. The parallel Hadoop MapReduce framework is offers scalable computing due to adoption of cloud computing platform.
5. The proposed parallel Hadoop MapReduce framework model reduces execution time and cost for computing data intensive and scientific algorithm when compared with state of-art Hadoop based parallel computing model.
Physical Hardware and Software Components:-
1. Parallel computing framework:
High Performance Computing Environment of Parallel Programming”, the study of parallel design patterns is essential. Even though a lot of research has been carried out in the field of automatic compilation, still there is a limit to the extent of parallelization that can be identified and extracted.
2. Shared memory and thread approach:
The tasks share a common address space to perform read and write operations asynchronously in a shared-memory programming model. Independently operating multiple processors shares the same memory resources. During the implementation on shared memory platform the user program variables are translated into actual memory addresses globally by
the native compilers.
3. Parallel virtual machine:
The PVM system is another implementation of a functionally complete message passing model. It is designed to link computing resources for example in a heterogenous network and to provide developers with a parallel platform. As the message passing model seemed to be the paradigm of choice in the late 80's the PVM project was created in 1989 to allow developers to exploit distributed computing across a wide variety of computer types. A key concept in PVM is that it makes a collection of computers appear as one large virtual machine.
4. Dryad:
Dryad is a well-knows distributed computation platform for coarse grain information parallel algorithms or tasks. Dryad distributed computing model takes inputs as a directed acyclic graphs (DAG) where the edges depicts as corresponding channel over which the information stream starting with one vertex then onto the next and vertex depicts 26 computation tasks. In the high performance computing framework adaptation of DryadLINQ the information is kept in (or segmented to) Windows shared indexes in local virtual computing machines and a meta-information record is used to deliver a depiction of the information replication and dissemination.
5. Multi-core, GPU, FPGA:
A completely different approach in the parallel programming model context is the use of shared memory. Message passing is built upon this shared memory model which means that every processor has direct access to the memory of every other processor in the system. This class of multiprocessor is also called Scalable Shared Memory Multiprocessor (SSMP). Like the development of the MPI specifications the development of OpenMP started for one simple reason: portability.
6. MapReduce framework:
For handle the huge amount of data the traditional MySQL approach was not sufficient. Thus, for handling the data that is big data, Google introduced MapReduce framework which is distributed computing processing and programing model written using java programing language. Namely Map and Reduce is the two important tasks in the MapReduce algorithm. Map takes a fixed amount of data and changes it into different set of data, where single elements are fragmented into tuples (key/value pairs). Then, the output of the map task become the input of the reduce task and combines those data tuples into tiny set of tuples. According to the name of the MapReduce the map 29 task will always be performed first then the reduce task will be performed. The main benefit of MapReduce is that it is easy to balance data processing over various computing nodes. Below the MapReduce model, mapper and reducer are called as the data processing primitives. Dividing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model.
Reduce stage: - The combination of the shuffle stage and reduce stage is called as the reduce stage. The reducer’s task is to manipulate the data that comes from the mapper; basically the output of the mapper is input of the reducer. The newly produced set of output is stored in the HDFS (Hadoop Distributed File System). The Map and Reduce tasks will be sent to the appropriate server in the cluster during the MapReduce job. All the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between nodes the framework manage all these things.
7. Hadoop Framework
Apache Hadoop has a comparative design to Google's MR runtime, where it gets to information by means of Hadoop distributed file system, which maps all the local storages of the virtual processing machines to a solitary file network progressive (hierarchical) system, enabling the information to be distributed over every one of the information/virtual computing machines. Hadoop distributed file system 34 likewise recreates (duplicate copies) the information (data replication) on different virtual computing machine so failure of any virtual computing node containing a part of the information won't influence the evaluation (algorithm analysis) which utilize that information. Hadoop schedules the MR evaluation assignments (jobs) relying upon the information region, enhancing the general I/O data rate.
8. Cloud computing framework:
Amidst the day when you are perusing this, a larger amount of information will be delivered or generated than the measure of data enclosed in all written articles across the globe. The Internet Data Generation Center (IDGC) assessed the development of information to be of a factor of three hundred from 2005 to 2020, hoping to increase from one hundred and thirty Exabyte’s to twenty thousand Exabyte’s. This Data Surge transforms both business, which presently exploit quantum of data collection for processing data intensive and scientific application (discovery), that aid research community to move toward new direction and standard: Data Science (DS). Subsequently, the HPC algorithms prerequisite to scale and distribute their computation so as to deal with uncontrollable volumes, high velocity and varieties of dataset information. These difficulties are related to what is classified "the BigData wonder". One feature which quickened the insurgency of BigData and which rose close by with it, is distributed computing platform such as cloud computing (CC) framework. The expansive, multiservice oriented platform of cloud computing framework, which empowers gathering calculation and information, and the on-request (i.e., offering service on demand) scaling, gives a fascinating choice to support BigData processing. Cloud computing platforms offers scalable, versatile and robust infrastructure which are managed by outsourced service provider. Thus, aid and enable its subscriber or the clients to stay away from the overhead of purchasing and overseeing complex distributed hardware equipment. In this manner, clients focus on leasing, and scaling their administrations for a superior utilization of cloud resources, as indicated by the application's QoS (Quality of service) 36 (computation) requirements. Cloud computing platform is a standout amongst the most rising advances (technology) of present days and an administration framework that encourages administration on interest for information storage, computing and exceedingly robust network systems.

Documents

Application Documents

#	Name	Date
1	202241018195-FORM 1 [29-03-2022(online)].pdf	2022-03-29
2	202241018195-DRAWINGS [29-03-2022(online)].pdf	2022-03-29
3	202241018195-COMPLETE SPECIFICATION [29-03-2022(online)].pdf	2022-03-29
4	202241018195-FORM-9 [05-12-2022(online)].pdf	2022-12-05
5	202241018195-FORM 18 [02-02-2024(online)].pdf	2024-02-02
6	202241018195-FORM 18 [09-08-2024(online)].pdf	2024-08-09
7	202241018195-FER.pdf	2025-10-15

Search Strategy

1	202241018195_SearchStrategyNew_E_202241018195_1E_23-09-2025.pdf