A Real Time Data Pipeline System Using Parquet Format

< Back

A Real Time Data Pipeline System Using Parquet Format

Abstract: A real-time data pipeline for parquet conversion on the analytical engine platform using real-time sink connectors (105). The system comprises a modular software component (104) that is used by sink connectors (105) to use the classes in the modular software components (104). The modular software components (104) convert that data file to parquet format, which supports optimized storage and analytics operations on data. The system further comprises a schema registry (107) for building custom connectors with the help of the modular software components (104). The schema defines the structure and format of a data record, and the registry is a logical container of schemas that also organize schemas. The schema registry (107) used to load the data in any other sink system.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

12 December 2023

Publication Number

10/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

TECH MAHINDRA LIMITED

Tech Mahindra Limited, Phase III, Rajiv Gandhi Infotech Park Hinjewadi, Pune - 411057, Maharashtra, India

Inventors

1. JAIN, Anant Kumar

B3,503, Rohan Abhilasha,Wagholi, Pune - 412207, Maharashtra, India

Specification

Description:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of Invention:
A REAL-TIME DATA PIPELINE SYSTEM USING PARQUET FORMAT

Applicant:
TECH MAHINDRA LIMITED
A company Incorporated in India under the Companies Act, 1956
Having address:
Tech Mahindra Limited, Phase III, Rajiv Gandhi Infotech Park Hinjewadi,
Pune - 411057, Maharashtra, India

The following specification particularly describes the invention and the manner in which it is to be performed.
CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
[001] The present application does not claim priority from any patent application.
TECHNICAL FIELD
[002] The present subject matter described herein, in general, relates to a real-time data pipeline system. Particularly, the present subject matter relates to the system, which loads real-time data into an analytical engine using connectors and modular software components in Parquet format.
BACKGROUND
[003] Currently one of the major tasks is to build the real time data lake on the cloud platforms with a huge amount of data. This real time big data comes with its own storage, scalability, and performance challenges.
[004] Further, the real time data lake quickly faces some challenges of its own as the amount of data grows rapidly. Due to the experimental and iterative nature of data science and how a typical data lake, in general, processes data, many files are created on the cluster. It is not uncommon to encounter clusters with millions of different files. Currently, most clusters are managed using various techniques. However, the data in these clusters is still required to be managed in a systematic manner.
[005] The data lake is a logical place to store practically unlimited amounts of data of any format and schema, and is relatively inexpensive and massively scalable due to the use of commodity hardware. An important premise the data lake is that data will be there and available to the users when it is needed. If there is no way to find the stored data in a fast manner, then these impediments can defeat the purpose of having the data lake. If the users of the data lake always know exactly which files they need and understand the content of the files well, they can access the required data.
[006] Today open-source platform such as Apache Kafka is used for building real-time streaming data pipelines and applications, which is a fully managed service that makes it easy to build data lake and run applications that use said platform to process streaming data. Said platform allows to configure and deploy a connector using modular software components with a just few clicks. The connect provisions the required resources and sets up the cluster. It continuously monitors the health and delivery state of connectors and manages the underlying hardware, and auto-scales connectors to match changes in throughput. As a result, one can focus your resources on building applications rather than managing infrastructure. The software module components migrate the connectors without code changes. The components support compatible clusters as sources and sinks. These clusters can be self-managed as long as the modular software components can privately connect to the clusters. The software modular component creates a JAR files that contain the implementation of one or more connectors, transforms, or converters. The JAR files have a standard format and not format free, makes the system less efficient and less scalable.
[007] Hence, to overcome aforesaid drawbacks an efficient real-time data pipeline system is required to convert data in parquet format by using modular software components and connectors.
OBJECTS OF THE INVENTION
[008] Main object of the present disclosure is to provide a real-time data pipeline system to provide for parquet conversion on the analytical engine or platform by using connectors.
[009] Yet another object of the present disclosure is to provide the real-time data pipeline system to build the data lake on the cloud platforms with a huge amount of data.
[0010] Yet another object of the present disclosure is to provide the real-time data pipeline system to provide a modular software component that perform lightweight logic such as transformation, format conversion, or filtering data before delivering the data to a destination.
SUMMARY OF THE INVENTION
[0011] Before the present system is described, it is to be understood that this application is not limited to the particular machine, device or system, as there can be multiple possible embodiments that are not expressly illustrated in the present disclosures. It is also to be understood that the terminology used in the description is for the purpose of describing the particular versions or embodiments only, and is not intended to limit the scope of the present application. This summary is provided to introduce aspects related to building a real-time data pipeline in parquet format using MSK Connect, and the aspects are further elaborated below in the detailed description. This summary is not intended to identify essential features of the proposed subject matter nor is it intended for use in determining or limiting the scope of the proposed subject matter.
[0012] The real-time data pipeline system for parquet conversion on the analytical engine or platform using Real-time sink connectors. The system comprises a computer-readable medium containing executable program instructions that, when executed by a data processing system, perform the steps of, creating modular software components (104) with specific classes at an analytical engine capable of accepting modular software components (104), receiving a stream of data in real-time at said engine from a source database (101) via a source connector (102), permitting access to the stream of data to manage and enforce schemas on the stream of data stored within a schema registry (107) of said analytical engine determine a format of the data at said engine, and receiving a request from said modular software components (104), to transfer the stream of data from the analytical engine to an external database (106) in Parquet format via sink connector (105), wherein the modular software component (104) derived the schemas from the stream of data stored in the schema registry (107), and processed the data from the schema registry (107), and use the specific classes to finally load the data in the parquet format on the external database (106).
[0013] In another embodiment, the present invention provides the modular software component (104) is a plugin.
[0014] In another embodiment, the present invention provides the schema registry (107) comprises a plurality of clusters configured for storing the real-time stream of data as data chunks in different formats.
[0015] In yet another embodiment, the present disclosure provides the schema comprises checking a format of the stream of data stored in the plurality of clusters as data chunks with different schemas and maintaining compatibility of the data chunks in the schema registry (107).
[0016] In another embodiment, the present invention provides the software component process comprised of integrating the schemas of the stream of data to the specific class of the software modular component.
[0017] In yet another embodiment, the present disclosure provides the specific class configured to define a predetermined standard used for accessing each data stored in the cluster.
[0018] In another embodiment, the present invention provides the modular software component (104) to automatically scale the sink connector (105) to adjust said connector for change in the data.
[0019] In yet another embodiment, the present disclosure provides the modular software component (104) configured to perform the transformation, format conversion, or filtering data before delivering the data to the external database (106).
[0020] In another embodiment, the present invention provides the source connector (102) is configured to pull data from a data source and push this data into the cluster, and the sink connector (105) is configured to pull data from the cluster and push this data into the external database (106).
[0021] In yet another embodiment, the present disclosure provides the modular software component (104) configured to build custom connectors in the schema registry (107).
BRIEF DESCRIPTION OF DRAWING
[0022] The foregoing summary, as well as the following detailed description of embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the disclosure, there is shown in the present document example constructions of the disclosure, however, the disclosure is not limited to the specific methods and devices disclosed in the document and the drawing. The detailed description is described with reference to the following accompanying figure.
[0023] Figure 1 illustrates a block diagram of an analytical engine platform using modular software components (i.e., plugin), real-time sink connectors, and real-time source connectors for parquet conversion of external data sources, in accordance with an embodiment of the present subject matter.
[0024] The figure depicts various embodiments of the present disclosure for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures illustrated herein may be employed without departing from the principles of the disclosure described herein.

DETAILED DESCRIPTION OF THE INVENTION
[0025] Some embodiments of this disclosure, illustrating all its features, will now be discussed in detail. The words "comprising", “having”, and "including," and other forms thereof, are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms "a," "an," and "the" include plural references unless the context clearly dictates otherwise. Although any devices and methods similar or equivalent to those described herein can be used in the practice or testing of embodiments of the present disclosure, the exemplary, devices and methods are now described. The disclosed embodiments are merely exemplary of the disclosure, which may be embodied in various forms.
[0026] Various modifications to the embodiment will be readily apparent to those skilled in the art and the generic principles herein may be applied to other embodiments. However, one of ordinary skill in the art will readily recognize that the present disclosure is not intended to be limited to the embodiments illustrated, but is to be accorded the widest scope consistent with the principles and features described herein.
[0027] Following is a list of elements and reference numerals used to explain various embodiments of the present subject matter.
Reference Numeral Element Description
100 Real-time data pipeline system
101 Source database
102 Source connector
103 Intermittent storage
104 Modular software components or Plugin
105 Sink connector
106 External database
107 Schema Registry

[0028] In present invention, to replicate an OLTP, or online transaction processing, from a source database in real time on an analytical engine, source connectors are used as producers to pull the incremental changes. The source connector is further integrated with the analytical engine cluster by loading that data into the cluster. Thereon, a sink connector pulls the data from these clusters and loads it onto an external database in parquet format. In order to load the data in parquet format, a software modular component is created involving the parquet converter classes used at the sink connector. This represents a fault-tolerant, scalable, and performant solution for parquet conversion and building the data-lake in parquet using a unique software modular component. The unique software modular component provides the benefits of real time, fault tolerant, scalable, stable and performant solution with the use of schema registry for additional benefits for schema storage, schema evolution, schema validation.
[0029] Before, providing the working or operation of the features of the present invention, firstly, introduction of the features used in the invention is provided below.
[0030] Parquet File Format - Parquet is an efficient, structured, column-oriented (also called columnar storage), compressed, binary file format. Parquet supports several compression codecs, including Snappy, GZIP, deflate, and BZIP2. Structured file formats such as RCFile, Avro, Sequence File, and Parquet offer better performance with compression support, which reduces the size of the data on the disk and consequently the I/O and CPU resources required to de-serialize data. Parquet is a file format designed to support fast data processing for complex data, with several notable characteristics, such as: unlike row-based formats such as CSV or Avro, Parquet is column-oriented, meaning the values of each table column are stored next to each other rather than those of each record. Further, parquet is open-source, which means it is free to use and open source under the Apache Hadoop license and is compatible with most Hadoop data processing frameworks. Parquet is easily accessible regardless of the choice of data processing framework, data model, or programming language. Further, the data stored in the parquet format is self-describing. In addition to the data, the parquet file contains metadata, including schema and structure. Each file stores both the data and the standards used for accessing each record, making it easier to decouple services that write, store, and read parquet files.
[0031] Software Modular Component (104): The software modular component is a plugin created on an analytical engine that makes it easy for developers to stream data to and from the analytical engine. Said modular component is a source framework for connecting analytical engine clusters with external systems such as databases. With said components fully managed connectors built for the analytical engine, move data into or pull data from popular databases. Further, with said components deploy connectors in analytical engines for streaming change logs from databases into the engine clusters, or deploy an existing connector with no code changes. The modular software components automatically scale to adjust for changes in database.
[0032] Connectors (102, 105): Connectors built within the analytical engine, integrates external systems such as databases with analytical engine such as apache kaftka by continuously copying streaming data from a data source into your analytical engine cluster, or continuously copying data from your cluster into an external database or data sink. The connector can also perform logic such as transformation, format conversion, or filtering data before delivering the data to a destination. Connectors comprises a) source connectors (102), which pulls data from a data source and push this data into the cluster and b) sink connectors (105), which pull data from the cluster and push this data into a data sink.
[0033] Database (101, 106) : - The database in present matter is SQL server database or any other database, which holds the history data and the real time transactions reflected in the database tables. The goal is to load all the historical data and the real time transaction data from the source database (101) using real time data pipeline into the external database (106) and build a data lake.
[0034] Schema Registry (107): A schema defines the structure and format of a data record in the database, or the schema is a versioned specification for reliable data. The schema registry is a logical container for schemas in the analytical engine. Registries allow for the organization of schemas as well as the management of access control over data. Each schema may have multiple versions. Versioning is governed by a compatibility rule that is applied to a schema. Requests to register new schema versions are checked against this rule by the schema registry. The schema version is used to determine the compatibility of registering new versions of a schema. When a schema first gets created, the default checkpoint will be the first version compatibility mode, which allows you to control how schemas can or cannot evolve over time. These modes form the contract between applications producing and consuming data. When a new version of a schema is submitted to the registry, the compatibility rule is applied to the schema, and then it is determined if the new version can be accepted.
[0035] Figure 1 illustrates a block diagram of the real-time data pipeline system (100). The present invention discloses a real-time data pipeline that converts external data files in parquet format by using software component module and connectors. The connection between a real-time source connector (102) corresponding to an external source or also referred as source database is established which periodically records the position for each event generated corresponding to each database transaction in the database transaction log. The source connector starts reading the tables in the scope of replication after the maximum log sequence number in the transaction log. Further, this periodically committed log sequence number (also known as offsets) ensures no events are lost in case of connector failure when some network failure or node rebalancing happens. The connector generates events corresponding to each database read (r), insert (i), update(u), and delete (d) and events are streamed to intermittent storage (103) which is hosted on the analytical engine. The intermittent storage (103) reliably stores the messages/events received from the source database via the source connector to a configurable retention period. The source connector (102) serializes and converts the events in avro format thereby enabling data compression and encryption and sends avro data to the intermittent storage (103). The schema of the records is registered by the source connector (102) in the schema store present on the analytical engine. A schema store is a centralized place on an analytical engine to store, discover, and evolve data stream schemas. A schema defines the structure and format of a data record and is used for the validation of records before loading them into intermittent storage (103). Further, in present invention, while pulling data from external sources, the source connector obtains the schema of each record with help of schema registry libraries and validates it against the schema versions present on schema registry. If there is a change in the structure of record i.e., its schema does not exist in schema registry, it will identify the schema change and register a new version of the schema of the record in the schema registry based on the compatibility checks. These schemas are also validated at sink connector when it pulls the data from the intermittent storage. This ensures integrity of the data.
[0036] Each schema has multiple versions depending on changes in record format. Versioning is governed by a compatibility rule (like backward or previous schema version compatibility checks, future or forward schema version compatibility checks, etc.) that is applied to a schema. Requests to register new schema versions are checked against this rule by the Schema Store before they can succeed. Further, there are record format converter and serde (serializer and deserializer) libraries present in the modular software component. These libraries are used by the source connector to extract schema from the records and serialize-deserialize the data and registers schema in the schema registry
[0037] Further, in present invention the real time sink connector (105) independent of the analytical engine consumes data from intermittent storage (103). It makes use of the classes present in the software modular component (104). The software modular component is made communicative or is coupled with the real time sink connector.
[0038] The modular software component or (104) is a collection of serialization and deserialization convertor classes (SerDe) using which data is serialized, deserialized, and converted from one format to another like to parquet. It's an integration of classes from multiple platforms.
[0039] The Real-time sink connector (105) consumes data from intermittent storage (103), validates data using the schema versions stored in schema store, deserializes and converts the data from Avro to parquet, and loads data to external database or target systems (106) like some object storage, SQL databases or third party cloud data warehouses, NoSql and databases. Further, all the source connector (102), plugin (104), and sink connector (105) intermittent storage (103) may be run inside docker containers and can be independent of the analytical engine except for the schema store present on the engine.
[0040] The analytical engine of the subject matter may be any of the engine such as AWS S3, AWS Snowflake, and Hadoop.
Equivalents
[0039] With respect to the use of substantially any plural and/or singular terms herein, those having skill in the art can translate from the plural to the singular and/or from the singular to the plural as is appropriate to the context and/or application. The various singular/plural permutations may be expressly set forth herein for the sake of clarity.
[0040] It will be understood by those within the art that, in general, terms used herein, and especially in the appended claims (e.g., bodies of the appended claims) are generally intended as "open" terms (e.g., the term "including" should be interpreted as "including but not limited to," the term "having" should be interpreted as "having at least," the term "includes" should be interpreted as "includes but is not limited to," etc.). It will be further understood by those within the art that if a specific number of an introduced claim recitation is intended, such an intent will be explicitly recited in the claim, and in the absence of such recitation no such intent is present.
[0041] Although implementations for the system for a real-time data pipeline system (100) have been described in language specific to structural features and/or methods, it is to be understood that the appended claims are not necessarily limited to the specific features described. Rather, the specific features are disclosed as examples of implementation for real time data pipeline system (100).
, Claims:

1. A computer readable medium containing executable program instructions that, when executed by a data processing system, perform the steps of,
creating a modular software components (104) with specific classes at an analytical engine capable of accepting modular software components (104);
receiving a stream of data in real time at said engine from a source database (101) via a source connector (102);
permitting access to the stream of data to manage and enforce schemas on the stream of data stored within a schema registry (107) of said analytical engine and determine a format of the data at said engine; and
receiving a request from said modular software components (104), to transfer the stream of data from the analytical engine to an external database (106) in Parquet format via sink connector (105);
wherein the modular software component (104) derived the schemas from the stream of data stored in the schema registry (107), and process the data from the schema registry (107) and use the specific classes to finally load the data in the parquet format on the external database (106).

2. The system as claimed in claim 1, wherein said modular software component (104) is a plugin.

3. The system as claimed in claim 1, wherein the schema registry (107) comprises a plurality of clusters configured for storing the real time stream of data as a data chunks in different format.

4. The system as claimed in claim 1 or 3, wherein enforcing the schema comprises checking a format of the stream of data stored in the plurality of clusters as data chunks with different schemas and maintain a compatibility of the data chunks in schema registry (107).

5. The system as claimed in claim 4, wherein the processing of the software component process comprises integrating the schemas of the stream of data to the specific class of the software modular component.

6. The system as claimed in claim 1, wherein the specific class is configured to define a predetermined standard used for accessing each data stored in the cluster.

7. The system as claimed in claim 1, wherein the modular software component (104) automatically scale the sink connector (105) to adjust said connector for change in the data.

8. The system as claimed in claim 1, wherein the modular software component (104) is configured to perform transformation, format conversion, or filtering data before delivering the data to the external database (106).

9. The system as claimed in claim 1, wherein the source connector (102) is configured to pull data from the source database and push this data into the cluster, and the sink connector (105) is configured to pull data from the cluster and push this data into the external database (106).

10. The system as claimed in claim 1, wherein the modular software component (104) is configured to build custom connectors in the schema registry (107).

Documents

Application Documents

#	Name	Date
1	202321084867-STATEMENT OF UNDERTAKING (FORM 3) [12-12-2023(online)].pdf	2023-12-12
2	202321084867-REQUEST FOR EXAMINATION (FORM-18) [12-12-2023(online)].pdf	2023-12-12
3	202321084867-FORM 18 [12-12-2023(online)].pdf	2023-12-12
4	202321084867-FORM 1 [12-12-2023(online)].pdf	2023-12-12
5	202321084867-FIGURE OF ABSTRACT [12-12-2023(online)].pdf	2023-12-12
6	202321084867-DRAWINGS [12-12-2023(online)].pdf	2023-12-12
7	202321084867-DECLARATION OF INVENTORSHIP (FORM 5) [12-12-2023(online)].pdf	2023-12-12
8	202321084867-COMPLETE SPECIFICATION [12-12-2023(online)].pdf	2023-12-12
9	202321084867-FORM-9 [16-12-2023(online)].pdf	2023-12-16
10	Abstact.jpg	2024-01-06
11	202321084867-FORM-26 [19-02-2024(online)].pdf	2024-02-19
12	202321084867-Proof of Right [23-05-2024(online)].pdf	2024-05-23
13	202321084867-FER.pdf	2025-06-04
14	202321084867-FORM 3 [12-07-2025(online)].pdf	2025-07-12

Search Strategy

1	202321084867_SearchStrategyNew_E_202321084867E_08-05-2025.pdf