Real Time Updatable Splined Learned Index

< Back

Real Time Updatable Splined Learned Index

Abstract: Learned indexes that currently exist require retraining of data models when a new data entry is to be made. This is time, effort, and cost consuming especially in applications that require frequent insertion and deletion of data entries. The disclosure herein generally relates to learned indexes, and, more particularly, to a real-time updatable splined learned index. The real-time updatable splined learned index uses an Over Flow Array (OFA) to accommodate new data entries. The system disclosed herein provides algorithms to perform insertion and lookup in the real-time updatable splined learned index. Further, the real-time updatable splined learned index is configurable based on characteristics of hardware components on which the real-time updatable splined learned index is implemented. [To be published with FIG. 2]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

22 June 2021

Publication Number

52/2022

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

kcopatents@khaitanco.com

Parent Application

Patent Number

Legal Status

Grant Date

2025-01-02

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point Mumbai Maharashtra India 400021

Inventors

1. MISHRA, Mayank

Tata Consultancy Services Limited Olympus - A, Opp Rodas Enclave, Hiranandani Estate, Ghodbunder Road, Patlipada, Thane West Maharashtra India 400607

2. SINGHAL, Rekha

Tata Consultancy Services Limited Olympus - A, Opp Rodas Enclave, Hiranandani Estate, Ghodbunder Road, Patlipada, Thane West Maharashtra India 400607

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention: REAL-TIME UPDATABLE SPLINED LEARNED INDEX
Applicant
Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.

TECHNICAL FIELD [001] The disclosure herein generally relates to learned indexes, and, more particularly, to a real-time updatable splined learned index.
BACKGROUND
[002] Machine Learning (ML) algorithm-based systems are used in a wide range of applications, to accelerate performance and to automate the processes. The ML based systems handle extensive data processing, to arrive at conclusions, to generate predictions, and accordingly trigger specific actions. The data may be real¬time as well non-real time, stored in appropriate storage spaces. For the systems to be efficient, it is important to ensure faster access to the data stored in the databases.
[003] In the ML based approaches, data models may have to be updated frequently, and this may require simultaneous inserts and deletes in the data model. Data indexing is a traditional approach used, which is a way of optimizing performance of the databases by minimizing number of disk accesses required while processing a query. Machine learning algorithms have accelerated data access through ’learned index’, where a set of data items is indexed by a model learned on the pairs of data key and the corresponding record’s position in the memory. Traditional learned index approaches have the disadvantage that they require retraining of data models every time a new data entry is being inserted to the data model. As the data insertion (and deletion) operation is quite frequent in most applications, the retraining becomes time-consuming, expensive, and effort intensive.
SUMMARY
[004] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a processor implemented method is provided. In this method, a new data entry to be inserted to a spline based learned index is collected as input, via one or more hardware processors, wherein the new data entry comprises key and record position information as a (knew, pN) tuple. Further, a spline point that is a

lower bound to value of knew in the spline based learned index is identified via the one or more hardware processors. Further, if any free space is available out of a space quota allocated for the spline point that has been identified as the lower bound, in the overflow array (OFA), it is determined that the new data entry is to be accommodated in an overflow array (OFA) of the spline based learned index. Further, an absolute index of the free space is calculated and then the value of knew is inserted to the free space, via the one or more hardware processors. Further, an overflow count value indicating current available space in the space quota allocated for the spline point is updated after inserting the value of knew to the free space, via the one or more hardware processors.
[005] In another aspect, a method of searching for an entry in the spline based learned index is provided. The spline based learned index is searched for finding a data entry. Searching for the data entry in the spline based learned index comprises the following steps. Initially, a search query is received as input, wherein the search query comprises a key k of the data entry to be looked up in the spline based learned index. In response to receiving the search query, a lower bound spline point of a spline points pair of the key k is located. Further, a search is performed in a Data Array (DA) and in the overflow array of the spline based learned index, in parallel, in response to receiving the search query. The search in the DA includes the following steps. Initially an approximate interpolation position pi is obtained in the DA. Further, one of a binary search or a linear search is applied in an area between a lower error boundary (pi-max_error) and an upper error boundary (pi+max_error), to obtain position of k or position of a key immediately smaller to the value of k. Further, a linear scan is performed over the DA starting from the position of p till the position where the keys are identical to k. Similarly, performing the search in the overflow array includes the following steps. Further, presence of any new entry in the space quota allocated for the lower bound spline point in the overflow array is determined based on value of an overflow count of the lower bound spline point. Further, the linear scan is performed in the space quota allocated for the lower bound spline point in the overflow array, if the presence of new entries is detected.

[006] In yet another aspect, a system is provided. The system includes one or more hardware processors, an I/O interface, and a memory storing a plurality of instructions. The plurality of instructions when executed, cause the one or more hardware processors to initially collect a new data entry to be inserted to a spline based learned index, as input, wherein the new data entry comprises key and record position information as a (knew, pN) tuple. a system is provided. Further, a spline point that is a lower bound to value of knew in the spline based learned index is identified by the system. Further, if any free space is available out of a space quota allocated for the spline point that has been identified as the lower bound, in the overflow array (OFA), the system determines that the new data entry is to be accommodated in an overflow array (OFA) of the spline based learned index. Further, an absolute index of the free space is calculated and then the value of knew is inserted to the free space, by the system. Further, an overflow count value indicating current available space in the space quota allocated for the spline point is updated after inserting the value of knew to the free space by the system.
[007] In yet another aspect, the system searches the spline based learned index for finding a data entry. Searching in the spline based learned index by the system includes the following steps. Initially, a search query is received as input, wherein the search query comprises a key k of the data entry to be looked up in the spline based learned index. In response to receiving the search query, a lower bound spline point of a spline points pair of the key k is located. Further, a search is performed in a Data Array (DA) and in the overflow array of the spline based learned index, in parallel, in response to receiving the search query. The search in the DA includes the following steps. Initially an approximate interpolation position pi is obtained in the DA. Further, one of a binary search or a linear search is applied in an area between a lower error boundary (pi-max_error) and an upper error boundary (pi+max_error), to obtain position of k or position of a key immediately smaller to the value of k. Further, a linear scan is performed over the DA starting from the position of p till the position where the keys are identical to k. Similarly, performing the search in the overflow array includes the following steps. Further, presence of any new entry in the space quota allocated for the lower bound spline

point in the overflow array is determined based on value of an overflow count of the lower bound spline point. Further, the linear scan is performed in the space quota allocated for the lower bound spline point in the overflow array, if the presence of new entries is detected.
[008] In yet another aspect, a non-transitory computer readable medium is provided. The non-transitory computer readable medium includes a plurality of instructions, which when executed by the one or more hardware processors, causes the following steps. Initially a new data entry to be inserted to a spline based learned index is collected as input, via one or more hardware processors, wherein the new data entry comprises key and record position information as a (knew, pN) tuple. Further, a spline point that is a lower bound to value of knew in the spline based learned index is identified via the one or more hardware processors. Further, if any free space is available out of a space quota allocated for the spline point that has been identified as the lower bound, in the overflow array (OFA), it is determined that the new data entry is to be accommodated in an overflow array (OFA) of the spline based learned index. Further, an absolute index of the free space is calculated and then the value of knew is inserted to the free space, via the one or more hardware processors. Further, an overflow count value indicating current available space in the space quota allocated for the spline point is updated after inserting the value of knew to the free space, via the one or more hardware processors.
[009] In yet another aspect, a non-transitory computer readable medium for searching for an entry in the spline based learned index is provided. Searching for the data entry in the spline based learned index, by the non-transitory computer readable medium, comprises the following steps. Initially, a search query is received as input, wherein the search query comprises a key k of the data entry to be looked up in the spline based learned index. In response to receiving the search query, a lower bound spline point of a spline points pair of the key k is located. Further, a search is performed in a Data Array (DA) and in the overflow array of the spline based learned index, in parallel, in response to receiving the search query. The search in the DA includes the following steps. Initially an approximate interpolation position pi is obtained in the DA. Further, one of a binary search or a

linear search is applied in an area between a lower error boundary (pi-max_error) and an upper error boundary (pi+max_error), to obtain position of k or position of a key immediately smaller to the value of k. Further, a linear scan is performed over the DA starting from the position of p till the position where the keys are identical to k. Similarly, performing the search in the overflow array includes the following steps. Further, presence of any new entry in the space quota allocated for the lower bound spline point in the overflow array is determined based on value of an overflow count of the lower bound spline point. Further, the linear scan is performed in the space quota allocated for the lower bound spline point in the overflow array, if the presence of new entries is detected.
[010] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[011] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[012] FIG. 1 illustrates an exemplary system for real-time updatable splined learned index, according to some embodiments of the present disclosure.
[013] FIG. 2 is a flow diagram depicting steps involved in the process of real-time insert in the real-time updatable splined learned index, using the system of FIG. 1, according to some embodiments of the present disclosure.
[014] FIG. 3 is a flow diagram depicting steps involved in the process of lookup for an entry in the real-time updatable splined learned index, using the system of FIG. 1, according to some embodiments of the present disclosure.
[015] FIG. 4 is a flow diagram depicting steps involved in the process of determining space quota for each spline point, in an Overflow Array of the real¬time updatable splined learned index, using the system of FIG. 1, according to some embodiments of the present disclosure.

[016] FIGS. 5A and 5B (collectively referred to as FIG. 5) is a flow diagram depicting steps involved in the process of determining value of a threshold of array size, of the real-time updatable splined learned index, using the system of FIG. 1, according to some embodiments of the present disclosure.
[017] FIG. 6 is a flow diagram depicting steps involved in the process of selecting one of a linear search and a binary search for lookup in the real-time updatable splined learned index, using the system of FIG. 1, according to some embodiments of the present disclosure.
[018] FIGS. 7A, 7B, and 7C depict example architecture, data insertion, and data lookup of the real-time updatable splined learned index, respectively, using the system of FIG. 1, according to some embodiments of the present disclosure.
[019] FIGS. 8A, 8B, and 8C are example graphs depicting performance of insert operation in the real-time updatable splined learned index, by the system of FIG. 1, according to some embodiments of the present disclosure.
[020] FIGS. 9A, 9B, and 9C are example graphs depicting performance of lookup operation in the real-time updatable splined learned index, by the system of FIG. 1, according to some embodiments of the present disclosure.
[021] FIG. 10A is an example graph depicting lookup time for range queries using the real-time updatable splined learned index, by the system of FIG. 1, according to some embodiments of the present disclosure.
[022] FIG. 10B is an example graph depicting throughput of the real-time updatable splined learned index for mix of concurrent inserts and lookup, by the system of FIG. 1, according to some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS [023] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are

described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims.
[024] Faster data access is an essential requirement for Machine Learning (ML) based systems to be efficient. In the ML based approaches, data models may have to be updated frequently, and this may require simultaneous inserts and deletes in the data model. Data indexing is a traditional approach used, which is a way of optimizing performance of the databases by minimizing number of disk accesses required while processing a query. Machine learning algorithms have accelerated data access through ’learned index’, where a set of data items is indexed by a model learned on the pairs of data key and the corresponding record’s position in the memory.
[025] RadixSpline, one of the traditional learned index approaches, supports insert operation, to facilitate insertion of new data entry to a data model. However, RadixSpline requires retraining of the data model. This poses challenges because most process intensive applications require frequent insertion and deletion of entries to and from the data models, and retraining of the data model every time becomes time-consuming, expensive, and effort-intensive.
[026] Further, one of the state-of-the-art methods addressing the problem of insertion of new data entries to data models discusses use of caches to make the index structures insertion tolerant. This method uses an additional cache in which new data entries can be made. However, in this approach, the system searches for available cache space, and randomly inserts new data. Absence of a structured/formalized insertion process reduces effectiveness of the data insertion. Also, in this approach, data lookup involves searching in main memory first, and then searching in the cache only if the entry being looked up is not found in the main memory. The lookup becomes time consuming in this sequential approach, with increase in size of the main memory.
[027] The method and system of the real-time updatable splined learned index disclosed herein supports insertion of new data entries and lookup of entries,

by providing an appropriate structure of the splined learned index and mechanisms for the insertion and lookup. The real-time updatable splined learned index structure, and algorithms provided for the insertion and lookup of data entries are explained herein.
Table 1: Acronyms/Functions and descriptions

[028] Referring now to the drawings, and more particularly to FIG. 1 through FIG. 10B, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[029] FIG. 1 illustrates an exemplary system for real-time updatable splined learned index, according to some embodiments of the present disclosure. In an embodiment, the one or more hardware processors 104 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any

devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 104 are configured to fetch and execute computer-readable instructions stored in the memory 102. In an embodiment, the system 100 can be implemented in a variety of computing systems including laptop computers, notebooks, hand-held devices such as mobile phones, workstations, mainframe computers, servers, and the like.
[030] The I/O interface(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface to display the generated target images and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular and the like. In an embodiment, the I/O interface (s) 106 can include one or more ports for connecting to a number of external devices or to another server or devices.
[031] The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
[032] Further, the memory 102 includes a database 108 that stores all data associated with operations such as but not limited to insertion, and lookup, of data in the real-time updatable splined learned index. For example, the database stores a plurality of instructions which when executed by the one or more hardware processors 104, cause execution of various processes associated with various functions of the real-time updatable splined learned index. In another example, the real-time updatable splined learned index and all associated data are hosted in the database 108. Further, information on new data entry insertions attempted, result of insertion (success/fail), attempted data lookups and so on, at least for a period of time, may be stored as historical information. Functions of the components of the system 100 are explained in conjunction with FIGS. 2 through FIG. 7C.

[033] FIG. 2 is a flow diagram depicting steps involved in the process of real-time insert in the real-time updatable splined learned index, using the system of FIG. 1, according to some embodiments of the present disclosure.
[034] To understand the insert operation better, structure of the real-time updatable splined learned index is explained with respect to FIG. 7A. As seen in FIG. 7A, the real-time updatable splined learned index includes a Data Array (DA), a Spline-point Array (SA), a Radix Table (RT), and an Overflow Array (OFA). Consider a data array having (key, record) pairs. The data is sorted on keys and are stored from positions pos0 to pos N-1, such that keyi is stored at position pos, and if keyi ≤ keyj, then posi < posj. Cumulative Distribution Function (CDF) of (key, position) pairs is learnt by creating a spline-based model. In this model, spline points are learnt over a CDF curve. The spline points are stored in an array, as (spline key, position, overflow count), sorted on their key values. The array of spline-point is accessed by the radix table (RT), such that the RT maps the most-significant-bits (MSB) of a given data key to the closest spline-point in the array of spline-points. A radix table’s size is 2radix_bits, where radix_bits is a configurable input parameter and determines the MSB bits used in the radix table. For example, consider that a spline-point ’s key is having the first 3 bits as 010. There can be multiple eligible spline-points to be pointed by a radix table entry (all having the same first radix_bits in their respective keys), however, the spline-point with the smallest key is selected. Similar to SA, the OFA also stores (key, position, overflow count) tuples for newly inserted data (also referred to as new data entries, and may also be represented in terms keys present in the data entry). The real-time updatable splined learned index architecture is such that every spline-point in the SA has a fixed number of slots in OFA. By inserting the new data entries in the OFA, the original data structure in the DA remains unaffected. This approach helps in performing the data insertions, without having to retrain the data model.
[035] Various steps in the data insertion process are depicted in FIG. 2, and is explained with reference to FIG. 7B.
[036] At step 202, the system 100 collects a new data entry to be inserted into the spline-based learning index, as input for a real-time/dynamic data insertion

operation to be performed. The ‘new data entry’ is alternately referred to as ‘new data’ or ‘data’ throughout the description. The data is in (knew, pN) tuple format, where knew is value of a key, and pN represents ‘record position (or alternately referred to as ‘position’) of the key’. pN may be associated with first non-filled location in DA.
[037] At step 204, the system 100 identifies a spline point that is a lower bound to knew, from the spline point array SA. The system 100 may use the get spline point function to identify the spline point, by using a methodology used by the RadixSpline approach. Each spline point has only a fixed space (referred to as “space quota”) allocated in the OFA. At the time of attempting a data insertion, the system 100 has to verify that the OFA can accommodate the new data entry, based on space availability in the space quota allocated for the spline point that has been identified as the lower bound spline point to knew. The system 100 verifies the space availability by checking value of the overflow_count of the spline point. The value of the overflow_count indicates/represents available space in the space quota. If the overflow_count indicates that there is no free space in the space quota of the spline point (i.e. overflow_count is FULL) identified as the lower bound spline point, the insert attempt may be terminated, and an error may be returned. If the system determines at step 206 that free space is available in the space quota, then at step 208, the system 100 calculates an absolute index of the free space, and then inserts the value of knew to the free space. The absolute index is determined for the following reasons. There is a single overflow array which is shared by spline-points. Each spline point has a dedicated region (a part) in overflow array reserved for it. Total size of overflow array is “num spline-points” times “space reserved for one spline point”. To access the beginning of region reserved for spline point “s” then it is “s-1” times “space reserved for one spline-point”. At this step, it is assumed that array index begins at “0”. To go to the next free slot in the reserved space of spline point “s” it is given by “s-1” times “space reserved for one spline-point” + “overflow_count.” The space reserved for spline-point “s” is full when overflow_count == “space reserved for one spline-point” – 1. Again assuming array to begin at index 0. After inserting the value of knew to the free space, at step 210,

the system 100 updates the value of the overflow count of the spline point, to capture latest information on the available free space.
[038] In an embodiment, the newly inserted keys to the OFA are not maintained in order. However, all the keys inserted for a particular spline point are smaller than smallest of the keys inserted for next spline point in the SA. Also, as space corresponding between two adjoining spline-points in OFA is bounded by Os, the spline-based learning index can support only as many operations between any two adjoining splines before retraining. This approach helps to keep lookup time unaffected. In another embodiment, when large amount of data is to be processed, multiple OFAs may be used in parallel to distribute load, which in turn helps in achieving faster insertion even while facilitating large number of concurrent insert operations.
[039] In another embodiment, the real-time updatable splined learned index supports concurrent insert operations corresponding to different pairs of consecutive spline-points because of non-overlapping overflow spaces in the OFA. Only the set of inserts between the same pair of spline-points get serialized. The real-time updatable splined learned index can also support multi-threading to support a large number of uniformly distributed concurrent real-time insert operations.
[040] In the case of skewed insert operations, OFA space corresponding to certain consecutive spline-point pairs gets filled up more quickly than others. Keeping more than a bounded size OS per spline-pair may slow down lookup operations for the newly inserted keys. In such scenarios, retraining of the real-time updatable splined learned index may be initiated by the system 100 to rebuild the index. In yet another embodiment, the real-time updatable splined learned index is configurable to allow users obtain more data elements in a range of interest, by supporting allocation of OFAs proportional to a number of expected insert operations. In an alternate embodiment, the system 100 can be configured to learn an insertions distribution from a history data and then use linear regression to allocate appropriate size of overflow arrays, with an assumption that the same

pattern as that in the history data repeats in future insert operations as well. The insert operation is depicted in FIG. 7B.
[041] Once data is stored in the real-time updatable splined learned index, the system 100 supports a lookup operation/action to search and retrieve specific data from the real-time updatable splined learned index. Various steps involved in the lookup operation are depicted in FIG. 3.
[042] The system 100 initiates a lookup operation upon receiving a search query as input, at step 302, which includes a key k of an entry to be looked up. In the real-time updatable splined learned index, post insert operation, the key k has a pair of spline points around it i.e., one each on either sides. Upon receiving the search query, then at step 304, the system 100 locates a lower bound spline point of the spline points pair of the key k. The system 100 may locate the lower bound spline point by using the get Spline Point function. Further, at step 306, the system 100 performs the search in the Data Array (DA) and in the OFA, in parallel. The search in DA is executed in step 308, and the search in the OFA is depicted in step 310.
[043] Various steps involved in the search in DA are covered in steps 308a through 308c. At step 308a, the system 100 obtains an approximate interpolate function pi of k. The system 100 uses the function to obtain pi. At
step 308b, the system 100 applies a linear search or a binary search in an area between a lower error boundary and an upper error boundary of the DA, to obtain position of k or position of a key immediately smaller to the value of k. The lower error boundary is represented by pi – max_error and the upper error boundary is represented by pi + max_error. Further, at step 308c, the system 100 performs a linear scan over the DA starting from position p till a position where the keys are identical to k. Also, the approach used by the system 100 for making a selection between the linear search and the binary search is depicted in FIG. 6. At step 602, the system 100 determines size of the area between the lower error boundary and the upper error boundary as a search area. Further at step 604, the system 100 compares the determined search area with a threshold of array size. The system 100 selects the linear search at step 606, if the search area is less than the threshold of

array size. Similarly, if the search area is exceeding the threshold of array size, then the binary search is selected by the system 100, at step 608.
[044] Various steps involved in the search in the OFA are covered in steps 310a and 310b. To prevent unnecessary search in the OFA to prevent unnecessary delays in the lookup process, the system 100 performs the search in the OFA only if there are any entries present in the space quota of a spline point. To perform this check, at step 310a, the system 100 determines presence of any new data entry in the space quota allocated for the lower bound spline point in the OFA, based on value of overflow count of the lower bound spline point. If the overflow count indicates presence of any new data entries, then at step 310b, the system 100 performs a linear scan in the space quota allocated for the lower bound spline point, in the OFA, to lookup for the key k. The lookup operation is depicted in FIG. 7C.
[045] As the searches in the DA and OFA are performed in parallel, one of the searches returns the search result with respect to the key k, wherein the result is a consolidated list of position of k. The parallel search helps reduce possible delay in the lookup.
[046] In an embodiment, the size of the OFA is determined based on characteristics of the hardware on which the real-time updatable splined learned index is implemented. There are a variety of hardware processors available in the market, from different manufacturers, and having different specifications. When same approach is used irrespective of the type of hardware, results may vary. However, if the real-time updatable splined learned index is configured by considering the hardware characteristics, results can be improved. Various steps involved in the process of determining size/value of OFA based on the hardware characteristics are depicted in FIG. 4.
[047] At step 402, the system 100 determines a threshold of array size ‘T’, based on the hardware characteristics used to implement the splined learned index. Various steps involved in the process of determining the threshold of array size ‘T’ are depicted in FIG. 5. At step 502, the system 100 initializes a plurality of parameters including Random_Full_Array, a decrement value, a current_size, a minimum_value, and the threshold of array size ‘T’. Further, at step 504, the system

100 creates two arrays i.e., a Random_Array and a Sorted_Array, with a size equal to the current_size. The two arrays Random_Array and a Sorted_Array may be created in the memory 102, at a space different from where the DA, OFA are hosted. The Random_Array and a Sorted_Array are temporary arrays, which may be deleted once the threshold of array size ‘T’ is determined. Further, at step 506, the system 100 moves a plurality of items of size equal to the current_size, from the Random_Full_Array to the Random_Array and the Sorted_Array.
[048] Further, at step 508, the system 100 moves a random subset of the plurality of items in the Random_Array to a Test_Array. The system 100 then performs a binary search over the “Sorted Array” and finds mean time taken for the search, wherein the mean time represents value of time_binary, at step 510. The system 100 also performs a linear search over “Random_array” and finds mean time taken for the search, wherein the mean time represents value of time_linear.
[049] At step 512, the system compares the values of time_linear and time_binary. If value of time_linear is less than or equal to that of time_binary, then at step 514 the system 100 determines value of the threshold of array size T as equal to the initialized value of current_size. If the value of time_linear exceeds the value of time_binary, at step 516, the system 100 determines the value of threshold of array size T as equal to an updated value of current_size, wherein difference between the initialized value of the current_size and a decrement_value provides the updated value of current_size. In an embodiment, smallest value of “decrement_value” is 1 and largest is “current_size”. Appropriate value of decrement_value may be selected to speedup the process of finding “T”. For example, when the decrement_value is kept as 1, the time taken will be longer, whereas when the decrement_value is 5 the time taken would be comparatively lesser.
[050] Various steps in method 200 may be performed in the same order as depicted in FIG. 2, or in any alternate order that is technically feasible. In another embodiment, one or more steps in method 200 may be skipped as needed. Experimental Results:-

[051] In the experiments, SOSD, a benchmark for learned indexes, was used to evaluate performance of the real-time updatable spline based learned index in terms of lookup time and memory usage. SOSD benchmark does not have a mechanism for evaluating insert operations. Hence, the SOSD benchmark was extended to evaluate the time taken by the learned index to update itself for an insert operation. The real-time updatable spline based learned index was also evaluated for throughput on concurrent insert and lookup operations using multithreading. Datasets and System Configuration:-[052] For the experiments, the following datasets were used: amzn (Amazon® data containing book popularity), wiki (Wikipedia® edit timestamps), osm (Open street Maps cell IDs) and face (Ids of Facebook® users), each containing 200 Million, 64 bit keys in sorted order to benchmark learned indexes. The type of CDF impacts the number of spline points and hence the model size used by the real-time updatable spline based learned index, which is given in Table 2 for amzn and face datasets. The real-time updatable spline based learned index was also evaluated on a 56 core Intel Xeon server with 64 GB RAM running Ubuntu 18.04.
Table. 2 Comparison of real-time updatable spline based learned index and RadixSpline methods using SOSD benchmark

real-time
Dataset Radix bits Max_error Spline-points RadixSpline (MB) updatable Spline based learned index (MB) #max inserts
face 20 2 31.5M 508.49 1981.37 94.5M
200M 23 35 1.89M 63.81 267.98 13.2M
keys 20 265 0.23M 7.98 44.33 2.3M

Wiki 27 8 2.1M 506.86 669.70 10.5M
200M 23 40 140K 31.91 47.72 1M
keys 20 250 18K 3.99 6.61 170K
osm 27 7 6.8M 507 929 27M
200M 24 50 824K 63.09 152 5.7M
keys 21 365 109K 7.99 24.8 1.09M
Amzn 25 2 27M 505 1784 82M
200M 22 45 410K 14.95 59 2.8M
keys 12 135 62K 1 9.57 558K
Amzn 21 8 31M 509.85 2951.2 158M
800M 25 50 3.5M 123.3 508.50 25M
keys 20 380 117K 3.97 22.03 1.1M
Model Building:-[053] The real-time updatable spline based learned index, like RadixSpline, uses a single-pass build process and takes same time to build. The only difference being that the real-time updatable spline based learned index also allocates space for Overflow Array (OFA) for future inserts. The size information for allocation (number of spline-points, max_error) is available after the single pass itself. During experiments, it was observed that the build times of RadixSpline and the real-time updatable spline based learned index are same. Build time of RadixSpline (0.5 to 2.5 sec for different datasets) was observed to be much lower than RMI, and is similar to ART and BTree, the same observation was made for the real-time updatable spline based learned index as well. Table 2 compares the model sizes for real-time updatable spline based learned index and RadixSpline for different configurations of radix _bits and max_error. The real-time updatable

spline based learned index has a larger model size among the two for all configurations because of the fact that it also has space for future inserts. Column "max inserts" denotes the maximum number of inserts possible for a given model size. This is directly proportional to the number of spline-points and log of the configured max_error.
Insert Operations:-[054] It was observed, in FIGS. 8A, 8B, and 8C, an insert operation latency is around 50ns and is increasing with increase in the model. size. As observed in Table 2, increase in model of the real-time updatable spline based learned index is due to increase in number of spline-points, large number of spline points implies more time for binary search within the spline-points array SA. Insert vs Lookup Latency:
[055] An observation made during the experiments was that insert latency (50 ns FIG. 8A) in real-time updatable spline based learned index was lesser than lookup latency (270ns in FIG. 9A). This appears to be counter-intuitive as, notionally, the inserts require lookups. However, in real-time updatable spline based learned index, the inserts only require lookup for the correct spline-point and not the position of key. It does not involve the interpolation, binary-search, and scanning operations needed for lookup. Further, the time taken by an insert operation was found to be less compared to the lookup time in real-time updatable spline based learned index as evident from FIGS. 8A through 8C, and FIGS. 9A through 9C.
Concurrent Inserts:-[056] During the experiments, throughput of the real-time updatable spline based learned index was evaluated for executing concurrent inserts using multiple threads. FIG.8C shows that throughput of concurrent inserts increases with number of threads. However, throughput decreases with increase in the model size as seen in the FIGS. 8A and 8B, due to increase in latency of single insert operation with increase in model size.
Lookup Operations:-

[057] The lookup operations were evaluated after the build operation and after insertions as well. The lookup operation of real-time updatable spline based learned index has two components to span lookup operation across newly inserted data as well. The first part is the same as experienced by RadixSpline. A lookup for data keys was presented, which may have been inserted later i.e., after model creation, to bring in more complexity. As shown in FIGS. 9A through 9C, because of design of the real-time updatable spline based learned index, and parallel search across OFA and DA, lookup time remains the same as that in RadixSpline. Lookup time of the real-time updatable spline based learned index decreases with an increase in model size. Further, with an increase in the number of threads, lookup throughput increases due to parallel searches across OFA and DA.
Range Queries:
[058] Range lookups were performed using the real-time updatable spline based learned index. As shown in FIG. 10A it was found that latency incurred is around 300ns per range lookup, which is close to the single key lookup (FIG. 9A). The range lookups were performed with just one key lookup followed by sequential scans of the OFA and DA. Sequential scanning of arrays was found to be very fast and hence minimal extra latency.
Concurrent Insert and Lookup Operations:
[059] FIG. 10B shows the throughput achieved for mixed load of concurrent lookups and inserts (50:50 load split). It was observed that a mixed throughput of more than 25Mops per second was obtained. The throughput does not change much with an increase in model size. This is because decrease in throughput for concurrent inserts (FIGS. 8B and 8C) is balanced out by increase in throughput due to concurrent lookups (FIGS. 9B and 9C).
[060] The real-time updatable spline based learned index employs a temporary memory to update the index in real-time in constant time. To keep the lookup time unaffected by the inserts, the real-time updatable spline based learned index bounds the size of temporary memory to be O(log(N/Nsp), where Nsp is the number of spline-points learned on the data set of size ‘N’. The real-time updatable spline based learned index employs multithreading to increase the throughput for

concurrent insert and lookup operations. Performance of the real-time updatable spline based learned index for lookup, range queries, and insert operations was evaluated using the SOSD benchmark. The experimental results indicated that the real-time updatable spline based learned index can look up a key in 270ns, range query in 300ns, and insert operations takes only 50 ns. It was also observed that the real-time updatable spline based learned index can support 40 million concurrent insert and lookup operations.
[061] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[062] The embodiments of present disclosure herein address unresolved problem of real-time data insertion in learned indexes. The embodiment, thus provides a mechanism to perform real-time insertions in learned indexes, using an Over Flow Array. Moreover, the embodiments herein further provide a mechanism to search and retrieve results from the learned index.
[063] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware

means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[064] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[065] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[066] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A

computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[067] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.

We Claim:
1. A processor implemented method (200), comprising:
collecting (202) a new data entry to be inserted to a spline based learned index, as input, via one or more hardware processors, wherein the new data entry comprises key and record position information as a (knew, pN) tuple; identifying (204) a spline point that is a lower bound to value of knew, in the spline based learned index, via the one or more hardware processors; determining (206) that the new data entry is to be accommodated in an overflow array (OFA) of the spline based learned index , if any free space is available out of a space quota allocated for the spline point that has been identified as the lower bound, in the overflow array (OFA), via the one or more hardware processors;
calculating (208) an absolute index of the free space and inserting the value of knew to the free space, via the one or more hardware processors; and updating (210) an overflow count value indicating current available space in the space quota allocated for the spline point, after inserting the value of knew to the free space, via the one or more hardware processors.
2. The method as claimed in claim 1, wherein the spline based learned index
is searched for finding a data entry, wherein searching for the data entry in
the spline based learned index comprises:
receiving (302) a search query as input, wherein the search query
comprises a key k of the data entry to be looked up in the spline based
learned index;
locating (304) a lower bound spline point of a spline points pair of the
key k;
performing (306) a search in a Data Array (DA) and in the overflow
array of the spline based learned index, in parallel, in response to
receiving the search query,
wherein performing the search in the DA (308) comprises:

obtaining (308a) an approximate interpolation position pi; applying (308b) one of a binary search or a linear search in an area between a lower error boundary (pi-max_error) and an upper error boundary (pi+max_error), and obtaining position of k or position of a key immediately smaller to the value of k; and
performing (308c) a linear scan over the DA starting from the position of p till the position where the keys are identical to k, and wherein performing the search in the overflow array (310) comprises: determining (310a) presence of any new entry in the space quota allocated for the lower bound spline point in the overflow array, based on value of an overflow count of the lower bound spline point; and
performing (310b) the linear scan in the space quota allocated for the lower bound spline point in the overflow array, if the presence of new entries is detected.
3. The method as claimed in claim 1, wherein the new data entry is rejected if there is no free space in the space quota allocated for the spline point that has been identified as the lower bound, in the overflow array (OFA), via the one or more hardware processors.
4. The method as claimed in claim 1, wherein size of the overflow array (OFA) is dynamically determined by:
determining (402) a threshold of array size based on a plurality of
characteristics of the one or more hardware processors used to
implement the splined learned index;
comparing (404) value of (2 * max_error) and the threshold of array
size;
determining (406) the size of the space quota allocated for the spline
points in the overflow array as a value less than the threshold of array
size, if the value of (2 * max_error) is greater than or equal to the

threshold of array size, wherein max_error represents a maximum interpolation error while defining a plurality of the spline points in the spline based learned index; and
determining (408) the size of the space quota allocated for the spline points in the overflow array as a value equal to (2 * max_error), if the value of (2 * max_error) is less than the threshold of array size.
5. The method as claimed in claim 4, wherein determining the threshold of array size comprises:
initializing (502) a plurality of parameters comprising a
Random_Full_Array, a decrement value, a current_size, a
minimum_size, and the threshold of array size ‘T’, with value
selected based on the plurality of characteristics of the one or more
hardware processors used to implement the splined learned index;
creating (504) a Random_Array and a Sorted_Array, wherein size
of the Random_Array and the Sorted_Array is equal to the
current_size;
moving (506) a plurality of items having size equal to the
current_size, from the Random_Full_Array to the Random_Array
and the Sorted_Array;
sorting content of the Sorted_Array;
moving (508) a random subset of the plurality of items in the
Random_Array to a Test_Array;
for the plurality of items in the Test_Array,
performing binary search over “Sorted Array” and finding mean time taken for the search, wherein the mean time represents value of time_binary (510); and
performing linear search over “Random_array” and finding mean time taken for the search, wherein the mean time represents value of time_linear (510); comparing (512) time_binary and time_linear;

determining (514) value of the threshold of array size T as equal to initialized value of current_size if time_linear is less than or equal to time_binary; and
determining (516) value of the threshold of array size T as equal to an updated value of current_size if time_linear is greater than the time_binary, wherein the updated value of current_size is obtained as a difference between the initialized value of the current_size and a decrement_value.
6. The method as claimed in claim 2, wherein one of the binary search or the
linear search is selected for application in the area between the lower error
boundary and the upper error boundary, comprises:
determining (602) size of the area between the lower error boundary
and the upper error boundary as a search area;
comparing (604) the determined search area with a threshold of
array size;
selecting (606) the linear search if the search area is less than the
threshold of array size; and
selecting (608) the binary search if the search area is exceeding the
threshold of array size.
7. A system (100), comprising:
one or more hardware processors (104);
an I/O interface (106); and
a memory (102) storing a plurality of instructions, wherein the plurality of
instructions when executed, cause the one or more hardware processors to: collect a new data entry to be inserted to a spline based learned index, as input, wherein the new data entry comprises key and record position information as a (knew, pN) tuple;

identify a spline point that is a lower bound to value of knew, in the
spline based learned index;
determine that the new data entry is to be accommodated in an
overflow array (OFA) of the spline based learned index, if any free
space is available out of a space quota allocated for the spline point
that has been identified as the lower bound, in the overflow array
(OFA);
calculate an absolute index of the free space and inserting the value
of knew to the free space; and
update an overflow count value indicating current available space in
the space quota allocated for the spline point, after inserting the
value of knew to the free space.
8. The system as claimed in claim 7, wherein the system searches the spline based learned index for finding a data entry, by:
receiving a search query as input, wherein the search query comprises
a key k of the data entry to be looked up in the spline based learned
index;
locating a lower bound spline point of a spline points pair of the key k;
performing a search in a Data Array (DA) and in the overflow array of
the spline based learned index, in parallel, in response to receiving the
search query,
wherein performing the search in the DA comprises:
obtaining an approximate interpolation position pi;
applying one of a binary search or a linear search in an area
between a lower error boundary (pi-max_error) and an upper error
boundary (pi+max_error), and obtaining position of k or position
of a key immediately smaller to the value of k; and
performing a linear scan over the DA starting from the position of
p till the position where the keys are identical to k, and
wherein performing the search in the overflow array comprises:

determining presence of any new entry in the space quota allocated for the lower bound spline point in the overflow array, based on value of an overflow count of the lower bound spline point; and performing the linear scan in the space quota allocated for the lower bound spline point in the overflow array, if the presence of new entries is detected.
9. The system as claimed in claim 7, wherein the system rejects the new data entry if there is no free space in the space quota allocated for the spline point that has been identified as the lower bound, in the overflow array (OFA).
10. The system as claimed in claim 7, wherein the system dynamically determines the size of the overflow array (OFA) by:
determining a threshold of array size based on a plurality of characteristics of the one or more hardware processors used to implement the splined learned index;
comparing value of (2 * max_error) and the threshold of array size; determining the size of the space quota allocated for the spline points in the overflow array as a value less than the threshold of array size, if the value of (2 * max_error) is greater than or equal to the threshold of array size, wherein max_error represents a maximum interpolation error while defining a plurality of the spline points in the spline based learned index; and
determining the size of the space quota allocated for the spline points in the overflow array as a value equal to (2 * max_error), if the value of (2 * max_error) is less than the threshold of array size.
11. The system as claimed in claim 10, wherein the system determines the
threshold of array size by:
initializing a plurality of parameters comprising a
Random_Full_Array, a decrement value, a current_size, a

minimum_size, and the threshold of array size ‘T’, with value
selected based on the plurality of characteristics of the one or more
hardware processors used to implement the splined learned index;
creating a Random_Array and a Sorted_Array, wherein size of the
Random_Array and the Sorted_Array is equal to the current_size;
moving a plurality of items having size equal to the current_size,
from the Random_Full_Array to the Random_Array and the
Sorted_Array;
sorting content of the Sorted_Array;
moving a random subset of the plurality of items in the
Random_Array to a Test_Array;
for the plurality of items in the Test_Array,
performing binary search over “Sorted Array” and finding mean time taken for the search, wherein the mean time represents value of time_binary; and
performing linear search over “Random_array” and finding mean time taken for the search, wherein the mean time represents value of time_linear; comparing time_binary and time_linear;
determining value of the threshold of array size T as equal to initialized value of current_size if time_linear is less than or equal to time_binary; and
determining value of the threshold of array size T as equal to an updated value of current_size if time_linear is greater than the time_binary, wherein the updated value of current_size is obtained as a difference between the initialized value of the current_size and a decrement_value.
12. The system as claimed in claim 8, wherein the system selects one of the binary search or the linear search for application in the area between the lower error boundary and the upper error boundary, by:

determining size of the area between the lower error boundary and
the upper error boundary as a search area;
comparing the determined search area with a threshold of array size;
selecting the linear search if the search area is less than the threshold
of array size; and
selecting the binary search if the search area is exceeding the
threshold of array size.
13. A spline based learned index, comprising:
a Data Array (DA) storing data entries in the spline based learned index
in the form of a (key, record) tuple;
a Spline-Point Array (SA) storing a plurality of spline points defined
for the data entries in the DA, wherein each of the plurality of spline
points is stored as a (key, position, overflow_count) tuple;
a Radix Table (RT) storing an array of indexes to the plurality of spline
points in the Spline-Point Array; and
an Over Flow Array (OFA), wherein the OFA is used to store new data
entries as (key, position) tuples.

Documents

Application Documents

#	Name	Date
1	202121027966-STATEMENT OF UNDERTAKING (FORM 3) [22-06-2021(online)].pdf	2021-06-22
2	202121027966-REQUEST FOR EXAMINATION (FORM-18) [22-06-2021(online)].pdf	2021-06-22
3	202121027966-PROOF OF RIGHT [22-06-2021(online)].pdf	2021-06-22
4	202121027966-FORM 18 [22-06-2021(online)].pdf	2021-06-22
5	202121027966-FORM 1 [22-06-2021(online)].pdf	2021-06-22
6	202121027966-FIGURE OF ABSTRACT [22-06-2021(online)].jpg	2021-06-22
7	202121027966-DRAWINGS [22-06-2021(online)].pdf	2021-06-22
8	202121027966-DECLARATION OF INVENTORSHIP (FORM 5) [22-06-2021(online)].pdf	2021-06-22
9	202121027966-COMPLETE SPECIFICATION [22-06-2021(online)].pdf	2021-06-22
10	202121027966-FORM-26 [22-10-2021(online)].pdf	2021-10-22
11	Abstract1..jpg	2021-12-07
12	202121027966-FER.pdf	2023-07-10
13	202121027966-OTHERS [23-10-2023(online)].pdf	2023-10-23
14	202121027966-FER_SER_REPLY [23-10-2023(online)].pdf	2023-10-23
15	202121027966-DRAWING [23-10-2023(online)].pdf	2023-10-23
16	202121027966-CLAIMS [23-10-2023(online)].pdf	2023-10-23
17	202121027966-US(14)-HearingNotice-(HearingDate-25-10-2024).pdf	2024-09-26
18	202121027966-US(14)-ExtendedHearingNotice-(HearingDate-05-11-2024)-1030.pdf	2024-10-21
19	202121027966-Correspondence to notify the Controller [04-11-2024(online)].pdf	2024-11-04
20	202121027966-Written submissions and relevant documents [08-11-2024(online)].pdf	2024-11-08
21	202121027966-PatentCertificate02-01-2025.pdf	2025-01-02
22	202121027966-IntimationOfGrant02-01-2025.pdf	2025-01-02

Search Strategy

1	SearchStrategy202121027966E_10-07-2023.pdf