System And Method For Selectivity Estimation Using Dependent Column

System And Method For Selectivity Estimation Using Dependent Column Information For Generating Optimized Query Plan

Abstract: Joint selectivity estimation of dependent columns in database planner using multidimensional histograms is disclosed. A method of joint selectivity estimation for at least two tables using at least a multi-dimensional histogram, the method comprises receiving at least one dependency information defining dependencies between the tables and the associated columns; identifying, based on the dependency information received, at least two table recurrently joined and used based on at least a column; collecting the two table recurrently joined and used based on the column identified as joint statistics; calculating, using at least a histogram technique, at least a joint histogram based on the joint statistics collected; storing the joint histogram calculated in at least one system table, thereby maintaining at least a histogram cache of the joint histogram; and estimating the joint selectivity for the tables by using the joint histogram stored in the system table. (TO BE PUBLISHED WITH FIGURE 1&3)

Patent Information

Application #

Filing Date

08 August 2016

Publication Number

06/2018

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

cal@patentindia.com

Parent Application

Patent Number

Legal Status

Grant Date

2022-08-31

Renewal Date

Applicants

HUAWEI TECHNOLOGIES INDIA PVT. LTD.

SYNO 37, 46, 45/3, 45/4 ETC., KNO 1540, Kundalahalli Village, Bengaluru, Karnataka – 560 037, India

Inventors

1. KUMAR Dilip

SYNO 37, 46, 45/3, 45/4 ETC., KNO 1540, Kundalahalli Village, Bengaluru, Karnataka – 560 037, India

2. SREEKANTAIAH, Nirmala

SYNO 37, 46, 45/3, 45/4 ETC., KNO 1540, Kundalahalli Village, Bengaluru, Karnataka – 560 037, India

3. BEHERA, Mahesh Kumar

SYNO 37, 46, 45/3, 45/4 ETC., KNO 1540, Kundalahalli Village, Bengaluru, Karnataka – 560 037, India

Specification

Claims:
1. A data processing system, comprising:
a processor;
a memory coupled to the processor for executing a plurality of modules present in the memory, the plurality of modules comprising:
a receiving module configured to receive at least one dependency information defining dependencies between the tables and the associated columns;
a statistics collector module configured to:
identify, based on the dependency information received, at least two table recurrently joined and used based on at least a column;
collect the two table recurrently joined and used based on the column identified as joint statistics;
a joint histogram module configured to:
calculate, using at least a histogram technique, at least a joint histogram based on the joint statistics collected;
store the joint histogram calculated in at least one system table, thereby maintain at least a histogram cache of the joint histogram; and
estimate the joint selectivity for the tables by using the joint histogram stored in the system table.

2. The data processing system as claimed in claim 1, wherein the receiving module is further configured to:
receive the dependency information by manually defining the dependencies between the tables and the associated columns by triggering at least an ALTER SYSTEM command, preferably a ALTER SYSTEM SET JOINSELECT (Table 1 (column 1), Table 2(column 2), Table 3(column 3)...) command.

3. The data processing system as claimed in claim 1, wherein the receiving module is further configured to:
receive the dependency information defining the dependencies between the tables and the associated columns based on at least a database query, on receipt of the database query, thereby:
identify one or more join indexes having one or more join index predicates in the database query received, the one or more join index(es) respectively defines the dependencies between the tables and the associated columns;
collect the statistics of the join indexes, and storing the statistics in at least one system table.

4. The data processing system as claimed in claim 1, wherein the receiving module is further configured to store the dependency information defining the dependencies between the tables and the associated columns in at least a system table.

5. The data processing system as claimed in claim 1, wherein the histogram technique is preferably selected from a Multidimensional Histogram (MHIST) technique, an equi-depth histogram technique or any combination thereof.

6. The data processing system as claimed in claim 1, wherein the histogram cache of the joint histogram is maintained using a table id and a column id.

7. The data processing system as claimed in claim 1, wherein the selectivity of the column is estimated based on at least a histogram value of other column using the dependency coefficient.

8. The data processing system as claimed in claim 1 is characterized in joint selectivity estimation of dependent columns in database planner using the joint or multidimensional histograms.

9. A method of joint selectivity estimation for at least two tables using at least a multi-dimensional histogram, the method comprising:
receiving at least one dependency information defining dependencies between the tables and the associated columns;
identifying, based on the dependency information received, at least two table recurrently joined and used based on at least a column;
collecting the two table recurrently joined and used based on the column identified as joint statistics;
calculating, using at least a histogram technique, at least a joint histogram based on the joint statistics collected;
storing the joint histogram calculated in at least one system table, thereby maintaining at least a histogram cache of the joint histogram;
estimating the joint selectivity for the tables by using the joint histogram stored in the system table.

10. The method as claimed in claim 9, wherein the method comprises receiving the dependency information by manually defining the dependencies between the tables and the associated columns by triggering at least an ALTER SYSTEM command, preferably a ALTER SYSTEM SET JOINSELECT (Table 1 (column 1), Table 2(column 2), Table 3(column 3)...) command.

11. The method as claimed in claim 9, wherein comprises receiving the dependency information defining the dependencies between the tables and the associated columns based on at least a database query, on receipt of the database query, the method further comprises:
identifying one or more join indexes having one or more join index predicates in the database query received, the one or more join index(es) respectively defines the dependencies between the tables and the associated columns;
collecting the statistics of the join indexes, and storing the statistics in at least one system table.

12. The method as claimed in claim 9 further comprises storing the dependency information defining the dependencies between the tables and the associated columns in at least a system table.

13. The method as claimed in claim 9, wherein the histogram technique is preferably selected from a Multidimensional Histogram (MHIST) technique, an equi-depth histogram technique or any combination thereof.

14. The method as claimed in claim 9, wherein the histogram cache of the joint histogram is maintained using a table id and a column id.

15. The method as claimed in claim 9, wherein estimating the selectivity of the column is based on at least a histogram value of other column using the dependency coefficient.

16. The method as claimed in claim 9 is characterized in estimating joint selectivity of dependent columns in database planner using the joint or multidimensional histograms.
, Description:TECHNICAL FIELD

The present subject matter described herein, in general, relates to database query optimizers, and more particularly, to systems and methods for multi-dimensional selectivity estimation for multiple tables using dependent column information for generating optimized query plan.

BACKGROUND

In a traditional database system there are many ways in which a database query can be executed. Where the query comprises several query predicates and references several database tables, there are several options as to the order in which these query predicates can be evaluated. A database query engine comprises an optimizer which determines the most cost effective ordering of execution of these query predicates within a database query. In a traditional optimizer, cardinality estimation for single table predicates and joins only uses demographic information. The database query engine typically rely on query size estimator, in order to evaluate the potential costs of the alternative plan and select the best one. Generally histograms are used to calculate the alternative plans. In histograms the data set partitioned into buckets, each approximated by aggregate statistics. Each bucket consist of a bounding box and a tuple frequency value. Uniformity is assumed inside buckets.

In case of multidimensional queries, the queries have selectivity dependency on more than one attributes and estimation size depends upon joint data distribution. Most of the databases including Postgres, assumes all the attributes are independent, hence it calculates individual selectivity for each attribute and simply multiply the frequency. However, this assumption can lead to completely wrong estimation if attributes are dependent. Also, if statistics is not very accurate then plan will not be the most optimal plan and will impact the execution time.

For example consider below query:
Select name from employee e, salary s where e.age >35 and s.salary >3500000 and e.eid==s.eid;
It is natural for the salary attribute of the Employee relation to be ‘strongly’ dependent on the age attribute (i.e., higher/lower salaries mostly going to older/younger people).
Example of selectivity Estimation by Independent assumption:
Suppose selectivity of age >35 is 0.4 and selectivity of salary >3500000 is 0.3 then the joint selectivity will be calculated as 0.4*0.3 = 0.12.But in this case it is understood from above that almost all the employees which are having salary 3500000 are having age >35, so actual joint selectivity is ~ 0.3 not 0.12.

A multi-dimensional histogram for selectivity estimation is well researched topic, but the existing techniques have used the multi-dimensional histogram for dependent column in a single table. However, there can be huge number of tables and as a well-known problem their column and finding all permutations will be in millions.

SUMMARY

This summary is provided to introduce concepts related to systems and methods for multi-dimensional selectivity estimation for multiple tables using dependent column information for generating optimized query plan, and are further described below in the detailed description. This summary is not intended to identify essential features of the claimed subject matter nor is it intended for use in determining or limiting the scope of the claimed subject matter.

In view of the technical problems as recited in the background section above, and in order to achieve an efficient, economical, and query plans, there exists a dire need in query optimizers for better selectivity estimation for dependent columns of multiple tables.

In order to provide a technical solution to the above mentioned technical problems, one aspect of the present invention is to provide a system and method to provide join selectivity estimation for multiple tables.

Another aspect of the present invention is to provide system and method to use multidimensional histograms for joint selectivity estimation of dependent columns in database planner.

Another aspect of the present invention is to provide a system and method to provide an option for the user using which user can provide column dependency, and if user has declared columns as dependent then only join selectivity can be calculated for the columns.

Yet another aspect of the present invention is to collect the statistics from each query and identify which columns from which tables are used together in join and which are dependent.

Yet another aspect of the present invention is to calculate the joint histograms for the columns what user have identified as dependent or automatically detected using statistics, for example in view of above scenario:
CREATE TABLE employee (eid int, age int, name varchar);
CREATE TABLE payment (eid int, salary float);
SET JOINTSELECT (employee.age, payement.salary).

Yet another aspect of the present invention is to store the joint histogram in a memory storage and cache can be built using table ID and column ID.

Still another aspect of the present invention is if at run time during planning if a condition for more than two columns is fired then the present invention enables to search in the cache and if joint histogram available then directly use these.

Accordingly, in one implementation, a data processing system is disclosed. The system comprises a processor; a memory coupled to the processor for executing a plurality of modules present in the memory. The plurality of modules comprises a receiving module configured to receive at least one dependency information defining dependencies between the tables and the associated columns; a statistics collector module configured to identify, based on the dependency information received, at least two table recurrently joined and used based on at least a column; collect the two table recurrently joined and used based on the column identified as joint statistics; a joint histogram module configured to calculate, using at least a histogram technique, at least a joint histogram based on the joint statistics collected; store the joint histogram calculated in at least one system table, thereby maintaining at least a histogram cache of the joint histogram; and estimate the joint selectivity for the tables by using the joint histogram stored in the system table.

In one implementation, a method of joint selectivity estimation for at least two tables using at least a multi-dimensional histogram is disclosed. The method comprises:
• receiving at least one dependency information defining dependencies between the tables and the associated columns;
• identifying, based on the dependency information received, at least two table recurrently joined and used based on at least a column;
• collecting the two table recurrently joined and used based on the column identified as joint statistics;
• calculating, using at least a histogram technique, at least a joint histogram based on the joint statistics collected;
• storing the joint histogram calculated in at least one system table, thereby maintaining at least a histogram cache of the joint histogram;
• estimating the joint selectivity for the tables by using the joint histogram stored in the system table.

As compared to the prior-art techniques, the present invention provides join selectivity estimation for multiple table based on joint statistics. Joint Statistics calculation will provide better planning when dependent columns are involved in the query, this will give huge performance benefits. According to the present invention, a user/admin can decide for which columns Joint statistics to be collected, this will save space and overhead of collecting many joint histograms. As compared to the conventional techniques, the present invention enables using multidimensional histograms for joint selectivity estimation of dependent columns in database planner.

BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS

The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to refer like features and components.

Figure 1 illustrates a data processing system 100, in accordance with an embodiment of the present subject matter.

Figure 2 illustrates a data server200in accordance with an embodiment of the present subject matter.

Figure 3 illustrates a block diagram showing various building blocks of the invention, in accordance with an embodiment of the present subject matter.

Figure 4 illustrates a method of joint selectivity estimation for at least two tables using at least a multi-dimensional histogram, in accordance with an embodiment of the present subject matter.

Figure 5 illustrates a behavior of existing technique and new technique (the present invention) when increasing the number of buckets, in accordance with an embodiment of the present subject matter.

Figure 6 (a), (b), and (c) illustrates an example for the joint histogram calculation, in accordance with an embodiment of the present subject matter.

It is to be understood that the attached drawings are for purposes of illustrating the concepts of the invention and may not be to scale.

DETAILED DESCRIPTION OF THE PRESENT INVENTION

The following clearly describes the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Apparently, the described embodiments are merely a part rather than all of the embodiments of the present invention. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts shall fall within the protection scope of the present invention.

The invention can be implemented in numerous ways, including as a process, an apparatus, a system, a composition of matter, a computer readable medium such as a computer readable storage medium or a computer network wherein program instructions are sent over optical or electronic communication links. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention.

A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.

Systems and methods for selectivity estimation using dependent column information for generating optimized query plan are disclosed.

While aspects are described for Systems and methods for selectivity estimation using dependent column information for generating optimized query plan, the present invention may be implemented in any number of different computing systems, environments, and/or configurations, the embodiments are described in the context of the following exemplary systems, apparatus, and methods.

Accordingly, in one implementation, the present invention enables to provide better selectivity estimation for dependent columns of multiple tables. The present invention provides an option to user using which user may provide column dependency, and if user has declared columns as dependent then only join selectivity may be calculated for the columns. Alternatively, the present invention collects the statistics from each query and identify which columns from which tables are used together in join and which are dependent. The present invention may then calculate the joint histograms for the columns what user may have identified as dependent or automatically detected using statistics(such as, using standard MHIST or equi-depth histogram techniques).The joint histograms data may be kept in some memory storage and cache may be built using table id and column id.At run time during planning if condition is there for more than two columns then it can be searched in the cache and if joint histogram available then directly use these.

In one implementation, the user may provide the joint selection as the below provided example:
CREATE TABLE employee (eid int, age int, name varchar);
CREATE TABLE payment (eid int, salary float);
SET JOINTSELECT (employee.age, payement.salary);

Referring now to figure 1, the figure 1 illustrates a data processing system100, in accordance with an embodiment of the present subject matter.

Referring now to figure 2, the figure 2 illustrates a data server200 in accordance with an embodiment of the present subject matter.

Although the present subject matter is explained considering that the data processing system 100 or data server 200 for selectivity estimation using dependent column information for generating optimized query plan, it may be understood that the data processing system 100 or data server 200 may also be implemented in a variety of computing systems, such as a laptop computer, a desktop computer, a notebook, a workstation, a mainframe computer, a server, a network server, and the like. It will be understood that the data processing system 100 or data server 200 may be accessed by multiple users through one or more user devices (not shown) or applications residing on the user devices. Examples of the data processing system 100 or data server 200 may include, but are not limited to, a portable computer, a personal digital assistant, a handheld device, and a workstation, routers, servers. The data processing system 100 or data server 200are communicatively coupled to the other devices (not shown) through a network (not shown), for example, there may be multiple apparatuses connected to the data processing system 100 or data server 200 accessing the database by sending a database query to fetch the result/ data stored in the database.

In one implementation, the network may be a wireless network, a wired network or a combination thereof. The network can be implemented as one of the different types of networks, such as intranet, local area network (LAN), wide area network (WAN), the internet, and the like. The network may either be a dedicated network or a shared network. The shared network represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), and the like, to communicate with one another. Further the network may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like.

In one implementation, the data processing system 100 or data server 200 may include at least one processor 102/202, an input/output (I/O) interface 104 / 204, and a memory 106 / 206. The at least one processor 102 / 202 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the at least one processor 102 / 202 is configured to fetch and execute computer-readable instructions stored in the memory 106 / 206.

The I/O interface 104 / 204 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface 104 / 204may allow the data processing system 100 or data server 200to interact with a user directly or through the client devices (not shown). Further, the I/O interface 104 / 204may enable the apparatus 100/200to communicate with other computing devices, such as web servers and external data servers (not shown). The I/O interface 104 / 204can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. The I/O interface 104 / 204may include one or more ports for connecting a number of devices to one another or to another server.

The memory 106 / 206 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.

The modules include routines, programs, objects, components, data structures, etc., which perform particular tasks or implement particular abstract data types.

In one implementation, as shown in figure1, the data processing system 100 is disclosed. The data processing system 100 may include a processor 102 and a memory 106 coupled to the processor for executing a plurality of modules present in the memory. The modules may include a receiving module108, a statistics collector module 110, and a joint histogram module 112. The data processing system 100may further include a database (no shown) storing the data in the form of tables and rows. The data processing system 100 may further include system tables 114 storing different types of data from the database.

In one implementation, the receiving module 108 is configured to receive at least one dependency information defining dependencies between the tables and the associated columns. The statistics collector module110 is configured to identify, based on the dependency information received, at least two table recurrently joined and used based on at least a column, and collect the two table recurrently joined and used based on the column identified as joint statistics. The joint histogram module 112 is configured to calculate, using at least a histogram technique, at least a joint histogram 118 based on the joint statistics collected, store the joint histogram calculated in at least one system table 114, thereby maintain at least a histogram cache 116 of the joint histogram, and estimate the joint selectivity for the tables by using the joint histogram stored in the system table.

In one implementation, as shown in figure 2, the database server 200is disclosed. As shown in figure 2, the database server may include a processor 202, and interface 204, a memory 206 coupled to the processor and a disk 208. The memory 206 and the disk 208 are coupled to the query optimizer 106. The query optimizer may perform the similar functions as described in the various modules a receiving module 108, a statistics collector module 110, and a joint histogram module 112 above. In order to avoid the complexity in the understanding of the present invention, the repetition of the function of the various modules is avoided in this paragraph.

In one implementation, the receiving module is further configured to receive the dependency information by manually defining the dependencies between the tables and the associated columns by triggering at least an ALTER SYSTEM command, preferably a ALTER SYSTEM SET JOINSELECT (Table 1 (column 1), Table 2(column 2), Table 3(column 3)...) command.

In one implementation, the receiving module is further configured to receive the dependency information defining the dependencies between the tables and the associated columns based on at least a database query, on receipt of the database query, thereby identify one or more join indexes having one or more join index predicates in the database query received, the one or more join index(es) respectively defines the dependencies between the tables and the associated columns; collect the statistics of the join indexes, and storing the statistics in at least one system table.

In one implementation, the receiving module is further configured to store the dependency information defining the dependencies between the tables and the associated columns in at least a system table114.

In one implementation, the histogram technique is preferably selected from a Multidimensional Histogram (MHIST) technique, an equi-depth histogram technique or any combination thereof. However, it may be understood by the person skilled in the art that any existing algorithm for histogram may be used, the specific mentioning of the name or not mentioning the name of the histogram technique shall not restrict the scope of the present invention.

In one implementation, the histogram cache116 of the joint histogram is maintained using a table id and a column id.

In one implementation, the data processing system enables to achieve the selectivity of the column is estimated based on at least a histogram value of other column using the dependency coefficient.

In one implementation, the data processing system enables to achieve the joint selectivity estimation of dependent columns in database planner using the joint or multidimensional histograms.

Referring now to figure 3, figure 3 illustrates a block diagram showing various building blocks of the invention, in accordance with an embodiment of the present subject matter.

In one implementation, figure 3 illustrates how table dependency is identified and where it is stored in the system.

In one implementation, as shown in figure 3, the present invention provides two different mechanisms using which table dependency is identified. First mechanism is a user set joint table. The user set table dependency is based on a database query received from the user, and dependency will be stored in the system tables. In one implementation, the user may manually define dependency between tables and their columns using ALTER SYSTEM command (a new DDL command will be added) i.e., ALTER SYSTEM SET JOINSELECT (T1(c1), T2(c2), T3(c3)...). Second mechanism is by collecting statistics for depend items. In one implementation, the collecting statistics for depend items occurs during Input query, whenever any query is given for execution and which involves join then statistics collector will Stores the information in system tables. Using the second mechanism, the system collects the statistics in background and identify few table dependency which are frequently used together and collect join histogram for those.

As shown in figure 3, after the table dependency is identified, the statistics collector (background worker) block comes into picture. In one implementation, this block maintains a storage and background thread to keep analyzing these data (the table dependency identified) and over the period of time when it will identify which tables are joined frequently and based on which columns, after it whichever tables are joined together will be selected for joint statistics collections and joint histograms will be generated for those tables. According to the present invention, there may be two techniques for generating joint histograms. The first technique according to the present invention can directly calculate joint histogram which used has defined the dependency. However, for system generated data it need to analyze and declare the dependency and hence second technique after dependency is identified collects the joint histograms.

As shown in figure 3, after the statistics collector (background worker) has collected the data / declared the dependency, the joint histogram block comes into picture. According to the present invention, the joint histogram uses the standard algorithm for identifying the multidimensional histogram and store in histogram cache.

According to the present invention, the joint statistics calculation will provide better planning when dependent columns are involved in the query, this will give huge performance benefits. Further, the user/admin may decide for which columns joint statistics to be collected, this will save space and overhead of collecting many joint histograms.

Figure 4 illustrates a method of joint selectivity estimation for at least two tables using at least a multi-dimensional histogram. The method may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The method may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, computer executable instructions may be located in both local and remote computer storage media, including memory storage devices.

The order in which the method is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method or alternate methods. Additionally, individual blocks may be deleted from the method without departing from the protection scope of the subject matter described herein. Furthermore, the method can be implemented in any suitable hardware, software, firmware, or combination thereof. However, for ease of explanation, in the embodiments described below, the method may be considered to be implemented in the above described data processing system100 or database server 200.

Referring now to figure 4, the figure 4 illustrates a method of joint selectivity estimation for at least two tables using at least a multi-dimensional histogram.

At block 402, at least one dependency information defining dependencies between the tables and the associated columns is received. In one implementation, the dependency information may be provided by manually defining the dependencies between the tables and the associated columns by triggering at least an ALTER SYSTEM command, preferably a ALTER SYSTEM SET JOINSELECT (Table 1 (column 1), Table 2(column 2), Table 3(column 3)...) command. In one implementation, the dependency information defining the dependencies between the tables and the associated columns may be automatically identified be the system based on at least a database query, on receipt of the database query. In one embodiment, when the database query is received, the system identifies one or more join indexes having one or more join index predicates in the database query received, the one or more join index(es) respectively defines the dependencies between the tables and the associated columns, and then collect the statistics of the join indexes, and store the statistics in at least one system table. In one implementation, the dependency information defining the dependencies between the tables and the associated columns are stored in at least a system table.

At block 404, on receipt of the dependency information, at least two tables recurrently joined and used based on at least a column are identified.

At block 406, the two table recurrently joined and used based on the column identified are collected as joint statistics.

At block 408, at least a joint histogram based on the joint statistics collected is calculated. The joint histogram may be calculated using the histogram technique is preferably selected from a Multidimensional Histogram (MHIST) technique, an equi-depth histogram technique or any combination thereof.

At block 410, the joint histogram calculated is stored in at least one system table.

At block 412, at least a histogram cache of the joint histogram is maintained. In one implementation, the histogram cache of the joint histogram is maintained using a table id and a column id.

At block 414,the joint selectivity for the tables are estimated by using the joint histogram stored in the system table. In one implementation, the selectivity of the column is based on at least a histogram value of other column using the dependency coefficient. In one implementation, the joint selectivity of dependent columns in database planner is estimated using the joint or multidimensional histograms.

Referring now to figure 5, the figure 5 illustrates a behavior of existing technique and new technique (the present invention) when increasing the number of buckets, in accordance with an embodiment of the present subject matter. Figure 5 depicts how the existing technique and new technique behave by increasing the number of buckets.

As standard available optimizer techniques, such as AVI, always considers each attribute as independent, so even when the number of histogram buckets are increased, the AVI estimation error is not reduced, because even after increasing the buckets, the accuracy for independent attribute is increased but the method of joining them is not correct for dependent attribute so error % is same which is shown in the uppermost line in the graph of figure 5.

Also, in case of the MHIST-2 by increasing the number of bucket Error in MHIST estimation reduced drastically because it calculate the joint statistics of the attributes and as we increase the number of buckets accuracy will increase and error will reduced which is shown in the below line in the graph of figure 5.

To understand the working of the present invention, please consider that there are two tables for some organization where one table maintains employee information and other table maintains finance related information.
CREATE TABLE employee (eid int, age int, name varchar);
CREATE TABLE payment (eid int, salary float);
Now consider the query where user need to identify the salary information of some of the senior members, as stated below:
SET JOINTSELECT (employee.age, payement.salary).
SELECT E.eid, E.name, P.salary
FROM employee As E JOIN payment As P
On E.eid=P.eid
WHERE E.age > 40 and P.salary > 50000;

It may be understood and noted that in general optimizer (in absence of the idea), this will be considered as two independent clause and there selectivity will be calculated independentlyi.e., suppose selectivity of age > 40 is 0.25 (25% employee are of age > 40)and selectivity of salary > 50000 is 0.20 (20% of employee are of sal > 50000).

So optimizer will consider the equal spread means 25% employees are > age 40 and out of which 20% is of salary > 50000 so total selectivity= 0.25*0.20 = 0.050 => 5 %.

This can be true for independent columns but here salary is highly dependent on age and experience so it is possible that all 20 % employees which has salary > 50000, are of age > 40, means actual selectivity should be 20% not 5 %, so if optimizer make plan by considering 5% then plan will be very inefficient.

Now using the present invention the user can define relationship between two columns or system can identify the relationship, the same is shown with the dotted ellipse in figure 6 (a):
i. Either User can define using below command ALTER SYSTEM SET JOINSELECT (employee (age), payment(salary)); OR
ii. System can see the conditions in the query and store some statistics that which tables are used together and store the dependency in system table.

Also, according to the present invention, the background statistics collector takes the information from Sys Table that which all columns are dependent and create the Join Statistics and store in Histograms tables, the same is shown with the dotted ellipse in figure 6 (b).

Now Next time when user execute the query, say:
SET JOINTSELECT (employee.age, payement.salary).
SELECT E.eid, E.name, P.salary
FROM employee As E JOIN payment As P
On E.eid=P.eid
WHERE E.age > 40 and P.salary > 50000;
as per the general optimizer process (available in the prior art), it needs to find the statistics for selectivity estimation, in addition to the according to the present invention, on top of finding the statistics for selectivity estimation the present invention enables to search for Joint statistics from the histogram tables and plan as per the joint histograms and plan will be much accurate, the same is shown with the dotted ellipse in figure 6 (c).

Apart from what is explained above, the present invention also include the below mentioned advantages:
• Joint Statistics calculation as provided in the present invention will provide better planning when dependent columns are involved in the query, this will give huge performance benefits.

• User/Admin can decide for which columns Joint statistics to be collected, this will save space and overhead of collecting many joint histograms.
• The present invention provides usage of multidimensional histograms for joint selectivity estimation of dependent columns in database planner.
• In database system when there is query with condition on multiple dependent columns of same or different tables, the present invention enables the user to define dependency using a DDL commands in order to collect joint statistics for those columns.
• According to the present invention, the join statistics will be stored in memory and a cache will be maintained to search the statistics.
• Selectivity of one column can be predicted based on the histogram value of other column using the dependency coefficient.
• Generally histograms are store in system tables for entry of the tables because it is limited to one table, according to the present invention the histograms are spared across tables.

A person of ordinary skill in the art may be aware that in combination with the examples described in the embodiments disclosed in this specification, units and algorithm steps may be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether the functions are performed by hardware or software depends on the particular applications and design constraint conditions of the technical solution. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of the present invention.

It may be clearly understood by a person skilled in the art that for the purpose of convenient and brief description, for a detailed working process of the foregoing system, apparatus, and unit, reference may be made to a corresponding process in the foregoing method embodiments, and details are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described apparatus embodiment is merely exemplary. For example, the unit division is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented through some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.

When the functions are implemented in a form of a software functional unit and sold or used as an independent product, the functions may be stored in a computer-readable storage medium. Based on such an understanding, the technical solutions of the present invention essentially, or the part contributing to the prior art, or a part of the technical solutions may be implemented in a form of a software product. The computer software product is stored in a storage medium, and includes several instructions for instructing a computer device (which may be a personal computer, a server, or a network device) to perform all or a part of the steps of the methods described in the embodiment of the present invention. The foregoing storage medium includes: any medium that can store program code, such as a USB flash drive, a removable hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk, or an optical disc.

Although implementations for systems and methods for multi-dimensional selectivity estimation for multiple tables using dependent column information for generating optimized query plan have been described in language specific to structural features and/or methods, it is to be understood that the appended claims are not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as examples of implementations of the systems and methods for multi-dimensional selectivity estimation for multiple tables using dependent column information for generating optimized query plan.

Documents

Application Documents

#	Name	Date
1	Power of Attorney [08-08-2016(online)].pdf	2016-08-08
2	Form 3 [08-08-2016(online)].pdf	2016-08-08
3	Form 18 [08-08-2016(online)].pdf_62.pdf	2016-08-08
4	Form 18 [08-08-2016(online)].pdf	2016-08-08
5	Drawing [08-08-2016(online)].pdf	2016-08-08
6	Description(Complete) [08-08-2016(online)].pdf	2016-08-08
7	abstract 201641027073 .jpg	2016-10-07
8	Other Patent Document [08-02-2017(online)].pdf	2017-02-08
9	Correspondence by Agent_Form1_14-02-2017.pdf	2017-02-14
10	201641027073-PA [28-02-2018(online)].pdf	2018-02-28
11	201641027073-ASSIGNMENT DOCUMENTS [28-02-2018(online)].pdf	2018-02-28
12	201641027073-8(i)-Substitution-Change Of Applicant - Form 6 [28-02-2018(online)].pdf	2018-02-28
13	Correspondence by Agent _Deed Of Assignment_26-03-2018.pdf	2018-03-26
14	201641027073-FER.pdf	2020-02-20
15	201641027073-OTHERS [10-06-2020(online)].pdf	2020-06-10
16	201641027073-FER_SER_REPLY [10-06-2020(online)].pdf	2020-06-10
17	201641027073-CLAIMS [10-06-2020(online)].pdf	2020-06-10
18	201641027073-Correspondence to notify the Controller [01-01-2021(online)].pdf	2021-01-01
19	201641027073-Written submissions and relevant documents [18-01-2021(online)].pdf	2021-01-18
20	201641027073-US(14)-HearingNotice-(HearingDate-04-01-2021).pdf	2021-10-17
21	201641027073-Response to office action [31-08-2022(online)].pdf	2022-08-31
22	201641027073-PatentCertificate31-08-2022.pdf	2022-08-31
23	201641027073-IntimationOfGrant31-08-2022.pdf	2022-08-31

Search Strategy

1	83thfileTPOsearchstrategy_20-02-2020.pdf
2	83thfileinpass_20-02-2020.pdf