Abstract: The present subject matter relates to a computer implemented method for stratified sampling of a database. The method includes receiving at least one stratification parameter indicating a type of data to be selected in the database. The method further includes obtaining a stopping parameter indicating a number of data records to be included in each of at least one stratum, creating the at least one stratum based on the at least one stratification parameter and the stopping parameter, and extracting data from the at least one stratum.
FORM 2
THE PATENTS ACT, 1970 (39 of 1970) & THE PATENTS RULES, 2003
COMPLETE SPECIFICATION
(See section 10, rule 13)
1. Title of the invention: STRATIFIED SAMPLING OF A DATABASE
2. Applicant(s)
NAME NATIONALITY ADDRESS
TATA CONSULTANCY Indian Nirmal Building, 9th Floor, Nariman Point,
SERVICES LIMITED Mumbai, Maharashtra 400021
India
3. Preamble to the description
COMPLETE SPECIFICATION
The following specification particularly describes the invention and the manner in which it is to be
performed.
TECHNICAL FIELD
[0001] The present subject matter, in general, relates to database sampling and, in particular,
relates to methods and systems for stratified sampling of a database.
BACKGROUND
[0002] Database sampling generally relates to a process by which a subset of data within the
database is extracted, and can be commonly used in applications for testing data intensive activities, such as, data mining, query cost evaluation, and software testing. Database sampling is utilized generally to reduce testing and computation time in databases, which can contain substantially vast portions of data.
[0003] For example, in the field of software testing, the complete testing process of a
software application, generally referred to as testing lifecycle, includes iterative testing processes that account for a large amount of testing activities. With the rise in the trend to develop modular and large integrated software applications, the complexity of testing the software has increased. Software testing is generally performed based on test data, which is typically a copy or clone of entire production data within the databases, as the production data corresponds to actual operational data. However, copying the entire production data and keeping it in different test environments may lead to increased space requirements. Therefore, a subset of the data, which is representative of the entire database, is extracted from the databases via sampling techniques, and utilized as test data for software testing.
SUMMARY
[0004] This summary is provided to introduce concepts related to stratified sampling of a
database, and the concepts are further described below in the detailed description. This summary is not intended to identify essential features of the claimed subject matter nor is it intended for use in determining or limiting the scope of the claimed subject matter.
[0005] In one implementation, a method includes receiving at least one stratification
parameter indicating a type of data to be selected in the database, obtaining a stopping parameter indicating a number of data records to be included in each stratum, creating the at least one stratum based on the at least one stratification parameter and the stopping parameter, and extracting data from the at least one stratum.
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] The detailed description is described with reference to the accompanying figures. In
the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to reference like features and components.
[0007] Fig. 1 illustrates a network environment implementing a stratified sampling system,
in accordance with an embodiment of the present subject matter.
[0008] Fig. 2 illustrates a stratified sampling system for sampling a database, in accordance
with an embodiment of the present subject matter.
[0009] Fig. 3 illustrates a method for stratified sampling of data, in accordance with an
embodiment of the present subject matter.
DETAILED DESCRIPTION
[0010] Systems and methods for stratified sampling of a database are described therein. The
systems and methods can be implemented in a variety of computing devices, such as laptops, desktops, workstations, tablet-PCs, smart phones, notebooks or portable computers, tablet computers, mainframe computers, mobile computing devices, entertainment devices, computing platforms, internet appliances and similar systems. However, a person skilled in the art will comprehend that the embodiment of the present subject matter are not limited to any particular computing system, architecture or application device, as it may be adapted to take advantage of new computing system and platform as they become accessible.
[0011] Data sampling is generally a process by which samples or subsets of data within
databases are selected and extracted for purposes, such as data mining, query cost evaluation, and software testing. For software testing, the sampling of data enables testing times to be considerably reduced by allowing for a small portion of data to be selected and utilized, rather than testing the software on a complete copy of a production database. In the process of data sampling, the selection of data within the database can be facilitated in a variety of ways. The manner in which data is selected is primarily influenced by a desire to minimize bias in the selection procedure. In order to reduce said bias in the selection procedure, the data is generally selected randomly from a given database, by a process known as random, or probability
sampling. The data, thus randomly selected, is generally considered to be reflective of the characteristics of the entire database.
[0012] In random sampling techniques, each element or record of a database can be assigned
an equal probability of selection, and the selection depends on the total number of records that needs to be selected. Random sampling generally ensures fairness to a large extent as there is usually no bias in the selection process. However, random sampling techniques can be prone to sampling error, where due to the probability or randomness factor, the selected sample may not be reflective of the characteristics of the data in the database. For example, a certain section of the population might not be adequately represented in the sample, which could further lead to inefficiencies in software testing, where the dependency on the extracted sample may be substantially high. Therefore, it is important for processes, such as software testing, to utilize data samples that are extracted from smaller populations. This effect is more pronounced for clusters or groups of data with a substantially larger number of elements in them. Moreover, random sampling techniques can be limited in operation in that they are usually query based. For example, the database that needs to be sampled generally has to adhere to certain known query standards, such as SQL standards in order to be sampled.
[0013] In order to overcome the limitations of the random sampling process, certain
modifications and improvements were made thereof, which resulted in a systematic sampling process, and a stratified sampling process. The systematic sampling process includes deriving a sampling fraction, with which data can be selected in a database. The systematic sampling process is not a true random sampling process per se, as it involves a certain degree of systematic arrangement. However, since there is no conscious control of exactly which data is selected, the process can be considered to involve a degree of randomness.
[0014] Sampling techniques can be required to cater to specific testing requirements, such as
regression testing, where bias towards larger clusters of data can occur. Stratified sampling includes grouping or stratifying a database into clusters or buckets of data. These clusters can contain an equal number of elements, or an unequal number of elements. Thereafter, the random sampling process can be performed within the smaller clusters.
[0015] The present subject matter describes systems and methods for stratified sampling of a
database. In one implementation, the database can be divided into one or more clusters of
different data types, for example, via a User Interface (UI) of the system. In one example, these clusters can be referred to as strata and a single cluster can be referred to as a stratum. By dividing said database into the strata based on the type of data, a substantially high degree of homogeneity can be achieved within each of the strata. In one implementation, in order to divide the data into the homogeneous strata, the data can be primarily categorized based on one or more stratification parameters. The stratification parameters can be said to be indicative of a data type and can be of numeric, word string, time or date type. In one example, upper and lower limits of the stratification parameters can be specified. In a further example, for a database consisting of customer details, such as age, occupation, working hours and gender, for a particular city, the stratification parameters can include criteria, such as an age group and a gender. Therefore, in said example, if the stratification parameters include the criteria ‘25 to 28 years of age’, and ‘male’, the database can be categorized into strata to include only data records that fulfill the specified limits of the stratification parameters. In one example, the term data records are indicative of the data present in a row of a database table. In a further example, the stratification parameters can be indicative of a number of strata to be created. For example, 3 sets of stratification parameters can be provided, which indicate 3 strata to be created, where the data records filled in each of the 3 strata are selected based on the stratification parameters. In one example, the strata can be formed within the database.
[0016] Furthermore, according to the present subject matter, in addition to the stratification
parameter, a stopping parameter can be defined to create the strata. In one implementation, the stopping parameter is indicative of a number of data records to be included in each of the strata. In other words, the maximum number of data records to be included in each of the strata is defined by the stopping parameter.
[0017] In one implementation, the one or more stratification parameters, along with the
stopping parameter are applied during the creation of the strata. The strata thus created are based on the type of data, and not only on the number of data records in each of the strata. The method is also beneficial in that a substantially high degree of homogeneity can be imparted to each strata, and subsequent processing, which utilizes said sampling data can be rendered substantially free of sampling error. As described earlier, sampling error refers to the probability or randomness factor, by which the selected sample may not be reflective of the characteristics of
the data in the database. Moreover, even databases of larger size can be efficiently sampled with reduced sampling error by the method according to the present subject matter. Therefore, according to the present subject matter, data coverage, such as for testing purposes, is improved with least possible records. This is beneficial in that data driven testing is considered to be useful when maximum coverage is ensured by substantially smaller amounts of data.
[0018] In one implementation, the stratified sampling according to the present subject matter
can be carried out at a database table level. For example, the database table can be selected based on a testing requirement, or a required data type, further to which the strata can be created. Once the database table is selected, the stratification parameters can be specified, and the stopping parameter can be defined. Based on one or more stratification parameters, corresponding columns in the database table can be selected to create the strata. The database table can then be scanned, for example, row-wise, based on the one or more stratification parameters, and the stopping parameter. In one example, upon scanning of each row of the database table against the stratification parameters, it can be verified whether the data records in the row fulfill the stratification parameters or not. If the data records in the row fulfill the stratification parameters, the data records therein are selected for the corresponding stratum. Further to the stratification parameter, it can then be verified whether the stopping parameter is achieved or not. If the data record in the particular row that is being scanned is greater than the stopping parameter, then that data record is not filled in the stratum. However, if the data record is less than the number specified in the stopping parameter, that data record can be included in the respective strata being created. The process is repeated in this manner until the defined stopping parameter for each of the one or more strata is achieved, or an end of the database table is reached. In this manner, a plurality of strata can be formed of homogenous data with a specific size per stratum. It would be understood that data across strata would not be homogenous. Homogeneity is achieved within a specific stratum. Therefore, depending on the size of the database, the size of each stratum can be increased or decreased accordingly. Moreover, the samples selected from said homogenous strata can be said to be substantially representative of the characteristics of the entire population contained in the stratum. Furthermore, the above mentioned process can be cascaded to other tables related to the database table under consideration. In one example, the system according to the present subject matter can be utilized with any kind of database, including and not restricted
to, databases where query standards, such as Structured Query Language (SQL) standards, do not apply. An example of such a database a Virtual Storage Access Method (VSAM) mainframes.
[0019] These and other advantages of the present subject matter would be described in
greater detail in conjunction with the following figures. While aspects of described systems and methods for stratified sampling of a database can be implemented in any number of different computing systems, environments, and/or configurations, the embodiments are described in the context of the following exemplary system(s).
[0020] Fig. 1 illustrates a network environment 100 implementing a stratified sampling
system 102, according to an embodiment of the present subject matter. In the network environment 100, the stratified sampling system 102 is connected to a network 104. Furthermore, a database 105, and one or more client devices 108-1, 108-2...108-N, collectively referred to as client devices 108, are also connected to the network 104.
[0021] The stratified sampling system 102 can be implemented as any computing device
connected to the network 104. In one example, the stratified sampling system 102 can be embodied integrally with the database 105. In one instance, the stratified sampling system 102 may be implemented as mainframe computers, workstations, personal computers, desktop computers, multiprocessor systems, laptops, network computers, minicomputers, servers and the like. In addition, the stratified sampling system 102 may include multiple servers to perform mirrored tasks for users, thereby relieving congestion or minimizing traffic.
[0022] Furthermore, the stratified sampling system 102 can be connected to the client
devices 108 through the network 104. Examples of the client devices 108 include, but are not limited to personal computers, desktop computers, smart phones, PDAs, and laptops. Communication links between the client devices 108 and the stratified sampling system 102 are enabled through a desired form of connections, for example, via dial-up modem connections, cable links, digital subscriber lines (DSL), wireless or satellite links, or any other suitable form of communication.
[0023] Moreover, the network 104 may be a wireless network, a wired network, or a
combination thereof. The network 104 can also be an individual network or a collection of many such individual networks interconnected with each other and functioning as a single large
network, e.g., the internet or an intranet. The network 104 can be implemented as one of the different types of networks, such as intranet, local area network (LAN), wide area network (WAN), the internet and such. The network 104 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), etc., to communicate with each other. Further, the network 104 may include network devices, such as network switches, hubs, routers, host bus adapters (HBAs), for providing a link between the stratified sampling system 102 and the client devices 108. The network devices within the network 104 may interact with the stratified sampling system 102 and the client devices 108 through communication links.
[0024] In one implementation, the stratified sampling system 102 includes a request
processing module 112 and an analysis module 114. In one implementation, the request processing module 112 can be configured to receive at least one stratification parameter, from users of the client devices 108. In one example, the one or more stratification parameters are indicative of a data type to be selected in a database table of a database, such as the database 105, for creation of one or more strata. In one example the database 105 can be a production database, or a data warehouse.
[0025] In one implementation, the request processing module 112 can be further configured
to obtain a stopping parameter that is indicative of a number of data records to be included per stratum. For example, the stopping parameter can include a numeric value of 10, where upon selection of 10 data records from the selected database table, further data records will not be selected. As indicated above, the stratified sampling system 102 can be configured to initially apply the one or more stratification parameters, along with which the stratified sampling system 102 can be configured to further apply the stopping parameter to determine how many of the selected data records are to be included in each of the strata thus created.
[0026] In one implementation, the analysis module 114 can be configured to create the one
or more strata based on the one or more stratification parameters and the stopping parameters. The manner in which the analysis module 114 creates the one or more strata is described in the detailed description pertaining to fig. 2.
[0027] In one implementation, the user may utilize the client devices 108 to provide the one
or more stratification parameters and the stopping parameters. In one example, the client devices 108 can be provided with the UI, via which the user can provide the one or more stratification parameters and the stopping parameters as described earlier. In another example, the client devices can be configured to access a UI hosted at a third party location in order to provide the parameters.
[0028] Fig. 2 illustrates the stratified sampling system 102, in accordance with an
embodiment of the present subject matter. In said implementation, the stratified sampling system 102 includes one or more processor(s) 202, interface(s) 204, and a memory 206 coupled to the processor 202. The processor 202 can be a single processing unit or a number of units, all of which could also include multiple computing units. The processor 202 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor 202 is configured to fetch and execute computer-readable instructions and data stored in the memory 206.
[0029] The interfaces 204 may include a variety of software and hardware interfaces, for
example, interface for peripheral device(s), such as a keyboard, a mouse, an external memory, and a printer. Further, the interfaces 204 may enable the stratified sampling system 102 to communicate with other computing devices, such as web servers and external data repositories in the communication network (not shown in the figure). The interfaces 204 may facilitate multiple communications within a wide variety of protocols and networks, such as a network, including wired networks, e.g., LAN, cable, etc., and wireless networks, e.g., WLAN, cellular, satellite, etc. The interfaces 204 may include one or more ports for connecting the stratified sampling system 102 to a number of computing devices.
[0030] The memory 206 may include any computer-readable medium known in the art
including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. The memory 206 also includes module(s) 208 and data 210.
[0031] The module(s) 208 include routines, programs, objects, components, data structures,
etc., which perform particular tasks or implement particular abstract data types. In one implementation, the module(s) 208 includes the request processing module 112, the analysis module 114, and other module(s) 216. The other module(s) 216 may include programs or coded instructions that supplement applications and functions of the stratified sampling system 102.
[0032] On the other hand, the data 210, inter alia serves as a repository for storing data
processed, received, and generated by one or more of the module(s) 208. The data 210 includes for example, the request processing data 220, analysis data 222, and other data 224. The other data 224 includes data generated as a result of the execution of one or more modules in the module(s) 208.
[0033] In one implementation, the stratified sampling system 102 can be configured to create
one or more strata that include data selected from a database, such as the database 105. The stratified sampling system 102 can be configured to operate at a database table level, where the data can be selected from a database table, which in turn can be selected based on a testing requirement. For example, if the testing requirement requires customer data, the required database table containing said data can be selected. In one example, the stratified sampling system 102 can be configured to cascade the stratification process to other relational database tables (parent and child).
[0034] In one implementation the stratified sampling system 102 includes the request
processing module 112. The request processing module 112 can be configured to receive one or more stratification parameters, for example, via a User Interface (UI) of the stratified sampling system 102. The stratification parameters can be used to indicate a type of data to be selected from the database table selected in the manner described earlier. In one implementation, the stratification parameter can be numeric criteria, or a range of the same, time or date criteria, or a range of the same, and a text string. For example, if a database table is composed of customer data including names, addresses, age, gender, date of entry, and salary, the stratification parameter can include specific values or ranges of values as mentioned above (as either numeric, word string, time, or date type) indicating which of the customer data is to be selected for a particular stratum. In one example, the user might specify different strata to be created based on varying stratification parameters. This means that each stratum that is created can be based on
different stratification parameters, and the number of strata to be formed can be directly or indirectly dependent on the stratification parameters.
[0035] In one example, a maximum number of strata to be created can be defined in the
stratified sampling system 102. The maximum number of strata to be formed can be defined to limit an unnecessarily high usage of computational resources from being utilized during the stratification process. For instance, if a column ‘annual income’ has the least value of ‘0’ and the highest value in billions of USD, a default criterion may define that the maximum number of strata that can be formed is 10,000. This prevents a user choosing small ranges, such as 0-500 USD, 500- to 1000 USD and so on, for defining exceptionally high number of strata that may consume substantially large computational resources. In one example, the request processing module 112 can be further configured to gather the number of strata to be created, and store the stopping parameter and the number of strata to be created in the request processing data 220. In another example, the same stratification parameter(s) can be applicable for the creation of all the strata. The stratification parameters are sensitive towards a type of data, which is beneficial in providing homogeneity to the stratification process, which is further substantially beneficial to the end testing process, for example, a software testing process. The request processing module 112 can be further configured to store the one or more stratification parameters in the request processing data 220.
[0036] In one implementation, the request processing module 112 can be configured to
obtain a stopping parameter, for example, via the UI of the stratified sampling system 102. The stopping parameter can be indicative of a number of data records to be included in each of the strata to be created by the stratified sampling system 102. For example, the stopping parameter can include a numeric value of 10, where upon selection of 10 data records from the selected database table, no further selection will take place. As indicated above, the stratified sampling system 102 can be configured to apply the one or more stratification parameters to categorize the data in the selected database table, along with which the stratified sampling system 102 can be configured to further apply the stopping parameter to determine how many of the selected data records are to be included in each of the strata thus created. In one example, the application of the stratification parameters can be understood as defining placeholders to indicate where to fill data records that fulfill the stratification parameters. In one example, the user may specify certain
additional rules and constraints with which to apply the one or more stratification parameters and the stopping parameter. In said example, the user might specify all the strata to have an equal number of data records, i.e., the same stopping parameter will be applicable to all the strata. In another example, the user might specify different stopping parameters for different strata.
[0037] In one implementation, the analysis module 114 can be configured to create the one
or more strata based on the one or more stratification parameters, the stopping parameter, and the number of strata to be created. In one example, the number of strata to be formed can be dependent on the stratification parameters. In another example, the number of strata to be formed can be defined by a user. As described earlier, one or more stratification parameters can be defined in order to create the strata, and the maximum number of stratification parameters that can possibly be defined may depend on the number of columns in a given database table. For example, in a database table containing 5 rows and 3 columns, where the rows represent individual data records, and the columns indicate a type of data, various stratification parameters can be defined. In said example, the columns can be ‘age’, ‘pin code’, and ‘salary’. The stratification parameters as indicated earlier can be of numeric, word string, time, or date type. In one example, upper and lower limits of the stratification parameters can be specified. In the above example, the stratification parameters can include an age range between 25 to 30, a pin code, say 500001, and salaries between 20,000 and 30,000. Therefore, 1 stratum can be created based on the above mentioned stratification parameters, where rows that fulfill these stratification parameters will be selected and filled in the stratum. Furthermore, in another example, one stratification parameter can be an age 18 and salary 20,000, and a further stratification parameter can be an age 20, and salary 25,000. In this case, two strata can be created with the data records that fulfill each of the stratification parameters. Moreover, the stratification process can continue until the end of the table is reached, i.e., in this case, till the 5th row, or up till the stopping parameter for each stratum is achieved. The stopping parameter as described earlier is a value that defines how many data records to include per stratum.
[0038] During operation, the user can select a database table, for example, in the database
105, from which the strata are to be created. The selection of the database table can be based on a type of data sought, which can further be stratified according to the present subject matter to achieve substantially high levels of homogeneity in the one or more strata. In one example,
information pertaining to the selection of the database tables can be stored in the stratified sampling system 102. In another example, the analysis module 112 can be configured to store said information in the analysis data 222, or the other data 224.
[0039] Furthermore, the analysis module 114 can be configured to fetch rules pertaining to
the creation of the one or more strata, viz. the one or more stratification parameters, the stopping parameter, and the number of strata to be created from the request processing data 220. Furthermore, the analysis module 114 can be configured to scan each row of the selected database table based on the one or more stratification parameters. In case more than one stratum has been configured in the manner described earlier, the analysis module 114 can be configured to fill the plurality of strata in parallel. Each row is scanned by the analysis module 114, which can be further configured to determine the number of records that fall in a stratum based on the stopping parameter. For example, if the stopping parameter is achieved for a stratum, no new records are added to it by the analysis module 114. Furthermore, if the stopping parameter is achieved for the plurality of the strata, then the analysis module 114 can be configured to stop scanning. In one example, the analysis module 114 can be configured to store the strata in the analysis data 222. For similar testing requirements, the user can refer to the stored strata in the analysis data 222, for example, via functionality of the UI, for regression testing purposes.
[0040] In an operational example, a database table containing census information (age, sex,
address, occupation, etc. of citizens), can be sampled using the system according to the present subject matter. If information regarding males, aged 20 to 25, residing in a particular geographical location is required, the request processing module 112 can be configured in a manner as described earlier, to receive relevant stratification parameters. In this case, the stratification parameters can include ‘male’, an upper and lower numerical parameter, viz. the age group 20 to 25, and the geographical location, such as a pin code. Based on the above received stratification parameters, the analysis module 114 of the stratified sampling system 102 can be configured to select the relevant columns in the database table, in this case, a sex column, an age column, and an address pin code column. Furthermore, as described earlier, the request processing module 112 can be configured to obtain the stopping parameter, indicative of the number of data records to include per stratum. In this case, let’s say an end requirement is 10 data records, for which the required stopping parameter can be 10.
[0041] During operation, the analysis module 114 scans each row of the database table
according to the stratification parameters received earlier. Furthermore, all data records fitting the stratification parameters, i.e., all males, between the ages 20 to 25, staying within the pin codes specified, will be selected within the stratum. In one example, a placeholder can be defined for each of the strata defined by the stratification parameters. The placeholders will specify for each stratum, where to fill or populate the data records fulfilling the stratification parameters. Moreover, as the stopping parameter is set at 10, the analysis module 114 can be configured to stop the creation of strata once the stopping parameter of 10 is achieved for each of the strata.
[0042] In one example, the analysis module 114 of the stratified sampling system 102 can be
configured to continue performing the above mentioned process until the end of the database table is reached, or until the pre-requisite number of strata has been created with the number of records specified in the stopping parameter. In one example, the analysis module 114 can be configured to perform a single pass through the database table, until the one or more stratification parameters, the stopping condition, and the number of strata to be created criterion are fulfilled. In other words, the analysis module 114 of the stratified sampling system 102 can be configured to maintain a count of the data records selected per stratum until the stopping parameter is achieved. Once the strata have been created, the user can utilize the data contained therein for data intensive uses, such as data mining, query cost evaluation, and software testing. In order to utilize said data, for example, for the testing of software, data needs to be sampled or extracted from the strata.
[0043] In one example, the analysis module 114 can be configured to perform a sampling
process, by which data can be extracted or sampled, from the strata that are created in the manner described above. For example, the homogeneity of data in the one or more strata can ensure that data that is sampled from therein is representative of the entire data population of said stratum. Furthermore, due to the configuration of the stratified sampling system 102, the stratification process is rendered to be deterministic and predictable in nature. Due to this deterministic nature, databases can be repeatedly sampled without fear of overlapping samples. For repeated sampling, the stopping parameter can be increased, where by maintaining the same stratification parameters, the categorization criteria of the data records can be the same. However, only new data records get appended to the strata thereby providing predictability to the stratification
process. For example, if the stopping parameter in one sampling requirement is specified to be 5, the stratified sampling system 102 can perform a single pass through the database table and select the first 5 data records that fulfill the one or more stratification parameters. At a later stage, for the same sampling requirement, if a further 5 data records are to be selected for the same stratification parameters, the stratified sampling system 102 can be configured to ignore the 5 data records that were already selected in the previous sampling and perform the single pass scanning after the 5th data record in the database table. In other words, the stratified sampling system 102 can be configured to operate in an incremental manner. The stratified sampling system 102 is designed to provide data without compromising the quality of test data. Moreover, the stratified sampling system 102 is designed to work with databases of considerably larger size without bias.
[0044] Furthermore, the stratified sampling system 102 can configured to operate with
minimal user inputs. This means that a user with substantially low levels of information can also leverage the stratified sampling system 102 effectively. Furthermore, the strata are formed in a homogenized manner by utilizing the one or more stratification parameters and the stopping parameter. In other words, the stratified sampling system 102 can be configured to create strata based on actual content of the database tables, which can be more effective in fields, such as software testing.
[0045] Fig. 3 illustrates a method 300 for stratified sampling of a database, according to one
embodiment of the present subject matter. The method 300 may be implemented in a variety of computing systems in several different ways. For example, the method 300, described herein, may be implemented using the stratified sampling system 102, as described above.
[0046] The method 300, completely or partially, may be described in the general context of
computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. A person skilled in the art will readily recognize that steps of the method can be performed by programmed computers. Herein, some embodiments are also intended to cover program storage devices, e.g., digital data storage media, which are machine or computer readable and encode machine-
executable or computer-executable programs of instructions, wherein said instructions perform some or all of the steps of the described method 300.
[0047] The order in which the method 300 is described is not intended to be construed as a
limitation, and any number of the described method blocks can be combined in any order to implement the method, or an alternative method. Additionally, individual blocks may be deleted from the method without departing from the spirit and scope of the subject matter described herein. Furthermore, the methods can be implemented in any suitable hardware, software, firmware, or combination thereof. It will be understood that even though the method 300 is described with reference to the stratified sampling system 102, the description may be extended to other systems as well.
[0048] At block 302, in order to create one or more strata, one or more stratification
parameters can be received, for example, from a user. The user can specify the one or more stratification parameters that are indicative of a type of data to be selected in a database table, and a number of strata to be created, in a manner described earlier. In one example, a selection of the database table can be driven by an end result testing requirement. For example, if the testing requirement is financial transaction information pertaining to savings bank accounts of customers, then the user can select a database table containing data pertaining to the same from a database. In one example, the analysis module 114 can be configured to select the database table of interest based on a user input, and store information pertaining to said selection, for example, in the analysis data 222, or the other data 224, of the stratified sampling system 102. Furthermore, the stratification parameters can be dependent on a type of testing to be performed. The stratification parameters can include numeric criteria, or a range of the same, date, time criteria, or a range of the same, and a text string. For example, a set of stratification parameters can include an age group of males between the ages 25 to 30 with an address having pin codes between 500001 and 500020, between a time period extending from the year 1995 to 2005. In such a case the stratification parameters can act as categorization parameters to categorize data in the database in order to fulfill the stratification parameters. In one example, the request processing module 112 of the stratified sampling system 102 can be configured to receive the one or more stratification parameters in a manner as previously described.
[0049] At block 304, in furtherance to receiving the one or more stratification parameters, a
stopping parameter can be obtained. The stopping parameter, as described earlier, is indicative of a number of data records to be included on each of the one or more strata that are to be created. In one example, the stopping parameter can be a numerical value indicating the number of data records to be included during the creation of the one more strata. In a further example, a common stopping parameter for all the strata can be obtained, or a different stopping parameter can be obtained for each of the plurality of strata. Furthermore, a numerical value indicating a number of strata to be created can also be obtained. For example, if the user wishes to restrict the number of strata to be created, he may do so. In an example, the request processing module 112 of the stratified sampling system 102 can be configured to obtain the stopping parameter, and the number of strata to be created in a manner as previously described.
[0050] At blocks 306 and 308, the one or more strata can be created based on the one or
more stratification parameters and the stopping parameter. At the block 306, the one or more stratification parameters can be applied to the selected database table in order to create the one or more strata. In case more than one stratum is to be created based on the stratification parameters, each of the plurality of strata can be created in parallel. This means that each of the rows of the database table can be scanned against the stratification parameters together. In one example, the analysis module 114 of the stratified sampling system 102 can be configured to scan the rows of the database table based on the one or more stratification parameter.
[0051] At the block 308, the rows of the database can be scanned until the stopping
parameter has been achieved, or until the end of the database is reached. For example, if the stopping parameter defines a numerical value of 5, then during the scanning of the database table, the data records in each of the rows thereof being scanned can be measured against the stopping parameter as if it were a threshold parameter. If the data record of the current row being scanned corresponds to a stratum for which enough records exists, i.e., the stopping parameter is already achieved for this stratum, the data record is not selected. Furthermore, this process can be performed until the number of strata, which was defined in block 304, contains the number of data records defined in the stopping parameter, or the end of the database table is reached.
[0052] At a block 310, data can be extracted from the one or more strata that have been
created according to the process described in the blocks 302 to 308. The data extracted from the
one or more strata can be considered to be substantially representative of the characteristics of the entire stratum. In this manner, a relatively high efficiency of testing can be facilitated.
[0053] At a block 312, the process described in the blocks 302 to 308, can be cascaded to
other database tables. For example, the process described in the blocks 302 to 308 can be cascaded to database tables that bear a parent/child relationship to the database table selected in the block 302. In said manner, the process described above can effectively be extended to other database tables in the database, thereby substantially reducing sampling and testing times.
[0054] Although implementations of stratified sampling of a database have been described in
language specific to structural features and/or methods, it is to be understood that the present subject matter is not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as implementations for stratified sampling of a database.
I/We claim:
1. A computer implemented method for stratified sampling of a database, the method
comprising:
receiving at least one stratification parameter indicating a type of data to be selected in the database;
obtaining a stopping parameter indicating a number of data records to be included in each of at least one stratum;
creating the at least one stratum based on the at least one stratification parameter and the stopping parameter; and
extracting data from the at least one stratum.
2. The method as claimed in claim 1, wherein the creating further comprises:
selecting a database table from which the at least one stratum is to be created; and
scanning each row of the database table to determine whether the stopping parameter has been achieved.
3. The method as claimed in claim 2, wherein the scanning further comprises scanning until one of an end of the database is reached, and the stopping parameter is achieved.
4. The method as claimed in claim 1 further comprising defining a number of strata to be created.
5. A stratified sampling system (102) for stratified sampling of a database, the stratified sampling system (102) comprising:
a processor (202); and
a memory (206) coupled to the processor (202), the memory (206) comprising:
a request processing module (112) configured to:
receive at least one stratification parameter indicating a type of data to be selected in the database; and
obtain a stopping parameter indicating a number of data records to be included in each of at least one stratum; and
an analysis module (114) configured to:
create the at least one stratum based on the at least one stratification parameter and the stopping parameter; and
extract data from the at least one stratum.
6. The stratified sampling system (102) as claimed in claim 5, wherein the request processing module (112) is further configured to gather a number of strata to be created.
7. The stratified sampling system (102) as claimed in claim 5, wherein the analysis module (114) is further configured to:
scan each row of a selected database table based on at least one of the stratification parameter, the stopping parameter, and the number of strata to be created.
8. The stratified sampling system (102) as claimed in claim 7, wherein the analysis module (114) is further configured to continue scanning until one of an end of the selected database table is reached, and the gathered number of strata is created with the number of records specified in the stopping parameter.
9. The stratified sampling system (102) as claimed in claim 5, wherein the at least one stratification parameter is at least one of numeric criteria, date criteria, time criteria, and a text string.
10. A computer-readable medium having embodied thereon a computer program for executing a method comprising:
receiving at least one stratification parameter indicating a type of data to be selected in a database;
obtaining a stopping parameter indicating a number of data records to be included in each of at least one stratum; and
creating the at least one stratum based on the at least one stratification parameter and the stopping parameter; and
extracting data from the at least one stratum.
| # | Name | Date |
|---|---|---|
| 1 | 680-MUM-2012-FORM 3.pdf | 2018-08-11 |
| 2 | 680-MUM-2012-FORM 26(25-4-2012).pdf | 2018-08-11 |
| 3 | 680-MUM-2012-FORM 2.pdf | 2018-08-11 |
| 4 | 680-MUM-2012-FORM 18(16-3-2012).pdf | 2018-08-11 |
| 5 | 680-MUM-2012-FORM 1(22-8-2012).pdf | 2018-08-11 |
| 6 | 680-MUM-2012-CORRESPONDENCE(25-4-2012).pdf | 2018-08-11 |
| 7 | 680-MUM-2012-CORRESPONDENCE(22-8-2012).pdf | 2018-08-11 |
| 8 | 680-MUM-2012-CORRESPONDENCE(16-3-2012).pdf | 2018-08-11 |
| 9 | 680-MUM-2012-FER.pdf | 2018-10-15 |
| 10 | 680-MUM-2012-OTHERS [15-04-2019(online)].pdf | 2019-04-15 |
| 11 | 680-MUM-2012-FER_SER_REPLY [15-04-2019(online)].pdf | 2019-04-15 |
| 12 | 680-MUM-2012-COMPLETE SPECIFICATION [15-04-2019(online)].pdf | 2019-04-15 |
| 13 | 680-MUM-2012-CLAIMS [15-04-2019(online)].pdf | 2019-04-15 |
| 14 | 680-MUM-2012-ABSTRACT [15-04-2019(online)].pdf | 2019-04-15 |
| 15 | 680-MUM-2012-HearingNoticeLetter-(DateOfHearing-25-10-2019).pdf | 2019-10-14 |
| 16 | 680-MUM-2012-Correspondence to notify the Controller (Mandatory) [22-10-2019(online)].pdf | 2019-10-22 |
| 17 | 680-MUM-2012-Written submissions and relevant documents (MANDATORY) [07-11-2019(online)].pdf | 2019-11-07 |
| 18 | 680-MUM-2012-PatentCertificate16-06-2021.pdf | 2021-06-16 |
| 19 | 680-MUM-2012-IntimationOfGrant16-06-2021.pdf | 2021-06-16 |
| 20 | 680-MUM-2012-RELEVANT DOCUMENTS [26-09-2023(online)].pdf | 2023-09-26 |
| 1 | search_680mum2012_11-10-2018.pdf |