High Dimensional Stratified Sampling

Abstract: In one aspect a processing device of an information processing system is operative to perform high dimensional stratified sampling of a database comprising a plurality of records arranged in overlapping sub groups. For a given record the processing device determines which of the sub groups the given record is associated with and for each of the sub groups associated with the given record checks if a sampling rate of the sub group is less than a specified sampling rate. If the sampling rate of each of the sub groups is less than the specified sampling rate the processing device samples the given record and otherwise does not sample the given record. The determine check and sample operations are repeated for additional records and samples resulting from the sample operations are processed to generate information characterizing the database. Other aspects of the invention relate to determining which records to sample through iterative optimization of an objective function that may be based for example on a likelihood function of the sampled records.

Patent Information

Application #

Filing Date

24 January 2013

Publication Number

47/2014

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

ALCATEL LUCENT

3 avenue Octave Gréard F 75007 Paris

Inventors

1. CHEN Aiyou

53 Ethan Drive Apartment 1A New Providence NJ 07974

2. XIONG Ming

19 Elmara Drive Bridgewater NJ 08807

Specification

HIGH-DIMENSIONAL STRATIFIED SAMPLING Field of the Invention The present invention relates generally to the field of information processing, and more particularly relates to techniques for stratified sampling of records associated with a database of an information processing system. Background of the Invention Large databases often include millions of records or more, with each record having many attributes. Statistical operations may be performed on such databases using sampling techniques that generally involve selecting records at random from the database. The selected records may then be analyzed to generate statistics characterizing the complete set of records in the database. In order to ensure that the resulting statistics accurately characterize the database, stratified sampling techniques may be used. In stratified sampling, the database records are separated into sub-groups or "strata," and one or more records are then randomly selected from each of the sub groups for analysis. An example of a conventional stratified sampling technique is described in U.S. Patent Application Publication No. 2002/0198863, entitled "Stratified Sampling of Data in a Database System." A problem with conventional stratified sampling techniques is that such techniques typically attempt to separate the records into mutually exclusive sub-groups, and can therefore only consider a limited number of attributes. The number of attributes per record is generally referred to as the "dimensionality" of the database, and the conventional stratified sampling techniques are practical only in low dimensionality situations. However, many modern databases, such as those used to track connection data in telecommunication applications, have a very high dimensionality. Consider by way of example a database that stores N records, each with K attributes, where each attribute takes mk discrete values, 1< k < K . If K is small, one can simply concatenate the attributes in order to partition the database into mutually exclusive sub-groups. The number of sub-groups in this case is given by However, as K gets larger, this approach is impractical. For example, if mk = 5 and K = , then there are nearly 107 subgroups, many of which will contain no records or only a small number of records. In this type of high dimensionality context, conventional stratified sampling techniques are unable to provide an appropriate stratified sample for each of the K attributes. The problem is apparent in numerous information processing applications, including large scale database integration and maintenance, data mining, data warehousing, query processing, telecommunication network traffic analysis, opinion polls, etc. Summary of the Invention Illustrative embodiments of the present invention provide high-dimensional stratified sampling techniques that are suitable for use in applications in which both the number N of records and the number^ of attributes per record are large. These embodiments include sequential and optimal high-dimensional stratified sampling algorithms. The former is particularly useful for online sampling, while the latter is particularly useful for offline or periodical sampling, although both can also be used in a wide variety of other sampling applications. In accordance with one aspect of the invention, a processing device of an information processing system is operative to perform high-dimensional stratified sampling of a database comprising a plurality of records arranged in overlapping sub-groups. For a given record, the processing device determines which of the sub-groups the given record is associated with, and for each of the sub-groups associated with the given record, checks if a sampling rate of the sub group is less than a specified sampling rate. If the sampling rate of each of the sub-groups is less than the specified sampling rate, the processing device samples the given record, and otherwise does not sample the given record. The determine, check and sample operations are repeated for additional records, and samples resulting from the sample operations are processed to generate information characterizing the database. In accordance with another aspect of the invention, a processing device of an information processing system performs high-dimensional stratified sampling of a database comprising a plurality of records arranged in overlapping sub-groups by optimizing an objective function characterizing which of the plurality of records are to be sampled. The objective function may be based, for example, on a likelihood function of the sampled records, and more specifically may be based on a binomial-normal approximation of a likelihood function of the sampled records. The optimization of the objective function is performed by iteratively updating components of a binary indicator that specifies whether or not respective ones of the plurality of records are sampled. The processing device samples particular ones of the plurality of records based on values of the updated components of the binary indicator which optimize the obj ective function, and the resulting samples are processed to generate information characterizing the database comprising the sub-groups of records. The illustrative embodiments provide significant advantages over conventional approaches. For example, the sequential and optimal high-dimensional stratified sampling processes in the illustrative embodiments can be used to generate reliable, unbiased samples with minimal computing and memory requirements. These and other features and advantages of the present invention will become more apparent from the accompanying drawings and the following detailed description. Brief Description of the Drawings FIG. 1 is a block diagram of an information processing system implementing highdimensional stratified sampling in an illustrative embodiment of the invention. FIG. 2 shows a more detailed view of a processing device of the FIG. 1 system. FIG. 3 is a flow diagram of a sequential high-dimensional stratified sampling process in an illustrative embodiment of the invention. FIG. 4 is a flow diagram of an optimal high-dimensional stratified sampling process in an illustrative embodiment of the invention. FIG. 5 shows a simple example of a set of connection records in a network traffic application in which the high-dimensional stratified sampling processes of FIGS. 3 or 4 may be applied. FIG. 6 is a set of plots comparing estimation error as a function of sampling rate for sequential and optimal high-dimensional stratified sampling with that of conventional random sampling. FIG. 7 shows multiple sets of plots each comparing estimation error as a function of the number of sub-groups for sequential and optimal high-dimensional stratified sampling with that of conventional random sampling. Detailed Description of the Invention The present invention will be illustrated herein in conjunction with exemplary information processing systems, processing devices and high-dimensional stratified sampling techniques. It should be understood, however, that the invention is not limited to use with the particular types of systems, devices and techniques disclosed. For example, aspects of the present invention can be implemented in a wide variety of other information processing system configurations, using processing devices and process steps other than those described in conjunction with the illustrative embodiments. FIG. 1 shows an information processing system 100 comprising a controller 102 coupled via a network 104 to a database system 105 that includes a plurality of servers 106-1, 106-2, . . . 106-N, also denoted Server 1, Server 2, . . . ServerN. Each of the servers 106 has an associated database 108. These databases store records or other data objects that are accessed by the controller 102 via the network 104. The controller 102 in this embodiment comprises a sampling module 110 that is configured to implement one or more high-dimensional stratified sampling techniques to be described in greater detail below. The sampling module 110 utilizes the highdimensional stratified sampling technique(s) to process sets of records that are separated into sub-groups that are not necessarily mutually exclusive. The records processed by the sampling module 110 may be received from data sources 112 or retrieved from one or more of the databases 108 of the database system 105. The resulting stratified samples may be stored by the controller 102 in a sample database 114. Although shown in the figure as being separate from the database system 105, system elements such as controller 102 and sample database 114 may alternatively be implemented within the database system 105. The controller 102 may comprise at least a portion of a computer or any other type of processing device suitable for communicating with the database system 105 over network 104. For example, the controller may comprise a portable or laptop computer, mobile telephone, personal digital assistant (PDA), wireless email device, television set-top box (STB), or other communication device. The network 104 may comprise a wide area network such as the Internet, a metropolitan area network, a local area network, a cable network, a telephone network, a satellite network, as well as portions or combinations of these or other networks. In other embodiments, the sampling module 110 may be implemented in one or more of the servers 106 or their associated databases 108, or in a separate centralized controller coupled to one or more of these elements. It is also possible to implement the sampling module in a distributed manner with portions of the module being arranged in respective ones of the devices 102, 106 or 108 or subsets thereof. The databases 108 need not be in any particular configuration, and the term "database" as used herein is therefore intended to be construed broadly so as to encompass any number of different arrangements of stored records. Referring now to FIG. 2, one possible implementation of the controller 102 of the system 100 is shown. In this embodiment, the controller comprises a processor 200 coupled to a memory 202, and further comprises network interface circuitry 204. The memory 202 is assumed to store records 205 or portions thereof for processing by the sampling module 110. The stored records 205 may be received from data sources 112 or retrieved from the database system 105 over the network 104. The sampling module 110 of the controller 102 in this implementation comprises a sub-group identification module 210, a sampling rate determination module 212, a sampling decision module 214, an optimization module 215, and a set of counters 220 including counters 222 that count the number of records per sub-group and counters 224 that count the number of samples per sub-group. The operation of these modules and counters will be described in greater detail below in conjunction with FIGS. 3 and 4 . The processor 200 may be implemented as a microprocessor, a microcontroller, an application-specific integrated circuit (ASIC) or other type of processing device, as well as portions or combinations of such devices. The memory 202 may comprise an electronic random access memory (RAM), a read-only memory (ROM), a disk-based memory, or other type of storage device, as well as portions or combinations of such devices. The processor and memory may be used in storage and execution of one or more software programs for high-dimensional stratified sampling, as well as for performing related operations, such as those associated with storage and processing of records. The modules 210, 212, 214 and 215 may therefore be implemented at least in part using such software programs. The memory 202 may be viewed as an example of what is more generally referred to herein as a computer program product or still more generally as a computer-readable storage medium that has executable program code embodied therein. Other examples of computer-readable storage media may include disks or other types of magnetic or optical media, in any combination. The processor 200, memory 202 and interface circuitry 204 may comprise well-known conventional circuitry suitably modified to operate in the manner described herein. Also, the various modules shown in FIG. 2 may be viewed as examples of circuitry used to implement the associated functionality. For example, portions of such circuitry may comprise matrix multiplication circuitry or other types of arithmetic logic circuitry. Conventional aspects of such circuitry are well known to those skilled in the art and therefore will not be described in detail herein. It is to be appreciated that an information processing system and associated controller as disclosed herein may be implemented using components and modules other than those specifically shown in the exemplary arrangements of FIGS. 1 and 2 . The operation of the system 100 in illustrative embodiments will now be described with reference to the flow diagrams of FIGS. 3 and 4 . These flow diagrams illustrate respective sequential and optimal high-dimensional stratified sampling techniques. It will be assumed for these embodiments that the sampling techniques are applied to a database that stores N records, each with K attributes, where each attribute takes mk discrete values, 1< k

Documents

Application Documents

#	Name	Date
1	787-delnp-2013-Correspondence Others-(31-01-2013).pdf	2013-01-31
2	SPECIFICATION.pdf	2013-02-06
3	GPOA.pdf	2013-02-06
4	FORM 5.pdf	2013-02-06
5	FORM 3.pdf	2013-02-06
6	FIGURES.pdf	2013-02-06
7	787-DELNP-2013.pdf	2013-03-01
8	787-delnp-2013-Form-3-(20-06-2013).pdf	2013-06-20
9	787-delnp-2013-Correspondence-Others-(20-06-2013).pdf	2013-06-20
10	787-DELNP-2013-Form-3-(26-02-2014).pdf	2014-02-26
11	787-DELNP-2013-Correspondence-Others-(26-02-2014).pdf	2014-02-26
12	787-DELNP-2013-Form-3-(22-07-2014).pdf	2014-07-22
13	787-DELNP-2013-Correspondence-Others-(22-07-2014).pdf	2014-07-22
14	787-DELNP-2013-Form 3-311014.pdf	2014-11-27
15	787-DELNP-2013-Correspondence-311014.pdf	2014-11-27
16	787-DELNP-2013-FER.pdf	2018-11-20
17	787-DELNP-2013-AbandonedLetter.pdf	2019-10-19

Search Strategy

1	searchstrat_19-11-2018.pdf