Specification
HIGH-DIMENSIONAL STRATIFIED SAMPLING
Field of the Invention
The present invention relates generally to the field of information processing, and more
particularly relates to techniques for stratified sampling of records associated with a database of
an information processing system.
Background of the Invention
Large databases often include millions of records or more, with each record having many
attributes. Statistical operations may be performed on such databases using sampling techniques
that generally involve selecting records at random from the database. The selected records may
then be analyzed to generate statistics characterizing the complete set of records in the database.
In order to ensure that the resulting statistics accurately characterize the database, stratified
sampling techniques may be used. In stratified sampling, the database records are separated into
sub-groups or "strata," and one or more records are then randomly selected from each of the sub
groups for analysis. An example of a conventional stratified sampling technique is described in
U.S. Patent Application Publication No. 2002/0198863, entitled "Stratified Sampling of Data in a
Database System."
A problem with conventional stratified sampling techniques is that such techniques
typically attempt to separate the records into mutually exclusive sub-groups, and can therefore
only consider a limited number of attributes. The number of attributes per record is generally
referred to as the "dimensionality" of the database, and the conventional stratified sampling
techniques are practical only in low dimensionality situations. However, many modern
databases, such as those used to track connection data in telecommunication applications, have a
very high dimensionality.
Consider by way of example a database that stores N records, each with K attributes,
where each attribute takes mk discrete values, 1< k < K . If K is small, one can simply
concatenate the attributes in order to partition the database into mutually exclusive sub-groups.
The number of sub-groups in this case is given by However, as K gets larger, this
approach is impractical. For example, if mk = 5 and K = , then there are nearly 107 subgroups,
many of which will contain no records or only a small number of records. In this type of
high dimensionality context, conventional stratified sampling techniques are unable to provide an
appropriate stratified sample for each of the K attributes. The problem is apparent in numerous
information processing applications, including large scale database integration and maintenance,
data mining, data warehousing, query processing, telecommunication network traffic analysis,
opinion polls, etc.
Summary of the Invention
Illustrative embodiments of the present invention provide high-dimensional stratified
sampling techniques that are suitable for use in applications in which both the number N of
records and the number^ of attributes per record are large. These embodiments include
sequential and optimal high-dimensional stratified sampling algorithms. The former is
particularly useful for online sampling, while the latter is particularly useful for offline or
periodical sampling, although both can also be used in a wide variety of other sampling
applications.
In accordance with one aspect of the invention, a processing device of an information
processing system is operative to perform high-dimensional stratified sampling of a database
comprising a plurality of records arranged in overlapping sub-groups. For a given record, the
processing device determines which of the sub-groups the given record is associated with, and
for each of the sub-groups associated with the given record, checks if a sampling rate of the sub
group is less than a specified sampling rate. If the sampling rate of each of the sub-groups is less
than the specified sampling rate, the processing device samples the given record, and otherwise
does not sample the given record. The determine, check and sample operations are repeated for
additional records, and samples resulting from the sample operations are processed to generate
information characterizing the database.
In accordance with another aspect of the invention, a processing device of an information
processing system performs high-dimensional stratified sampling of a database comprising a
plurality of records arranged in overlapping sub-groups by optimizing an objective function
characterizing which of the plurality of records are to be sampled. The objective function may
be based, for example, on a likelihood function of the sampled records, and more specifically
may be based on a binomial-normal approximation of a likelihood function of the sampled
records. The optimization of the objective function is performed by iteratively updating
components of a binary indicator that specifies whether or not respective ones of the plurality of
records are sampled. The processing device samples particular ones of the plurality of records
based on values of the updated components of the binary indicator which optimize the obj ective
function, and the resulting samples are processed to generate information characterizing the
database comprising the sub-groups of records.
The illustrative embodiments provide significant advantages over conventional
approaches. For example, the sequential and optimal high-dimensional stratified sampling
processes in the illustrative embodiments can be used to generate reliable, unbiased samples with
minimal computing and memory requirements.
These and other features and advantages of the present invention will become more
apparent from the accompanying drawings and the following detailed description.
Brief Description of the Drawings
FIG. 1 is a block diagram of an information processing system implementing highdimensional
stratified sampling in an illustrative embodiment of the invention.
FIG. 2 shows a more detailed view of a processing device of the FIG. 1 system.
FIG. 3 is a flow diagram of a sequential high-dimensional stratified sampling process in
an illustrative embodiment of the invention.
FIG. 4 is a flow diagram of an optimal high-dimensional stratified sampling process in an
illustrative embodiment of the invention.
FIG. 5 shows a simple example of a set of connection records in a network traffic
application in which the high-dimensional stratified sampling processes of FIGS. 3 or 4 may be
applied.
FIG. 6 is a set of plots comparing estimation error as a function of sampling rate for
sequential and optimal high-dimensional stratified sampling with that of conventional random
sampling.
FIG. 7 shows multiple sets of plots each comparing estimation error as a function of the
number of sub-groups for sequential and optimal high-dimensional stratified sampling with that
of conventional random sampling.
Detailed Description of the Invention
The present invention will be illustrated herein in conjunction with exemplary
information processing systems, processing devices and high-dimensional stratified sampling
techniques. It should be understood, however, that the invention is not limited to use with the
particular types of systems, devices and techniques disclosed. For example, aspects of the
present invention can be implemented in a wide variety of other information processing system
configurations, using processing devices and process steps other than those described in
conjunction with the illustrative embodiments.
FIG. 1 shows an information processing system 100 comprising a controller 102 coupled
via a network 104 to a database system 105 that includes a plurality of servers 106-1, 106-2, . . .
106-N, also denoted Server 1, Server 2, . . . ServerN. Each of the servers 106 has an associated
database 108. These databases store records or other data objects that are accessed by the
controller 102 via the network 104. The controller 102 in this embodiment comprises a sampling
module 110 that is configured to implement one or more high-dimensional stratified sampling
techniques to be described in greater detail below. The sampling module 110 utilizes the highdimensional
stratified sampling technique(s) to process sets of records that are separated into
sub-groups that are not necessarily mutually exclusive. The records processed by the sampling
module 110 may be received from data sources 112 or retrieved from one or more of the
databases 108 of the database system 105. The resulting stratified samples may be stored by the
controller 102 in a sample database 114. Although shown in the figure as being separate from
the database system 105, system elements such as controller 102 and sample database 114 may
alternatively be implemented within the database system 105.
The controller 102 may comprise at least a portion of a computer or any other type of
processing device suitable for communicating with the database system 105 over network 104.
For example, the controller may comprise a portable or laptop computer, mobile telephone,
personal digital assistant (PDA), wireless email device, television set-top box (STB), or other
communication device.
The network 104 may comprise a wide area network such as the Internet, a metropolitan
area network, a local area network, a cable network, a telephone network, a satellite network, as
well as portions or combinations of these or other networks.
In other embodiments, the sampling module 110 may be implemented in one or more of
the servers 106 or their associated databases 108, or in a separate centralized controller coupled
to one or more of these elements. It is also possible to implement the sampling module in a
distributed manner with portions of the module being arranged in respective ones of the devices
102, 106 or 108 or subsets thereof.
The databases 108 need not be in any particular configuration, and the term "database" as
used herein is therefore intended to be construed broadly so as to encompass any number of
different arrangements of stored records.
Referring now to FIG. 2, one possible implementation of the controller 102 of the system
100 is shown. In this embodiment, the controller comprises a processor 200 coupled to a
memory 202, and further comprises network interface circuitry 204. The memory 202 is
assumed to store records 205 or portions thereof for processing by the sampling module 110.
The stored records 205 may be received from data sources 112 or retrieved from the database
system 105 over the network 104. The sampling module 110 of the controller 102 in this
implementation comprises a sub-group identification module 210, a sampling rate determination
module 212, a sampling decision module 214, an optimization module 215, and a set of counters
220 including counters 222 that count the number of records per sub-group and counters 224 that
count the number of samples per sub-group. The operation of these modules and counters will
be described in greater detail below in conjunction with FIGS. 3 and 4 .
The processor 200 may be implemented as a microprocessor, a microcontroller, an
application-specific integrated circuit (ASIC) or other type of processing device, as well as
portions or combinations of such devices. The memory 202 may comprise an electronic random
access memory (RAM), a read-only memory (ROM), a disk-based memory, or other type of
storage device, as well as portions or combinations of such devices. The processor and memory
may be used in storage and execution of one or more software programs for high-dimensional
stratified sampling, as well as for performing related operations, such as those associated with
storage and processing of records. The modules 210, 212, 214 and 215 may therefore be
implemented at least in part using such software programs. The memory 202 may be viewed as
an example of what is more generally referred to herein as a computer program product or still
more generally as a computer-readable storage medium that has executable program code
embodied therein. Other examples of computer-readable storage media may include disks or
other types of magnetic or optical media, in any combination.
The processor 200, memory 202 and interface circuitry 204 may comprise well-known
conventional circuitry suitably modified to operate in the manner described herein. Also, the
various modules shown in FIG. 2 may be viewed as examples of circuitry used to implement the
associated functionality. For example, portions of such circuitry may comprise matrix
multiplication circuitry or other types of arithmetic logic circuitry. Conventional aspects of such
circuitry are well known to those skilled in the art and therefore will not be described in detail
herein.
It is to be appreciated that an information processing system and associated controller as
disclosed herein may be implemented using components and modules other than those
specifically shown in the exemplary arrangements of FIGS. 1 and 2 .
The operation of the system 100 in illustrative embodiments will now be described with
reference to the flow diagrams of FIGS. 3 and 4 . These flow diagrams illustrate respective
sequential and optimal high-dimensional stratified sampling techniques. It will be assumed for
these embodiments that the sampling techniques are applied to a database that stores N records,
each with K attributes, where each attribute takes mk discrete values, 1< k
Documents
Application Documents
| # |
Name |
Date |
| 1 |
787-delnp-2013-Correspondence Others-(31-01-2013).pdf |
2013-01-31 |
| 2 |
SPECIFICATION.pdf |
2013-02-06 |
| 3 |
GPOA.pdf |
2013-02-06 |
| 4 |
FORM 5.pdf |
2013-02-06 |
| 5 |
FORM 3.pdf |
2013-02-06 |
| 6 |
FIGURES.pdf |
2013-02-06 |
| 7 |
787-DELNP-2013.pdf |
2013-03-01 |
| 8 |
787-delnp-2013-Form-3-(20-06-2013).pdf |
2013-06-20 |
| 9 |
787-delnp-2013-Correspondence-Others-(20-06-2013).pdf |
2013-06-20 |
| 10 |
787-DELNP-2013-Form-3-(26-02-2014).pdf |
2014-02-26 |
| 11 |
787-DELNP-2013-Correspondence-Others-(26-02-2014).pdf |
2014-02-26 |
| 12 |
787-DELNP-2013-Form-3-(22-07-2014).pdf |
2014-07-22 |
| 13 |
787-DELNP-2013-Correspondence-Others-(22-07-2014).pdf |
2014-07-22 |
| 14 |
787-DELNP-2013-Form 3-311014.pdf |
2014-11-27 |
| 15 |
787-DELNP-2013-Correspondence-311014.pdf |
2014-11-27 |
| 16 |
787-DELNP-2013-FER.pdf |
2018-11-20 |
| 17 |
787-DELNP-2013-AbandonedLetter.pdf |
2019-10-19 |
Search Strategy
| 1 |
searchstrat_19-11-2018.pdf |