Apparatus And Method For Detecting And Removing Outliers Using

Apparatus And Method For Detecting And Removing Outliers Using Sensitivity Score

Abstract: A method for detecting outliers is provided, the method comprising: receiving a digitized text corpus comprising a plurality of data points; identifying k clusters of the plurality of data points; sampling a data point among the plurality of data points as a first cluster center of the k clusters; determining sampling probability of each of remaining data points of the plurality of data points; sampling the next cluster center based on the sampling probability and iterate the process of determining sampling probability and the process of sampling the next cluster center until k cluster centers are sampled; generating weightage for each of the k cluster centers; determining sensitivity scores of the data points belonging to each of the k cluster centers; and labeling a data point having a sensitivity score greater than a threshold value as an outlier and removing the outlier from the digitized text corpus. FIG.1

Patent Information

Application #

Filing Date

28 September 2018

Publication Number

14/2020

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

bangalore@knspartners.com

Parent Application

Patent Number

Legal Status

Grant Date

2024-02-14

Renewal Date

Applicants

WIPRO LIMITED

Doddakannelli, Sarjapur Road, Bangalore 560035, Karnataka, India.

Inventors

1. RAMESHWAR PRATAP YADAV

BF102, Second Floor, Janakpuri 110058, New Delhi

Specification

Claims:WE CLAIM:

1. An apparatus for detecting outliers of a digitized text corpus, the apparatus comprising:
a memory for storing instructions; and
a processor that, when executing the instructions performs a method, the method comprising:
receiving the digitized text corpus comprising a plurality of data points (X=?{x_i}?_(i=1)^n);
identifying k clusters of the plurality of data points, wherein k is a natural number smaller than n;
sampling a data point among the plurality of data points as a first cluster center of the k clusters;
determining sampling probability of each of remaining data points of the plurality of data points, wherein the remaining data points indicate a difference between the plurality of data points and the sampled data point and the sampling probability indicates a probability of each of the remaining data points to be sampled as a next cluster center of the k clusters;
sampling the next cluster center based on the sampling probability and iterate the process of determining sampling probability and the process of sampling the next cluster center until k cluster centers are sampled to form a set of cluster centers C = ?{C}?_(i=1)^k, wherein each of the k cluster centers corresponds to each of the k clusters;
generating weightage for each of the k cluster centers by counting a number of data points belonging to each of the k cluster centers;
determining sensitivity scores of the data points belonging to each of the k cluster centers based on the weightage for each of the k cluster centers;
labeling, based on the determined sensitivity scores, a data point having a sensitivity score greater than a threshold value as an outlier of the digitized text corpus and removing the outlier from the digitized text corpus; and
providing a first parameter of the digitized text corpus by analyzing the removed outlier or a second parameter of the digitized text corpus by analyzing data points without the outlier.

2. The apparatus of claim 1, wherein the k clusters of the plurality of data points are identified by spherical k-means clustering algorithm applied to the plurality of data points.

3. The apparatus of claim 1, wherein the first cluster center is selected among data points belonging to a cluster having a highest density of data points among the k clusters.

4. The apparatus of claim 1, wherein the sampling probability is determined by calculating angular distance between the sampled data point and each of the remaining data points and normalizing the calculated angular distance with a sum of all distances between the sampled data point and the remaining data points.

5. The apparatus of claim 1, wherein a sensitivity of a data point is determined based on determination of a maximal ratio between a cost contribution of the data point and an average cost contribution of the plurality of data points.

6. The apparatus of claim 1, wherein the number of data points belonging to each of the k cluster centers is determined by assigning each of the plurality of data points to a nearest cluster center among the k cluster centers and determining distance between each of the plurality of data points and the corresponding nearest cluster center.

7. The apparatus of claim 1, wherein the threshold value is determined based on a number of outliers that a user decided to detect and remove from the digitized text corpus.

8. A method for detecting outliers of a digitized text corpus, the method comprising:
receiving the digitized text corpus comprising a plurality of data points (X=?{x_i}?_(i=1)^n);
identifying k clusters of the plurality of data points, wherein k is a natural number smaller than n;
sampling a data point among the plurality of data points as a first cluster center of the k clusters;
determining sampling probability of each of remaining data points of the plurality of data points, wherein the remaining data points indicate a difference between the plurality of data points and the sampled data point and the sampling probability indicates a probability of each of the remaining data points to be sampled as a next cluster center of the k clusters;
sampling the next cluster center based on the sampling probability and iterate the process of determining sampling probability and the process of sampling the next cluster center until k cluster centers are sampled to form a set of cluster centers C = ?{C}?_(i=1)^k, wherein each of the k cluster centers corresponds to each of the k clusters;
generating weightage for each of the k cluster centers by counting a number of data points belonging to each of the k cluster centers;
determining sensitivity scores of the data points belonging to each of the k cluster centers based on the weightage for each of the k cluster centers;
labeling, based on the determined sensitivity scores, a data point having a sensitivity score greater than a threshold value as an outlier of the digitized text corpus and removing the outlier from the digitized text corpus; and
providing a first parameter of the digitized text corpus by analyzing the removed outlier, or a second parameter of the digitized text corpus by analyzing data points without the outlier.

9. The method of claim 8, wherein identifying k clusters of the plurality of data points further comprises:
applying, by the processor, spherical k-means clustering algorithm to the plurality of data points.

10. The method of claim 8, wherein the first cluster center is selected among data points belonging to a cluster having a highest density of data points among the k clusters.

11. The method of claim 8, wherein determining sampling probability of each of remaining data points of the plurality of data points further comprises:
calculating, by the processor, angular distance between the sampled data point and each of the remaining data points; and
normalizing, by the processor, the calculated angular distance with a sum of all distances between the sampled data point and the remaining data points.

12. The method of claim 8, wherein a sensitivity of a data point is determined based on determination of a maximal ratio between a cost contribution of the data point and an average cost contribution of the plurality of data points.

13. The method of claim 8, wherein generating weightage of each of the k cluster centers further comprises:
assigning, by the processor, each of the plurality of data points to a nearest cluster center among the k cluster centers; and
determining, by the processor, distance between each of the plurality of data points and the corresponding nearest cluster center.

14. The method of claim 8, wherein the threshold value is determined based on a number of outliers that a user decided to detect and remove from the digitized text corpus.

Dated this 28th day of September, 2018

R Ramya Rao
Of K&S Partners
Agent for the Applicant
IN/PA-1607
, Description:FIELD
Apparatuses, methods and systems consistent with the present disclosure relate generally to detecting and removing outliers, and more particularly, to apparatuses, methods and systems that detect outliers from a text corpus using dynamically determined sensitivity score.

Documents

Application Documents

#	Name	Date
1	201841036828-STATEMENT OF UNDERTAKING (FORM 3) [28-09-2018(online)].pdf	2018-09-28
2	201841036828-REQUEST FOR EXAMINATION (FORM-18) [28-09-2018(online)].pdf	2018-09-28
3	201841036828-POWER OF AUTHORITY [28-09-2018(online)].pdf	2018-09-28
4	201841036828-FORM 18 [28-09-2018(online)].pdf	2018-09-28
5	201841036828-FORM 1 [28-09-2018(online)].pdf	2018-09-28
6	201841036828-DRAWINGS [28-09-2018(online)].pdf	2018-09-28
7	201841036828-DECLARATION OF INVENTORSHIP (FORM 5) [28-09-2018(online)].pdf	2018-09-28
8	201841036828-COMPLETE SPECIFICATION [28-09-2018(online)].pdf	2018-09-28
9	201841036828-Request Letter-Correspondence [09-10-2018(online)].pdf	2018-10-09
10	201841036828-Power of Attorney [09-10-2018(online)].pdf	2018-10-09
11	201841036828-Form 1 (Submitted on date of filing) [09-10-2018(online)].pdf	2018-10-09
12	201841036828-Proof of Right (MANDATORY) [20-12-2018(online)].pdf	2018-12-20
13	Correspondence by Agent_Form30,Form1_31-12-2018.pdf	2018-12-31
14	201841036828-RELEVANT DOCUMENTS [04-10-2021(online)].pdf	2021-10-04
15	201841036828-PETITION UNDER RULE 137 [04-10-2021(online)].pdf	2021-10-04
16	201841036828-OTHERS [04-10-2021(online)].pdf	2021-10-04
17	201841036828-Information under section 8(2) [04-10-2021(online)].pdf	2021-10-04
18	201841036828-FORM 3 [04-10-2021(online)].pdf	2021-10-04
19	201841036828-FER_SER_REPLY [04-10-2021(online)].pdf	2021-10-04
20	201841036828-DRAWING [04-10-2021(online)].pdf	2021-10-04
21	201841036828-CORRESPONDENCE [04-10-2021(online)].pdf	2021-10-04
22	201841036828-COMPLETE SPECIFICATION [04-10-2021(online)].pdf	2021-10-04
23	201841036828-CLAIMS [04-10-2021(online)].pdf	2021-10-04
24	201841036828-FER.pdf	2021-10-17
25	201841036828-PatentCertificate14-02-2024.pdf	2024-02-14
26	201841036828-IntimationOfGrant14-02-2024.pdf	2024-02-14
27	201841036828-PROOF OF ALTERATION [02-05-2024(online)].pdf	2024-05-02

Search Strategy

1	2021-04-0616-24-42E_06-04-2021.pdf