Sign In to Follow Application
View All Documents & Correspondence

Apparatus And Method For Detecting And Removing Outliers Using Sensitivity Score

Abstract: A method for detecting outliers is provided, the method comprising: receiving a digitized text corpus comprising a plurality of data points; identifying k clusters of the plurality of data points; sampling a data point among the plurality of data points as a first cluster center of the k clusters; determining sampling probability of each of remaining data points of the plurality of data points; sampling the next cluster center based on the sampling probability and iterate the process of determining sampling probability and the process of sampling the next cluster center until k cluster centers are sampled; generating weightage for each of the k cluster centers; determining sensitivity scores of the data points belonging to each of the k cluster centers; and labeling a data point having a sensitivity score greater than a threshold value as an outlier and removing the outlier from the digitized text corpus. FIG.1

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
28 September 2018
Publication Number
14/2020
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
bangalore@knspartners.com
Parent Application
Patent Number
Legal Status
Grant Date
2024-02-14
Renewal Date

Applicants

WIPRO LIMITED
Doddakannelli, Sarjapur Road, Bangalore 560035, Karnataka, India.

Inventors

1. RAMESHWAR PRATAP YADAV
BF102, Second Floor, Janakpuri 110058, New Delhi

Specification

Claims:WE CLAIM:

1. An apparatus for detecting outliers of a digitized text corpus, the apparatus comprising:
a memory for storing instructions; and
a processor that, when executing the instructions performs a method, the method comprising:
receiving the digitized text corpus comprising a plurality of data points (X=?{x_i}?_(i=1)^n);
identifying k clusters of the plurality of data points, wherein k is a natural number smaller than n;
sampling a data point among the plurality of data points as a first cluster center of the k clusters;
determining sampling probability of each of remaining data points of the plurality of data points, wherein the remaining data points indicate a difference between the plurality of data points and the sampled data point and the sampling probability indicates a probability of each of the remaining data points to be sampled as a next cluster center of the k clusters;
sampling the next cluster center based on the sampling probability and iterate the process of determining sampling probability and the process of sampling the next cluster center until k cluster centers are sampled to form a set of cluster centers C = ?{C}?_(i=1)^k, wherein each of the k cluster centers corresponds to each of the k clusters;
generating weightage for each of the k cluster centers by counting a number of data points belonging to each of the k cluster centers;
determining sensitivity scores of the data points belonging to each of the k cluster centers based on the weightage for each of the k cluster centers;
labeling, based on the determined sensitivity scores, a data point having a sensitivity score greater than a threshold value as an outlier of the digitized text corpus and removing the outlier from the digitized text corpus; and
providing a first parameter of the digitized text corpus by analyzing the removed outlier or a second parameter of the digitized text corpus by analyzing data points without the outlier.

2. The apparatus of claim 1, wherein the k clusters of the plurality of data points are identified by spherical k-means clustering algorithm applied to the plurality of data points.

3. The apparatus of claim 1, wherein the first cluster center is selected among data points belonging to a cluster having a highest density of data points among the k clusters.

4. The apparatus of claim 1, wherein the sampling probability is determined by calculating angular distance between the sampled data point and each of the remaining data points and normalizing the calculated angular distance with a sum of all distances between the sampled data point and the remaining data points.

5. The apparatus of claim 1, wherein a sensitivity of a data point is determined based on determination of a maximal ratio between a cost contribution of the data point and an average cost contribution of the plurality of data points.

6. The apparatus of claim 1, wherein the number of data points belonging to each of the k cluster centers is determined by assigning each of the plurality of data points to a nearest cluster center among the k cluster centers and determining distance between each of the plurality of data points and the corresponding nearest cluster center.

7. The apparatus of claim 1, wherein the threshold value is determined based on a number of outliers that a user decided to detect and remove from the digitized text corpus.

8. A method for detecting outliers of a digitized text corpus, the method comprising:
receiving the digitized text corpus comprising a plurality of data points (X=?{x_i}?_(i=1)^n);
identifying k clusters of the plurality of data points, wherein k is a natural number smaller than n;
sampling a data point among the plurality of data points as a first cluster center of the k clusters;
determining sampling probability of each of remaining data points of the plurality of data points, wherein the remaining data points indicate a difference between the plurality of data points and the sampled data point and the sampling probability indicates a probability of each of the remaining data points to be sampled as a next cluster center of the k clusters;
sampling the next cluster center based on the sampling probability and iterate the process of determining sampling probability and the process of sampling the next cluster center until k cluster centers are sampled to form a set of cluster centers C = ?{C}?_(i=1)^k, wherein each of the k cluster centers corresponds to each of the k clusters;
generating weightage for each of the k cluster centers by counting a number of data points belonging to each of the k cluster centers;
determining sensitivity scores of the data points belonging to each of the k cluster centers based on the weightage for each of the k cluster centers;
labeling, based on the determined sensitivity scores, a data point having a sensitivity score greater than a threshold value as an outlier of the digitized text corpus and removing the outlier from the digitized text corpus; and
providing a first parameter of the digitized text corpus by analyzing the removed outlier, or a second parameter of the digitized text corpus by analyzing data points without the outlier.

9. The method of claim 8, wherein identifying k clusters of the plurality of data points further comprises:
applying, by the processor, spherical k-means clustering algorithm to the plurality of data points.

10. The method of claim 8, wherein the first cluster center is selected among data points belonging to a cluster having a highest density of data points among the k clusters.

11. The method of claim 8, wherein determining sampling probability of each of remaining data points of the plurality of data points further comprises:
calculating, by the processor, angular distance between the sampled data point and each of the remaining data points; and
normalizing, by the processor, the calculated angular distance with a sum of all distances between the sampled data point and the remaining data points.

12. The method of claim 8, wherein a sensitivity of a data point is determined based on determination of a maximal ratio between a cost contribution of the data point and an average cost contribution of the plurality of data points.

13. The method of claim 8, wherein generating weightage of each of the k cluster centers further comprises:
assigning, by the processor, each of the plurality of data points to a nearest cluster center among the k cluster centers; and
determining, by the processor, distance between each of the plurality of data points and the corresponding nearest cluster center.

14. The method of claim 8, wherein the threshold value is determined based on a number of outliers that a user decided to detect and remove from the digitized text corpus.

Dated this 28th day of September, 2018

R Ramya Rao
Of K&S Partners
Agent for the Applicant
IN/PA-1607
, Description:FIELD
Apparatuses, methods and systems consistent with the present disclosure relate generally to detecting and removing outliers, and more particularly, to apparatuses, methods and systems that detect outliers from a text corpus using dynamically determined sensitivity score.

Documents

Application Documents

# Name Date
1 201841036828-STATEMENT OF UNDERTAKING (FORM 3) [28-09-2018(online)].pdf 2018-09-28
2 201841036828-REQUEST FOR EXAMINATION (FORM-18) [28-09-2018(online)].pdf 2018-09-28
3 201841036828-POWER OF AUTHORITY [28-09-2018(online)].pdf 2018-09-28
4 201841036828-FORM 18 [28-09-2018(online)].pdf 2018-09-28
5 201841036828-FORM 1 [28-09-2018(online)].pdf 2018-09-28
6 201841036828-DRAWINGS [28-09-2018(online)].pdf 2018-09-28
7 201841036828-DECLARATION OF INVENTORSHIP (FORM 5) [28-09-2018(online)].pdf 2018-09-28
8 201841036828-COMPLETE SPECIFICATION [28-09-2018(online)].pdf 2018-09-28
9 201841036828-Request Letter-Correspondence [09-10-2018(online)].pdf 2018-10-09
10 201841036828-Power of Attorney [09-10-2018(online)].pdf 2018-10-09
11 201841036828-Form 1 (Submitted on date of filing) [09-10-2018(online)].pdf 2018-10-09
12 201841036828-Proof of Right (MANDATORY) [20-12-2018(online)].pdf 2018-12-20
13 Correspondence by Agent_Form30,Form1_31-12-2018.pdf 2018-12-31
14 201841036828-RELEVANT DOCUMENTS [04-10-2021(online)].pdf 2021-10-04
15 201841036828-PETITION UNDER RULE 137 [04-10-2021(online)].pdf 2021-10-04
16 201841036828-OTHERS [04-10-2021(online)].pdf 2021-10-04
17 201841036828-Information under section 8(2) [04-10-2021(online)].pdf 2021-10-04
18 201841036828-FORM 3 [04-10-2021(online)].pdf 2021-10-04
19 201841036828-FER_SER_REPLY [04-10-2021(online)].pdf 2021-10-04
20 201841036828-DRAWING [04-10-2021(online)].pdf 2021-10-04
21 201841036828-CORRESPONDENCE [04-10-2021(online)].pdf 2021-10-04
22 201841036828-COMPLETE SPECIFICATION [04-10-2021(online)].pdf 2021-10-04
23 201841036828-CLAIMS [04-10-2021(online)].pdf 2021-10-04
24 201841036828-FER.pdf 2021-10-17
25 201841036828-PatentCertificate14-02-2024.pdf 2024-02-14
26 201841036828-IntimationOfGrant14-02-2024.pdf 2024-02-14
27 201841036828-PROOF OF ALTERATION [02-05-2024(online)].pdf 2024-05-02

Search Strategy

1 2021-04-0616-24-42E_06-04-2021.pdf

ERegister / Renewals

3rd: 02 May 2024

From 28/09/2020 - To 28/09/2021

4th: 02 May 2024

From 28/09/2021 - To 28/09/2022

5th: 02 May 2024

From 28/09/2022 - To 28/09/2023

6th: 02 May 2024

From 28/09/2023 - To 28/09/2024

7th: 02 May 2024

From 28/09/2024 - To 28/09/2025

8th: 25 Sep 2025

From 28/09/2025 - To 28/09/2026