Foundation Model Ai Safety Guardrails

Abstract: DYNAMIC THREAT MITIGATION OF GENERATIVE ARTIFICIAL INTELLIGENCE 5 MODELS ABSTRACT The disclosure relates to a method and system for dynamically mitigating threats of 10 generative Artificial Intelligence (AI) models. Conventional systems often suffer from inefficiencies due to sequentially applying threat detection checks leading to unnecessary preprocessing and increased computational demands. Additionally, such systems typically focus only on input data, neglecting potential threats in outputs. The disclosed system and method addresses these drawbacks by employing a hierarchical structure of macro and nano classifiers. 15 The system utilizes macro classifiers for broad initial threat categorization followed by specialized nano classifiers for detailed analysis of specific threat subtypes, thereby optimizing processing time and computational resources. The system operates in real time, applying predefined moderation rules to both input and output data to ensure comprehensive threat mitigation. Additionally, continuous telemetry data updates refine nano classifiers and threat identification 20 mechanisms, maintaining high accuracy and adaptability. The disclosed method enhances safety efficiency and reliability of generative AI models. [To be published with FIG. 5]

Patent Information

Application #

Filing Date

10 August 2023

Publication Number

07/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

INFOSYS LIMITED

44, Infosys Avenue, Electronics City, Hosur Road, Bangalore, 560100, Karnataka, India

Inventors

1. Syed Ahmed

#4, 11th main Friends colony, S.T.Bed, Koramangala 4th Block Bangalore – 560047, Karnataka, India

2. Ritarshi Chakraborty

264 Ashoke Road Flat – 1B, Gangulybagan, Kolkata 700084, West Bengal, India

Specification

We Claim:
1. A processor implemented method (500) for dynamically mitigating threats of a generative
Artificial Intelligence (AI) model, the method comprising:
receiving (502), via one or more hardware processors, data associated with the generative
10 AI model at a user interface (UI) of a computing device, the data associated with one or more
attributes, wherein the data comprises at least one of an input data and an output data;
applying (504), via the one or more hardware processors, one or more macro classifiers to
the data to determine, in real time, presence of one or more types of threats from amongst a
plurality of types of threats, wherein each macro classifier of the one or more macro classifiers is
15 capable of computing a first threat probability score associated with a type of threat from amongst
the plurality of types of threats based on the one or more attributes associated with the data;
dynamically configuring (506), via the one or more hardware processors, a threat detection
model comprising one or more nano classifiers to detect one or more sub-type of threats associated
with the one or more type of threats in the data;
20 selectively moderating (508) the data, via the one or more hardware processors, based on
one or more predefined rules corresponding to each of the one or more sub-type of threats to obtain
a moderated data; and
validating (510), via the one or more hardware processors, the moderated data to determine
one of presence and absence of the one or more sub-types of threats in the moderated data.
25
2. The processor implemented method of claim 1, wherein each attribute from amongst the
one or more attributes comprises one of nature of the input data, nature of the output data, usage
history and context associated with the data, and similarity with past violations and threats.
3. The processor implemented method of claim 1, wherein dynamically configuring the threat
detection model comprises:
selecting, from a database, one or more nano classifiers from amongst a plurality of nano classifiers
selectively trained to detect the one or more sub-type of threats in the data;
computing, for each of the one or more sub-types of threats, a second threat probability
score by the one or more nano classifiers; and
5 comparing, for each of the one or more sub-types of the threats, the second threat
probability score with a predefined threshold value of the second threat probability score to detect
the presence of the one or more sub-types of the threats in the data.
4. The processor implemented method of claim 1, wherein a type of threat from amongst the
10 plurality of types of threats is one of a prompt injection threat, a jailbreak threat, a profanity threat,
a toxicity threat, a Personal Identifiable Information (PII) leakage threat, an Intellectual Property
(IP) violation threat, an organization policy and role-based threat, a hallucination threat, security
attacks, and sensitive information leakage threat.
15 5. The processor implemented method of claim 1, wherein the one or more nano classifiers
comprises at least one of one or more Machine Learning (ML) models, one or more deep learning
models, one or more transfer learning models, one or more rule-based repositories, one or more
datasets and dictionaries, one or more custom and finetuned models, one or more knowledge
databases, one or more Retrieval Augmented Generation (RAG) model, and a reminder generation
20 model.
6. The processor implemented method of claim 1, wherein moderating the data comprises
performing at least one of filtering and rephrasing at least a portion of the data.
25 7. The processor implemented method of claim 1, wherein validating the moderated data
comprises iteratively:
computing, for each of the one or more sub-types of threats in the moderated data, the
second threat probability score;
comparing the second threat probability score with the predefined threshold value of the
30 second threat probability score; and
moderating the moderated data, until the second threat probability score is determined to
be less than the predefined threshold value of the second threat probability score, wherein the
second threat probability score being less than the predefined threshold value is indicative of
absence of the one or more sub-types of threats in the moderated data.
35
44
5 8. The processor implemented method of claim 7, further comprises:
restricting moderating the moderated data upon determining the second threat probability
score greater than or equal to the predefined threshold value of the second threat probability score
for a predefined number of iterations; and
rendering details of restricting the moderated data on the UI.
10
9. The processor implemented method of claim 1, further comprising:
tracking a telemetry status of the data; and
updating the one or more nano classifiers, the threat detection model, and one or more
policies associated with an entity implementing the generative AI model based on the telemetry
15 status of the data.
10. A system (200) for dynamically mitigating threats of a generative Artificial Intelligence
(AI) model, the system comprising:
one or more hardware processors (202); and
20 a memory (204) communicatively coupled to the one or more hardware processors,
wherein the memory stores processor-executable instructions, which, on execution, causes the one
or more hardware processors to:
receive data associated with a generative AI model at a user interface (UI) of a
computing device, the data associated with one or more attributes;
apply one or more macro classifiers to the data to determine, in real time, presence
of one or more types of threats from amongst a plurality of types of threats, wherein each macro
classifier of the one or more macro classifiers is capable of computing a first threat probability
score associated with a type of threat from amongst the plurality of types of threats based on the
one or more attributes associated with the data;
30 dynamically configure a threat detection model comprising one or more nano
classifiers to detect one or more sub-type of threats associated with the one or more type of threats
in the data;
selectively moderate the data based on one or more predefined rules corresponding
to each of the one or more sub-type of threats to obtain a moderated data; and
validate the moderated data to determine one of presence and absence of the one or
more sub-types of threats in the moderated data.

Documents

Application Documents

#	Name	Date
1	202341053821-STATEMENT OF UNDERTAKING (FORM 3) [10-08-2023(online)].pdf	2023-08-10
2	202341053821-PROVISIONAL SPECIFICATION [10-08-2023(online)].pdf	2023-08-10
3	202341053821-PROOF OF RIGHT [10-08-2023(online)].pdf	2023-08-10
4	202341053821-POWER OF AUTHORITY [10-08-2023(online)].pdf	2023-08-10
5	202341053821-FORM 1 [10-08-2023(online)].pdf	2023-08-10
6	202341053821-DRAWINGS [10-08-2023(online)].pdf	2023-08-10
7	202341053821-DECLARATION OF INVENTORSHIP (FORM 5) [10-08-2023(online)].pdf	2023-08-10
8	202341053821-Power of Attorney [17-05-2024(online)].pdf	2024-05-17
9	202341053821-Form 1 (Submitted on date of filing) [17-05-2024(online)].pdf	2024-05-17
10	202341053821-Covering Letter [17-05-2024(online)].pdf	2024-05-17
11	202341053821-FORM 18 [08-08-2024(online)].pdf	2024-08-08
12	202341053821-DRAWING [08-08-2024(online)].pdf	2024-08-08
13	202341053821-CORRESPONDENCE-OTHERS [08-08-2024(online)].pdf	2024-08-08
14	202341053821-COMPLETE SPECIFICATION [08-08-2024(online)].pdf	2024-08-08