Method And System For Compressing And Tuning Large Language Models

Abstract: A method (300) and a system (100) of compressing and tuning large language models is disclosed. A processor 104 receives an LLM, a pruning ratio, an initial rank, and a set of target layers from a plurality of layers of the LLM. A dependency-wise pruning is performed of the LLM based on the pruning ratio. A rank-based factorization of the LLM is performed based on the initial rank to generate factorized weights. A pruned LLM is determined based on the dependency-wise pruning. The pruned LLM is updated by injecting one or more additional layers to one or more corresponding layers of the pruned LLM to generate a compressed LLM. The compressed LLM is fine-tuned for a specific domain or for a specific task by fine-tuning the factorized weights for the additional layers of the compressed LLM based on the domain-specific training data or task-specific training data. FIG. 1

Patent Information

Application #

Filing Date

22 March 2024

Publication Number

39/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

L&T TECHNOLOGY SERVICES LIMITED

DLF IT SEZ Park, 2nd Floor – Block 3, 1/124, Mount Poonamallee Road, Ramapuram, Chennai - 600 089, Tamil Nadu, India

Inventors

1. SUDHIR BHADAURIA

25, Madhav Park 3, Vastral Road, Pranami Nagar, Ahmedabad, Gujarat, India - 382418

2. VIKRAM SUBRAMANI

II/21, Melkottai, Guruvinayanapalli, Bargur, Krishnagiri, Tamil Nadu, India - 635120

Specification

Description:PLEASE REFER THE ATTACHMENT , Claims:PLEASE REFER THE ATTACHMENT

WE CLAIM:
1. A method (400) of compressing and tuning a large language model (LLM), the method
(400) comprising:
receiving (402), via a model compression and tuning device (102), an LLM, a pruning
ratio, an initial rank, and a set of target layers from a plurality of layers of the LLM;
performing (404), via the model compression and tuning device (102), a dependencywise pruning of the LLM based on the pruning ratio to generate a pruned LLM;
performing (412), via the model compression and tuning device (102), a rank-based
factorization of the LLM based on the initial rank to generate factorized weights for each of
the set of target layers of the LLM; and
updating (424), via the model compression and tuning device, the pruned LLM by
injecting one or more additional layers to one or more corresponding layers of the pruned LLM
to generate a compressed LLM,
wherein the one or more additional layers are based on the factorized weights
for each of the set of target layers of the LLM.
2. The method (400) of claim 1, comprising:
fine-tuning (434), via a model compression and tuning device, the compressed LLM
for a specific domain or for a specific task by fine-tuning the factorized weights for the one or
more additional layers of the compressed LLM based on domain-specific training data or taskspecific training data respectively.
3. The method (400) of claim 1, wherein performing the dependency-wise pruning comprises:
grouping (406), via the model compression and tuning device, dependent layers from
the plurality of layers of the LLM, based on one or more parameters, into a set of groups;
determining (408), via the model compression and tuning device, a similarity between
each of the set of groups based on a cosine distance among them; and
determining (410), via the model compression and tuning device, a number of
connections to be pruned from each of the set of groups based on the similarity and the pruning
ratio.
4. The method (400) of claim 1, wherein performing the rank-based factorization comprises:
applying (414), via the model compression and tuning device, a singular value
decomposition on each of the set of target layers to generate singular value decomposition matrices (SVDMs) for each of the set of target layers, wherein the SVDMs comprises initial
factorized weights for a given layer;
determining (416), via the model compression and tuning device, a rank for each of the
set of target layers based on application of a pre-defined algorithm on singular values from the
corresponding SVDMs, wherein the singular values are arranged in a ranked order; and
normalizing (420), via the model compression and tuning device, the rank for each of
the set of target layers based on the initial rank to determine the factorized weights for each of
the set of target layers of the LLM.
5. The method (400) of claim 4, wherein performing the rank-based factorization comprises:
down-sampling (422), via the model compression and tuning device, the factorized
weights for each of the set of target layers of the LLM based on the pruning ratio to compress
the factorized weights for each of the set of layers of the LLM.
6. The method (400) of claim 1, wherein updating the pruned LLM comprises:
generating (428), via the model compression and tuning device, an initial output for
each of the one or more corresponding layers of the pruned LLM;
generating (430), via the model compression and tuning device, an additional output
for each of the one or more additional layers; and
determining (432), via the model compression and tuning device, an output for each of
the one or more corresponding layers of the pruned LLM based on the initial output and the
additional output.
7. A system (100) for compressing and tuning a large language model (LLM), comprising:
a processor (104);
a memory (106) communicably coupled to the processor (104), wherein the memory
(106) stores processor-executable instructions, which, on execution, cause the
processor (104) to:
receive an LLM, a pruning ratio, an initial rank, and a set of target layers from
a plurality of layers of the LLM;
perform a dependency-wise pruning of the LLM based on the pruning ratio to
generate a pruned LLM;
perform a rank-based factorization of the LLM based on the initial rank to
generate factorized weights for each of the set of target layers of the LLM; and update the pruned LLM by injecting one or more additional layers to one or
more corresponding layers of the pruned LLM to generate a compressed LLM,
wherein the one or more additional layers are based on the factorized
weights for each of the set of target layers of the LLM.
8. The system (100) as claimed in claim 7, wherein the processor (104) is configured to:
fine-tune the compressed LLM for a specific domain or for a specific task by fine-tuning
the factorized weights for the one or more additional layers of the compressed LLM based on
the domain-specific training data or task-specific training data respectively.
9. The system (100) as claimed in claim 7, wherein to perform the dependency-wise pruning,
the processor (104) is configured to:
group the dependent layers from the plurality of layers of the LLM, based on one or
more parameters, into a set of groups;
determine a similarity between each of the set of groups based on a cosine distance
among them; and
determine a number of connections to be pruned from each of the set of groups based
on the similarity and the pruning ratio.
10. The system (100) as claimed in claim 7, wherein to perform the rank-based factorization,
the processor (104) is configured to:
apply a singular value decomposition on each of the set of target layers to generate
singular value decomposition matrices (SVDMs) for each of the set of target layers,
wherein the SVDMs comprises initial factorized weights for a given layer;
determine a rank for each of the set of target layers based on application of a pre-defined
algorithm on singular values from the corresponding SVDMs,
wherein the singular values are arranged in a ranked order;
normalize the rank for each of the set of target layers based on the initial rank to
determine the factorized weights for each of the set of target layers of the LLM.
11. The system (100) as claimed in claim 10, wherein to perform the rank-based factorization,
the processor (104) is configured to: down-sample the factorized weights for each of the set of target layers of the LLM
based on the pruning ratio to compress the factorized weights for each of the set of layers of
the LLM.
12. The system (100) as claimed in claim 7, wherein to update the pruned LLM, the processor
(104) is configurable to:
generate an initial output for each of the one or more corresponding layers of the pruned
LLM;
generate an additional output for each of the one or more additional layers; and
determine an output for each of the one or more corresponding layers of the pruned
LLM based on the initial output and the additional output.

Documents

Application Documents

#	Name	Date
1	202441022184-STATEMENT OF UNDERTAKING (FORM 3) [22-03-2024(online)].pdf	2024-03-22
2	202441022184-REQUEST FOR EXAMINATION (FORM-18) [22-03-2024(online)].pdf	2024-03-22
3	202441022184-PROOF OF RIGHT [22-03-2024(online)].pdf	2024-03-22
4	202441022184-POWER OF AUTHORITY [22-03-2024(online)].pdf	2024-03-22
5	202441022184-FORM 18 [22-03-2024(online)].pdf	2024-03-22
6	202441022184-FORM 1 [22-03-2024(online)].pdf	2024-03-22
7	202441022184-DRAWINGS [22-03-2024(online)].pdf	2024-03-22
8	202441022184-DECLARATION OF INVENTORSHIP (FORM 5) [22-03-2024(online)].pdf	2024-03-22
9	202441022184-COMPLETE SPECIFICATION [22-03-2024(online)].pdf	2024-03-22
10	202441022184-Form 1 (Submitted on date of filing) [22-04-2024(online)].pdf	2024-04-22
11	202441022184-Covering Letter [22-04-2024(online)].pdf	2024-04-22
12	202441022184-FORM 3 [24-07-2024(online)].pdf	2024-07-24