A Hybrid Cnn Lstm Model For Analysing Student Behaviour And Engagement

< Back

A Hybrid Cnn Lstm Model For Analysing Student Behaviour And Engagement In Ai Assisted Classrooms

Abstract: The invention relates to a Hybrid CNN-LSTM model for analysing student behaviour and engagement in AI-assisted classrooms. The system comprises an Edge Vision Device (30) integrated with a Raspberry Pi (31), HD Camera (37), Wi-Fi Modem (36), Battery (33), Solar Panel (35), and Data Storage (40), connected to a Cloud Server (40). The device captures real-time classroom video streams of students and preprocesses data locally using lightweight Convolutional Neural Networks (CNNs) to extract facial and body gesture features. The preprocessed numerical data is transmitted to the Cloud Server (40), where CNNs perform spatial analysis of expressions and gestures, and Long Short-Term Memory (LSTM) networks model temporal behaviour patterns over time. The hybrid CNN-LSTM model generates engagement scores and performance predictions, which are displayed through a Display (34) and web-based dashboard. The system provides a cost-effective, non-intrusive, and energy-efficient solution for real-time monitoring of student engagement and learning performance in classroom environments.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

13 September 2025

Publication Number

40/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

UTTARANCHAL UNIVERSITY

ARCADIA GRANT, P.O. CHANDANWARI, PREMNAGAR, DEHRADUN - 248007, UTTARAKHAND, INDIA

Inventors

1. MANISHA SAINI

ASSISTANT PROFESSOR, UTTARANCHAL UNIVERSITY, ARCADIA GRANT, P.O. CHANDANWARI, PREMNAGAR, DEHRADUN - 248007, UTTARAKHAND, INDIA

2. DR. SAURABH DHYANI

ASSISTANT PROFESSOR, UTTARANCHAL UNIVERSITY, ARCADIA GRANT, P.O. CHANDANWARI, PREMNAGAR, DEHRADUN - 248007, UTTARAKHAND, INDIA

Claims

1. A Hybrid CNN-LSTM Model for analysing student behaviour and engagement in AI-assisted classrooms comprising an Edge Vision Device (30), a Cloud Server (40), a Raspberry Pi (31), a Power Supply (32), a Battery (33), a Display (34), a Solar Panel (35), an Onboard Wi-Fi Modem (36), an HD Camera (37), a Keyboard (38), a Mouse (39), and Data Storage (40).

2. The system as claimed in claim 1, wherein the Edge Vision Device (30) is configured with the Raspberry Pi (31), HD Camera (37), SD card, Wi-Fi Module (36), Battery (33), and Solar Panel (35) to capture classroom video streams and preprocess data locally.

3. The system as claimed in claim 1, wherein the HD Camera (37) captures facial expressions and body gestures of students in real time for behavioral and engagement monitoring.

4. The system as claimed in claim 1, wherein the Edge Vision Device (30) extracts feature locally using lightweight Convolutional Neural Networks (CNNs) to convert captured video frames into numerical data.

5. The system as claimed in claim 1, wherein the Data Storage (40) stores numerical data generated by the Edge Vision Device (30) before transmission to the Cloud Server (40).

6. The system as claimed in claim 1, wherein the Onboard Wi-Fi Modem (36) transmits preprocessed numerical data from the Edge Vision Device (30) to the Cloud Server (40).

7. The system as claimed in claim 1, wherein the Cloud Server (40) processes the transmitted data using CNN models for extracting spatial features including posture landmarks and facial expressions.

8. The system as claimed in claim 1, wherein the Cloud Server (40) further processes temporal patterns in student behaviour using Long Short-Term Memory (LSTM) networks.

9. The system as claimed in claim 1, wherein the hybrid CNN-LSTM model generates outputs comprising engagement scores and performance predictions for students based on classroom behavioural data.

10. The system as claimed in claim 1, wherein the Display (34) visualises analysed results including student performance scores and engagement levels for institutional administration and educators through a web-based dashboard.

Specification

Description:FIELD OF THE INVENTION
This invention relates to Hybrid CNN-LSTM Model for Analysing Student Behaviour and Engagement in AI-Assisted Classrooms
BACKGROUND OF THE INVENTION
Old traditional methods for student performance evaluation highly relies upon manual observation which results in delayed assessments and subjective grading which fails in capturing real-time emotional and engagement states. While talking about the active classrooms, teachers often struggle to monitor each student’s confusion, attention and interest levels effectively. Also, there is lack of intelligent systems which can process non-verbal cues like body gestures and facial expressions for dynamic performance analysis. Existing solutions are either too much expensive, limited to post-session analytics or invasive. Moreover, there is lack of the automation tools in most of the classrooms which can respond adaptively to students’ behaviour in real-time. There is a growing need of a non-intrusive, cost-effective and scalable system for monitoring and enhancing learning outcomes. For addressing this gap there is requirement of such a system which integrates deep learning with IoT-based system for continuous monitoring of student behaviour.
There have been numerous efforts made for assessing student engagement and performance on this evolving era of educational technology. Traditionally, the educators have relied on classroom participation, summative observation and subjective observation to gauge student learning outcomes. While these methods were effective to some extent but also lacks objectivity and immediacy, particularly in the dynamic classroom environments. New opportunities have emerged in measuring the student activity with the integration of digital learning platform using data analytics. However, these tools are limited only to monitoring academics interactions such as quiz scores, screen time or login frequencies which overlooks the non-verbal behavioural cues which plays a vital role in understanding student motivation, attention as well as the cognitive states.
Existing artificial intelligence-based systems for monitoring students primarily uses keystroke analysis, biometric sensors and screen monitoring. These approaches often intrude on privacy of the student, requires hardware which are expensive or are constrained to virtual learning environments. Furthermore, the tools used for facial expression recognition used for online proctoring are one dimensional often which only focuses upon detecting the distraction or anomalies and don’t account for the holistic behavioral patterns which are essential for accurate analysis of performance in real classrooms. However, there are significant advancements in computer vision and affective computing but its application in the classroom environment is still in its infancy. Most of the studies are limited to controlled environments in this domain with small datasets and do not adequately capture the variability as well as diversity of student behavior in scenarios of real-world classrooms. Additionally, there exists many models which are designed for analyzing the isolated gestures or static frames, ignoring the temporal dimension of learning behaviors that unfold over time. The lack of integrated spatiotemporal models, such as those combining the Long Short-Term Memory and Convolutional Neural Networks which limits the ability of the system in understanding patterns in behavior which are essential for predicting the performance of the students and engagement levels.
Moreover, there is minimal research addressing the need for energy-efficient, cost-effective and scalable classroom monitoring systems deployable in resource-constrained educational environments. With the increasing interest in edge computing and IoT, there is a clear research gap in developing solar-powered, portable AI devices that can operate autonomously and transmit data related to the behavior of the students to the cloud for advanced analysis. Therefore, there exist a significant research gap in building a real-time, non-intrusive, AI-powered system which integrates both facial expression as well as body gesture analysis by using deep learning especially in active classroom settings. Such a system would offer educators meaningful insights into student engagement by enabling timely pedagogical interventions and contributing to the improved learning outcomes.
US10490096B2 The Learner Interaction Monitoring Systems (LiMS) is a web-based application that can interface with any web-based course delivery platform to transform the online learning environment into an active observer of learner engagement. The LiMS ‘event capture model’ collects detailed real-time data on learner behavior in self-directed online learning environments, and interprets these data by drawing on behavioral research. The LiMS offers education and training managers in corporate contexts a valuable tool for the evaluation of learner performance and course design. By allowing more detailed demonstration of ROI in education and training, LiMS allows managers to make the case for web-based courseware that reflects appropriate and evidence-based instructional design, rather than budgetary constraints.
RESEARCH GAP: The proposed system differs by using edge vision devices to analyze physical student behavior in real classrooms, unlike LiMS which monitors online interactions.
US10467919B2 Embodiments can provide dynamic assessment by an expert system software (ESS) module. The ESS module may determine a minimum set of diagnostic questions to identify a student's strengths and/or weaknesses for a given subject matter. Each diagnostic question may be a unique question variant generated from a dynamic question template. The ESS module may automatically and dynamically build a personal lesson plan that includes lessons on topics of the subject matter that the student has not mastered. The ESS module may dynamically modify the personal lesson plan to change a lesson and/or add lesson(s) considered by the ESS module as necessary for the student to master the subject matter. Given the ESS module's knowledge on various components of a question such as equation(s) used to produce a correct answer to the question, the ESS module may generate image(s), customized hint(s), customized explanation(s), and/or examine student solutions in step-by-step fashion.
RESEARCH GAP: The proposed system differs by focusing on real-time behavioral analysis using vision-based deep learning, whereas the ESS module emphasizes adaptive content generation and diagnostic questioning.
None of the prior art indicate above either alone or in combination with one another disclose what the present invention has disclosed. This invention relates to Hybrid CNN-LSTM Model for Analysing Student Behaviour and Engagement in AI-Assisted Classrooms
SUMMARY OF THE INVENTION
This summary is provided to introduce a selection of concepts, in a simplified format, that are further described in the detailed description of the invention.
This summary is neither intended to identify key or essential inventive concepts of the invention and nor is it intended for determining the scope of the invention.
Disclosed herein A Hybrid CNN-LSTM Model for Analysing Student Behaviour and Engagement in AI-Assisted Classrooms comprises of Edge Vision Device (30), Cloud Server (40), Raspberry PI (31), Power supply (32), Battery (33), Display (34), Solar Panel (35), Onboard Wifi Modem (36), HD Camera (37), Keyboard (38), Mouse (39) & Data storage (40).
The system consists of a smart classroom monitoring system which is having a camera-enabled device, a cloud server, and a dashboard with machine learning model. The system captures real-time body gestures and facial expressions of the students. The device preprocesses and transmits the numerical data to the cloud server. A hybrid deep learning model analyses performance and engagement. The result is visualized for institution administration and educators via an interface which is web-based.
The system consists an edge vision device which contains Raspberry Pi, HD camera, SD card, Wi-Fi module, battery which is continuously charged by the solar panel attached to it. This device is made such that to capture video data of the students present in classroom. The device is made such that it can extract the facial and gesture data locally using lightweight CNN models. This device converts this data into numerical data and transmits it to the cloud server for further processing.
An algorithm is used in this proposed device which integrates CNNs and LSTM networks to analyze temporal and spatial features of student behavior. CNNs processes the video frames to extract posture and emotion data. LSTM models analyze sequences over the time for identifying the trends of engagement. The algorithm provide output as performance scores and levels of engagement for educational use.
To further clarify advantages and features of the present invention, a more particular description of the invention will be rendered by reference to specific embodiments thereof, which is illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail with the accompanying drawings.
This invention proposes a novel and intelligent system for real-time analysis of performance of the students in active classroom using the deep learning algorithms which are focused on body gestures and facial expressions. The system is designed as a cost effective, non-intrusive and energy efficient solution. The system uses a combination of edge computing with cloud-based deep learning models to provide the actionable insights on student behavioral and engagement trends. This smart monitoring system aims at addressing the limitations of methods of traditional assessment which often relies upon manual observation, academic scores or delayed evaluations that do not reflect students’ real-time emotional and understanding involvement during lessons. At the core of this device there present a Raspberry Pi-based device which is integrated with a HD camera, SD card, mouse, keyboard and a Wi-Fi module. Power source for the entire unit is a battery which is continuously charged using a solar panel which makes this system sustainable and deployable in the remote places where there is lack of energy sources. N number of the devices be placed in the classrooms according to the requirement which captures the video stream of the students during the instructional sessions. These video feeds are processed locally to extract the key visual features of the students such as body posture points and facial landmarks. Deep learning models are used to extract the features like eye movement, mouth openness, arm position, head tilt and leaning behavior and convert it in numeric data and hence minimizes the bandwidth usage and preserves privacy of the students by avoiding the transmission of raw videos or images.
BRIEF DESCRIPTION OF THE DRAWINGS
The illustrated embodiments of the subject matter will be understood by reference to the drawings, wherein like parts are designated by like numerals throughout. The following description is intended only by way of example, and simply illustrates certain selected embodiments of devices, systems, and methods that are consistent with the subject matter as claimed herein, wherein:
FIGURE 1: SYSTEM ARCHITECTURE
The figures depict embodiments of the present subject matter for the purposes of illustration only. A person skilled in the art will easily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the disclosure described herein.
DETAILED DESCRIPTION OF THE INVENTION
The detailed description of various exemplary embodiments of the disclosure is described herein with reference to the accompanying drawings. It should be noted that the embodiments are described herein in such details as to clearly communicate the disclosure. However, the amount of details provided herein is not intended to limit the anticipated variations of embodiments; on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the scope of the present disclosure as defined by the appended claims.
It is also to be understood that various arrangements may be devised that, although not explicitly described or shown herein, embody the principles of the present disclosure. Moreover, all statements herein reciting principles, aspects, and embodiments of the present disclosure, as well as specific examples, are intended to encompass equivalents thereof.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments. As used herein, the singular forms “a",” “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes” and/or “including,” when used herein, specify the presence of stated features, integers, steps, operations, elements and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof.
It should also be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may, in fact, be executed concurrently or may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
In addition, the descriptions of "first", "second", “third”, and the like in the present invention are used for the purpose of description only, and are not to be construed as indicating or implying their relative importance or implicitly indicating the number of technical features indicated. Thus, features defining "first" and "second" may include at least one of the features, either explicitly or implicitly.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which example embodiments belong. It will be further understood that terms, e.g., those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
This study proposes a novel and intelligent system for real-time analysis of performance of the students in active classroom using the deep learning algorithms which are focused on body gestures and facial expressions. The system is designed as a cost effective, non-intrusive and energy efficient solution. The system uses a combination of edge computing with cloud-based deep learning models to provide the actionable insights on student behavioral and engagement trends. This smart monitoring system aims at addressing the limitations of methods of traditional assessment which often relies upon manual observation, academic scores or delayed evaluations that do not reflect students’ real-time emotional and understanding involvement during lessons. At the core of this device there present a Raspberry Pi-based device which is integrated with a HD camera, SD card, mouse, keyboard and a Wi-Fi module. Power source for the entire unit is a battery which is continuously charged using a solar panel which makes this system sustainable and deployable in the remote places where there is lack of energy sources. N number of the devices be placed in the classrooms according to the requirement which captures the video stream of the students during the instructional sessions. These video feeds are processed locally to extract the key visual features of the students such as body posture points and facial landmarks. Deep learning models are used to extract the features like eye movement, mouth openness, arm position, head tilt and leaning behavior and convert it in numeric data and hence minimizes the bandwidth usage and preserves privacy of the students by avoiding the transmission of raw videos or images.
The numerical data once generated is transmitted in real-time to the cloud server via the Wi-Fi module. The cloud server preprocesses the data like if there are some values or data missing it will fill those values, if there is some vagueness in the data or there is some incorrect or wrong data present then it will refine them. The CNN part performs spatial analysis by classifying facial expressions and gesture-based features from each frame into emotional categories such as confused, focused, attentive or bored. The LSTM component helps in processing sequences of these feature vector over the times to understand the temporal patterns and behavioral trends. Hence, distinguish between transient distractions and sustained disengagement. This spatiotemporal modelling approach significantly enhances the accuracy of performance assessment by accounting for the dynamic nature of the behavior of the students throughout a lesson.
The pre-processed data will be sent to the machine learning model which is pretrained by existing datasets, along with the labelled video data of classroom annotated by the educational experts for ensuring contextual relevance. This model will process the data and make the prediction about the behaviour of the students and will also predict how much the students are engaged in the classroom. Then the prediction will be sent to the dashboard of the institution and it will make the administration to better understand the behaviour of the student and they can work for the student’s upliftment.
The algorithm used by the device is a hybrid deep learning model combining CNNs and LSTM networks. The CNN is responsible for extracting spatial features from each video frame, such as posture landmarks and facial expressions. These features are then passed to the LSTM which captures the behavioural trends and temporal patterns over time. The spatiotemporal fusion allows the system for distinguishing between brief distractions and sustained engagement or confusion. The model is pretrained on gesture and emotion datasets and fine-tuned with real classroom data. It gives outputs on engagement scores and performance predictions which are based on observed patterns.
The pre-processed data will be sent to the machine learning model which is pretrained by existing datasets, along with the labelled video data of classroom annotated by the educational experts for ensuring contextual relevance. This model will process the data and make the prediction about the behavior of the students and will also predict how much the students are engaged in the classroom. Then the prediction will be sent to the dashboard of the institution and it will make the administration to better understand the behavior of the student and they can work for the student’s upliftment.
The algorithm used by the device is a hybrid deep learning model combining CNNs and LSTM networks. The CNN is responsible for extracting spatial features from each video frame, such as posture landmarks and facial expressions. These features are then passed to the LSTM which captures the behavioral trends and temporal patterns over time. The spatiotemporal fusion allows the system for distinguishing between brief distractions and sustained engagement or confusion. The model is pretrained on gesture and emotion datasets and fine-tuned with real classroom data. It gives outputs on engagement scores and performance predictions which are based on observed patterns. The algorithm used is as follows:
Algorithm Used
BEGIN
// Step 1: System Initialization
Initialize RaspberryPi with Camera, Wi-Fi Module, Battery, SD Card
Connect to Cloud Server
Load CNN model (for facial and pose features)
Load LSTM model (for temporal behavior analysis)

LOOP While Classroom Session is Active:
// Step 2: Video Frame Capture
CaptureFrame = Camera.read_frame()
Timestamp = get_current_time()
// Step 3: Preprocessing
FaceDetected = detect_face(CaptureFrame)
PoseDetected = detect_pose(CaptureFrame)
IF FaceDetected AND PoseDetected:
// Step 4: Feature Extraction
FacialFeatures = CNN.extract_facial_features(CaptureFrame)
PoseFeatures = CNN.extract_pose_features(CaptureFrame)
FeatureVector = concatenate(FacialFeatures, PoseFeatures)
// Step 5: Store Feature Locally
Save FeatureVector with Timestamp to SD Card
// Step 6: Transmit Feature to Cloud
Send FeatureVector to Cloud Server via Wi-Fi
END IF
WAIT for next frame (e.g., delay 0.5s)
END LOOP
// Step 7: Cloud Processing
CloudServer: RECEIVE FeatureVectors from Raspberry Pi
Sequence = accumulate_feature_vectors_over_time()
EngagementPrediction = LSTM.predict(Sequence)
PerformanceScore = calculate_score(EngagementPrediction)
// Step 8: Display Results
UpdateInstitutionDashboard(EngagementPrediction, PerformanceScore)
END
ADVANTAGES OF THE INVENTION:
The key advantages of the system proposed are as follows:
• The proposed system uses a HD camera and deep learning models for monitoring body gestures and facial expressions of the students in real time. It provides immediate feedback on emotional as well as attention states. This helps the teachers in adjusting their instruction to improve understanding of the students.
• Unlike biometric scanners or wearable devices, the system passively collects the behavioral data without interrupting the activities of the classroom. It only transmits the numerical features rather than transferring the raw video or personal data which ensures privacy and allows students to act naturally.
• Equipped with battery and solar panel, the device operates independently of power grinds. It supports continuous monitoring even in low-resource or rural schools. This makes the system both accessible and sustainable for broader deployment.
• As Respberry Pi and open-source software is used the system keeps hardware and maintenance costs low. Its modular design allows quick installation across multiple classrooms. Institutions can use this device without making any significant changes in their infrastructure.
• A hybrid LSTM-CNN model helps in analysing both gesture and facial data across the times. This allows the system for detecting meaningful patterns in student engagement and predict their performance trends. Early signs of disengagement or confusion can trigger timely teacher intervention.
, Claims:1. A Hybrid CNN-LSTM Model for analysing student behaviour and engagement in AI-assisted classrooms comprising an Edge Vision Device (30), a Cloud Server (40), a Raspberry Pi (31), a Power Supply (32), a Battery (33), a Display (34), a Solar Panel (35), an Onboard Wi-Fi Modem (36), an HD Camera (37), a Keyboard (38), a Mouse (39), and Data Storage (40).
2. The system as claimed in claim 1, wherein the Edge Vision Device (30) is configured with the Raspberry Pi (31), HD Camera (37), SD card, Wi-Fi Module (36), Battery (33), and Solar Panel (35) to capture classroom video streams and preprocess data locally.
3. The system as claimed in claim 1, wherein the HD Camera (37) captures facial expressions and body gestures of students in real time for behavioral and engagement monitoring.
4. The system as claimed in claim 1, wherein the Edge Vision Device (30) extracts feature locally using lightweight Convolutional Neural Networks (CNNs) to convert captured video frames into numerical data.
5. The system as claimed in claim 1, wherein the Data Storage (40) stores numerical data generated by the Edge Vision Device (30) before transmission to the Cloud Server (40).
6. The system as claimed in claim 1, wherein the Onboard Wi-Fi Modem (36) transmits preprocessed numerical data from the Edge Vision Device (30) to the Cloud Server (40).
7. The system as claimed in claim 1, wherein the Cloud Server (40) processes the transmitted data using CNN models for extracting spatial features including posture landmarks and facial expressions.
8. The system as claimed in claim 1, wherein the Cloud Server (40) further processes temporal patterns in student behaviour using Long Short-Term Memory (LSTM) networks.
9. The system as claimed in claim 1, wherein the hybrid CNN-LSTM model generates outputs comprising engagement scores and performance predictions for students based on classroom behavioural data.
10. The system as claimed in claim 1, wherein the Display (34) visualises analysed results including student performance scores and engagement levels for institutional administration and educators through a web-based dashboard.

Documents

Application Documents

#	Name	Date
1	202511087164-STATEMENT OF UNDERTAKING (FORM 3) [13-09-2025(online)].pdf	2025-09-13
2	202511087164-REQUEST FOR EARLY PUBLICATION(FORM-9) [13-09-2025(online)].pdf	2025-09-13
3	202511087164-POWER OF AUTHORITY [13-09-2025(online)].pdf	2025-09-13
4	202511087164-FORM-9 [13-09-2025(online)].pdf	2025-09-13
5	202511087164-FORM FOR SMALL ENTITY(FORM-28) [13-09-2025(online)].pdf	2025-09-13
6	202511087164-FORM 1 [13-09-2025(online)].pdf	2025-09-13
7	202511087164-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [13-09-2025(online)].pdf	2025-09-13
8	202511087164-EVIDENCE FOR REGISTRATION UNDER SSI [13-09-2025(online)].pdf	2025-09-13
9	202511087164-EDUCATIONAL INSTITUTION(S) [13-09-2025(online)].pdf	2025-09-13
10	202511087164-DRAWINGS [13-09-2025(online)].pdf	2025-09-13
11	202511087164-DECLARATION OF INVENTORSHIP (FORM 5) [13-09-2025(online)].pdf	2025-09-13
12	202511087164-COMPLETE SPECIFICATION [13-09-2025(online)].pdf	2025-09-13