An Ai Powered Video And Voice Recognition Device For Early Screening

< Back

An Ai Powered Video And Voice Recognition Device For Early Screening Of Autism Spectrum Disorder In Infants And Toddlers

Abstract: The invention relates to an AI-powered video and voice recognition device for early screening of Autism Spectrum Disorder (ASD) in infants and toddlers. The system comprises a Smart Edge Vision Data Collector Device (70) integrated with a Raspberry Pi (71), High Definition Camera (78), and Microphone (76) for capturing short sessions of a child’s natural behavior. The captured video and audio data are temporarily stored and transmitted via Wi-Fi Modem (75) to a Cloud Server (60) for advanced processing. The video stream is analyzed using Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks to extract facial expressions, gaze direction, and behavioral patterns, while audio signals are processed using Recurrent Neural Networks (RNN) or Transformer models to analyze pitch, prosody, and response latency. Multimodal data fusion enhances prediction reliability, generating an autism risk score categorized as low, medium, or high. The results are displayed on a local Display (77), enabling timely and accessible early ASD detection.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

13 September 2025

Publication Number

40/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

UTTARANCHAL UNIVERSITY

ARCADIA GRANT, P.O. CHANDANWARI, PREMNAGAR, DEHRADUN - 248007, UTTARAKHAND, INDIA

Inventors

1. SHIVANI PANT

UTTARANCHAL INSTITUTE OF TECHNOLOGY, UTTARANCHAL UNIVERSITY, ARCADIA GRANT, P.O. CHANDANWARI, PREMNAGAR, DEHRADUN - 248007, UTTARAKHAND, INDIA

2. RAJESH SINGH

UTTARANCHAL INSTITUTE OF TECHNOLOGY, UTTARANCHAL UNIVERSITY, ARCADIA GRANT, P.O. CHANDANWARI, PREMNAGAR, DEHRADUN - 248007, UTTARAKHAND, INDIA

3. ANITA GEHLOT

UTTARANCHAL INSTITUTE OF TECHNOLOGY, UTTARANCHAL UNIVERSITY, ARCADIA GRANT, P.O. CHANDANWARI, PREMNAGAR, DEHRADUN - 248007, UTTARAKHAND, INDIA

4. RAHUL MAHALA

LAW COLLEGE DEHRADUN, UTTARANCHAL UNIVERSITY, ARCADIA GRANT, P.O. CHANDANWARI, PREMNAGAR, DEHRADUN - 248007, UTTARAKHAND, INDIA

5. BABITA RAWAT

UTTARANCHAL INSTITUTE OF MANAGEMENT, UTTARANCHAL UNIVERSITY, ARCADIA GRANT, P.O. CHANDANWARI, PREMNAGAR, DEHRADUN - 248007, UTTARAKHAND, INDIA

Claims

1. An AI-based system for early screening of Autism Spectrum Disorder (ASD) in infants and toddlers comprising a Cloud Server (60), a Smart Edge Vision Data Collector Device (SEVDCD) (70), a Model Training Module (90), a Raspberry Pi (71), a Power Supply (72), a Battery (73), a Power Source (74), an Onboard Wi-Fi Modem (75), a Microphone (76), a Display (77), a High Definition Camera (78), a Keyboard (79), a Mouse (79a), and Data Storage (79b).

2. The system as claimed in claim 1, wherein the High-Definition Camera (78) and the Microphone (76) are configured to record short sessions of a child’s behavior, typically of about five minutes, in a naturalistic setting such as home or daycare.

3. The system as claimed in claim 1, wherein the system integrates computer vision, audio analysis, and machine learning within an embedded platform based on the Raspberry Pi (71).

4. The system as claimed in claim 1, wherein the Raspberry Pi (71) is configured to temporarily store the captured video and audio data in the Data Storage (79b).

5. The system as claimed in claim 1, wherein the Onboard Wi-Fi Modem (75) is configured to transmit the stored data to the Cloud Server (60) via a network connection.

6. The system as claimed in claim 1, wherein the Cloud Server (60) is configured to process video data by dissecting the stream into frames and applying a Convolutional Neural Network (CNN) to extract features including facial expression, gaze direction, and gesture patterns.

7. The system as claimed in claim 1, wherein the Cloud Server (60) is further configured to process audio data to extract features including pitch, prosody, energy distribution, and response latency using a Recurrent Neural Network (RNN) or Transformer model.

8. The system as claimed in claim 1, wherein the system employs multimodal data fusion to combine visual and auditory features, thereby enhancing the robustness and reliability of prediction.

9. The system as claimed in claim 1, wherein the deep learning model employs a hybrid approach that combines a Long Short-Term Memory (LSTM) network with the Convolutional Neural Network (CNN) to analyze temporal and spatial aspects of child behavior.

10. The system as claimed in claim 1, wherein the Display (77) is configured to present an autism risk score categorized as low, medium, or high based on the processed video and audio data, thereby providing immediate and interpretable feedback.

Specification

Description:FIELD OF THE INVENTION
This invention relates to AI-Powered Video and Voice Recognition Device for Early Screening of Autism Spectrum Disorder in Infants and Toddlers
BACKGROUND OF THE INVENTION
In the recent years children affected with autism deal with different problems with their real-life struggle. However, for these problems early detection is crucial for children so that timely intervention help autistic children in their life. For Early detection. Traditional methods incorporated some significant challenges due to delayed intervention, subjective assessment of behaviors. These methods are very time consuming, that needed real time observation by expert and these methods may not capture behavioral signs of early stages. So, there is a increasing demand for automated system, accessible and non- invasive systems that can detect autism spectrum disorder (ASD) at early stage in sitting. Integrating Artificial intelligence with vision technologies provides an innovative solution to observe facial related expression, response time, vocal pattern and eye gaze in children. To Fill this Gap, AI-Driven and low-cost embedded system collaboratively using the cloud related processing and Raspberry Pi for early diagnosis. The proposed system aims to support timely intervention by cues related audio and visual analysis that enables early identifications, hence outcomes related developmental will be improved.
Autism is a neurological condition which is a complex problem that was occurred in early childhood that mainly affects behavior, social interaction and communication. Children with Autism are helped by doing early Detection and early intervention, existent approaches were mainly focused on questionnaire (like ADOS & M-CHAT), Clinical observation and assessments on behavioural bases, this type of traditional technologies takes so much time for detection that leads to delay in detection specially in the Rural Areas where all resources are limited. As Well delayed imitation, unusual intonation, rhythm, loudness and voice quality are difficult to diagnosis without any focused and continuous monitoring.
In the last few years latest technologies like Artificial Intelligence with computer vision have advancement the new capability for automatic detection of behavioural patteren associated with ASD. Some studies have introduced deep learning model which are used to analyze body gesture, gaze pattern, facial expression by using the videos. Some other systems introduce prosody (audio-based feature), delayed speech and not response to their name. Still, many of these methods are needed controlled atmosphere, used some special tools like EEG Devices, Multimodal sensor arrays and eye tracking devices. These Tools are very impractical at home monitoring or larger area monitoring.
Moreover, Traditional AI models are trained on very limited dataset that do not helpful for linguistic, age related and cultural factor in child behaviour. So, it is very important to find out the limitation of traditional system there is no such type of integrated system which integrate processing, cloud-based AI analysis with inexpensive hardware. So, we need a reliable solution for early prediction of ASD with low resources.
Additionally, if we talk about wearable devices which are used for collection of physiological data, they may be unrestful or invasive for children, decreasing compliancy and potentially changing the behaviour. Still research community has to introduce the full capability of vision devices and inexpensive devices like made with Open-source tools and Raspberry Pi, so that can be done in daycare and at home naturally.
Hence, there exists a research gap in making a vision based, low – cost and autism screening tool driven by AI technologies that is incorporated at the edge and able of cloud-based analysis and continuous data transfer. So, system should be able of capturing real time data like audio-based behaviours and visual behaviours in small children, and then data will be analyzed by using strong and reliable ML model and giving important information to clinician and parents. Filling this gap will surely help Autistic child with early detection and introduces timely intervention in some regions.
AU2016204969B2 This invention provides methods and biomarkers for diagnosing autism by identifying cellular metabolites differentially produced in autistic patient samples versus non-autistic controls. Methods for identifying a unique profile of metabolites present of secreted in brain tissue, cerebrospinal fluid, plasma, or biofluids of autistic samples are described herein. The individual metabolites or a pattern of secreted metabolites provide metabolic signatures of autism, which can be used to provide a diagnosis thereof.
RESEARCH GAP: The proposed patent differs by using AI-based video and voice behavior analysis for non-invasive early autism screening, whereas this invention focuses on biochemical metabolite profiling for diagnosis from biofluids.
US11268966B2 Methods for identifying metabolic signatures in blood plasma which are unique to autism are described herein. Samples are analyzed using multiple chromatographic-mass spectrometry-based techniques to orthogonally measure a broad range of small molecular weight metabolites differentially produced in autistic patient samples versus non-autistic control samples. These individual metabolites or a panel of such metabolites serve as metabolic signatures of autism. Such metabolic signatures are used in diagnostic methods to accurately identify individuals with autism spectrum disorder (ASD).
RESEARCH GAP: The proposed patent differs by employing a multimodal AI system analyzing facial and vocal behaviors in natural settings, while this invention uses lab-based mass spectrometry to detect blood-based metabolic biomarkers for ASD diagnosis.
None of the prior art indicate above either alone or in combination with one another disclose what the present invention has disclosed. This invention relates to AI-Powered Video and Voice Recognition Device for Early Screening of Autism Spectrum Disorder in Infants and Toddlers.
SUMMARY OF THE INVENTION
This summary is provided to introduce a selection of concepts, in a simplified format, that are further described in the detailed description of the invention.
This summary is neither intended to identify key or essential inventive concepts of the invention and nor is it intended for determining the scope of the invention.
Disclosed herein an AI-based system for early screening of Autism Spectrum Disorder (ASD) in infants and toddlers comprises of Cloud server (60), Smart edge vision data collector device (SEVDCD) (70), Model Training (90), Raspberry PI (71), Power supply (72), Battery (73), Power source (74), Onboard WIFI Modem (75), Mic (76), Display (77), HD Camera (78), Keyboard (79), Mouse (79a) & Data Storage (79b).
The system uses a high-definition camera and a sensitive microphone to record a short session—typically five minutes—of a child's behavior in a naturalistic setting such as home or daycare.
The system integrating computer vision, audio analysis, and machine learning within an embedded platform based on a Raspberry Pi.
The microcontroller configured to temporarily store the captured audio and video data.
The wireless communication module configured to transmit the stored data to a remote server via a network connection.
The system dissects the stream into individual frames and applies a Convolutional Neural Network (CNN) to each frame to extract spatial features like facial expression, gaze direction, and gesture patterns.
The multimodal data fusion enhancing the robustness and reliability of prediction by integrating both visual and auditory cues.
The Deep Learning with hybrid approach that combines Long Short-Term Memory (LSTM) with Convolution Neural Networks (CNN) to examine visual type behaviour in the children.
To further clarify advantages and features of the present invention, a more particular description of the invention will be rendered by reference to specific embodiments thereof, which is illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail with the accompanying drawings.
The proposed AI-powered video and voice recognition device is a low-cost, non-invasive system designed for the early screening of Autism Spectrum Disorder (ASD) in infants and toddlers. It integrates computer vision, audio analysis, and machine learning within an embedded platform based on a Raspberry Pi. The mechanism begins with data acquisition, wherein the system uses a high-definition camera and a sensitive microphone to record a short session—typically five minutes—of a child's behavior in a naturalistic setting such as home or daycare. The video captures facial expressions, eye gaze, and gestures, while the audio records speech patterns, response to name, and prosodic features.
BRIEF DESCRIPTION OF THE DRAWINGS
The illustrated embodiments of the subject matter will be understood by reference to the drawings, wherein like parts are designated by like numerals throughout. The following description is intended only by way of example, and simply illustrates certain selected embodiments of devices, systems, and methods that are consistent with the subject matter as claimed herein, wherein:
FIGURE 1 OVERALL ARCHITECTURE
FIGURE 2 SMART EDGE VISION DATA COLLECTOR DEVICE
The figures depict embodiments of the present subject matter for the purposes of illustration only. A person skilled in the art will easily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the disclosure described herein.
DETAILED DESCRIPTION OF THE INVENTION
The detailed description of various exemplary embodiments of the disclosure is described herein with reference to the accompanying drawings. It should be noted that the embodiments are described herein in such details as to clearly communicate the disclosure. However, the amount of details provided herein is not intended to limit the anticipated variations of embodiments; on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the scope of the present disclosure as defined by the appended claims.
It is also to be understood that various arrangements may be devised that, although not explicitly described or shown herein, embody the principles of the present disclosure. Moreover, all statements herein reciting principles, aspects, and embodiments of the present disclosure, as well as specific examples, are intended to encompass equivalents thereof.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments. As used herein, the singular forms “a",” “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes” and/or “including,” when used herein, specify the presence of stated features, integers, steps, operations, elements and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof.
It should also be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may, in fact, be executed concurrently or may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
In addition, the descriptions of "first", "second", “third”, and the like in the present invention are used for the purpose of description only, and are not to be construed as indicating or implying their relative importance or implicitly indicating the number of technical features indicated. Thus, features defining "first" and "second" may include at least one of the features, either explicitly or implicitly.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which example embodiments belong. It will be further understood that terms, e.g., those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The proposed AI-powered video and voice recognition device is a low-cost, non-invasive system designed for the early screening of Autism Spectrum Disorder (ASD) in infants and toddlers. It integrates computer vision, audio analysis, and machine learning within an embedded platform based on a Raspberry Pi. The mechanism begins with data acquisition, wherein the system uses a high-definition camera and a sensitive microphone to record a short session—typically five minutes—of a child's behavior in a naturalistic setting such as home or daycare. The video captures facial expressions, eye gaze, and gestures, while the audio records speech patterns, response to name, and prosodic features.
Once the data is collected, it is temporarily stored on the Raspberry Pi and then securely transmitted to a cloud server via a Wi-Fi connection. The cloud infrastructure handles the computationally intensive tasks of data preprocessing and inference. For video, the system dissects the stream into individual frames and applies a Convolutional Neural Network (CNN) to each frame to extract spatial features like facial expression, gaze direction, and gesture patterns. These sequential features are then passed to a Long Short-Term Memory (LSTM) network, which captures the temporal dynamics of behavior—essential in recognizing ASD indicators like lack of sustained eye contact or repetitive actions.
Simultaneously, the audio stream is preprocessed to reduce noise and extract relevant features such as Mel-frequency cepstral coefficients (MFCCs), pitch variations, energy distribution, and response latency
. These features are input into either a Recurrent Neural Network (RNN) or a Transformer model, which identifies anomalies in speech rhythm, vocal tone, or delayed verbal responses—behavioral markers often associated with ASD.
The outputs from the video and audio analysis pipelines are fused using a feature concatenation method. This multimodal data fusion enhances the robustness and reliability of prediction by integrating both visual and auditory cues. The combined feature set is then passed through a fully connected classification layer, which applies a softmax or sigmoid activation to compute an autism risk score. This score is categorized into levels such as low, medium, or high risk and transmitted back to the Raspberry Pi. The result is displayed on the local interface, providing immediate, interpretable feedback to caregivers or clinicians.
The system's design ensures child-friendliness, real-time feedback, and affordability. Its edge-cloud hybrid architecture allows for scalable deployment across rural and under-resourced settings, making early autism screening accessible, efficient, and evidence-based.
The Suggested Idea represents A Deep Learning with hybrid approach that combines Long Short-Term Memory (LSTM) with Convolution Neural Networks (CNN) to examine visual type behaviour in the children. For extraction facial features like facial expression, eye gaze pattern and gestures from the videos, for this purpose proposed system uses CNN. While Long Short-Term Memory (LSTM) used for capturing the chronological order of these behaviors to identify autism related pattern like repetitive movement and lack of eye contact. Audio features such as tone, response time and pitch are processed through Transformer models or RNN (Recurrents Neural Networks) to diagnosis atypical vocal cues. Both outputs from both audio and video are integrated or fused for overall autism risk calculation. Labeled dataset is used for training the algorithm after that deployment will be done in the cloud server so that real time monitoring would be possible with showing the result in the display. The algorithm used is as follows:
Algorithm Used
BEGIN

// Step 1: Data Capture on Raspberry Pi Device
Initialize Camera
Initialize Microphone
Record Video and Audio for Fixed Duration (e.g., 5 minutes)
Save Video_Audio_File locally

// Step 2: Send to Cloud for Processing
Establish Wi-Fi Connection
Upload Video_Audio_File to Cloud Server

// Step 3: Cloud - Preprocessing
Split Video into Frames
FOR each Frame:
Detect and align Face
Extract facial features using CNN
END FOR
Store Video_Feature_Sequence

Extract audio stream from file
Clean audio using noise reduction
Extract audio features (MFCCs, pitch, energy)

// Step 4: Behavioral Modeling
Input Video_Feature_Sequence into LSTM
Input Audio_Features into RNN or Transformer

// Step 5: Multimodal Fusion and Prediction
Concatenate Output_Features from LSTM and RNN
Feed to FullyConnectedLayer
Apply Softmax or Sigmoid to get Autism_Risk_Score

// Step 6: Return Result to Device
Send Autism_Risk_Score to Raspberry Pi
Display Result to User Interface (e.g., Low/Medium/High Risk)

END
ADVANTAGES OF THE INVENTION:
The key advantages of the system proposed are as follows:
• The system utilizes a high-definition camera and microphone and avoids any kind of wearable devices that may discomfort the child. It gathers data in natural and familiar environment ensuring better and accurate observations. This non-intrusive approach improves compliance and applicability.
• AI powered system helps in analyzing bodily gestures, auditory data, gaze patterns and facial expression for assessment of autism related characteristics. It helps in early autism detection in children as young as 12-24 months well before conventional clinical diagnosis. This helps in timely intervention that enhances developmental outcomes.
• System is affordable and cost effective as it uses components like Raspberry Pi and open-source software. Its simpler design supports its use in any environment like homes, hospitals or rural health centers. This helps in autism screening even in under resourced places.
• The heavy task of processing like model inference and data fusion is done in cloud making system fast and accurate. Results are immediately displayed in real time to caregivers or clinicians making system both efficient and accessible.
• Audio-video inputs are fused and processed through classification layer allows system to access richer set which improves the reliability and accuracy of autism detection in comparison to single-channel systems. It better replicates the evaluation carried out by trained professionals.
, Claims:1. An AI-based system for early screening of Autism Spectrum Disorder (ASD) in infants and toddlers comprising a Cloud Server (60), a Smart Edge Vision Data Collector Device (SEVDCD) (70), a Model Training Module (90), a Raspberry Pi (71), a Power Supply (72), a Battery (73), a Power Source (74), an Onboard Wi-Fi Modem (75), a Microphone (76), a Display (77), a High Definition Camera (78), a Keyboard (79), a Mouse (79a), and Data Storage (79b).
2. The system as claimed in claim 1, wherein the High-Definition Camera (78) and the Microphone (76) are configured to record short sessions of a child’s behavior, typically of about five minutes, in a naturalistic setting such as home or daycare.
3. The system as claimed in claim 1, wherein the system integrates computer vision, audio analysis, and machine learning within an embedded platform based on the Raspberry Pi (71).
4. The system as claimed in claim 1, wherein the Raspberry Pi (71) is configured to temporarily store the captured video and audio data in the Data Storage (79b).
5. The system as claimed in claim 1, wherein the Onboard Wi-Fi Modem (75) is configured to transmit the stored data to the Cloud Server (60) via a network connection.
6. The system as claimed in claim 1, wherein the Cloud Server (60) is configured to process video data by dissecting the stream into frames and applying a Convolutional Neural Network (CNN) to extract features including facial expression, gaze direction, and gesture patterns.
7. The system as claimed in claim 1, wherein the Cloud Server (60) is further configured to process audio data to extract features including pitch, prosody, energy distribution, and response latency using a Recurrent Neural Network (RNN) or Transformer model.
8. The system as claimed in claim 1, wherein the system employs multimodal data fusion to combine visual and auditory features, thereby enhancing the robustness and reliability of prediction.
9. The system as claimed in claim 1, wherein the deep learning model employs a hybrid approach that combines a Long Short-Term Memory (LSTM) network with the Convolutional Neural Network (CNN) to analyze temporal and spatial aspects of child behavior.
10. The system as claimed in claim 1, wherein the Display (77) is configured to present an autism risk score categorized as low, medium, or high based on the processed video and audio data, thereby providing immediate and interpretable feedback.

Documents

Application Documents

#	Name	Date
1	202511087163-STATEMENT OF UNDERTAKING (FORM 3) [13-09-2025(online)].pdf	2025-09-13
2	202511087163-REQUEST FOR EARLY PUBLICATION(FORM-9) [13-09-2025(online)].pdf	2025-09-13
3	202511087163-POWER OF AUTHORITY [13-09-2025(online)].pdf	2025-09-13
4	202511087163-FORM-9 [13-09-2025(online)].pdf	2025-09-13
5	202511087163-FORM FOR SMALL ENTITY(FORM-28) [13-09-2025(online)].pdf	2025-09-13
6	202511087163-FORM 1 [13-09-2025(online)].pdf	2025-09-13
7	202511087163-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [13-09-2025(online)].pdf	2025-09-13
8	202511087163-EVIDENCE FOR REGISTRATION UNDER SSI [13-09-2025(online)].pdf	2025-09-13
9	202511087163-EDUCATIONAL INSTITUTION(S) [13-09-2025(online)].pdf	2025-09-13
10	202511087163-DRAWINGS [13-09-2025(online)].pdf	2025-09-13
11	202511087163-DECLARATION OF INVENTORSHIP (FORM 5) [13-09-2025(online)].pdf	2025-09-13
12	202511087163-COMPLETE SPECIFICATION [13-09-2025(online)].pdf	2025-09-13