Image Caption Generator Using Deep Learning

Abstract: The problem of generating descriptive sentences automatically for images have gained a rising interest in Natural language processing (NLP). Image captioning is a basic task which requires correct understanding of images and capable to generate descriptive sentences with proper grammatical structure. To describe a picture, you need well-structured English phrases. Automatically defining image content is very helpful for visually impaired people to better understand the problem. The proposed methods handle hybrid system which uses multilayer Convolutional Neural Network (CNN) to generate keywords which narrates the images and a Long Short Term Memory (LSTM) to precisely structure significant sentences using the generated keywords. Convolution Neural Network (CNN) proven to be so effective that they are a way to get to any kind of estimating problem that includes image data as input. The convolutional neural network compares the given image with a huge dataset of trained images, then generates an accurate description using the trained captions. LSTM was developed to avoid the poor predictive problem which occurred while using traditional approaches. The model is trained in such a way that when an input image is provided model produces captions that almost describe the image The efficiency of the model is demonstrated using Flickr8K data sets.

Patent Information

Application #

Filing Date

13 September 2022

Publication Number

38/2022

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

jagadisha.n83@gmail.com

Parent Application

Applicants

Jagadisha

#7,Mantrada Huchappa Road, Magadi Main Road , Kamakshipalya

JAGADISHA N

#7 , mantrada huchappa road, m kamakshipalya,

YOGESH KUMARAN S

19/39 10th C main 6th block Rajajinagar, Banglaore

PRASHANTHA G R

Associate Professor, Department of CSE, Jain Institute of Technology, Davanagere

PRASAD M R

Associate Professor, Department of CSE, Vidyavardaka College of Engineering, P.B. No.206, Kannada Sahithya Parishath Rd, III Stage, Gokulam, Mysuru

RAVI KUMAR J

Assistant Professor, Department of CSE, Dr.AIT, BANGLOARE

Inventors

1. Jagadisha

#7,Mantrada Huchappa Road, Magadi Main Road , Kamakshipalya

2. JAGADISHA N

#7 , mantrada huchappa road, m kamakshipalya,

3. YOGESH KUMARAN S

19/39 10th C main 6th block Rajajinagar, Banglaore

4. PRASHANTHA G R

Associate Professor, Department of CSE, Jain Institute of Technology, Davanagere

5. PRASAD M R

Associate Professor, Department of CSE, Vidyavardaka College of Engineering, P.B. No.206, Kannada Sahithya Parishath Rd, III Stage, Gokulam, Mysuru

6. RAVI KUMAR J

Assistant Professor, Department of CSE, Dr.AIT, BANGLOARE

Specification

Description:Description:
1. The proposed methodology
1.1 Implementation Approaches
The goal of this work is to track objects autonomously without any prior information related to the objects. The moving object is detected using mixture of Gaussians (MOG). The methods like Gaussian average, temporal median filter and Eigen backgrounds can also be used for background subtraction, but MOG is more accurate. The process of detecting and tracking multiple moving objects is summarized in Algorithm. More recently, it has been shown that deep learning models are able to achieve good results in the field of caption predictions. Instead of requiring complex data editing or a pipeline of specially designed models, one end-to-end model can be defined to predict captions, if an image is provided. To test our model, we measure its performance using the Flickr8K datasets.

1.2 Data Cleaning and Preprocessing:
Data preprocessing is done in two parts, the images and the corresponding captions are cleaned and pre-processed separately. Image preprocessing is done by feeding the input data to the VGG16 application of the Keras API running on top of Tensor Flow. VGG16 is pre-trained on ImageNet. This helped us train the images faster with the help of Transfer learning. Our program starts with loading both, the text file, and the image file into separate variables; the text file is stored in a string. This string is used and manipulated in such a way so as to create a dictionary that maps each image with a list of 5 descriptions. The main task of data cleaning involves removing punctuation marks, converting the whole text to lowercase, removing stop words and removing words that contain numbers. Further, a vocabulary of all unique words from all the descriptions is created, which in the future will be used to generate captions to test images. We need to append theandidentifier for each caption since these will act as indicators for LSTM to understand where a caption is starting and where it is ending.

1.3 Data Cleaning and Preprocessing:
Data preprocessing is done in two parts, the images and the corresponding captions are cleaned and pre-processed separately. Image preprocessing is done by feeding the input data to the VGG16 application of the Keras API running on top of Tensor Flow. VGG16 is pre-trained on ImageNet. This helped us train the images faster with the help of Transfer learning. We have used the Flickr 8K dataset downloaded from Kaggle. The dataset consists of 8000 images and for every image, there are 5 captions. The 5 captions for a single image helps in understanding all the various possible scenarios. The dataset has a predefined training dataset Flickr_8k.trainImages.txt (6,000images), development dataset Flickr_8k.devImages.txt (1,000 images), and test dataset Flickr_8k.testImages.txt (1,000 images). The Images are opted from six varied Flickr groups and do not contain any well- known personality or places. However, they are manually selected to show a variety of scenes.

1.4 Extraction of feature vectors
A feature vector (or simply feature) is a numerical value in the matrix form, containing information about an object’s important characteristics, example can be intensity value of each pixel of the image. This technique is also called transfer learning, we don’t have to do everything on our own, we use the pre-trained model that have been already trained on large datasets and extract the features from these models and use them for our tasks. We are using the VGG16 model which has been trained on imagenet dataset that had 1000 different classes to classify. This technique is also called transfer learning, we don’t have to do everything on our own, we use the pre-trained model that have been already trained on large datasets and extract the features from these models and use them for our tasks. We are using the VGG16 model which has been trained on imagenet dataset that had 1000 different classes to classify. We can directly import this model from the keras.applications. Python makes using this model in our code extremely easy with keras.applications.vgg16 module. Since the VGG16 model was originally built for imagenet, little changes for integrating with our model is done. One thing to notice is that the VGG16 model takes 224*224*3 image size as input. We removed the last classification layer and get the 2048 feature vector.
Long Short Term Memory: LSTM is the one which translates the features and objects extracted by the Convolution Neural Network (CNN) to a natural sentence that is caption which describes the image in a suitable way.
1.5 Testing Approach

Software testing is the process used to help identify the correctness, completeness, security and quality of developed computer software. This includes the process of executing the program or application with the intent of finding errors. Quality is not an absolute it is value to some person. With that in mind testing can never completely establish the correctness of arbitrary comport software; testing furnishes a criticism or comparison that compares the state and behavior of product against a specification. Testing forms the first step in determining the errors in a program. Clearly the success of testing in revealing errors in programs depends critically on the test cases. Because code is the only product that can be executed and whose actual behavior can be observed, testing is the face where the errors remaining from all the previous phases must be detected. The program to be tested is executed with the set of test cases and the output of the program for the test cases are evaluated to determine if the programming is performing as expected. Testing forms the first step in determining errors in a program. The success of testing in revealing errors in programs depends critically on the test cases.
, Claims:1. The proposed innovation claims search engine optimization to make more sense of photo.
2. The proposed innovation may be helpful for visually impaired people who are unable to envision visuals and in self-driving cars.
3. The proposed innovation claims the model should generate the suitable caption using Neural Network and Natural Language Processing technique.
4. The proposed innovation claims to reduce manual work and hence it reduces human error.

Documents

Application Documents

#	Name	Date
1	202241052113-COMPLETE SPECIFICATION [13-09-2022(online)].pdf	2022-09-13
1	202241052113-STATEMENT OF UNDERTAKING (FORM 3) [13-09-2022(online)].pdf	2022-09-13
2	202241052113-DECLARATION OF INVENTORSHIP (FORM 5) [13-09-2022(online)].pdf	2022-09-13
2	202241052113-REQUEST FOR EARLY PUBLICATION(FORM-9) [13-09-2022(online)].pdf	2022-09-13
3	202241052113-DRAWINGS [13-09-2022(online)].pdf	2022-09-13
3	202241052113-FORM-9 [13-09-2022(online)].pdf	2022-09-13
4	202241052113-FORM 1 [13-09-2022(online)].pdf	2022-09-13
5	202241052113-DRAWINGS [13-09-2022(online)].pdf	2022-09-13
5	202241052113-FORM-9 [13-09-2022(online)].pdf	2022-09-13
6	202241052113-DECLARATION OF INVENTORSHIP (FORM 5) [13-09-2022(online)].pdf	2022-09-13
6	202241052113-REQUEST FOR EARLY PUBLICATION(FORM-9) [13-09-2022(online)].pdf	2022-09-13
7	202241052113-COMPLETE SPECIFICATION [13-09-2022(online)].pdf	2022-09-13
7	202241052113-STATEMENT OF UNDERTAKING (FORM 3) [13-09-2022(online)].pdf	2022-09-13