A Novel Method For Training Neural Networks On Very Large Images

Abstract: ABSTRACT A METHOD OF TRAINING NEURAL NETWORKS WITH VERY LARGE IMAGES AND PROVIDING INFERENCE A method of training neural networks with very large images and providing inference comprising the steps of taking an input image; dividing the input image into patches and each patch is processed as an independent image, and passing the patch as a batch to the neural network model for processing; performing model training and ensuring that the patches are covered over the course of iterations; and performing model inference. Reference Figure: Figures 3(a), and 3(b) Figure 3(a) Figure 3(b)

Patent Information

Application #

Filing Date

31 January 2023

Publication Number

31/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

TECHNOLOGY INNOVATION IN EXPLORATION & MINING FOUNDATION

3rd Floor, i2h Tower (Institute Innovation Hub), IIT(ISM) Dhanbad, Jharkhand - 826004

Inventors

1. Deepak Kumar Gupta

c/o Deepak Distributors, Dharamshala Road, Gandhi Nagar, Basti, Uttar Pradesh-272001

2. Dilip Kumar Prasad

12, Pirpukur Road, sarada Moni Park, Kolkata 700070, West Bengal, India

3. Arnav Santosh Chavan

C-501, Panchavati Dham CHS, Ashokvan, Dahisar East, Mumbai 400068, Maharashtra

4. Gowreesh Mago

327-A, Rishi Nagar, Backside Kali Mata Mandir, Ludhiana, Punjab 141001

Specification

DESC:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENTS RULES, 2003

COMPLETE SPECIFICATION
(See section 10 and rule 13)

A METHOD OF TRAINING NEURAL NETWORKS WITH VERY LARGE IMAGES AND PROVIDING INFERENCE

TECHNOLOGY INNOVATION IN EXPLORATION & MINING FOUNDATION
a company incorporated in India, having address
at 3rd Floor, i2h Tower (Institute Innovation Hub), IIT(ISM) Dhanbad, Jharkhand - 826004

The following specification particularly describes the invention and the manner in which it is to be performed.
FIELD OF THE INVENTION
The present invention relates to a method of training neural networks with very large images and providing inference.

BACKGROUND OF THE INVENTION
Convolutional neural networks (CNNs) are considered among the most vital ingredients for the rapid developments in the field of computer vision. This can be attributed to their capability of extracting very complex information far beyond what can be obtained from the standard computer vision methods (See the research papers, Khan, A., Sohai, A., Zahoora, U., and Qureshi, A. S. A survey of the recent architectures of deep convolutional neural networks. Artificial Intelligence Review, 53:5455-5516, 2020; Li, Z., Liu, F., Yang, W., Peng, S., and Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–21, 2021; Alzubaidi, L., Zhang, J., Humaidi, A. J., Al-Dujaili, A., Duan, Y., Al-Shamma, ., Santamaría, J., Fadhel, M. A., Al-Amidie, M., and Farhan, L. Review of deep learning: concepts, cnn architectures, challenges, applications, future directions. Journal of Big Data, 8, 2021).
With the recent technological developments, very large images are obtained from data acquisition in the fields of microscopy (See research papers, Khater, I. M., Nabi, I. R., and Hamarneh, G. A review of super-resolution single-molecule localization microscopy cluster analysis and quantification methods. Patterns, 1(3):100038, 2020; Schermelleh, L., Ferrand, A., Huser, T., Eggeling, C., Sauer, M., Biehlmaier, O., and Drummen, G. P. Super-resolution microscopy demystified. Nature cell biology, 21(1):72-84, 2019), medical imaging (See research paper, Aggarwal, R., Sounderajah, V., Martin, G., Ting, D. S., Karthikesalingam, A., King, D., Ashrafian, H., and Darzi, A. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ digital medicine, 4(1):65, 2021), and earth sciences (See research papers, Huang, Y., Chen, Z.-x., Tao, Y., Huang, X.-z., and Gu, X.-f. Agricultural remote sensing big data: Management and applications. Journal of Integrative Agriculture, 17(9):1915-1931, 2018; Amani, M., Ghorbanian, A., Ahmadi, S. A., Kakooei, M., Moghimi, A., Mirmazloumi, S. M., Moghaddam, S. H. A., Mahdavi, S., Ghahremanloo, M., Parsian, S., et al. Google earth engine cloud computing platform for remote sensing big data applications: A comprehensive review. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13:5326–5350, 2020), among others. Recently, there has been a drive to use deep learning methods in these fields as well. In particular, several deep learning methods have been proposed to handle the images from the microscopy domain (See research papers, Orth, A., Schaak, D., and Schonbrun, E. Microscopy, meet big data. Cell systems, 4(3):260–261, 2017; Dankovich, T. M. and Rizzoli, S. O. Challenges facing quantitative large-scale optical super-resolution, and some simple solutions. Iscience, 24(3):102134, 2021; Sekh, A. A., Opstad, I. S., Birgisdottir, A. B., Myrmel, T., Ahluwalia, B. S., Agarwal, K., and Prasad, D. K. Learning nanoscale motion patterns of vesicles in living cells. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14014-14023, 2020; Sekh, A. A., Opstad, I. S., Godtliebsen, G., Birgisdottir, Å . B., Ahluwalia, B. S., Agarwal, K., and Prasad, D. K. Physics-based machine learning for subcellular segmentation in living cells. Nature Machine Intelligence, 3(12): 1071-1080, 2021), however, the big data challenge of applying CNNs to analyze such images is immense, as shown in Figure 1.
Figure 1 shows nanoscopy image (left) of a mouse kidney cryosection approximately 1/12 th of the area of a single field-of-view of the microscope, chosen to illustrate the level of details at different scales. The bottom right images show that the smallest features in the image of relevance can be as small as a few pixels (here 5-8 pixels for the holes) (See the research paper, Villegas-Hernández, L. E., Dubey, V., Nystad, M., Tinguely, J.-C., Coucheron, D. A., Dullo, F. T., Priyadarshi, A., Acu˜na, S., Ahmad, A., Mateos, J. M., et al. Chip-based multimodal super-resolution microscopy for histological investigations of cryopreserved tissue sections. Light: Science & Applications, 11(1):1–17, 2022).
High content nanoscopy involves taking nanoscopy images of several adjacent fields-of-view and stitching them side-by-side to have a full perspective of the biological sample, such as a patient's tissue biopsy, put under the microscope. There is information at multiple scales embedded in these microscopy images (See research paper, Villegas-Hernández, L. E., Dubey, V., Nystad, M., Tinguely, J.-C., Coucheron, D. A., Dullo, F. T., Priyadarshi, A., Acu˜na, S., Ahmad, A., Mateos, J. M., et al. Chip-based multimodal super-resolution microscopy for histological investigations of cryopreserved tissue sections. Light: Science & Applications, 11(1):1–17, 2022), with the smallest scale of features being only a few pixels in size. Indeed, such dimensions of images and levels of details are a challenge for the existing CNNs.
Existing deep learning models using CNNs are predominantly trained and tested on relatively low resolution regime (less than 300×300 pixels). This is partly because of the widely used image benchmarking datasets such as ILSVRC(ImageNet dataset) (See ImageNet, I. Large scale visual recognition challenge (ilsvrc). 2010) for classification and PASCAL VOC (See the research paper, Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. The pascal visual object classes (voc) challenge. International journal of computer vision, 88: 303-308, 2009) for object detection/segmentation consist of low-resolution images in a similar range, and most of the existing research has been towards achieving state-of-the-art (SOTA) results on these or similar datasets. Using these models on high-resolution images leads to quadratic growth of the associated activation size, and this in turn leads to massive increase in the training compute as well as the memory footprint. Further, when the available GPU memory is limited, such large images cannot be processed by CNNs.
There exist very limited works that address the issue of handling very large images using CNNs. The most common approach among these is to reduce the resolution of the images through downscaling. However, this can lead to a significant loss of information associated with the small-scale features, and it can adversely affect the semantic context associated with the image. An alternate strategy is to divide the image into overlapping or non-overlapping tiles and process the tiles in a sequential manner. However, this approach does not assure that the semantic link across the tiles will be preserved and it can hinder the learning process. Several similar strategies exist that attempt to learn the information contained in the large images, however, their failure to capture the global context limits their use.
The majority of existing works employ pixel-level segmentation masks, which are not always available. For example, the research papers Iizuka, O., Kanavati, F., Kato, K., Rambeau, M., Arihiro, K., and Tsuneki, M. Deep learning models for histopathological classification of gastric and colonic epithelial tumours. Scientific Reports, 10(1):1504, Jan 2020. ISSN 2045-2322. doi: 10.1038/s41598-020-58467-9. URL https://doi.org/10.1038/s41598-020-58467-9; and Liu, Y., Gadepalli, K., Norouzi, M., Dahl, G. E., Kohlberger, T., Boyko, A., Venugopalan, S., Timofeev, A., Nelson, P. Q., Corrado, G. S., et al. Detecting cancer metastases on gigapixel pathology images. arXiv preprint arXiv:1703.02442, 2017, describes performing patch-level classification based on labels created from patchwise segmentation masks available for the whole slide images (WSI), and then feed it to a RNN to obtain the final WSI label.
The research paper Braatz, J., Rajpurkar, P., Zhang, S., Ng, A. Y., and Shen, J. Deep learning-based sparse whole-slide image analysis for the diagnosis of gastric intestinal metaplasia. arXiv preprint arXiv:2201.01449, 2022, discloses use of goblet cell segmentation masks to perform patch-level feature extraction. However, these approaches require labelled segmentation data, are computationally expensive, feature learning is very limited, and the error propagation is higher.
Another set of methods focus on building a compressed latent representation of the large input images using existing pretrained models or unsupervised learning approaches. For example, the research paper Lai, Z.-F., Zhang, G., Zhang, X.-B., and Liu, H.-T. High-resolution histopathological image classification model based on fused heterogeneous networks with selfsupervised feature representation. BioMed Research International, 2022:8007713, Aug 2022. ISSN 2314-6133. doi: 10.1155/2022/8007713. URL https://doi.org/10.1155/2022/8007713, describes use of U-Net autoencoder and stack them into a cube, which is then fed to another module to obtain slide-level predictions.
The research paper Tellez, D., Litjens, G. J. S., van der Laak, J. A., and Ciompi, F. Neural image compression for gigapixel histopathology image analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43:567–578, 2018, describes use of different encoding strategies including reconstruction error minimization, contrastive learning and adversarial feature learning to map high-resolution patches to a lower dimensional vector. The research paper Tellez, D., Hö ppener, D., Verhoef, C., Gr¨unhagen, D., Nierop, P., Drozdzal, M., Laak, J., and Ciompi, F. Extending unsupervised neural image compression with supervised multitask learning. In Medical Imaging with Deep Learning, pp. 770–783. PMLR, 2020, further describes use of multi-task learning to get better representations of patches than their unsupervised counterparts. The limitation with these methods is that the encoding network created from unsupervised learning is not always the strong representative of the target task.
There exist several methods that use pretrained models derived from other other tasks as feature extractors and the output is then fed to a classifier. For example, methods include using Cancer-Texture Network (CAT-Net) and Google Brain (GB) models as feature extractors (See the research paper, Kosaraju, S. C., Park, J., Lee, H., Yang, J., and Kang, M. Deep learning-based framework for slide-based histopathological image analysis. Scientific Reports, 12, Nov 2022. doi: 10.1038/s41598-022-23166-0. URL https://doi.org/10. 1038/s41598-022-23166-0.), or additionally using similar datasets for fine-tuning (See the research paper, Brancati, N., Pietro, G. D., Riccio, D., and Frucci, M. Gigapixel histopathological image analysis using attention based neural networks. IEEE Access, 9:87552–87562, 2021). Although these methods gain advantage from transfer learning, such two-stage decoupled pipelines propagate errors through under-represented features and the performance of the model on the target task is hampered.
Several research works have focused on identifying the right patches from the large images and using them in a compute-effective manner to classify the whole image. For example, the research paper, Naik, N., Madani, A., Esteva, A., Keskar, N. S., Press, M. F., Ruderman, D., Agus, D. B., and Socher, R. Deep learning-enabled breast cancer hormonal receptor status determination from base-level h&e stains. Nature Communications, 11(1):5727, Nov 2020. ISSN 2041-1723. doi: 10.1038/s41467-020-19334-3. URL https://doi.org/10.1038/s41467-020-19334-3, describes constructing latent space using randomly selected tiles, however, this approach does not preserve the semantic coherence across the tiles and fails to extract features that are spread across multiple tiles. The research paper, Campanella, G., Hanna, M. G., Geneslaw, L., Miraflor, A., Werneck Krauss Silva, V., Busam, K. J., Brogi, E., Reuter, V. E., Klimstra, D. S., and Fuchs, T. J. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature medicine, 25(8): 1301–1309, 2019, describes a multi-instance learning approach, assigning labels to top- K probability patches for classification. Further, the research papers, Pinckaers, H., van Ginneken, B., and Litjens, G. Streaming convolutional neural networks for end-to-end learning with multi-megapixel images. IEEE Transactions on Pattern Analysis &; Machine Intelligence, 44(03):1581-1590, mar 2022. ISSN 1939-3539. doi: 10.1109/TPAMI.2020.3019563; and Huang, S.-C., Chen, C.-C., Lan, J., Hsieh, T.-Y., Chuang, H.-C., Chien, M.-Y., Ou, T.-S., Chen, K.-H., Wu, R.-C., Liu, Y.-J., Cheng, C.-T., Huang, Y.-J., Tao, L.-W., Hwu, A.-F., Lin, I.-C., Hung, S.-H., Yeh, C.-Y., and Chen, T.-C. Deep neural network trained on gigapixel images improves lymph node metastasis detection in clinical settings. Nature Communications, 13(1):3347, Jun 2022. ISSN 2041-1723. doi: 10.1038/s41467-022-30746-1. URL https://doi.org/10.1038/s41467-022-30746-1, propose a patch-based training, but make use of streaming convolution networks. The research paper, Sharma, Y., Shrivastava, A., Ehsan, L., Moskaluk, C. A., Syed, S., and Brown, D. E. Cluster-to-conquer: A framework for end-to-end multi-instance learning for whole slide image classification. In International Conference on Medical Imaging with Deep Learning, 2021, describes clustering similar patches and perform cluster-aware sampling to perform WSI and patch classification.
The research paper, Cordonnier, J.-B., Mahendran, A., Dosovitskiy, A., Weissenborn, D., Uszkoreit, J., and Unterthiner, T. Differentiable patch selection for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2351–2360, 2021, describes a patch scoring mechanism and patch aggregator network for final prediction, however they perform down sampling for patch scoring which may cause loss of patch-specific feature important for WSI.
The research paper, Papadopoulos, A., Korus, P., and Memon, N. Hard-attention for scalable image classification. Advances in Neural Information Processing Systems, 34:14694–14707, 2021, describes progressively increasing the resolution and localizing the regions of interest, and dropping the rest equivalent to performing hard adaptive attention.
The research paper, DiPalma, J., Suriawinata, A. A., Tafe, L. J., Torresani, L., and Hassanpour, S. Resolution-based distillation for efficient histology image classification. Artificial Intelligence in Medicine, 119:102136, 2021. ISSN 0933-3657. doi: https://doi.org/10.1016/j.artmed.2021.102136. URL https://www.sciencedirect.com/science/article/pii/S0933365721001299, describes training a teacher model at high-resolution and performs knowledge distillation for the same model at lower resolution.
The research paper, Katharopoulos, A. and Fleuret, F. Processing megapixel images with deep attention-sampling models. In International Conference on Machine Learning, pp. 3282–3291. PMLR, 2019, describes performing attention sampling on down sampled image and derive an unbiased estimator for the gradient update. However, this method involves down sampling for attention which may lose out some vital information.
With the recent popularity of Transformer-based methods for vision-based tasks, the research paper, Chen, R. J., Chen, C., Li, Y., Chen, T. Y., Trister, A. D., Krishnan, R. G., and Mahmood, F. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 16144–16155, June 2022, discloses a self-supervised learning objective for pre-training large-scale vision transformer at varying scale. The method involves a hierarchical vision transformer which leverages the natural hierarchical structure inherent in WSI. However, the method requires a massive pre-training stage which is not always feasible. Also, the method is specific to WSI rather than more general image classification and involves training multiple large-scale transformers.
Therefore, the object of the present invention is to solve one or more of the aforementioned issues.

SUMMARY OF THE INVENTION
A method of training neural networks with very large images and providing inference is described. The method comprising the steps of taking an input image; dividing the input image into patches and each patch is processed as an independent image, and passing the patch as a batch to the neural network model for processing; performing model training and ensuring that the patches are covered over the course of iterations; and performing model inference.

BRIEF DESCRIPTION OF THE DRAWINGS
Reference will be made to embodiments of the invention, example of which may be illustrated in the accompanying figure(s). These figure(s) are intended to be illustrative, not limiting. Although the invention is generally described in the context of these embodiments, it should be understood that it is not intended to limit the scope of the invention to these particular embodiments.
Figure 1 shows an example of nanoscopy image;
Figure 2 shows performance comparison of standard CNN and the present invention for the task of classification of UltraMNIST digits of size 512 × 512 pixels using ResNet50 model according to an embodiment of the present invention;
Figure 3(a) shows steps of filling of Z block according to an embodiment of the present invention;
Figure 3(b) shows steps of model update and model inference according to an embodiment of the present invention;
Figure 4 shows a sample PANDA and UltraMNIST dataset images used for training according to an embodiment of the present invention; and
Figure 5 shows sample PANDA images along with their latent space Z.

DETAILED DESCRIPTION OF THE INVENTION
A method of training neural networks with very large images and providing inference is disclosed. The present invention provides CNN training method that can train networks with high resolution images. The present invention, rather than performing gradient-based updates on an entire image at once, model updates are performed on only small parts of the image at a time, ensuring that the majority of it is covered over the course of iterations. However, even if only a portion of the image is used, the model is still trainable end-to-end.

The present invention provides construction or filling of Z block, a deep latent representation of the full input image. Irrespective of which parts of the input are used to perform model updates, Z builds an encoding of the full image based on information acquired for different parts of it from the previous few update steps.

Figure 3(a) explains the use of the Z block. As shown, Z is primarily an encoding of an input image X obtained using any given model parameterized with weights ?_1. The input image is divided into m×n patches and each patch are processed as an independent image using ?_1. The size of Z is always enforced to be m×n×s, such that patch x_ij in the input space corresponds to the respective 1×1×s segment in the Z block.
The process of Z-filling spans over multiple steps, where every step involves sampling k patches and their respective positions from X and passing them as a batch to the model for processing. The output of the model combined with the positions are then used to fill the respective parts of Z. Once all the m×n patches of X are sampled, the filled form of Z is obtained. The process of filling Z is used during model training as well as inference stages. To build an end-to-end CNN model, a small subnetwork comprising convolutional and fully connected layers that processes the information contained in Z and transforms it into a vector of c probabilities as desired for the task of classification is added. The cost of adding this small sub-network is generally negligible.

Figure 3(b) shows steps for model training and inference. During training, model components ?_1 as well as ?_2 are updated. Based on a fraction of patches sampled from the input image, the respective encodings are computed using the latest state of ?_1 and the output is used to update the corresponding entries in the already filled Z. The partially updated Z is then used to further compute the loss function value and the model parameters are updated through back-propagation.

MATHEMATICAL FORMULATION
Detailed mathematical formulation and description of model training and inference steps is provided.

Let f_?:R^(M×N×C)?R^c denote a CNN-based model parameterized by ? that takes an input image X of spatial size M×N and C channels, and computes the probability
of it to belong to each of the c pre-defined classes. To train this model, the following optimization problem is solved.
min-?? L(f(?;X),y), - Equation 1
where {X,y}?D refers to the data samples used to train the network and L(·) denotes the loss function associated with the training. Traditionally, this problem is solved in deep learning using mini-batch gradient descent approach where updates are performed at every step using only a fraction of the data samples.

Gradient Descent (GD)
Gradient descent in deep learning involves performing model updates using the gradients computed for the loss function over one or more image samples. With updates performed over one sample at a time, referred as stochastic gradient descent method, the model update at the i^"th " step can be mathematically stated as
?^((i))=?^((i-1))-a dL/(d?^((i-1)) ), - Equation 2
where a denotes the learning rate. However, performing model updates over one sample at a time leads to very slow convergence, especially because of the noise induced by the continuously changing descent direction. This issue is alleviated in mini-batch gradient descent method where at every step, the model weights are updated using the average of gradients computed over a batch of samples, denoted here as S. Based on this, the update can be expressed as
?^((i))=?^((i-1))-a/(N(S)) ?_(X?S)¦? (dL^((X)))/(d?^((i-1)) ) - Equation 3
and N(S) here denotes the size of the batch used. As can be seen in Equation 3, if the size of image samples s?S is very large, it will lead to large memory requirements for the respective activations, and under limited compute availability, only small values of N(S), sometimes even just 1 fits into the GPU memory. This clearly demonstrate the limitation of the gradient descent method, when handling large images. This issue is alleviated by the present invention.
The present invention avoids model updates on an entire image sample in one go, rather it computes gradients using only part of the image and updates the model parameters. In this regard, the model update step can be stated as
?^((i,j))=?^((i,j-1))-a/(k·N(S_i ) ) ?_(X?S_i)¦? ?_(p?P_(X,j))¦? (dL^((X,p)))/(d?^((i,j-1)) ) - Equation 4
In the context of deep learning, i here refers to the index of the mini-batch iteration within a certain epoch. Further, j denotes the inner iterations, where at every inner iteration, k patches are sampled from the input image X (denoted as P_(X,j) ) and the gradient-based updates are performed as stated in Equation 4. Note that for any iteration i, multiple inner iterations are run ensuring that the the majority of samples from the full set of patches that are obtained from the tiling of X are explored.
In Equation 4, ?^((i,0)) denotes the initial model to be used to start running the inner iterations on S_i and is equal to ?^((i-1,?)), the final model state after ? inner iterations of patch-level updates using S_(i-1). A step-by-step model update process is provided in Process 1. As described earlier, the present invention uses an additional sub-network that looks at the full latent encoding Z for any input image X. Thus the parameter set ? is extended as ?=[?_1,?_2 ]^?, where the base CNN model is f_(?_1 ) and the additional sub-network is denoted as g_(?_2 ).
Process 1 describes model training over one batch of B images, denoted as X?R^(B×M×N×C). As the first step of the model training process, Z corresponding to each X?X is initialized. The process of filling of Z is described in Process 2. For patch x_ab, the respective Z[a,b,:] is updated using the output obtained from f_(?_1 ). ?_1 is loaded from the last state obtained during model update on the previous batch of images. During the filling of Z, no gradients are stored for backpropagation.

Process 1: Model Training for 1 iteration
Next the model update process is performed over a series of ? inner-iterations, where at every step j?{1,2,…,?},k patches are sampled per image X?X and the respective parts of Z are updated. Next, the partly updated Z is processed with the additional sub-network ?_2 to compute the class probabilities and the corresponding loss value. Based on the computed loss, gradients are backpropagated to perform update of ?_1 and ?_2. The frequency of model updates in the inner iterations is controlled through an additional term ?. Similar to how a batch size of 1 in mini-batch gradient descent introduces noise and adversely affects the convergence process, it is observed that gradient update per inner-iteration leads to sometimes poor convergence. Thus, gradient accumulation is introduced over ? steps and update the model accordingly. Gradients are allowed to backpropagate only through those parts of Z that are active at the j^"th " inner-iteration. During inference phase, Z is filled using the optimized f_(?_1^* ) as stated in Process 2 and then the filled version of Z is used to compute the class probabilities for input X using g_(?_2^* ).

Process 2: Filling of the Z block (referred as Z-filling)
EXPERIMENTAL ANALYSIS
The efficacy of the present invention is demonstrated through numerical experiments on two benchmark datasets comprising large images with features at multiple scales.
Experimental setup
Datasets
For the experiments, two datasets UltraMNIST, and PANDA are taken. For the purpose of this study, the focus is on large-scale images with different sets of resolutions ranging from 512 to 4096.

UltraMNIST
This is a synthetic dataset generated by making use of the MNIST digits. For constructing an image, 3-5 digits are sampled such that the total sum of digits is less than 10. Thus, an image can be assigned a label corresponding to the sum of the digits contained in the image. Each of the 10 classes from 0-9 each has 1000 samples making the dataset sufficiently large. The variation used in this dataset is an adapted version of the original data presented in the research paper, Gupta, D. K., Bamba, U., Thakur, A., Gupta, A., Sharan, S., Demir, E., and Prasad, D. K. Ultramnist classification: A benchmark to train cnns for very large images. arXiv, 2022. Since the digits vary significantly in size and are placed far from each other, this dataset fits well in terms of learning semantic coherence in an image. Moreover, it poses the challenge that downscaling the images leads to significant loss in the information. While even higher resolution could be chosen, it is later demonstrated that the chosen image size is sufficient to demonstrate the superiority of the present invention over the conventional gradient descent method.

PANDA
The Prostate cANcer graDe Assessment Challenge (See the research paper, Bulten, W., Kartasalo, K., Chen, P.-H. C., Ström, P., Pinckaers, H., Nagpal, K., Cai, Y., Steiner, D. F., van Boven, H., Vink, R., et al. Artificial intelligence for diagnosis and gleason grading of prostate cancer: the panda challenge. Nature medicine, 28(1):154–163, 2022) consists of one of the largest publicly available datasets for Histopathological images which scale to a very high resolution. It is important to mention that any masks are not used as in other aforementioned approaches. Therefore, the complete task boils down to taking an input high-resolution image and then classifying them into 6 categories based on the International Society of Urological Pathology (ISUP) grade groups. The images are downscaled to 4096x4096 resolution applying white bordering to avoid distortion. This is further downscaled to 512x512 and 2048x2048 for demonstrating performance of both the baseline and the present invention.
CNN models
Two CNN architectures ResNet50 (See research paper, He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770 - 778, 2016), and MobileNetV2 (See research paper, Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510 - 4520, 2018) are being used.
ResNet50 is a popular network from the residual networks family and forms backbone for several models used in a variety of computer vision tasks (such as object detection and tracking). The working of the present invention is primarily demonstrated on this model.
MobileNetV2 is a lightweight architecture which is commonly employed for edgedevices, and the performance of MobileNetV2 with large images under limited memory scenarios is being demonstrated.

Implementation details
All models are trained for 100 epochs with Adam optimizer and a peak learning rate of 1e-3. A learning rate warm-up of for 2 epochs starting from 0 and linear decay for 98 epochs till half the peak learning rate was employed. The latent classification head consists of 4 convolutional layers with 256 channels in each. The gradient accumulation is performed over inner iterations for better convergence, in the case of PANDA. To verify if results are better, not because of an increase in parameters (coming from the classification head), baselines are also extended with a similar head. GD-extended, for MobileNetV2 on UltraMNIST, refers to baseline extended with this head.
In the case of low memory, as demonstrated in the UltraMNIST experiments, the original backbone architecture is trained separately for 100 epochs. This provides a better initialization for the backbone and is further used in the present invention (referred to as PatchGD).
For baseline in PANDA at 2048 resolution, another study involved gradient accumulation over images, which was done for the same number of images that can be fed when the percent sampling is 10% i.e. 14 times since a 2048x2048 image with a patch size of 128 and percentage sampling of 10 percent can have a maximum batch size of 14 under 16GB memory constraint. That is to say, baseline can virtually process a batch of 14 images. This, however, was not optimal and the peak accuracy reported was in the initial epochs due to the loading of pre-trained model on the lower resolution after which the metrics remained stagnant (accuracy: 32.11%, QWK:0.357).
Same hyperparameters were followed across experiments for a fair comparison.
Classification accuracy and quadratic weighted kappa (QWK) was reported for PANDA dataset. PyTorch is the choice of framework to implement both the baselines and the PatchGD. Memory constraints of 4GB, 16GB and 24GB were followed to mimic the popular deep learning GPU memory limits. Latency is calculated on 40GB A100 GPU, completely filling the GPU memory.
Results
UltraMNIST
Figure 2 shows the performance of the present invention for UltraMNIST. More detailed results are presented in Tables 1 and 2. For both the architectures, the present invention outperforms the standard gradient descent method (abbreviated/referred as GD) by large margins. The present invention employs an additional subnetwork g_(?_2 ), and it can be argued that the gains reported are due to it. For this purpose, the base CNN architectures used in GD was extended and report the respective performance scores in Tables 1 and 2 as GD-extended.

Table 1: Performance scores for standard Gradient Descent (abbreviated/referred as GD) and the present invention (abbreviated/referred as PatchGD) obtained using ResNet50 architectures on the task of UltraMNIST classification with images of size 512×512.

Table 2: Performance scores for standard Gradient Descent (abbreviated/referred as GD) and the present invention (abbreviated/referred as PatchGD) on the task of UltraMNIST classification with images of size 512×512 obtained using MobileNetV2 architecture.

For both the architectures, the present invention outperforms GD as well as GD-extended by large margins. For ResNet50, the performance difference is even higher with a low memory constraint. At 4 GB, while GD seems unstable with a performance dip of more than 11% compared to the 16 GB case, the present invention approach seems to be significantly more stable.
For MobileNetV2, the difference between the present invention and GD is even higher at 16GB case, thereby clearly showing that the present invention blends well with lightweight models such as MobileNetV2. For MobileNetV2, going from 16" " GB to 4" " GB, shows there is no drop in model performance, which demonstrates that MobileNetV2 can work well with GD even at low memory conditions. Nevertheless, the present invention still performs significantly better. The underlying reason for this gain can be attributed to the fact that since present invention facilitates operating with partial images, the activations are small and more images per batch are permitted. It is also observed that the performance scores of GD-extended are inferior compared to even GD. ResNet50 and MobilenetV2 are optimized architectures, and it is speculated that addition of plain convolutional layers in the head of the network is not suited due to which the overall performance is adversely affected.

PANDA
Table 3 presents the results obtained on PANDA dataset for three different image resolutions. For all experiments, the number of images used per batch are maximized while also ensuring that the memory constraint is not violated. For images of 512×512, it shows that GD as well as PatchGD deliver approximately similar performance scores (for both accuracy as well as QWK) at 16 GB memory limit. However, for the similar memory constraint, when images of size 2048×2048 (2K) pixels are used, the performance of GD drops by approximately 10% while PatchGD shows a boost of 9% in accuracy. There are two factors that are important in creating such a big gap in the performance of GD and PatchGD. First, due to significantly increased activation size for higher resolution images, GD faces a bottleneck of batch size and only 1 image per batch is permitted. Note that to stabilize it, it was experimented with gradient-accumulation across batches, however, it did not help. Alternatively, hierarchical training was performed, where the model trained on the lower resolution case was used as the initial model for the higher resolution. To alleviate the issue of using only 1 image per batch, a higher memory limit was considered. Another reason for the low performance is that for higher resolution images, the optimized receptive field of ResNet50 is not suited which leads to non-optimal performance.
For increased batch size at 2" " K resolution, running quantized networks at half-precision and increased memory was considered (see Table 3). At half-precision, the performance of GD improves, however, it is still significantly lower than PatchGD. Similar observation is made for 4" " K images that PatchGD performs better. The performance improves further when a patch size of 256 is used. Clearly, from the results reported on PANDA dataset, it is evident that PatchGD is significantly better than GD in terms of accuracy as well as QWK when it comes to handle large images in an end-to-end manner. Latency of both the methods during inference time was reported, and it can be seen that PatchGD performs almost at par with GD. The reason is that unlike GD, the activations produced by PatchGD are smaller and the gain in terms of speed from this aspect balance the slowness induced by patch wise processing of the images. Clearly for applications demanding to handle large images but also aiming to achieve real-time inference the present invention provides an optimum solution.

Additional study
As demonstrated in the earlier experiments that present invention (referred as PatchGD) performs significantly better than its counterpart. A brief study related to some of the hyperparameters involved in PatchGD is presented. Table 4 presents the influence of patch sampling on the overall performance of PatchGD. The sampling fraction is varied per inner-iteration as well as the fraction of samples considered in total for an image in a certain iteration. It is observed that keeping the sampling fraction per inner-iteration small helps to achieve better accuracy. This is counter-intuitive since smaller fractions provide a lesser context of the image in one go. It is speculated that similar to mini-batch gradient descent, not using too large patch batch size induces regularization noise, which in turn improves the convergence process. However, this aspect needs to be studied in more detail for a better understanding. It is also observed that the fraction of the image seen in one overall pass of the image in PatchGD does not generally affect the performance, unless it is low. For lower fractions, it is hard for the model to build the global context and the convergence is sub-optimal.
It is also briefly studied the influence of gradient accumulation length parameter for PatchGD and the results are reported in Table 6. It is observed that performing gradient-based model update per inner iteration leads to superior performance for the chosen experiment. However, the choice of ? depends on the number of inner steps ?. For large values of ?, values greater than 1 are favored. For example, for the case of processing 2" " K resolution images with patch size of 128×128,?=? worked well. However, an empirical relation between ? and ? is still to be identified.

Table 3: Performance scores obtained using Resnet50 on PANDA dataset for Gradient Descent (GD) and Patch Gradient Descent (PatchGD). In case of 512 image size, 10% sampling leads to only one patch, hence 30% patches are chosen.

Table 4: Sampling ablation on PANDA dataset. Memory limit is 16 GB, Image size and patch size are 2048 and 128 respectively.

Table 5: Influence of different number of gradient accumulation steps epsilon on the performance of MobileNetV2 for the UltraMNIST classification task.

Table 6: Influence of different number of gradient accumulation steps epsilon on the performance of MobileNetV2 for the UltraMNIST classification task.

INDUSTRIAL APPLICABILITY, AND ECONOMIC SIGNIFICANCE
The present invention enables training CNNs with large images in an end-to-end manner and providing inference even when only limited GPU memory/hardware is available. The experimental result demonstrates the efficacy of the present invention in handling large images as well as operating under low memory conditions, and in all scenarios, the present invention outperforms the standard gradient descent by significant margins.
Due to inherent ability to work with small fractions of a given image, the present invention (referred to as PatchGD) is scalable on small GPUs, where training the original full-scale images may not even be possible. Further, the PatchGD reinvents the existing CNN training pipeline in a very simplified manner and this makes it compatible with any existing CNN architecture. Moreover, its simple design allows it to benefit from the pre-training of the standard CNNs on the low-resolution data.

The foregoing description of the invention has been set merely to illustrate the invention and is not intended to be limiting. Since modifications of the disclosed embodiments incorporating the substance of the invention may occur to persons skilled in the art, the invention should be construed to include everything within the scope of the disclosure.

CLAIMS
I/We Claim:
A method of training neural networks with very large images and providing inference comprising the steps of:
taking an input image;
dividing the input image into patches and each patch is processed as an independent image, and passing the patch as a batch to the neural network model for processing;
performing model training and ensuring that the patches are covered over the course of iterations; and
performing model inference.

A method of training neural networks with very large images and providing inference as claimed in claim 1, wherein the step (b) comprises of:
using a Z block, wherein Z is an encoding of an input image X obtained using any given model parameterized with weights ?_1;
filing the Z block, wherein the process of Z-filling spans over multiple steps comprising of:
sampling k patches and their respective positions from X and passing them as a batch to the model for processing;
filing the respective parts of Z using the output of the model combined with the positions; and
once all the patches of X are sampled, the filled form of Z is obtained.

The method of training neural networks with very large images and providing inference as claimed in claim 2, wherein the size of Z is enforced to be m×n×s, such that patch x_ij in the input space corresponds to the respective 1×1×s segment in the Z block.

The method of training neural networks with very large images and providing inference as claimed in claims 2 and 3, wherein once all the m×n patches of X are sampled, the filled form of Z is obtained.

The method of training neural networks with very large images and providing inference as claimed in claim 2, wherein the process of filling Z is used during model training as well as inference stages.

The method of training neural networks with very large images and providing inference as claimed in claim 1, wherein to build an end-to-end CNN model, a small subnetwork comprising convolutional and fully connected layers that processes the information contained in Z and transforms it into a vector of c probabilities as desired for the task of classification is added.

The method of training neural networks with very large images and providing inference as claimed in claim 2, wherein the during training, model components ?_1 and ?_2 are updated.

The method of training neural networks with very large images and providing inference as claimed in claim 2, wherein based on a fraction of patches sampled from the input image, the respective encodings are computed using the latest state of ?_1 and the output is used to update the corresponding entries in the already filled Z.

The method of training neural networks with very large images and providing inference as claimed in claim 2, wherein the partially updated Z is used to further compute the loss function value and the model parameters are updated through back-propagation.

The method of training neural networks with very large images and providing inference as claimed in claim 1, wherein the method of training neural networks with very large images and providing inference is performed on small GPU hardware in an end-to-end manner.

The method of training neural networks with very large images and providing inference as claimed in claim 1, wherein model is updated through introducing gradient accumulation over steps.

The method of training neural networks with very large images and providing inference as claimed in claim 11, wherein gradients are allowed to backpropagate only through those parts of Z that are active at the given inner-iteration.

The method of training neural networks with very large images and providing inference as claimed in claim 2, wherein the method further employs an additional subnetwork g_(?_2 ).

The method of training neural networks with very large images and providing inference as claimed in claim 11, wherein the gradient accumulation is introduced over ? steps and update the model accordingly, and gradients are allowed to backpropagate only through those parts of Z that are active at the j^"th " inner-iteration.

The method of training neural networks with very large images and providing inference as claimed in claim 1, wherein during inference phase, Z is filled using the optimized f_(?_1^* ) and then the filled version of Z is used to compute the class probabilities for input X using g_(?_2^* ).

The method of training neural networks with very large images and providing inference as claimed in claim 1, wherein f_?:R^(M×N×C)?R^c denote a CNN-based model parameterized by ? that takes an input image X of spatial size M×N and C channels, and computes the probability
of it to belong to each of the c pre-defined classes.

Dated this the 31st day of January 2023.
For TECHNOLOGY INNOVATION IN EXPLORATION & MINING FOUNDATION

----------------------------------------------------------
Nishidh Patel
Registered Patent Agent: IN/PA/2399
SINGHANIA & CO. LLP

ABSTRACT
A METHOD OF TRAINING NEURAL NETWORKS WITH VERY LARGE IMAGES AND PROVIDING INFERENCE

A method of training neural networks with very large images and providing inference comprising the steps of taking an input image; dividing the input image into patches and each patch is processed as an independent image, and passing the patch as a batch to the neural network model for processing; performing model training and ensuring that the patches are covered over the course of iterations; and performing model inference.
Reference Figure: Figures 3(a), and 3(b)

Figure 3(a)

Figure 3(b)

,CLAIMS:I/We Claim:
A method of training neural networks with very large images and providing inference comprising the steps of:
taking an input image;
dividing the input image into patches and each patch is processed as an independent image, and passing the patch as a batch to the neural network model for processing;
performing model training and ensuring that the patches are covered over the course of iterations; and
performing model inference.