Abstract: Traditional food allergen identification mainly relies on in vivo and in vitro experiments, which often need a long period and high cost. The artificial intelligence (AI)-driven rapid food allergen identification method has solved the two drawbacks and is becoming an efficient auxiliary tool. Aiming to overcome the limitations of lower accuracy of traditional machine learning models in predicting the allergenicity of food allergens, this work proposed to introduce a transformer deep learning model with a self-attention mechanism and ensemble learning model (representative as Light Gradient Boosting Machine (LightGBM) and eXtreme Gradient Boosting (XGBoost)) to solve the problem. In order to highlight the superiority of the proposed novel method, the study also selected various commonly used machine learning models as the baseline classifiers. The results of 5-fold cross-validation found that the AUC of the deep model was the highest (0.9400), which was better than the ensemble learning and baseline algorithms. But it needed to be pre-trained, and the training cost was the highest. By comparing the characteristics of the transformer model and boosting models, it can be analyzed that the two types of models have their own advantages, which provide novel clues and inspiration for the rapid prediction of food allergens in the future.
FIELD OF THE INVENTION
This invention relates to the identification of food sensitivities and intolerances such as gluten
sensitivity, lactose intolerance and FODMAPs sensitivity.
BACKGROUND OF THE INVENTION
Food allergies induce inflammation in the body when certain food proteins are ingested,
inhaled, or touched. It's allergic. Food allergy has gained more attention in recent years due to
its implications. Extra-intestinal food allergy symptoms include angioedema, rashes, and
dermatitis. It can induce rhinitis, conjunctivitis, recurring oral ulcers, bronchial asthma,
allergic purpura, tachycardia, headache, dizziness, and anaphylactic shock. Increasing food
allergies and the favourable association between food allergies and the respiratory tract are
major health threats. Studies show that people with food allergies have a greater prevalence
of respiratory disorders than those without. Food allergens are antigen molecules that
provoke immune system responses. Almost all food allergens are proteins, most of which are
water-soluble glycoproteins. According to the UN Food and Agriculture Organization, fish,
eggs, milk, crustaceans, soybeans, peanuts, almonds, and wheat include more than 900
allergies. More than 180 foods have been recognized as allergies, yet many remain.
Identifying and detecting food allergies is crucial. Serology and cytology are traditional
allergen identification procedures. The allergy identification approach can also be separated
into in vivo and in vitro procedures. in vitro tests include ELISA, simulated gastrointestinal
digestion, histamine release test, western blotting, and Allergen adsorption experiment.
Traditional identifying methods produce solid results. Long experimental periods and high
costs hinder high-throughput, high-speed food allergen prediction.
SUMMARY OF THE INVENTION
This work proposed to adopt the pre-training BERT deep learning model and novel ensemble
learning models represented by LightGBM and XGBoost to predict the allergenicity of food
allergens. Extensive experiments showed excellent results that were superior to previous
studies. Among them, the AUC value of BERT reached 0.9400, and the accuracy reached
0.8850. An experiment has been conducted to compare and analyze the characteristics of the
two: although the BERT model performed best in this task, it requires more expensive
training costs and more training time. LightGBM and XGBoost need short training time and
have relatively accurate prediction results in the small food allergen dataset. This provides
guidance for the application scenarios of different models. As far as we know, this work is
the first reported study using the above method to predict the allergenicity of food allergens,
which provides inspiration for the more rapid prediction of food allergens in the future.
BRIEF DESCRIPTION OF THE INVENTION
“Food sensitivity”, as used herein, refers to an unpleasant reaction to certain foods or constituents in certain foods; for example, reactions such as but not limited to gastrointestinal symptoms and extra intestinal symptoms. Examples of food sensitivities include but are not limited to gluten sensitivity and FODMAP sensitivity. “Food intolerance”, as used herein, refers to a reaction when the body lacks a particular enzyme to digest the food or a constituent in the food. A non-limiting example of food intolerance is lactose intolerance. Examples of reactions that the body develops include but are not limited to gastrointestinal symptoms and extra intestinal symptoms. One major issue involves the difficulties involved with accurately tracking food intake. But beyond the actual tracking of food, there is a tremendous amount of noise inherent in the data. One can try to create a series of scores that track food ingestion against symptom expression, then create normal distribution curves for how people physically respond to known ingestions, which is hard enough to do, but if the complexity of food tracking is added on top of that, coupled with the complexity of normal diets ranging widely from person-to-person, it becomes almost impossible to really measure these things. There are many reasons that this may not work. Not just that the food tracking may not work (which is an important but solvable issue), or that people will not really want to track their dietary intake carefully, or that they will be accurate in their dietary intake tracking, but that even if they do all of that correctly and with appropriate technologies, that their data will still be diverse and extremely difficult to interpret in almost all cases. So, tracking a range of common symptoms, from GI symptoms to extra intestinal symptoms like fatigue and headache, and linking them back to very imprecise data will yield very imprecise results. The way people typically figure things out in “real life” is to systematically eliminate specific dietary constituents and then re-introduce them to see if they cause a problem.
Described herein is a formalized version of that; a series of N=1 cross-over experiments with each individual user becoming their own control. Also described herein removes the need completely for tracking any meals on an application; it removes the intractable noise in usual meals that undermine effective measurement models, and it provides a standard stimulus that ties to a standard score. This also distinguishes it from existing applications in the art. With the development of artificial intelligence technology, machine learning has been gradually applied in the prediction of protein functions, and good results have been obtained. Using a labeled protein database, researchers train networks to study the primary or secondary structure of the protein and discover the in-depth relationship between the sequence structure, physical and chemical properties of the proteins and their functions. Then the well-trained network can be used to predict unknown proteins. In the early years of research on allergenicity prediction, supervised learning on the extracted peptide sequence features and selected support vector machines (SVM) with linear kernel functions for classification to obtain relatively accurate results. Classifiers such as K-nearest neighbour (K-NN) were adopted to predict the sensitization of allergens. Besides, some researchers have extracted the pseudo-amino acid composition (PseAAC) feature of the allergen and employed the SVM classifier to predict the protein’s allergenicity. In recent years, deep learning models such as deep neural networks (DNN) have also been used for the identification of allergens, which behave better than traditional machine learning methods. Meanwhile, online servers developed by using various machine learning algorithms have been successively reported recently, which greatly improves work efficiency and facilitates high-throughput prediction. The efficiency of prediction methods based on machine learning algorithms is much higher than in vivo and in vitro experiments. Furthermore, the accuracy of the predictions is constantly breaking through with the improvement and optimization of the model. DNNs have become the mainstream tool for food allergen prediction in the future. Bidirectional Encoder Representation from Transformers (BERT) is mainly used for natural language processing (NLP) and is currently rarely applied for peptide or protein function prediction. We found that it can extract high-dimensional features between peptides sequence for study, which is a novel prediction method. Convolutional Recurrent Neural Network (CRNN is used for end-to-end recognition of text sequences of indefinite length. Instead of cutting out a single text first, it converts text recognition into a sequence-dependent sequence learning problem. It has also been reported that CRNN plays a role in the function prediction of proteins. The novel ensemble learning model is becoming one of the mainstream methods to improve the performance of machine learning, which has shown superior performance compared to traditional classifiers in text classification, disease diagnosis and other fields, and its application in the peptide sequence classification has been rarely seen. The three methods above all provide novel ideas to improve the accuracy and performance of machine learning algorithms to predict the allergenicity of food allergens.
In this, we original introduced BERT, a novel pre-training model in the field of natural language processing, into the allergenicity prediction of food allergens. An independent attention mechanism in each layer was adopted, so compared to traditional Recurrent Neural Networks (RNN), our network can capture longer-distance dependencies more efficiently. Additionally, in order to make a comparison of characteristics between the deep learning model and ensemble learning model in this task, two novel ensembles learning models-LightGBM and XGBoost, for 5-fold cross-validation, were employed. The results showed that for the dataset in this work, the introduced ensemble learning models (LightGBM, XGBoost) were better than the baseline classifiers but didn’t perform as well as deep learning on certain evaluation indicators such as accuracy. However, the convenience brought by its short training time makes it suitable for certain specific environments. The novel self-attention mechanism of BERT with superior performance has infinite potential in larger-scale data training in the near future.
DEEP LEARNING MODELS
BERT is a self-supervised method for pre-training deep transformer encoders, which can be finetuned for different downstream tasks after pre-training. BERT can be optimized for two training objectives-mask, language modelling (MLM) and next sentence prediction (NSP), and only large unlabeled datasets are needed for its training. As a novel deep learning model, BERT is commonly used in the field of NLP, and it is rarely applied in the study of food allergen prediction. The architecture of BERT is a multi-layer transformer structure. The transformer is an encoder-decoder structure formed by stacking several encoders and decoders. The encoder consists of Multi-Head Attention and a feed forward neural network, which is used to convert the input protein sequence into a feature vector (Figure 2). The input of the decoder is the output of the encoder and the predicted result, which is composed of Masked Multi-Head Attention and a feed forward neural network. The decoder outputs the conditional probability of the final result (Figure 2). The highlight of BERT is the use of Multi-Head Attention, which divides a word vector into N dimensions. Since the allergen sequence is mapped in the high-dimensional space in the form of multi-dimension vectors, the mechanism of Multi-Head Attention enables the model to learn different characteristics of each dimension, and the information learned from adjacent spaces is similar, which is more reasonable than mapping the entire space together.
In this, we employed the BERT-Base, Uncased pre-training model, which transferred a large number of operations deployed in specific downstream NLP tasks to pre-training word vectors. After obtaining the word vector used by BERT, a multi-layer perceptron (MLP) to the word vector was added. This experiment separated each amino acid character with space and cut the amino acid sequence so that the amino acid chain formed a string with a certain length, which was used as a basic structure input. The model parameters were 12-layer, 768-hidden, 12-heads and 110M parameters. In order to break through the bottleneck of low accuracy encountered by traditional allergen prediction methods, we designed deep learning model with novel self-attention transformer structure and improved tree ensemble model to predict the allergenicity of food allergens, which was superior to the machine learning methods employed in previous similar works. The work provided new ideas for future food allergen screening. As far as we know, this is the first reported work to introduce the BERT deep model, LightGBM and XGBoost ensemble models into the allergen prediction task. In this section, we will compare and analyze the characteristics of the proposed models and discuss their application scenes, which will definitely facilitate future model selection. In the deep learning model of BERT, the advantage of introducing self-attention is that it can connect two long-term dependent features in the sequence. This may require more time to accumulate and react for the recurrent neural network (RNN) structure, so the self-attention mechanism can improve the parallelism of the network. The input of this research is protein sequences of different lengths. Self-attention can ignore the distance between amino acids and directly calculate their dependence relationship. It can help learn the internal structure of protein sequences well, which is better than traditional natural language processing algorithms and more efficient. Meanwhile, BERT model employed in our work has been pre-trained, and a large number of operations done in the downstream tasks of natural language processing are transferred to the pre-trained word vector. This not only improves the efficiency of the allergen sequence recognition, but also bestows it more powerful generalization ability. The architecture of BERT is based on multi-layer two-way conversion and decoding, where "two-way" means that when the model is processing a certain word (amino acid), it can use both of the previous word (amino acid) and the following word (amino acid) at the same time, which is different from traditional RNNs. The above advantages all highlight the great potential of BERT to accurately predict food allergens. In this study, BERT's model AUC reached 0.94, which was better than all ensemble learning models the best of which was 0.9105 and previously reported machine learning models the best of which was 0.8529. The high AUC value shows its powerful predictive ability. In terms of recognition accuracy, BERT was 0.8850, which was also obviously excellent, better than LightGBM (0.8686) and XGBoost (0.8186). This benefits from the unique advantages of the transformer architecture, which surpasses the boosting ensemble models in the task of food allergen prediction. However, it cannot be ignored that pre-training requires a large amount of various types of data, which leads to high cost of transfer learning. On the other hand, the BERT model has a large amount of parameters and requires a long training time, which also puts forward strict requirements on computing equipment.
The novel ensemble learning models also performed well in the task of food allergens identification. For example, LightGBM is a novel GBDT algorithm framework that has many advantages. One is GOSS, the algorithm does not adopt the sample points to calculate the gradient, but samples the samples to calculate the gradient. The second is EFB, which means that certain features are bundled together to reduce the dimensionality of the features. In addition, using Leaf-wise strategy for iteration can reduce errors as many as possible and get better accuracy. Based on the above characteristics, LightGBM needs shorter training time and has better learning effect than traditional machine learning algorithms for food allergen predition. In this research, the average prediction accuracy of the model was 0.8686, the F1 score was 0.8681, and the AUC reached 0.9105, which showed that it has the ability to accurately predict the allergenicity of a test sequence under a small-scale training. Additionally, as a novel ensemble learning model, XGBoost has been widely used in many fields. It should be emphasized that strict allergenicity prediction studies need to be verified by in vitro wet experiments (such as ELISA, etc.), which will be further improved in the future.
We Claims:
1. This also distinguishes it from existing applications in the art. With the development of artificial intelligence technology, machine learning has been gradually applied in the prediction of protein functions, and good results have been obtained.
2. Furthermore, the accuracy of the predictions is constantly breaking through with the improvement and optimization of the model.
3. DNNs have become the mainstream tool for food allergen prediction in the future.
4. Beyond the actual tracking of food, there is a tremendous amount of noise inherent in the data.
5. We employed the BERT-Base, Uncased pre-training model, which transferred a large number of operations deployed in specific downstream NLP tasks to pre-training word vectors.
6. The input of this research is protein sequences of different lengths. Self-attention can ignore the distance between amino acids and directly calculate their dependence relationship.
7. It can help learn the internal structure of protein sequences well, which is better than traditional natural language processing algorithms and more efficient. Meanwhile, BERT model employed in our work has been pre-trained, and a large number of operations done in the downstream tasks of natural language processing are transferred to the pre-trained word vector.