The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction

Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.


Relationship between term-centric performance, InterPro annotation similarity and term frequency
Differences in class imbalance might be an important confounding factor in the relationship between term-centric performance and the percentage of proteins annotated with a term that share a domain/ family/superfamily annotation (structural similarity of proteins that perform a particular function). Therefore, we set out to establish whether the percentage of proteins that share an interpro annotation is still significantly related to performance if we take the number of positive examples for each term into account. First, there was a very large correlation between the number of positive examples in the training set and in the test set (ρ = 0.89, p-value < 1e − 16), so we only used the training frequency in the following. We used linear regression to model the term-centric ROCAUC (y) as a function of the term absolute frequency in the training set (f ) and the structural similarity (s). We log-transformed f and then centered and scaled the three variables of interest. We fitted five different models that capture different possible relationships between the variables: Our aim was to determine whether structural similarity gives any additional information for termcentric performance if we already know the term frequency. If it does not, we expect the third model to be the top-performing, while if it does then either the fourth or the fifth model or both will outperform the third model. For all models we assumed that ∼ N (0, σ 2 ) (standard linear regression) and used the standard normal distribution as prior for b 0 , b 1 , b 2 , b 12 and the exponential distribution with λ = 1 as a prior for σ. We compared the models using leave-one-out log likelihood estimated using Paretto smoothing, which also takes into account differences in overfitting risk due to differences in the number of parameters [4]. Results are shown in Table S We found that the model with only domain similarity (model 2) is ranked the highest, meaning that term frequency is not affecting this association. The posterior mean of b 2 in model 2 is +0.43 ± 0.05, meaning that an increase of 1 standard deviation in domain similarity leads to an expected increase of about 0.43 standard deviations in ROCAUC. The relationship for family similarity is more complicated, but we see that all three models that include s outperform the model that only uses term frequency. In addition, the coefficient of s is reliably larger than 0 in all these models: These results show that higher percentage of proteins with a shared family within a GO term is indeed associated with increased term-centric performance, even after correcting for term frequency. For the superfamily case, the two models that include both GO term properties are best at describing the variation of ROCAUC, with the interaction model being slightly better. In both cases, the coefficient of s was reliably positive (0.15 ± 0.07 for model 5 and 0.18 ± 0.06 for model 4), showing that again the superfamily similarity is an independent predictor of performance.
If we group the terms by GO category and repeat the analysis we get the results of For domains, as in the ungrouped case, domain similarity without including term frequency is the best predictor of performance and has a reliably positive coefficient with posterior mean and variance equal to 0.62 and 0.18 respectively. For families, the term frequency is now an important confounder as a model that contains both variables and an interaction term is the most predictive of performance. In this model, the positive association of s is confirmed by the reliably positive coefficient it has with posterior mean equal to 0.47, posterior variance equal to 0.17 and p(b 1 > 0) = 0.998 . For the superfamilies, we had found no significant association with performance on the aggregate dataset and this is confirmed by the inability of all models to significantly outperform the baseline model with only an intercept.
We obtained very similar model rankings and weights when replacing our leave-one-out log likelihood with the Widely Applicable Information Criterion [4]. Finally, to ensure that these findings do not rely on the choice of prior distributions, we repeated the experiments by changing the prior of the weights to N (0, 10) and/or the prior of σ to an exponential with λ = 0.2 and obtained nearly identical posterior distributions (data omitted for brevity).
Together these results show that the associations we find between similarity of interpro features and term-centric performance still hold even when correcting for the class imbalance across terms.

Baseline selection
The head-to-head comparison (Table S.6) shows that PSI-BLAST performs better than BLAST except for rat (the closest species to our training species). As far as coverage is concerned, PSI-BLAST's ability to find a hit for many more proteins (i.e. higher coverage) in distant species than BLAST makes it a more relevant baseline. For that purpose, we do not include the BLAST results, since they do not provide a stronger baseline here. Next, we compare two different PSIBLAST-based options, one based on the top-hit (i.e. maximum sequence identity between the target protein and all the PSI-BLAST hits) and the second based on frequency (i.e. fraction of hits that have a term). Head-to-head comparison demonstrated that the frequency-based PSI BLAST outperformed top-hit-based PSI BLAST in terms of protein-centric F1 for most species, while it had nearly 100% coverage for all species. The performance advantage of this method concerned mostly more distant species. In terms of term-centric ROCAUC, the two methods performed similarly. Given the frequency-based PSIBLAST baseline to have a better coverage and better ability to predict for distant species, which both are of key interest for cross-species predictions, we selected this option as our baseline method.

Including phylogeny-based annotations in the cross-species datasets does not alter results
A recent paper by Wei et. al. [5] has demonstrated that phylogeny-based annotations (i.e. with evidence codes 'IBA', 'IBD', 'IKR', 'IRD') are known to have poor accuracy, even lower than fully automated annotations with IEA evidence. Including them as ground truths in our cross-species training set could have a large impact. Indeed, IBA is a very common evidence code in our dataset, but the other three codes combined comprise less than 0.01% of each species's annotations. We repeated our cross-species experiment excluding all annotations with evidence codes IBA, IBD, IKR, and IRD, following the same procedure regarding training and evaluation as previously. The results, shown in Figure S.6, show that exclusion of these evidence codes has only a minor effect on the final results. The overall pattern of performance across species remains the same (for protein-centric F1 Pearson's ρ=0.990, p<1e-4 and for term-centric RO-CAUC ρ=0.997, p<1e-6, similar values for Spearman's ρ) and the absolute performance values change no more than 4.1% for the F1 score and 1.1% for the term-centric ROCAUC. At the same time, the confidence intervals also remain relatively similar despite the reduction in the number of test proteins.
A possible reason for the small differences is that the large number of IBA misannotations reported by Wei et al. refer to UniProt releases from 2018 and 2019, while we used a 2020 release. As mentioned by Wei et al., their method is used to continuously search for misannotations, so it is possible that a significant portion of them might have been fixed.

Cross-species protein-centric performance positively correlates with protein sequence identity
Using the BLAST top hit of every protein to find its sequence identity to the training set (the ratio of the number of exact matches to the total alignment length), we observed a very strong positive correlation between protein-centric SeqVec-based molecular function prediction performance and average sequence identity per species (Spearman correlation: 0.96, p-value: 4.5e-05) ( Figure 5D). From the distributions of the sequence identity, we observed that with increasing divergence time the distributions skewed from high sequence identity to low sequence identity ( Figure S.7). Theprevious observed deviant performance of Yeast could partially be explained by the observed correlation, as Yeast had the lowest (but very similar to A. thaliana) average sequence identity. The performance of A. thaliana in relation to its average sequence identity was more corresponding with the other test species, indicating that Yeast is likely breaking the trend. On the other hand, proteins in the 'twilight zone' have below 30% sequence identity resulting in limited structural similarity, decreasing the likelihood of conserved protein function [1,2].
As the amount of sequence identity and the divergence time between species are related, we repeated the MLP experiments while evaluating only proteins from the twilight zone. While we observed a very similar average sequence identity in all species after the sequence identity constraint, the spread in performance remained large (Figure S.5D). Additionally, the species with the highest sequence identity did no longer have the best performance ( Figure S.5E). We no longer observed a statistically significant correlation between performance and sequence identity (Spearman correlation: -0.21, p-value: 0.64). These findings illustrate that for proteins in the twilight zone their sequence identity is no longer indicative of performance. As we observed that Yeast had the highest fraction of proteins in the twilight zone, this possibly explains the low Yeast performance (Figure S.7).

Protein-centric metrics
For protein-centric evaluation, we calculated the F1-scores of all test proteins. To calculate the F1-scores, we first calculated for some decision threshold t ∈ [0, 1], which converted the predicted probabilities into binary class labels, the precision pr o (t) and recall rc o (t) for every protein o. Using these, we calculated the F1-score F 1o (t) for every protein.
To calculate the average F1-score over all the proteins, we first calculated the average precision and recall over all the proteins for a fixed threshold t using: and Note that we calculated the precision only over the U (t) proteins for which at least one prediction was made above threshold t and Recall over all the V proteins. Consequently, we obtained the prediction coverage using: which represents the fraction of protein with at least one predicted annotation for threshold t, regardless of the prediction being false or true. Finally, we calculated the average F1-score over all the proteins using: F 1 (t) = 2 · pr(t) · rc(t) pr(t) + rc(t) (4) over all the thresholds t. We calculated the maximum F 1 -score over all thresholds using: To recreate a real case scenario in the cross-species experiments in which no validation set is available for the test species, we determined the threshold t resulting in the highest F max score on the mouse validation set. Using this fixed threshold, we calculated the F 1 -score in the other species. By applying that threshold, we also calculated the semantic distance (SD), defined based on remaining uncertainty (ru) and misinformation (mi), as shown on equations 6-8, where T o is the set of true annotations of protein o, P o (t) a model's predicted GO terms for protein o at threshold t and τ indexes the GO terms. The term information content (IC) was calculated by pooling the entire cross-species dataset (training, validation and test sets).

Term-centric metrics
For term-centric evaluation, we calculated the area under the ROC curve (ROCAUC score) of all GO terms. The average ROCAUC score was obtained by averaging over all the evaluated GO terms.