Transmembrane proteins (TMPs) are located in different membranes and they provide gates between the inner and outer side of cells or organelles. From a structural point of view, regions embedded in the membrane are highly ordered, however tails and connection loops may contains flexible segments that can serve as linkers, binding sites or exhibit short linear motifs.
MemDis is the first disordered prediction method that is specific to membrane proteins. Both during the construction of dataset, and training of our methos we considered the special characteristics of transmembrane proteins.
MemDis was trained using x-ray chrystallography data: we used the MobiDB database as source, and selected only those proteins, where 90% of the structures were in agreement. Electron densities, and therefore coordinates belonging to residues and segments that cannot have a stable structure are missing from the final structure – giving a complement indication of protein disorder. We selected these regions along with ordered segments and trained convolutional neural networks (CNNs) and bidirectional long-short term memory (LSTM) networks to predict disorder propensity (Figure 1).
Figure 1: Data preparation for the training of MEMDIS. First we selected protein fragments based on the available PDB information. Extracellular distant (distance from membrane >15 AA), proximal (<15AA) and intracellular distant, proximal residues from these fragments were fed into the appropriate CNN, also considering information from residues within 5AA from the residue of interest. The LSTM was trained on the full length protein fragments considering the preceding 10AA.
MemDis separates different subcellular localizations based on protein topology: tail and loop segments of membrane proteins are further divided into intra and extracellular, membrane proximal and distant regions – we trained different CNNs for each localization as segments are exposed to different environments, that can affect folding. In addition MemDis also considers membrane protein specific features, such as topology, length of membrane segments (carrying informormation about the tilt of the segments) and more.
We evaluated MemDis on an independent test set, that do not share similar sequences to the train and validation sets. We measured the accuracy of popular disordered prediction methods and compared them to MemDis sensitive and specific settings. MemDis outperforms currently available state-of-the-art methods on this membrane protein specific dataset (Table 1). Although some methods may offer similar or slightly higher specifcity, however they barely predict disordered residues at all. In contrast MemDis produces the highest MCC, AUC and Balanced accuracy values.
Method | True Positive | False Positive | False Negative | True Negative | Balanced Accuracy | Sensitivity | Specificity | Matthew's Correlation Coefficient | Positive Prediction Value | F1 Score | Segment overlap | Area Under Curve |
Disembl rem465 | 1460 | 313 | 3199 | 5124 | 0.63 | 0.31 | 0.94 | 0.34 | 0.82 | 0.45 | 0.72 | 0.77 |
IUPred long | 1441 | 264 | 3218 | 5173 | 0.63 | 0.31 | 0.95 | 0.35 | 0.85 | 0.45 | 0.63 | 0.78 |
Disembl hot loops | 2350 | 1086 | 2309 | 4351 | 0.65 | 0.50 | 0.80 | 0.32 | 0.68 | 0.58 | 0.67 | 0.74 |
IUPred short | 1650 | 340 | 3009 | 5097 | 0.65 | 0.35 | 0.94 | 0.37 | 0.83 | 0.50 | 0.67 | 0.79 |
Espritz DisProt | 391 | 228 | 4268 | 5209 | 0.52 | 0.08 | 0.96 | 0.09 | 0.63 | 0.15 | 0.57 | 0.67 |
Espritz NMR | 2217 | 762 | 2442 | 4675 | 0.67 | 0.48 | 0.86 | 0.37 | 0.74 | 0.58 | 0.72 | 0.75 |
Espritz X-ray | 1758 | 351 | 2901 | 5086 | 0.66 | 0.38 | 0.94 | 0.38 | 0.83 | 0.52 | 0.73 | 0.76 |
GlobPlot | 1476 | 611 | 3183 | 4826 | 0.60 | 0.32 | 0.89 | 0.25 | 0.71 | 0.44 | 0.60 | 0.45 |
MemDis specific | 2220 | 291 | 2439 | 5146 | 0.71 | 0.48 | 0.95 | 0.49 | 0.88 | 0.62 | 0.76 | 0.84 |
MemDis sensitive | 3258 | 1030 | 1401 | 4407 | 0.75 | 0.70 | 0.81 | 0.51 | 0.76 | 0.73 | 0.78 | 0.83 |
We also checked a handful of well defined examples where the output of MemDis is supported by literature evidences. (Figure 2).
Figure 2: Case studies using specific settings on the MemDis server A) Syntaxin-1A is a nervous system protein playing role int he fusion of synaptic vesicles to the plasma membrane via formation of SNARE complex. Munc18a controls SNARE assembly through its interaction with the syntaxin N-peptide, which is disordered. The protein also contains a linker region between Habc and SNARE domains (18337752) B) Integrin alpha-IIb is a receptor protein with a cytosolic disordered tail, exhibiting short linear motifs proposed to play role in SARS-COV-2 infection (33436497, 33436498) C) Stannin is a small bitopic transmembrane protein, where a flexible linker provides connecttion between the CXC metal-binding motif and the 14-3-3-zeta binding domain (16246365). D) GPCRs are a large family of receptor proteins with 7 transmembrane helices. N- and C-terminal regions, and the third intracellular loop (ICL3) is considered to be disordered. Their C-terminal and ICL3 segments mediate interactions with signaling partners. The role of N-terminal sites is not fully understood, however they exhibit many PTM sites, arguing that modifications might occur during sorting (25198166).