MemDis

Desciption of the server

Transmembrane proteins (TMPs) are located in different membranes and they provide gates between the inner and outer side of cells or organelles. From a structural point of view, regions embedded in the membrane are highly ordered, however tails and connection loops may contains flexible segments that can serve as linkers, binding sites or exhibit short linear motifs.

MemDis is the first disordered prediction method that is specific to membrane proteins. Both during the construction of dataset, and training of our methos we considered the special characteristics of transmembrane proteins.

MemDis was trained using x-ray chrystallography data: we used the MobiDB database as source, and selected only those proteins, where 90% of the structures were in agreement. Electron densities, and therefore coordinates belonging to residues and segments that cannot have a stable structure are missing from the final structure – giving a complement indication of protein disorder. We selected these regions along with ordered segments and trained convolutional neural networks (CNNs) and bidirectional long-short term memory (LSTM) networks to predict disorder propensity (Figure 1).

Figure 1: Data preparation for the training of MEMDIS. First we selected protein fragments based on the available PDB information. Extracellular distant (distance from membrane >15 AA), proximal (<15AA) and intracellular distant, proximal residues from these fragments were fed into the appropriate CNN, also considering information from residues within 5AA from the residue of interest. The LSTM was trained on the full length protein fragments considering the preceding 10AA.

MemDis separates different subcellular localizations based on protein topology: tail and loop segments of membrane proteins are further divided into intra and extracellular, membrane proximal and distant regions – we trained different CNNs for each localization as segments are exposed to different environments, that can affect folding. In addition MemDis also considers membrane protein specific features, such as topology, length of membrane segments (carrying informormation about the tilt of the segments) and more.

We evaluated MemDis on an independent test set, that do not share similar sequences to the train and validation sets. We measured the accuracy of popular disordered prediction methods and compared them to MemDis sensitive and specific settings. MemDis outperforms currently available state-of-the-art methods on this membrane protein specific dataset (Table 1). Although some methods may offer similar or slightly higher specifcity, however they barely predict disordered residues at all. In contrast MemDis produces the highest MCC, AUC and Balanced accuracy values.

Method	True Positive	False Positive	False Negative	True Negative	Balanced Accuracy	Sensitivity	Specificity	Matthew's Correlation Coefficient	Positive Prediction Value	F1 Score	Segment overlap	Area Under Curve
Disembl rem465	1460	313	3199	5124	0.63	0.31	0.94	0.34	0.82	0.45	0.72	0.77
IUPred long	1441	264	3218	5173	0.63	0.31	0.95	0.35	0.85	0.45	0.63	0.78
Disembl hot loops	2350	1086	2309	4351	0.65	0.50	0.80	0.32	0.68	0.58	0.67	0.74
IUPred short	1650	340	3009	5097	0.65	0.35	0.94	0.37	0.83	0.50	0.67	0.79
Espritz DisProt	391	228	4268	5209	0.52	0.08	0.96	0.09	0.63	0.15	0.57	0.67
Espritz NMR	2217	762	2442	4675	0.67	0.48	0.86	0.37	0.74	0.58	0.72	0.75
Espritz X-ray	1758	351	2901	5086	0.66	0.38	0.94	0.38	0.83	0.52	0.73	0.76
GlobPlot	1476	611	3183	4826	0.60	0.32	0.89	0.25	0.71	0.44	0.60	0.45
MemDis specific	2220	291	2439	5146	0.71	0.48	0.95	0.49	0.88	0.62	0.76	0.84
MemDis sensitive	3258	1030	1401	4407	0.75	0.70	0.81	0.51	0.76	0.73	0.78	0.83

We also checked a handful of well defined examples where the output of MemDis is supported by literature evidences. (Figure 2).

Figure 2: Case studies using specific settings on the MemDis server A) Syntaxin-1A is a nervous system protein playing role int he fusion of synaptic vesicles to the plasma membrane via formation of SNARE complex. Munc18a controls SNARE assembly through its interaction with the syntaxin N-peptide, which is disordered. The protein also contains a linker region between Habc and SNARE domains (18337752) B) Integrin alpha-IIb is a receptor protein with a cytosolic disordered tail, exhibiting short linear motifs proposed to play role in SARS-COV-2 infection (33436497, 33436498) C) Stannin is a small bitopic transmembrane protein, where a flexible linker provides connecttion between the CXC metal-binding motif and the 14-3-3-zeta binding domain (16246365). D) GPCRs are a large family of receptor proteins with 7 transmembrane helices. N- and C-terminal regions, and the third intracellular loop (ICL3) is considered to be disordered. Their C-terminal and ICL3 segments mediate interactions with signaling partners. The role of N-terminal sites is not fully understood, however they exhibit many PTM sites, arguing that modifications might occur during sorting (25198166).