IJMS | Free Full-Text | HyperCys: A Structure- and Sequence-Based Predictor of Hyper-Reactive Druggable Cysteines
Next Article in Journal
Chalcones and Gastrointestinal Cancers: Experimental Evidence
Next Article in Special Issue
Editorial for Special Issue—“Early-Stage Drug Discovery: Advances and Challenges”
Previous Article in Journal
The Regulation of ZIP8 by Dietary Manganese in Mice
Previous Article in Special Issue
Selectivity of Hydroxamate- and Difluoromethyloxadiazole-Based Inhibitors of Histone Deacetylase 6 In Vitro and in Cells
 
 
Order Article Reprints
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Communication

HyperCys: A Structure- and Sequence-Based Predictor of Hyper-Reactive Druggable Cysteines

Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Hermann-Herder-Straße 9, 79104 Freiburg, Germany
*
Author to whom correspondence should be addressed.
Int. J. Mol. Sci. 2023, 24(6), 5960; https://doi.org/10.3390/ijms24065960
Received: 13 February 2023 / Revised: 15 March 2023 / Accepted: 17 March 2023 / Published: 22 March 2023
(This article belongs to the Special Issue Early-Stage Drug Discovery: Advances and Challenges)

Abstract

:
The cysteine side chain has a free thiol group, making it the amino acid residue most often covalently modified by small molecules possessing weakly electrophilic warheads, thereby prolonging on-target residence time and reducing the risk of idiosyncratic drug toxicity. However, not all cysteines are equally reactive or accessible. Hence, to identify targetable cysteines, we propose a novel ensemble stacked machine learning (ML) model to predict hyper-reactive druggable cysteines, named HyperCys. First, the pocket, conservation, structural and energy profiles, and physicochemical properties of (non)covalently bound cysteines were collected from both protein sequences and 3D structures of protein–ligand complexes. Then, we established the HyperCys ensemble stacked model by integrating six different ML models, including K-nearest neighbors, support vector machine, light gradient boost machine, multi-layer perceptron classifier, random forest, and the meta-classifier model logistic regression. Finally, based on the hyper-reactive cysteines’ classification accuracy and other metrics, the results for different feature group combinations were compared. The results show that the accuracy, F1 score, recall score, and ROC AUC values of HyperCys are 0.784, 0.754, 0.742, and 0.824, respectively, after performing 10-fold CV with the best window size. Compared to traditional ML models with only sequenced-based features or only 3D structural features, HyperCys is more accurate at predicting hyper-reactive druggable cysteines. It is anticipated that HyperCys will be an effective tool for discovering new potential reactive cysteines in a wide range of nucleophilic proteins and will provide an important contribution to the design of targeted covalent inhibitors with high potency and selectivity.

1. Introduction

The design of targeted covalent inhibitors (TCIs) has received increasing attention in the last decade. Recognizing the advantages of covalent inhibition, such as, on the one hand, greater selectivity and inhibition and, on the other hand, lower dosage and drug resistance, TCIs are increasingly becoming a focal point in the field of kinase-targeted cancer therapies. Nowadays, this approach has been expanded to other therapeutic areas, including autoimmunity, neurology, cardiovascular diseases, gastrointestinal disorders, and inflammation [1,2,3,4,5]. Typical reactive amino acids are serine, threonine, tyrosine, lysine, and, most importantly, cysteine, which is one of the two sulfur-containing amino acids. As the most nucleophilic of the 20 canonical amino acid residues, covalent inhibitors targeting reactive cysteines are becoming increasingly significant in drug discovery and development.
In early cysteine reactivity prediction studies, sequence-based approaches were widely utilized because the structural coverage of proteins possessing reactive cysteines was still largely incomplete, and high-quality databases of covalent 3D protein–ligand complexes were unavailable at that time [6,7]. Advances in technology and proteomics led to the development of quantitative activity-based protein profiling (ABPP), which allows for the high-throughput identification of reactive cysteines in proteins and the quantification of their reactivity [8]. The wealth of such data in the literature provides a benchmark data set for the prediction of cysteine reactivity using sequence-based machine learning (ML) algorithms. In addition, several tools have been developed for functional cysteine prediction, such as DeepCys and Cy-preds [9,10]. However, these methods usually globally characterize the structural cysteines’ (i.e., stable disulfide-bonded cysteines) function in the proteome. Although several computational methods are available to predict functional cysteines, the core problem of finding residues that can be targeted and are suitable for assembling highly reactive warheads is not yet solved. In addition, sequence-based prediction models do not consider the spatial environment’s influence on the reactive cysteines.
As more and more 3D structures of proteins are solved and deposited in the Protein Data Bank (PDB), several studies on the structure-based predictions of targetable cysteine have been reported [11,12]. In 2017, Zhang et al. developed a support vector machine (SVM) model to predict the reactivity of cysteines, for a given protein structure, suitable for TCI design. This analysis pointed out that covalently modified cysteines have unique features compared to cysteines without covalent ligand attachments, including lower acid dissociation constant (pKa), larger solvent-accessible surface area (SASA), and higher frequencies of hydrogen bonding; all of these favor covalent bond formation. Moreover, the authors concluded that the number and type of amino acids surrounding the reactive cysteine affect covalent bond formation.
The sequence-based approach allows for the characterization of cysteine conservation profiles, secondary structural profiles, and site-specific energy profiles [13]. Moreover, the structure-based approach allows for the consideration of environmental effects on reactive cysteines. Both approaches have different functions. Therefore, based on the integration of these two methods, higher-quality benchmark datasets, more relevant features, and a robust method for the prediction of reactive cysteines could be developed. The recently reported CovPDB is the most comprehensive database of covalent protein complexes manually annotated from the Protein Data Bank (PDB) to date, with cumulative information on the individual proteins and ligands [14,15]. The nucleophilic proteins with targetable cysteines in this database were used as a benchmark dataset for developing the reactive cysteine prediction ML algorithm. Protein sequences were downloaded from PDB according to the chain ID in which the covalent cysteines were located, and (non)covalent cysteines in the detected binding pockets were then accurately collected and labelled. Based on a comprehensive analysis of the 3D structures and sequences of the (non)covalent cysteines, we developed a stacked ensemble model for predicting the hyper-reactive cysteines, called HyperCys. We believe this work is an important step in theoretical protocols for hyper-reactive druggable cysteine reactivity prediction and will contribute to the design of TCIs with high potency and selectivity. Figure 1 illustrates the workflow of the proposed method.

2. Results and Discussion

2.1. Feature Analysis

2.1.1. Feature Statistical Analysis

Statistical analysis (Table 1) shows that the pKa value was 11.18 ± 2.0 pK units for druggable cysteines within proteins and 11.34 ± 1.8 pK units for the cysteines in the NonCovalent dataset, indicating only small differences in reactivity. These covalent cysteines have very high exposure to solvent molecules, elevating the possibility of drug molecules binding to targeting amino acids.

2.1.2. Feature Importance Analysis

The following ML classifiers were used in this study: KNN, SVM, LGBM, MLP, LR, and RF. These classifiers were trained to find the optimum hyperparameters and configurations. The algorithm was performed with Scikit-learn, which provides a GridSearch approach along with 10-fold CV to optimize the hyperparameters for each classifier under consideration. The optimum parameters obtained for each ML model after GridSearch implementation are summarized in Table S1.
Feature variables play an important role in creating predictive models, whether a regression or a classification model. Feature importance is a technique that provides a relevant score for every feature variable and can be used to decide which features are least or most important for predicting the target variable. We calculated the feature importance of each feature group using the RF classifier to determine which feature profile contributes the most to our proposed model and to improve model performance. The grouped feature importance of each feature profile utilized in this investigation to predict hyper-reactive cysteines is shown in Figure 2A. The SPP feature group (the SASA and pKa combination) is the most significant feature, followed by the CP, ACC, PP, SSP, and EP groups. On the other hand, Figure 2B shows that structure-based feature groups contribute more to the model than sequence-based ones.

2.2. Training ML Classifiers and Generating an Ensemble Stack Model

2.2.1. Different ML Models

We evaluated our model with 10-fold CV. The validation results illustrate the validity and robustness of those models. We developed a stacked model that aggregates the predictions of the base classifiers as a final step after training and assessing models made using the separate ML classifiers. This meta-model was trained using LR. The binary classification ensemble stack model demonstrated an accuracy of 0.758, an F1 score of 0.721, a recall score of 0.692, and a ROC AUC score of 0.818, as shown in Table 2. Regarding accuracy, F1, and ROC AUC, the stacked model (HyperCys) is slightly better than each of the individual models.

2.2.2. Different Window Sizes

The improved PSSM profile used in this paper can incorporate both the evolutionary information and the local environmental information. To optimize window size for predicting hyper-reactive druggable cysteines in a protein, we developed the HyperCys model using different window sizes from 1 to 23. We obtained a maximum accuracy of 0.784, with an ROC AUC of 0.824 for the 21-amino acid window size shown in Table S2, which is the method that achieved the best performance and was selected for the final model (Figure S1).

2.3. Application of an Ensemble Stacked Model Generates Accurate Predictions with Different Feature Groups

Table 3 illustrates that gradually incorporating the feature group into the input descriptors could increase HyperCys’ performance. As we gradually expand the feature sets in accordance with the value of the feature importance, the ROC AUC, F1 score, and accuracy all increase for most models, as shown in Table 3. Only the sequence-based feature combination presents considerably lower evaluation indices of 0.556–0.638. The structure-based features are much better than the sequence-based features, which is also consistent with the study of the feature importance of the RF. After incorporating 3D structural features, the accuracy was greatly improved from 0.626 to 0.784 for HyperCys. Table 3 also demonstrates that all feature profiles contribute to the ultimate ROC AUC of 82.6%. It is also shown that the standalone sequence-based machine learning method achieves an accuracy of 0.556 to 0.638. Therefore, pure sequence-based methods are subject to large uncertainties. The combination is more accurate for predicting hyper-reactive cysteine in targets compared with other ML models, which apply only sequenced-based or only 3D structural features.

3. Materials and Methods

3.1. Data Source

The covalent cysteine dataset (Cov-Set) was collected from the CovPDB database, a manual collection of high-resolution covalent protein–ligand (cP-L) complexes we had previously created [14]. Other cysteines on the same target constitute the noncovalent cysteine dataset (NonCov-Set). The sequences of each target extracted from the PDB, according to the chain IDs and the (non)covalent cysteines in the detected binding pockets, were carefully annotated. In the CovPDB, there are 959 cP-L complexes in which the ligands’ targets are cysteines. In the benchmark dataset, we utilized CD-HIT to cut off those sequences with ≥50% sequence identity [16]. Finally, 304 cysteines were involved in this study, including 135 covalent and 169 noncovalent cysteines for 147 unique cP–L complexes. More information about the dataset statistics is described in Table 4.

3.2. Descriptors

In this study, we collected cysteine descriptors at two levels. Firstly, the physico-chemical properties of (non)covalent cysteines were extracted from high-resolution cP–L complexes. Specifically:
(1)
Fpocket was used to determine whether or not amino acids are located in detectable binding pockets, and only those cysteines located in pockets were retained [17]. Pocket features, such as druggability, hydrophobicity, and polarity, were collected.
(2)
With the help of a PyMOL (Schrödinger LLC, USA) Python script, the total number of surrounding amino acid residues within 4, 6, 8, and 10 Å of cysteine and the respective number of polar and hydrophobic amino acids, according to the hydrophobic/polar (H/P) classification by Kamtekar et al., were calculated [18].
(3)
The pKa of each cysteine was calculated because it affects the rate of covalent modification. In theory, cysteines with low pKa values are more vulnerable to covalent alteration [11]. Furthermore, we calculated SASA, a measure of how much of the area of a molecule is available to the solvent. A cysteine with a high SASA value may easily be polarized, increasing the chance for a ligand with a proper warhead to form a covalent bond. The pKa and SASA values for each cysteine and its surrounding amino acids within 4, 6, 8, and 10 Å of the cysteine were calculated using PROPKA 3 and freeSASA, respectively [19,20].
(4)
Depth was used to calculate the depth of cysteine burial [21].
Secondly, in order to create an effective ML model, we extracted various features in cysteine conservation, structural information, and site-specific energy profiles derived from the sequence information. Specifically:
(1)
The position-specific scoring matrix (PSSM) corresponding to each protein was mainly generated by the PSI-Blast tool [22]. PSSM-based feature descriptors have successfully been applied to improve the performance of various predictors of protein attributes. The PSSM consists of position-specific conserved scores of amino acids. For each query protein sequence, PSI-BLAST was used to search within the non-redundant NCBI Protein Database (https://ftp.ncbi.nlm.nih.gov/blast/db/, accessed on 15 June 2021) in three iterations with an E-value < 10−3 to generate a PSSM. We calculated the L × 20 PSSM for each protein sequence, where L is the length of the sequence. Each residue is represented by a 20-dimensional integer-valued vector shown in Figure 3. The PSSM is generated by counting the frequency of each amino acid observed at each position in multiple sequence motifs, which is usually the log-likelihood ratio of the frequencies of 20 amino acids. According to the principle of additivity, the similarity of any given sequence to a known modality can be measured by calculating the sum of the likelihood scores of the actual occurrence of amino acids at each position.
For an element Pi,j in PSSM, its value indicates the probability that the amino acid at position i of the sequence mutates to the jth amino acid during the evolutionary process. If the value is positive, it indicates a higher probability; otherwise, it indicates a lower probability [23].
(2)
The conserved evolutionary information provided by PSSM was expanded to compute one monogram (MG) feature matrix (1 × L) and bigram (BG) feature matrix (20 × L) for each sequence. Those features of monogram-bigram were extracted from the PSSM updated consensus sequence and represented the probability that one amino acid was replaced by another [24].
(3)
DisPredict2 v1.0 and SPINE-X v1.0 were used to predict accessible surface area (ASA) and secondary structure (SS) probabilities for helices, coils, and β-folds from sequence information alone [13,25].
(4)
The position-specific estimate energy (PSEE) score for each amino acid was calculated using the method described by Iqbal et al. [13]. PSSE scores are generally used to detect the presence of functional binding regions of proteins.
Overall, we collected 36 features from 3D structures of high-resolution cP–L complexes and 46 features from their sequences. Features are summarized in Table 5.

3.3. Feature Window Size Selection

The structural state of a residue is not solely dictated by the amino acid residue itself but is additionally influenced by nearby residues. We applied a smoothed PSSM encoding scheme with sliding window sizes (ws) from 3 to 23 amino acid lengths with a step of 2 to obtain nearby residue information. The feature vector of a residue αi is represented by the summation of ws surrounding row vectors (Vi − (ws − 1)/2,..., Vi,..., Vi + (ws − 1)/2). For the N terminal and C terminal of a protein, (ws − 1)/2 ZERO vectors, consisting of 20 zero elements, are appended to the hand or tail of a PSSM profile. For example, as shown in Figure 4, when Vi = 6, ws = 5, the smoothed PSSM becomes (−2) + (−2) + (−3) + 3 + (−2) = −6 at the first position, followed by residues calculated in the same way [23].

3.4. Machine Learning Methods

We used a stacking-based ML technique to construct HyperCys, which was recently successfully employed to solve diverse bioinformatics challenges [26,27]. Stacking is an integration-based ML technique that integrates inputs from numerous models at various stages to create a new model. It is thought that stacking produces more accurate results than standalone ML approaches because the information collected from numerous prediction models reduces generalization mistakes. Six machine learning models, namely k-nearest neighbors (kNN), logistic regression (LR), support vector machine (SVM), light gradient boost machine (LGBM), multi-layer perceptron classifier (MLP), and random forest (RF), are applied in the suggested system to build the stacking model.
Tuning the hyperparameters of the models is one of the most difficult aspects of constructing ML models. This is because different hyperparameter settings can result in varying levels of accuracy. In order to acquire the best parameter tuning, each learning model was subjected to a GridSearch along with 10-fold cross-validation (CV).

3.5. Performance Evaluation

Accuracy refers to the proportion of the number of samples that the model predicts correctly (including TP and TN) to the number of samples in the population, defined as:
TP   +   TN TP   +   FN   +   TN   +   FP
The F1 score is a statistical measure of the accuracy of binary classification (or multitask dichotomous) models. It takes into account both the accuracy and recall of the classification model, defined as:
TP   TP   +   FN  
The receiver operating characteristic area under the curve (ROC AUC) measures the area beneath the ROC curve. The ability to appropriately identify random positive and negative data was measured in this domain.

4. Conclusions

Herein, we report the development of the first ensemble stacked model (HyperCys) for predicting hyper-reactive druggable cysteines based on both protein 3D structures and sequences. This method is powerful, as it blends a heterogeneous group of algorithms to expose distinct yet complementary aspects of the data. This study used ML algorithms to determine the optimal combination of protein features. The ensemble algorithm is based on six algorithms that have previously been shown to be effective classification algorithms. It is anticipated that HyperCys will be an effective tool for discovering new potential reactive cysteines for a wide range of targeted proteins.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ijms24065960/s1.

Author Contributions

Conceptualization, M.G. and S.G.; methodology, M.G. and S.G.; software, M.G.; validation, M.G.; formal analysis, M.G.; investigation, M.G. and S.G.; resources, M.G.; data curation, M.G.; writing—original draft preparation, M.G.; writing—review and editing, S.G.; visualization, M.G.; supervision, S.G.; project administration, S.G. All authors have read and agreed to the published version of the manuscript.

Funding

The financial support of the China Scholarship Council [Grant No. 201908080143] is gratefully acknowledged.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are openly available in a public repository. The data that support the findings of this study are openly available at https://github.com/mingjie-tech/HyperCys_v1.0 (accessed on 1 January 2023).

Acknowledgments

We would like to thank Aurélien F. A. Moumbock for his early contribution to the HyperCys project. We acknowledge support by the Open Access Publication Fund of the University of Freiburg.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Zarrin, A.A.; Bao, K.; Lupardus, P.; Vucic, D. Kinase inhibition in autoimmunity and inflammation. Nat. Rev. Drug Discov. 2021, 20, 39–63. [Google Scholar] [CrossRef] [PubMed]
  2. Zhao, Y.; Svensson, F.; Steadman, D.; Frew, S.; Monaghan, A.; Bictash, M.; Moreira, T.; Chalk, R.; Lu, W.; Fish, P.V.; et al. Structural Insights into Notum Covalent Inhibition. J. Med. Chem. 2021, 64, 11354–11363. [Google Scholar] [CrossRef]
  3. Herrmann, J.; Ciechanover, A.; Lerman, L.O.; Lerman, A. The ubiquitin–proteasome system in cardiovascular diseases—A hypothesis extended. Cardiovasc. Res. 2004, 61, 11–21. [Google Scholar] [CrossRef][Green Version]
  4. Gobert, A.P.; Boutaud, O.; Asim, M.; Zagol-Ikapitte, I.A.; Delgado, A.G.; Latour, Y.L.; Finley, J.L.; Singh, K.; Verriere, T.G.; Allaman, M.M.; et al. Dicarbonyl electrophiles mediate inflammation-induced gastrointestinal carcinogenesis. Gastroenterology 2021, 160, 1256–1268. [Google Scholar] [CrossRef] [PubMed]
  5. He, H.; Jiang, H.; Chen, Y.; Ye, J.; Wang, A.; Wang, C.; Liu, Q.; Liang, G.; Deng, X.; Jiang, W.; et al. Oridonin is a covalent NLRP3 inhibitor with strong anti-inflammasome activity. Nat. Commun. 2018, 9, 2550. [Google Scholar] [CrossRef] [PubMed][Green Version]
  6. Wang, H.; Chen, X.; Li, C.; Liu, Y.; Yang, F.; Wang, C. Sequence-based prediction of cysteine reactivity using machine learning. Biochemistry 2018, 57, 451–460. [Google Scholar] [CrossRef] [PubMed]
  7. Guang, X.; Guo, Y.; Xiao, J.; Wang, X.; Sun, J.; Xiong, W.; Li, M. Predicting the state of cysteines based on sequence information. J. Theor. Biol. 2010, 267, 312–318. [Google Scholar] [CrossRef]
  8. Weerapana, E.; Wang, C.; Simon, G.M.; Richter, F.; Khare, S.; Dillon, M.B.; Bachovchin, D.A.; Mowen, K.; Baker, D.; Cravatt, B.F. Quantitative reactivity profiling predicts functional cysteines in proteomes. Nature 2010, 468, 790–795. [Google Scholar] [CrossRef][Green Version]
  9. Nallapareddy, V.; Bogam, S.; Devarakonda, H.; Paliwal, S.; Bandyopadhyay, D. DeepCys: Structure-based multiple cysteine function prediction method trained on deep neural network: Case study on domains of unknown functions belonging to COX2 domains. Proteins Struct. Funct. Bioinform. 2021, 89, 745–761. [Google Scholar] [CrossRef]
  10. Soylu, I.; Marino, S.M. Cy-preds: An algorithm and a web service for the analysis and prediction of cysteine reactivity. Proteins Struct. Funct. Bioinform. 2016, 84, 278–291. [Google Scholar] [CrossRef]
  11. Zhang, W.; Pei, J.; Lai, L. Statistical analysis and prediction of covalent ligand targeted cysteine residues. J. Chem. Inf. Model. 2017, 57, 1453–1460. [Google Scholar] [CrossRef]
  12. Ferrè, F.; Clote, P. DiANNA 1.1: An extension of the DiANNA web server for ternary cysteine classification. Nucleic Acids Res. 2006, 34, W182–W185. [Google Scholar] [CrossRef] [PubMed][Green Version]
  13. Iqbal, S.; Hoque, M.T. Estimation of position specific energy as a feature of protein residues from sequence alone for structural classification. PLoS ONE 2016, 11, e0161452. [Google Scholar] [CrossRef] [PubMed][Green Version]
  14. Gao, M.; Moumbock, A.F.A.; Qaseem, A.; Xu, Q.; Günther, S. CovPDB: A high-resolution coverage of the covalent protein–ligand interactome. Nucleic Acids Res. 2022, 50, D445–D450. [Google Scholar] [CrossRef] [PubMed]
  15. Berman, H.M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.; Shindyalov, I.N.; Bourne, P.E. The protein data bank. Nucleic Acids Res. 2000, 28, 235–242. [Google Scholar] [CrossRef] [PubMed][Green Version]
  16. Fu, L.; Niu, B.; Zhu, Z.; Wu, S.; Li, W. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 2012, 28, 3150–3152. [Google Scholar] [CrossRef]
  17. Le Guilloux, V.; Schmidtke, P.; Tuffery, P. Fpocket: An open source platform for ligand pocket detection. BMC Bioinform. 2009, 10, 168. [Google Scholar] [CrossRef] [PubMed][Green Version]
  18. Kamtekar, S.; Schiffer, J.M.; Xiong, H.; Babik, J.M.; Hecht, M.H. Protein design by binary patterning of polar and nonpolar amino acids. Science 1993, 262, 1680–1685. [Google Scholar] [CrossRef]
  19. Olsson, M.H.; Søndergaard, C.R.; Rostkowski, M.; Jensen, J.H. PROPKA3: Consistent treatment of internal and surface residues in empirical p K a predictions. J. Chem. Theory Comput. 2011, 7, 525–537. [Google Scholar] [CrossRef]
  20. Mitternacht, S. FreeSASA: An open source C library for solvent accessible surface area calculations. F1000Research 2016, 5, 189. [Google Scholar] [CrossRef] [PubMed]
  21. Tan, K.P.; Nguyen, T.B.; Patel, S.; Varadarajan, R.; Madhusudhan, M.S. Depth: A web server to compute depth, cavity sizes, detect potential small-molecule ligand-binding cavities and predict the pKa of ionizable residues in proteins. Nucleic Acids Res. 2013, 41, W314–W321. [Google Scholar] [CrossRef] [PubMed]
  22. Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389–3402. [Google Scholar] [CrossRef] [PubMed][Green Version]
  23. Cheng, C.W.; Su, E.C.Y.; Hwang, J.K.; Sung, T.Y.; Hsu, W.L. Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. BMC Bioinform. 2008, 9, S6. [Google Scholar] [CrossRef] [PubMed][Green Version]
  24. Iqbal, S.; Hoque, M.T. DisPredict: A predictor of disordered protein using optimized RBF kernel. PLoS ONE 2015, 10, e0141551. [Google Scholar] [CrossRef] [PubMed]
  25. Faraggi, E.; Zhang, T.; Yang, Y.; Kurgan, L.; Zhou, Y. SPINE X: Improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles. J. Comput. Chem. 2012, 33, 259–267. [Google Scholar] [CrossRef][Green Version]
  26. Bhasuran, B.; Murugesan, G.; Abdulkadhar, S.; Natarajan, J. Stacked ensemble combined with fuzzy matching for biomedical named entity recognition of diseases. J. Biomed. Inform. 2016, 64, 1–9. [Google Scholar] [CrossRef]
  27. Cao, Z.; Pan, X.; Yang, Y.; Huang, Y.; Shen, H.B. The lncLocator: A subcellular localization predictor for long non-coding RNAs based on a stacked ensemble classifier. Bioinformatics 2018, 34, 2185–2194. [Google Scholar] [CrossRef]
Figure 1. The machine learning workflow of this study.
Figure 1. The machine learning workflow of this study.
Ijms 24 05960 g001
Figure 2. Feature importance for different feature profiles based on the RF classifier: (A) for six feature groups and (B) for structure-based (SPP + AAC + PP) and sequence-based (CP + SSP + EP), separately.
Figure 2. Feature importance for different feature profiles based on the RF classifier: (A) for six feature groups and (B) for structure-based (SPP + AAC + PP) and sequence-based (CP + SSP + EP), separately.
Ijms 24 05960 g002
Figure 3. The schematic of a PSSM.
Figure 3. The schematic of a PSSM.
Ijms 24 05960 g003
Figure 4. Examples of (A) standard PSSM and (B) smoothed PSSM.
Figure 4. Examples of (A) standard PSSM and (B) smoothed PSSM.
Ijms 24 05960 g004
Table 1. Statistical analysis of pKa and SASA of cysteines in NonCov-Set and Cov-Set datasets.
Table 1. Statistical analysis of pKa and SASA of cysteines in NonCov-Set and Cov-Set datasets.
FeatureGroupsMinQ1MedianQ3MaxMeanSD
SASANonCov-0.095.5412.1524.45112.5119.0121.61
Set
SASACov-Set4.1815.35921.5337.415122.0928.6220.73
pKaNonCov-510.3211.1912.1319.7611.341.8
Set
pKaCov-Set6.799.9910.8812.0220.0111.182.0
Table 2. Performance of six ML models for the prediction of hyper-reactive druggable cysteines involving all feature groups after 10-fold CV (ws = 1).
Table 2. Performance of six ML models for the prediction of hyper-reactive druggable cysteines involving all feature groups after 10-fold CV (ws = 1).
ModelACCF1RECALLROC AUC
KNN0.6850.6130.5710.771
LR0.7400.7040.7010.812
SVM0.7340.7010.7040.804
LGBM0.7570.7110.6910.821
RF0.7380.6850.6770.809
MLP0.7560.7230.6760.845
HyperCys0.7580.7210.6920.818
Table 3. ML models’ accuracy performance for different feature combinations with the best window size (ws = 21).
Table 3. ML models’ accuracy performance for different feature combinations with the best window size (ws = 21).
Feature CombinationMetricKNNLRSVMLGBMRFMLPHyperCys
sequence-basedACC0.6080.6320.6380.5650.5920.5560.626
F10.5210.5630.5200.4860.4930.0140.496
RECALL0.4820.5340.4450.4680.4880.0080.385
ROC AUC0.6240.6610.6740.5830.5740.6560.670
structure-basedACC0.6840.7070.7210.7600.7430.6940.757
F10.5770.6370.6690.7180.6810.6030.714
RECALL0.4960.5890.6540.6980.6740.5420.681
ROC AUC0.7180.7830.7650.8360.8190.7810.823
structure- & sequence-basedACC0.7040.7470.7100.7700.7410.7830.784
F10.6330.7040.6690.7220.6740.7270.754
RECALL0.5700.6800.6690.6830.6530.7430.742
ROC AUC0.7770.8010.7990.8470.8140.8520.824
Table 4. Overview of the dataset statistics.
Table 4. Overview of the dataset statistics.
AttributesCount
cP–L complexes147
Different covalent mechanisms14
Pre-reactive ligands134
Warheads40
Ligand types6
Nucleophilic proteins147
Protein classes9
Collected features80
Covalent cysteines135
Noncovalent cysteines169
Table 5. Feature description.
Table 5. Feature description.
Feature TypeFeature GroupSizeFeature
Structure-basedSASA-PKA Profile (SPP)1810Å.pka.total; 10Å.pka.ave;
10Å.SASA.total;
10Å.SASA.ave; 8Å.pka.total;
8Å.pka.ave; 8Å.SASA.total;
8Å.SASA.ave; 6Å.pka.total;
6Å.pka.ave; 6Å.SASA.total;
6Å.SASA.ave; 4Å.pka.total;
4Å.pka.ave; 4Å.SASA.total;
4Å.SASA.ave;
10Å.pka.total; 10Å.pka.ave;
10Å.SASA.total;
10Å.SASA.ave; 8Å.pka.total;
8Å.pka.ave; 8Å.SASA.total;
8Å.SASA.ave; 6Å.pka.total;
6Å.pka.ave; 6Å.SASA.total;
6Å.SASA.ave; 4Å.pka.total;
4Å.pka.ave; 4Å.SASA.total;
4Å.SASA.ave; SASA; pKa
Structure-basedPocket Profile (PP)4Atom Depth; Drug Score;
Hydrophobicity Score;
Polarity Score
Structure-basedAmino Acid Composition
(AAC)
1210Å.Total; 10Å.H; 10Å.P;
8Å.Total; 8Å.H; 8Å.P; 6Å.Total;
6Å.H; 6Å.P; 4Å.Total; 4Å.H;
4Å.P
Sequence-basedConservation Profile (CP)4120 PSSMs; monograms; 20
bigrams
Sequence-basedEnergy Profile (EP)1PSEE
Sequence-basedSecondary Structural Profile
(SSP)
4Helix probability; Beta-Strand
probability; Coil probability;
ASA
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gao, M.; Günther, S. HyperCys: A Structure- and Sequence-Based Predictor of Hyper-Reactive Druggable Cysteines. Int. J. Mol. Sci. 2023, 24, 5960. https://doi.org/10.3390/ijms24065960

AMA Style

Gao M, Günther S. HyperCys: A Structure- and Sequence-Based Predictor of Hyper-Reactive Druggable Cysteines. International Journal of Molecular Sciences. 2023; 24(6):5960. https://doi.org/10.3390/ijms24065960

Chicago/Turabian Style

Gao, Mingjie, and Stefan Günther. 2023. "HyperCys: A Structure- and Sequence-Based Predictor of Hyper-Reactive Druggable Cysteines" International Journal of Molecular Sciences 24, no. 6: 5960. https://doi.org/10.3390/ijms24065960

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop