Loading…

Protein-DNA interface hotspots prediction based on fusion features of embeddings of protein language model and handcrafted features

The identification of hotspot residues at the protein-DNA binding interfaces plays a crucial role in various aspects such as drug discovery and disease treatment. Although experimental methods such as alanine scanning mutagenesis have been developed to determine the hotspot residues on protein-DNA i...

Full description

Saved in:
Bibliographic Details
Published in:Computational biology and chemistry 2023-12, Vol.107, p.107970-107970, Article 107970
Main Authors: Li, Xiang, Wang, Gang-Ao, Wei, Zhuoyu, Wang, Hong, Zhu, Xiaolei
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The identification of hotspot residues at the protein-DNA binding interfaces plays a crucial role in various aspects such as drug discovery and disease treatment. Although experimental methods such as alanine scanning mutagenesis have been developed to determine the hotspot residues on protein-DNA interfaces, they are both inefficient and costly. Therefore, it is highly necessary to develop efficient and accurate computational methods for predicting hotspot residues. Several computational methods have been developed, however, they are mainly based on hand-crafted features which may not be able to represent all the information of proteins. In this regard, we propose a model called PDH-EH, which utilizes fused features of embeddings extracted from a protein language model (PLM) and handcrafted features. After we extracted the total 1141 dimensional features, we used mRMR to select the optimal feature subset. Based on the optimal feature subset, several different learning algorithms such as Random Forest, Support Vector Machine, and XGBoost were used to build the models. The cross-validation results on the training dataset show that the model built by using Random Forest achieves the highest AUROC. Further evaluation on the independent test set shows that our model outperforms the existing state-of-the-art models. Moreover, the effectiveness and interpretability of embeddings extracted from PLM were demonstrated in our analysis. The codes and datasets used in this study are available at: https://github.com/lixiangli01/PDH-EH. [Display omitted] •The embeddings of a PLM are used in protein-DNA hotspots prediction.•The effectiveness of embeddings of the PLM is demonstrated in our study.•The proposed model, PDH-EH, outperforms the other state-of-the-art models.•By attention mechanism, the predictive results can be interpreted.
ISSN:1476-9271
1476-928X
DOI:10.1016/j.compbiolchem.2023.107970