A Study on Novel Amino Acid Pair Features for Protein Evolutionary Classifications

Xiao geng Wan; Xin ying Tan; Jun Cao

doi:doi:10.11648/j.cbb.20241201.13

Research Article |

| Peer-Reviewed

A Study on Novel Amino Acid Pair Features for Protein Evolutionary Classifications

Xiao geng Wan^*

, Xin ying Tan

, Jun Cao

Published in Computational Biology and Bioinformatics (Volume 12, Issue 1)

Received: 14 August 2024 Accepted: 7 September 2024 Published: 23 September 2024

Views: Downloads:

Download PDF

Share This Article

Twitter
Linked In
Facebook

Abstract

Protein evolutionary classification from amino acid sequence is one of the hot research topics in computational biology and bioinformatics. The amino acid composition and arrangement in a protein sequence embed the hints to its evolutionary origins. The feature extraction from an amino acid sequence to a numerical vector is still a challenging problem. Traditional feature methods extract protein sequence information either from individual amino acids or kmers aspects, which have general performance with limitations in classification accuracy. To further improve the accuracy in protein evolutionary classifications, six new features defined on separated amino acid pairs are proposed for protein evolutionary classification analysis, where composition and arrangement as well as physical properties are considered for the different combinations of separated amino acid pairs. Different from general consideration of amino acid pairs, the new features account for the features of separated amino acid pairs with spatial intervals in the sequence, which may deeper reflect the spatial relationships and characters between the amino acid in pairs. In test of the performances of the new features, five standard protein evolutionary classification examples are employed, where the new features proposed are compared with classical protein sequence features such as averaged property factors (APF), natural vector (NV) and pseudo amino acid composition (PseAAC) as well as kmer versions of these features. The area under precision-recall curve (AUPRC) analysis shows that the new features are efficient in evolutionary classifications, which outperform traditional protein sequence features that are based on individual amino acids and kmers. Parameter analysis on the novel separated amino acid pair features and kmer features show that the features of some medium or longer length of amino acid pair intervals and kmers may achieve higher classification accuracy in evolutionary classifications. From this analysis, the newly proposed separated amino acid pairs with spacial intervals are proved to be efficient units in extracting protein sequences features, which may interpret richer evolutionary information of protein sequences than individual amino acids and kmers.

Published in	Computational Biology and Bioinformatics (Volume 12, Issue 1)
DOI	10.11648/j.cbb.20241201.13
Page(s)	18-31
Creative Commons	This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.
Copyright	Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Protein Sequence, Features, Amino Acid Pair, Evolutionary Classification

References

[1]	Gupta, M. K, Niyogi, R., Misra, M. A. A 2D graphical representation of protein sequence and their similarity analysis with probabilistic method. Match-commun. Math. Co. 2014, 72(2), 519–532. https://doi.org/10.5483/BMBRep.2008.41.3.217
[2]	He, P. A new graphical representation of similarity/dissimilarity studies of protein sequences. SAR QSAR in Environ. Res. 2010, 21(5-6), 571-580. https://doi.org/10.1080/1062936x.2010.510481
[3]	Hu, J., Huang, G. Similarity/dissimilarity analysis of protein sequences by a new graphical representation. Curr. Bioinf. 2013, 8, 539–544. https://doi.org/10.2174/1574893611308050003
[4]	Li, Z., Geng, C., He, P., Yao, Y. A novel method of 3D graphical representation and similarity analysis for proteins. Match. 2014, 71(1), 213-226.
[5]	Liu, Y., Li, D., Lu, K., Jiao, Y., He P. P-H Curve, a Graphical Representation of Protein Sequences for Similarities Analysis. Match-commun. Math. Co. 2013, 70(1), 451–566.
[6]	Yao, Y., Dai, Q., Li, C., He, P., Nan X. Analysis of similarity/dissimilarity of protein sequences. Proteins: Struct., Funct., Bioinf. 2008, 73(4), 864-871.
[7]	Mu, Z., Yu, T., Liu, X., Zheng, H., Wei, L., Liu, J. FEGS: a novel feature extraction model for protein sequences and its applications. BMC Bioinf. 2021, 22(1), 297. https://doi.org/10.1186/s128 59-021-04223-3
[8]	Zielezinski, A., Vinga, S., Almeida, J., Karlowski, W. M. Alignment-free sequence comparison: benefts, applications, and tools. Genome Biol. 2017, 18(1), 186. https://doi.org/10.1186/s13059-017-1319-7
[9]	Rackovsky, S. Sequence physical properties encode the global organization of protein structure space. Proc. Natl. Acad. Sci. 2009, 106(34), 14345–14348. https://doi.org/10.1073/pnas.0903433106
[10]	Yu, C., Deng, M., Cheng, S. Y., Yau, S. C., He, R. L., Yau, S. S.-T. Protein space: A natural method for realizing the nature of protein universe. J. of Theor. Biol. 2013, 318, 197–204. https://doi.org/10.1016/j.jtbi.2012.11.005
[11]	Shen, H., Chou, K. PseAAC: A flexible web server for generating various kinds of protein pseudo amino acid composition. Anal. Biochem. 2008, 373, 386-388. https://doi.org/10.1016/j.ab.2007.10.012
[12]	Yau, S. S.-T, Yu, C., He, R. L. A protein map and its application. DNA Cell Biol. 2008, 27, 241-250. https://doi.org/10.1089/dna.2007.0676
[13]	Yu, C., Cheng, S. Y., He, R. L., Yau, S. S.-T. Protein map: An alignment-free sequence comparison method based on various properties of amino acids. Gene. 2011, 486(1–2), 110–118. https://doi.org/10.1016/j.gene.2011.07.002
[14]	Liu, B., Liu, F., Wang, X., Chen, J., Fang, L., Chou, K. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015, 43(W1), W65-W71. https://doi.org/10.1093/ nar/gkv458
[15]	He, P., Zhang, Y., Yao, Y., Tang, Y., Nan, X. The graphical representation of protein sequences based on the physicochemical properties and its applications. J. Comput. Chem. 2010, 31, 2136–2142.
[16]	Wu, Z., Xiao, X., Chou, K. C. 2D-MH: a web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids. J. Theor. Biol. 2010, 267, 29–34. https://doi.org/10.1016/j.jtbi. 2010.08.007
[17]	Yu, J., Qu, A., Tang, H., Wang, F., Wang C., Wang, H., Wang, J., Zhu H. A novel numerical model for protein sequences analysis based on spherical coordinates and multiple physicochemical properties of amino acids. Biopolymers. 2019, 110, e23282. https://doi.org/10.1002/bip.23282
[18]	Randić, M. 2-D graphical representation of proteins based on physicochemical properties of amino acids. Chem. Phys. Lett. 2008, 440(4-6), 291–295. https://doi.org/10.1016/j.cplett.2007.04.037
[19]	Zhang, Y., Wen, J., Yau, S. S.-T. Phylogenetic analysis of protein sequences based on a novel k-mer natural vector method. Genomics. 2019, 111, 1298–1305. https://doi.org/10.1016/j. ygeno.2018.08.010
[20]	Yu, C., He, R. L., Yau, S. S.-T. Protein sequence comparison based on K-string dictionary. Gene. 2013, 529(2), 250-256. https://doi.org/10.1016/j.gene.2013.07.092
[21]	Chang, C. H., Nelson, W. C., Jerger, A., Wright, A. T., Egbert, R. G., McDermott, J. E. Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recording. Bioinform Adv. 2023, 3(1), vbad005. https://doi.org/10.1093/bioadv/vbad005
[22]	Ghandi, M., Mohammad-Noori, M., Ghareghani, N., Lee, D., Garraway, L., Beer, M. A. GkmSVM: an R package for gapped-kmer SVM. Bioinformatics. 2016, 32(14), 2205-2207. https://doi.org/10.1093/bioinformatics/btw203
[23]	Liu, B., Wang, S., Dong, Q., Li, S., Liu, X. Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning. IEEE T. on Nanobiosci. 2016, 15(4), 328-334. https://doi.org/10.1109/TNB.2016.2555951
[24]	Wen, J., Zhang, Y., Yau, S. S.-T. K-mer Sparse matrix model for genetic sequence and its applications in sequence comparison. J. Theor. Biol. 2014, 363, 145-150. https://doi.org/10.1016/j. jtbi.2014.08.028
[25]	Kim, T. K., Bunron, L. Fast Global Alignment Technique Using Kmer-Distance and Parallelism. BigDAS '15: Proceedings of the 2015 International Conference on Big Data Applications and Services Jeju Island Republic of Korea. 2015. https://doi.org/10.1145/2837060.2837094
[26]	Liu, Y., Wang, X., Liu, B. IDP–CRF: Intrinsically Disordered Protein/Region Identifification Based on Conditional Random Fields. Int J Mol Sci. 2018, 19(9), 2483. https://doi.org/10.3390/ijms19092483
[27]	Wen, J., Chan, R. H. F., Yau, S. C., He, R. L., Yau, S. S.-T. K-mer natural vector and its application to the phylogenetic analysis of genetic sequences. Gene. 2014, 546(1), 25-34. https://doi.org/10.1016/j.gene.2014.05.043
[28]	Naznin, F., Sarker, R., Essam, D. Two Hybrid Algorithms for Multiple Sequence Alignment. AIP Conf. Proc. 2010, 1210(1), 69-83. https://doi.org/10.1063/1.3314271
[29]	Yang, X. W., Wang, T. M. A novel statistical measure for sequence comparison on the basis of k-word counts. J. Theor. Biol. 2013, 318, 91–100. https://doi.org/10.1016/j.jtbi.2012.10.035
[30]	Yu, H. J. Segmented K-mer and its application on similarity analysis of mitochondrial genome sequences. Gene. 2013, 518, 419–424. https://doi.org/10.1016/j.gene.2012.12.079
[31]	Tian K., Zhao X., Zhang Y., Yau S. Comparing protein structures and inferring functions with a novel three-dimensional Yau-Hausdorff method. J. Biomol. Struct. Dyn. 2019, 37(16), 4151-60. https://doi.org/10.1080/07391102.2018.154 0359
[32]	Morikawa N. Discrete differential geometry of n-simplices and protein structure analysis. Applied Mathematics. 2014, 5(16), 2458-2463. https://doi.org/10.4236/am.2014.516237

Cite This Article

Plain Text BibTeX RIS

APA Style

Wan, X. G., Tan, X. Y., Cao, J. (2024). A Study on Novel Amino Acid Pair Features for Protein Evolutionary Classifications. Computational Biology and Bioinformatics, 12(1), 18-31. https://doi.org/10.11648/j.cbb.20241201.13

Copy | Download

ACS Style

Wan, X. G.; Tan, X. Y.; Cao, J. A Study on Novel Amino Acid Pair Features for Protein Evolutionary Classifications. Comput. Biol. Bioinform. 2024, 12(1), 18-31. doi: 10.11648/j.cbb.20241201.13

Copy | Download

AMA Style

Wan XG, Tan XY, Cao J. A Study on Novel Amino Acid Pair Features for Protein Evolutionary Classifications. Comput Biol Bioinform. 2024;12(1):18-31. doi: 10.11648/j.cbb.20241201.13

Copy | Download

@article{10.11648/j.cbb.20241201.13,
  author = {Xiao geng Wan and Xin ying Tan and Jun Cao},
  title = {A Study on Novel Amino Acid Pair Features for Protein Evolutionary Classifications
},
  journal = {Computational Biology and Bioinformatics},
  volume = {12},
  number = {1},
  pages = {18-31},
  doi = {10.11648/j.cbb.20241201.13},
  url = {https://doi.org/10.11648/j.cbb.20241201.13},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.cbb.20241201.13},
  abstract = {Protein evolutionary classification from amino acid sequence is one of the hot research topics in computational biology and bioinformatics. The amino acid composition and arrangement in a protein sequence embed the hints to its evolutionary origins. The feature extraction from an amino acid sequence to a numerical vector is still a challenging problem. Traditional feature methods extract protein sequence information either from individual amino acids or kmers aspects, which have general performance with limitations in classification accuracy. To further improve the accuracy in protein evolutionary classifications, six new features defined on separated amino acid pairs are proposed for protein evolutionary classification analysis, where composition and arrangement as well as physical properties are considered for the different combinations of separated amino acid pairs. Different from general consideration of amino acid pairs, the new features account for the features of separated amino acid pairs with spatial intervals in the sequence, which may deeper reflect the spatial relationships and characters between the amino acid in pairs. In test of the performances of the new features, five standard protein evolutionary classification examples are employed, where the new features proposed are compared with classical protein sequence features such as averaged property factors (APF), natural vector (NV) and pseudo amino acid composition (PseAAC) as well as kmer versions of these features. The area under precision-recall curve (AUPRC) analysis shows that the new features are efficient in evolutionary classifications, which outperform traditional protein sequence features that are based on individual amino acids and kmers. Parameter analysis on the novel separated amino acid pair features and kmer features show that the features of some medium or longer length of amino acid pair intervals and kmers may achieve higher classification accuracy in evolutionary classifications. From this analysis, the newly proposed separated amino acid pairs with spacial intervals are proved to be efficient units in extracting protein sequences features, which may interpret richer evolutionary information of protein sequences than individual amino acids and kmers.},
 year = {2024}
}

Copy | Download

TY  - JOUR
T1  - A Study on Novel Amino Acid Pair Features for Protein Evolutionary Classifications

AU  - Xiao geng Wan
AU  - Xin ying Tan
AU  - Jun Cao
Y1  - 2024/09/23
PY  - 2024
N1  - https://doi.org/10.11648/j.cbb.20241201.13
DO  - 10.11648/j.cbb.20241201.13
T2  - Computational Biology and Bioinformatics
JF  - Computational Biology and Bioinformatics
JO  - Computational Biology and Bioinformatics
SP  - 18
EP  - 31
PB  - Science Publishing Group
SN  - 2330-8281
UR  - https://doi.org/10.11648/j.cbb.20241201.13
AB  - Protein evolutionary classification from amino acid sequence is one of the hot research topics in computational biology and bioinformatics. The amino acid composition and arrangement in a protein sequence embed the hints to its evolutionary origins. The feature extraction from an amino acid sequence to a numerical vector is still a challenging problem. Traditional feature methods extract protein sequence information either from individual amino acids or kmers aspects, which have general performance with limitations in classification accuracy. To further improve the accuracy in protein evolutionary classifications, six new features defined on separated amino acid pairs are proposed for protein evolutionary classification analysis, where composition and arrangement as well as physical properties are considered for the different combinations of separated amino acid pairs. Different from general consideration of amino acid pairs, the new features account for the features of separated amino acid pairs with spatial intervals in the sequence, which may deeper reflect the spatial relationships and characters between the amino acid in pairs. In test of the performances of the new features, five standard protein evolutionary classification examples are employed, where the new features proposed are compared with classical protein sequence features such as averaged property factors (APF), natural vector (NV) and pseudo amino acid composition (PseAAC) as well as kmer versions of these features. The area under precision-recall curve (AUPRC) analysis shows that the new features are efficient in evolutionary classifications, which outperform traditional protein sequence features that are based on individual amino acids and kmers. Parameter analysis on the novel separated amino acid pair features and kmer features show that the features of some medium or longer length of amino acid pair intervals and kmers may achieve higher classification accuracy in evolutionary classifications. From this analysis, the newly proposed separated amino acid pairs with spacial intervals are proved to be efficient units in extracting protein sequences features, which may interpret richer evolutionary information of protein sequences than individual amino acids and kmers.
VL  - 12
IS  - 1
ER  -

Copy | Download

Author Information

Xiao geng Wan

Department of Mathematics, Beijing University of Chemical Technology, Beijing, China

Contact Email

http://orcid.org/0000-0002-1048-9810
Xin ying Tan

The Fourth Medical Center, PLA General Hospital, Beijing, China

Contact Email

http://orcid.org/0009-0000-2118-2489
Jun Cao

Faculty of Environment and Life, Beijing University of Technology, Beijing, China

Contact Email

http://orcid.org/0000-0003-0211-9227

Download PDF

Sections

Plain Text BibTeX RIS

APA Style

Wan, X. G., Tan, X. Y., Cao, J. (2024). A Study on Novel Amino Acid Pair Features for Protein Evolutionary Classifications. Computational Biology and Bioinformatics, 12(1), 18-31. https://doi.org/10.11648/j.cbb.20241201.13

Copy | Download

ACS Style

Wan, X. G.; Tan, X. Y.; Cao, J. A Study on Novel Amino Acid Pair Features for Protein Evolutionary Classifications. Comput. Biol. Bioinform. 2024, 12(1), 18-31. doi: 10.11648/j.cbb.20241201.13

Copy | Download

AMA Style

Wan XG, Tan XY, Cao J. A Study on Novel Amino Acid Pair Features for Protein Evolutionary Classifications. Comput Biol Bioinform. 2024;12(1):18-31. doi: 10.11648/j.cbb.20241201.13

Copy | Download

@article{10.11648/j.cbb.20241201.13,
  author = {Xiao geng Wan and Xin ying Tan and Jun Cao},
  title = {A Study on Novel Amino Acid Pair Features for Protein Evolutionary Classifications
},
  journal = {Computational Biology and Bioinformatics},
  volume = {12},
  number = {1},
  pages = {18-31},
  doi = {10.11648/j.cbb.20241201.13},
  url = {https://doi.org/10.11648/j.cbb.20241201.13},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.cbb.20241201.13},
  abstract = {Protein evolutionary classification from amino acid sequence is one of the hot research topics in computational biology and bioinformatics. The amino acid composition and arrangement in a protein sequence embed the hints to its evolutionary origins. The feature extraction from an amino acid sequence to a numerical vector is still a challenging problem. Traditional feature methods extract protein sequence information either from individual amino acids or kmers aspects, which have general performance with limitations in classification accuracy. To further improve the accuracy in protein evolutionary classifications, six new features defined on separated amino acid pairs are proposed for protein evolutionary classification analysis, where composition and arrangement as well as physical properties are considered for the different combinations of separated amino acid pairs. Different from general consideration of amino acid pairs, the new features account for the features of separated amino acid pairs with spatial intervals in the sequence, which may deeper reflect the spatial relationships and characters between the amino acid in pairs. In test of the performances of the new features, five standard protein evolutionary classification examples are employed, where the new features proposed are compared with classical protein sequence features such as averaged property factors (APF), natural vector (NV) and pseudo amino acid composition (PseAAC) as well as kmer versions of these features. The area under precision-recall curve (AUPRC) analysis shows that the new features are efficient in evolutionary classifications, which outperform traditional protein sequence features that are based on individual amino acids and kmers. Parameter analysis on the novel separated amino acid pair features and kmer features show that the features of some medium or longer length of amino acid pair intervals and kmers may achieve higher classification accuracy in evolutionary classifications. From this analysis, the newly proposed separated amino acid pairs with spacial intervals are proved to be efficient units in extracting protein sequences features, which may interpret richer evolutionary information of protein sequences than individual amino acids and kmers.},
 year = {2024}
}

Copy | Download

TY  - JOUR
T1  - A Study on Novel Amino Acid Pair Features for Protein Evolutionary Classifications

AU  - Xiao geng Wan
AU  - Xin ying Tan
AU  - Jun Cao
Y1  - 2024/09/23
PY  - 2024
N1  - https://doi.org/10.11648/j.cbb.20241201.13
DO  - 10.11648/j.cbb.20241201.13
T2  - Computational Biology and Bioinformatics
JF  - Computational Biology and Bioinformatics
JO  - Computational Biology and Bioinformatics
SP  - 18
EP  - 31
PB  - Science Publishing Group
SN  - 2330-8281
UR  - https://doi.org/10.11648/j.cbb.20241201.13
AB  - Protein evolutionary classification from amino acid sequence is one of the hot research topics in computational biology and bioinformatics. The amino acid composition and arrangement in a protein sequence embed the hints to its evolutionary origins. The feature extraction from an amino acid sequence to a numerical vector is still a challenging problem. Traditional feature methods extract protein sequence information either from individual amino acids or kmers aspects, which have general performance with limitations in classification accuracy. To further improve the accuracy in protein evolutionary classifications, six new features defined on separated amino acid pairs are proposed for protein evolutionary classification analysis, where composition and arrangement as well as physical properties are considered for the different combinations of separated amino acid pairs. Different from general consideration of amino acid pairs, the new features account for the features of separated amino acid pairs with spatial intervals in the sequence, which may deeper reflect the spatial relationships and characters between the amino acid in pairs. In test of the performances of the new features, five standard protein evolutionary classification examples are employed, where the new features proposed are compared with classical protein sequence features such as averaged property factors (APF), natural vector (NV) and pseudo amino acid composition (PseAAC) as well as kmer versions of these features. The area under precision-recall curve (AUPRC) analysis shows that the new features are efficient in evolutionary classifications, which outperform traditional protein sequence features that are based on individual amino acids and kmers. Parameter analysis on the novel separated amino acid pair features and kmer features show that the features of some medium or longer length of amino acid pair intervals and kmers may achieve higher classification accuracy in evolutionary classifications. From this analysis, the newly proposed separated amino acid pairs with spacial intervals are proved to be efficient units in extracting protein sequences features, which may interpret richer evolutionary information of protein sequences than individual amino acids and kmers.
VL  - 12
IS  - 1
ER  -

Copy | Download