Jump to Navigation
Subscribe to Bioinformatics Oxford Journals feed
Updated: 9 hours 18 min ago

Optimal Selection of Suitable Templates in Protein Interface Prediction

Mon, 21/08/2023 - 5:30am
AbstractMotivationMolecular-level classification of protein-protein interfaces can greatly assist in functional characterization and rational drug design. The most accurate protein interface predictions rely on finding homologous proteins with known interfaces since most interfaces are conserved within the same protein family. The accuracy of these template-based prediction approaches depends on the correct choice of suitable templates. Choosing the right templates in the immunoglobulin superfamily (IgSF) is challenging because its members share low sequence identity and display a wide range of alternative binding sites despite structural homology.ResultsWe present a new approach to predict protein interfaces. First, template specific, informative evolutionary profiles are established using a mutual information-based approach. Next, based on the similarity of residue level conservation scores derived from the evolutionary profiles, a query protein is hierarchically clustered with all available template proteins in its superfamily with known interface definitions. Once clustered, a subset of the most closely related templates is selected, and an interface prediction is made. These initial interface predictions are subsequently refined by extensive docking. This method was benchmarked on 51 IgSF proteins and can predict non-trivial interfaces of IgSF proteins with an average and median F-score of 0.64 and 0.78, respectively. We also provide a way to assess the confidence of the results. The average and median F-scores increase to 0.8 and 0.81, respectively, if 27% of low confidence cases and 17% of medium confidence cases are removed. Lastly, we provide residue level interface predictions, protein complexes, and confidence measurements for singletons in the IgSF.AvailabilitySource code is freely available at: https://gitlab.com/fiserlab.org/interdct_with_refinementSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

StarPep Toolbox: An Open-Source Software to Assist Chemical Space Analysis of Bioactive Peptides and Their Functions using Complex Networks

Mon, 21/08/2023 - 5:30am
AbstractMotivationAntimicrobial peptides (AMPs) are promising molecules to treat infectious diseases caused by multi-drug resistance pathogens, some types of cancer, and other conditions. Computer-aided strategies are efficient tools for the high-throughput screening of AMPs.ResultsThis report highlights StarPep Toolbox, an open-source and user-friendly software to study the bioactive chemical space of AMPs using complex network-based representations, clustering, and similarity-searching models. The novelty of this research lies in the combination of network science and similarity-searching techniques, distinguishing it from conventional methods based on machine learning and other computational approaches. The network-based representation of the AMP chemical space presents promising opportunities for peptide drug repurposing, development, and optimization. This approach could serve as a baseline for the discovery of a new generation of therapeutics peptides.AvailabilityAll underlying code and installation files are accessible through GitHub (https://github.com/Grupo-Medicina-Molecular-y-Traslacional/StarPep) under the Apache 2.0 license.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Dynamic applicability domain (dAD): compound-target binding affinity estimates with local conformal prediction

Fri, 18/08/2023 - 5:30am
AbstractMotivationIncreasing efforts are being made in the field of machine learning to advance the learning of robust and accurate models from experimentally measured data and enable more efficient drug discovery processes. The prediction of binding affinity is one of the most frequent tasks of compound bioactivity modelling. Learned models for binding affinity prediction are assessed by their average performance on unseen samples, but point predictions are typically not provided with a rigorous confidence assessment. Approaches such as the conformal predictor framework equip conventional models with a more rigorous assessment of confidence for individual point predictions. In this paper, we extend the inductive conformal prediction (ICP) framework for interaction data, in particular the compound-target binding affinity prediction task. The new framework is based on dynamically defined calibration sets that are specific for each testing pair and provides prediction assessment in the context of calibration pairs from its compound-target neighbourhood, enabling improved estimates based on the local properties of the prediction model.ResultsThe effectiveness of the approach is benchmarked on several publicly available datasets and tested in realistic use-case scenarios with increasing levels of difficulty on a complex compound-target binding affinity space. We demonstrate that in such scenarios, novel approach combining applicability domain paradigm with conformal prediction framework, produces superior confidence assessment with valid and more informative prediction regions compared to other state-of-the-art conformal prediction approaches.AvailabilityDataset and the code are available on GitHub (https://github.com/mlkr-rbi/dAD).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Predicted structural proteome of Sphagnum divinum and proteome-scale annotation

Thu, 17/08/2023 - 5:30am
AbstractMotivationSphagnum-dominated peatlands store a substantial amount of terrestrial carbon. The genus is undersampled and under-studied. No experimental crystal structure from any Sphagnum species exists in the Protein Data Bank and fewer than 200 Sphagnum-related genes have structural models available in the AlphaFold Protein Structure Database. Tools and resources are needed to help bridge these gaps, and to enable the analysis of other structural proteomes now made possible by accurate structure prediction.ResultsWe present the predicted structural proteome (25,134 primary transcripts) of S. divinum computed using AlphaFold, structural alignment results of all high-confidence models against an annotated non-redundant crystallographic database of over 90,000 structures, a structure-based classification of putative Enzyme Commission (EC) numbers across this proteome, and the computational method to perform this proteome-scale structure-based annotation.AvailabilityAll data and code are available in public repositories, detailed at https://github.com/BSDExabio/SAFA. The structural models of the S. divinum proteome have been deposited in the ModelArchive repository at https://modelarchive.org/doi/10.5452/ma-ornl-sphdiv.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Adjusting for gene-specific covariates to improve RNA-seq analysis

Thu, 17/08/2023 - 5:30am
AbstractSummaryThis paper suggests a novel positive false discovery rate (pFDR) controlling method for testing gene-specific hypotheses using a gene-specific covariate variable, such as gene length. We suppose the null probability depends on the covariate variable. In this context, we propose a rejection rule that accounts for heterogeneity among tests by employing two distinct types of null probabilities. We establish a pFDR estimator for a given rejection rule by following Storey’s q-value framework. A condition on a type 1 error posterior probability is provided that equivalently characterizes our rejection rule. We also present a suitable procedure for selecting a tuning parameter through cross-validation that maximizes the expected number of hypotheses declared significant. A simulation study demonstrates that our method is comparable to or better than existing methods across realistic scenarios. In data analysis, we find support for our method’s premise that the null probability varies with a gene-specific covariate variable.Supplementary informationOnline supplementary materialsupplementary material includes proofs of theorems, results of additional simulation and data analysis.
Categories: Bioinformatics Trends

TranSyT, an innovative framework for identifying transport systems

Thu, 17/08/2023 - 5:30am
AbstractMotivationThe importance and rate of development of genome-scale metabolic models have been growing for the last few years, increasing the demand for software solutions that automate several steps of this process. However, since TRIAGE’s release, software development for the automatic integration of transport reactions into models has stalled.ResultsHere we present the Transport Systems Tracker (TranSyT). Unlike other transport systems annotation software, TranSyT does not rely on manual curation to expand its internal database, which is derived from highly curated records retrieved from the Transporters Classification Database and complemented with information from other data sources. TranSyT compiles information regarding transporter families and proteins, and derives reactions into its internal database, making it available for rapid annotation of complete genomes. All transport reactions have GPR associations and can be exported with identifiers from four different metabolite databases. TranSyT is currently available as a plugin for merlin v4.0 and an app for KBase.AvailabilityTranSyT web service: https://transyt.bio.di.uminho.pt/; GitHub for the tool: https://github.com/BioSystemsUM/transyt; GitHub with examples and instructions to run TranSyT: https://github.com/ecunha1996/transyt_paper.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

iDeLUCS: A deep learning interactive tool for alignment-free clustering of DNA sequences

Thu, 17/08/2023 - 5:30am
AbstractSummaryWe present an interactive Deep Learning-based software tool for Unsupervised Clustering of DNA Sequences (iDeLUCS), that detects genomic signatures and uses them to cluster DNA sequences, without the need for sequence alignment or taxonomic identifiers. iDeLUCS is scalable and user-friendly: Its graphical user interface, with support for hardware acceleration, allows the practitioner to fine-tune the different hyper-parameters involved in the training process without requiring extensive knowledge of deep learning. The performance of iDeLUCS was evaluated on a diverse set of datasets: several real genomic datasets from organisms in kingdoms Animalia, Protista, Fungi, Bacteria, and Archaea, three datasets of viral genomes, a dataset of simulated metagenomic reads from microbial genomes, and multiple datasets of synthetic DNA sequences. The performance of iDeLUCS was compared to that of two classical clustering algorithms (k-means ++ and GMM) and two clustering algorithms specialized in DNA sequences (MeShClust v3.0 and DeLUCS), using both intrinsic cluster evaluation metrics and external evaluation metrics. In terms of unsupervised clustering accuracy, iDeLUCS outperforms the two classical algorithms by an average of ~ 20%, and the two specialized algorithms by an average of ~ 12%, on the datasets of real DNA sequences analyzed. Overall, our results indicate that iDeLUCS is a robust clustering method suitable for the clustering of large and diverse datasets of unlabelled DNA sequences.Availability and implementationiDeLUCS is available at https://github.com/Kari-Genomics-Lab/iDeLUCS under the terms of the MIT licence.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Bayesian Multi-task Learning for Medicine Recommendation Based on Online Patient Reviews

Tue, 08/08/2023 - 5:30am
AbstractMotivationWe propose a drug recommendation model that integrates information from both structured data (patient demographic information) and unstructured texts (patient reviews). It is based on multitask learning to predict review ratings of several satisfaction-related measures for a given medicine, where related tasks can learn from each other for prediction. The learned models can then be applied to new patients for drug recommendation. This is fundamentally different from most recommender systems in e-commerce, which do not work well for new customers (referred to as the cold-start problem). To extract information from review texts, we employ both topic modeling and sentiment analysis. We further incorporate variable selection into the model via Bayesian LASSO, which aims to filter out irrelevant features. To our best knowledge, this is the first Bayesian multitask learning method for ordinal responses. We are also the first to apply multitask learning to medicine recommendation. The sample code and data are made available at GitHub:ResultsWe evaluate the proposed method on two sets of drug reviews involving 17 depression/high blood pressure related drugs. Overall, our method performs better than existing benchmark methods in terms of accuracy and AUC. It is effective even with a small sample size and only a few available features, and more robust to possible non-informative covariates. Due to our model explainability, insights generated from our model may work as a useful reference for doctors. In practice, however, a final decision should be carefully made by combining the information from the proposed recommender with doctors’ domain knowledge and past experience.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Few-shot biomedical named entity recognition via knowledge-guided instance generation and prompt contrastive learning

Mon, 07/08/2023 - 5:30am
AbstractMotivationFew-shot learning (FSL) that can effectively perform named entity recognition in low-resource scenarios has raised growing attention, but it has not been widely studied yet in the biomedical field. In contrast to high-resource domains, biomedical named entity recognition (BioNER) often encounters limited human-labeled data in real-world scenarios, leading to poor generalization performance when training only a few labeled instances. Recent approaches either leverage cross-domain high-resource data or fine-tune the pre-trained masked language model using limited labeled samples to generate new synthetic data, which is easily stuck in domain shift problems or yields low-quality synthetic data. Therefore, in this paper, we study a more realistic scenario, i.e., few-shot learning for BioNER.ResultsLeveraging the domain knowledge graph, we propose knowledge-guided instance generation for few-shot BioNER, which generates diverse and novel entities based on similar semantic relations of neighbor nodes. In addition, by introducing question prompt, we cast BioNER as question answering (QA) task and propose prompt contrastive learning to improve the robustness of the model by measuring the mutual information (MI) between query-answer pairs. Extensive experiments conducted on various few-shot settings show that the proposed framework achieves superior performance. Particularly, in a low-resource scenario with only 20 samples, our approach substantially outperforms recent state-of-the-art (SoTA) models on four benchmark datasets, achieving an average improvement of up to 7.1% F1.AvailabilityOur source code and data are available at https://github.com/cpmss521/KGPC.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

TALAIA: A 3D visual dictionary for protein structures

Mon, 07/08/2023 - 5:30am
AbstractSummaryGraphical analysis of the molecular structure of proteins can be very complex. Full-atom representations retain most geometric information but are generally crowded, and key structural patterns can be challenging to identify. Non-full atom representations could be more instructive on physicochemical aspects but be insufficiently detailed regarding shapes (e.g., entity beans-like models in coarse grain approaches) or simple properties of amino acids (e.g., representation of superficial electrostatic properties). TALAIA aims to provide another layer of structural representations. It is a visual dictionary where a unique object, with differentiated shapes and colors, represents each amino acid. It makes it easier to spot crucial molecular information, including patches of amino acids or key interactions between side chains. Most conventions used in TALAIA are standard in chemistry and biochemistry, so experimentalists and modelers can rapidly grasp the meaning of any TALAIA depiction.MotivationThe work aims to offer a visual grammar that combines simple representations of amino acids while retaining their general geometry and physicochemical properties.ResultsWe propose a tool that renders protein structures and encodes structure and physicochemical aspects as a simple visual grammar. The approach is fast, highly informative, and intuitive, allowing the identification of possible interactions, hydrophobic patches, and other characteristic structural features at first glance.Availabilityhttps://github.com/insilichem/talaiaSupplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Enhancing Cryo-EM Maps With 3D Deep Generative Networks For Assisting Protein Structure Modeling

Mon, 07/08/2023 - 5:30am
AbstractMotivationThe tertiary structures of an increasing number of biological macromolecules have been determined using cryo-electron microscopy (cryo-EM). However, there are still many cases where the resolution is not high enough to model the molecular structures with standard computational tools. If the resolution obtained is near the empirical borderline (3–4.5 Å), improvement in the map quality facilitates improved structure modeling.ResultsWe report EM-GAN, a novel approach that modifies an input cryo-EM map to assist protein structure modeling. The method uses a 3D generative adversarial network (GAN) that has been trained on high- and low-resolution density maps to learn the density patterns, and modifies the input map to enhance its suitability for modeling. The method was tested extensively on a dataset of 65 EM maps in the resolution range of 3 Å to 6 Å and showed substantial improvements in structure modeling using popular protein structure modeling tools.Availabilityhttps://github.com/kiharalab/EM-GAN, Google Colab: https://tinyurl.com/3ccxpttx
Categories: Bioinformatics Trends

NanopoReaTA: a user-friendly tool for nanopore-seq real-time transcriptional analysis

Mon, 07/08/2023 - 5:30am
AbstractSummaryOxford Nanopore Technologies' (ONT) sequencing platform offers an excellent opportunity to perform real-time analysis during sequencing. This feature allows for early insights into experimental data and accelerates a potential decision-making process for further analysis, which can be particularly relevant in the clinical context. Although some tools for the real-time analysis of DNA-sequencing data already exist, there is currently no application available for differential transcriptome data analysis designed for scientists or physicians with limited bioinformatics knowledge. Here we introduce NanopoReaTA, a user-friendly real-time analysis toolbox for RNA sequencing data from ONT. Sequencing results from a running or finished experiment are processed through an R Shiny-based graphical user interface (GUI) with an integrated Nextflow pipeline for whole transcriptome or gene-specific analyses. NanopoReaTA provides visual snapshots of a sequencing run in progress, thus enabling interactive sequencing and rapid decision-making that could also be applied to clinical cases.AvailabilityGithub https://github.com/AnWiercze/NanopoReaTA; Zenodo https://doi.org/10.5281/zenodo.8099825Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

PACT: A pipeline for analysis of circulating tumor DNA

Mon, 07/08/2023 - 5:30am
AbstractMotivationDetection of genomic alterations in circulating tumor DNA (ctDNA) is currently used for active clinical monitoring of cancer progression and treatment response. While methods for analysis of small mutations are more developed, strategies for detecting structural variants (SVs) in ctDNA are limited. Additionally, reproducibly calling small scale mutations, copy number alterations, and SVs in ctDNA is challenging due to the lack to unified tools for these different classes of variants.ResultsWe developed a unified pipeline for the analysis of ctDNA (PACT) that accurately detects SVs and consistently outperformed similar tools when applied to simulated, cell line, and clinical data. We provide PACT in the form of a Common Workflow Language pipeline which can be run by popular workflow management systems in high-performance computing environments.AvailabilityPACT is freely available at https://github.com/ChrisMaherLab/PACTSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

CoCoNat: a novel method based on deep-learning for coiled-coil prediction

Fri, 04/08/2023 - 5:30am
AbstractMotivationCoiled-coil domains (CCD) are widespread in all organisms and perform several crucial functions. Given their relevance, the computational detection of coiled-coil domains is very important for protein functional annotation. State-of-the art prediction methods include the precise identification of coiled-coil domain boundaries, the annotation of the typical heptad repeat pattern along the coiled-coil helices as well as the prediction of the oligomerization state.ResultsIn this paper we describe CoCoNat, a novel method for predicting coiled-coil helix boundaries, residue-level register annotation and oligomerization state. Our method encodes sequences with the combination of two state-of-the-art protein language models and implements a three-step deep learning procedure concatenated with a Grammatical-Restrained Hidden Conditional Random Field (GRHCRF) for CCD identification and refinement. A final neural network (NN) predicts the oligomerization state. When tested on a blind test set routinely adopted, CoCoNat obtains a performance superior to the current state-of-the-art both for residue-level and segment-level coiled-coil detection. CoCoNat significantly outperforms the most recent state-of-the art methods on register annotation and prediction of oligomerization states.AvailabilityCoCoNat web server is available at https://coconat.biocomp.unibo.it. Standalone version is available on GitHub at https://github.com/BolognaBiocomp/coconat.
Categories: Bioinformatics Trends

ProtoCell4P: An Explainable Prototype-based Neural Network for Patient Classification Using Single-cell RNA-seq

Fri, 04/08/2023 - 5:30am
AbstractMotivationThe rapid advance in single-cell RNA sequencing (scRNA-seq) technology over the past decade has provided a rich resource of gene expression profiles of single cells measured on patients, facilitating the study of many biological questions at the single-cell level. One intriguing research is to study the single cells which play critical roles in the phenotypes of patients, which has the potential to identify those cells and genes driving the disease phenotypes. To this end, deep learning models are expected to well encode the single-cell information and achieve precise prediction of patients’ phenotypes using scRNA-seq data. However, we are facing critical challenges in designing deep learning models for classifying patient samples due to (1) the samples collected in the same dataset contain a variable number of cells — some samples might only have hundreds of cells sequenced while others could have thousands of cells, and (2) the number of samples available is typically small and the expression profile of each cell is noisy and extremely high-dimensional. Moreover, the black-box nature of existing deep learning models makes it difficult for the researchers to interpret the models and extract useful knowledge from them.ResultsWe propose a prototype-based and cell-informed model for patient phenotype classification, termed ProtoCell4P, that can alleviate problems of the sample scarcity and the diverse number of cells by leveraging the cell knowledge with representatives of cells (called prototypes), and precisely classify the patients by adaptively incorporating information from different cells. Moreover, this classification process can be explicitly interpreted by identifying the key cells for decision making and by further summarizing the knowledge of cell types to unravel the biological nature of the classification. Our approach is explainable at the single-cell resolution which can identify the key cells in each patient’s classification. The experimental results demonstrate that our proposed method can effectively deal with patient classifications using single-cell data and outperforms the existing approaches. Furthermore, our approach is able to uncover the association between cell types and biological classes of interest from a data-driven perspective.Availabilityhttps://github.com/Teddy-XiongGZ/ProtoCell4P
Categories: Bioinformatics Trends

SHEPHARD: a modular and extensible software architecture for analyzing and annotating large protein datasets

Fri, 04/08/2023 - 5:30am
AbstractMotivationThe emergence of high-throughput experiments and high-resolution computational predictions has led to an explosion in the quality and volume of protein sequence annotations at proteomic scales. Unfortunately, sanity checking, integrating, and analyzing complex sequence annotations remains logistically challenging and introduces a major barrier to entry for even superficial integrative bioinformatics.ResultsTo address this technical burden, we have developed SHEPHARD, a Python framework that trivializes large-scale integrative protein bioinformatics. SHEPHARD combines an object-oriented hierarchical data structure with database-like features, enabling programmatic annotation, integration, and analysis of complex datatypes. Importantly SHEPHARD is easy to use and enables a Pythonic interrogation of largescale protein datasets with millions of unique annotations. We use SHEPHARD to examine three orthogonal proteome-wide questions relating protein sequence to molecular function, illustrating its ability to uncover novel biology.AvailabilityWe provided SHEPHARD as both a stand-alone software package (https://github.com/holehouse-lab/shephard), and as a Google Colab notebook with a collection of precomputed proteome-wide annotations (https://github.com/holehouse-lab/shephard-colab)Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Ionmob: A Python Package for Prediction of Peptide Collisional Cross-Section Values

Fri, 04/08/2023 - 5:30am
AbstractMotivationIncluding ion mobility separation (IMS) into mass spectrometry proteomics experiments is useful to improve coverage and throughput. Many IMS devices enable linking experimentally derived mobility of an ion to its collisional cross-section (CCS), a highly reproducible physicochemical property dependent on the ion’s mass, charge and conformation in the gas phase. Thus, known peptide ion mobilities can be used to tailor acquisition methods or to refine database search results. The large space of potential peptide sequences, driven also by post-translational modifications (PTMs) of amino acids, motivates an in silico predictor for peptide CCS. Recent studies explored the general performance of varying machine-learning techniques, however, the workflow engineering part was of secondary importance. For the sake of applicability, such a tool should be generic, data driven and offer the possibility to be easily adapted to individual workflows for experimental design and data processing.ResultsWe created ionmob, a Python based framework for data preparation, training, and prediction of collisional cross-section values of peptides. It is easily customizable and includes a set of pretrained, ready-to-use models and preprocessing routines for training and inference. Using a set of ≈ 21.000 unique phosphorylated peptides and ≈ 17.000 MHC ligand sequences and charge state pairs, we expand upon the space of peptides that can be integrated into CCS prediction. Lastly, we investigate the applicability of in silico predicted CCS to increase confidence in identified peptides by applying methods of re-scoring and demonstrate that predicted CCS values complement existing predictors for that task.AvailabilityThe Python package is available at github: https://github.com/theGreatHerrLebert/ionmob.
Categories: Bioinformatics Trends

Flame (v2.0): advanced integration and interpretation of functional enrichment results from multiple sources

Fri, 04/08/2023 - 5:30am
Abstract Functional enrichment is the process of identifying implicated functional terms from a given input list of genes or proteins. In this article, we present Flame (v2.0), a web tool which offers a combinatorial approach through merging and visualizing results from widely-used functional enrichment applications while also allowing various flexible input options. In this version, Flame utilizes the aGOtool, g: Profiler, WebGestalt and Enrichr pipelines and presents their outputs separately or in combination following a visual analytics approach. For intuitive representations and easier interpretation, it uses interactive plots such as parameterizable networks, heatmaps, barcharts and scatter plots. Users can also: (i) handle multiple protein/gene lists and analyze union and intersection sets simultaneously through interactive UpSet plots, (ii) automatically extract genes and proteins from free text through text-mining and Named Entity Recognition (NER) techniques, (iii) upload single nucleotide polymorphisms (SNPs) and extract their relative genes or (iv) analyze multiple lists of differentially-expressed proteins/genes after selecting them interactively from a parameterizable volcano plot. Compared to the previous version of 197 supported organisms, Flame (v2.0) currently allows enrichment for 14,436 organisms.Availability Web Applicationhttp://flame.pavlopouloslab.infoCodehttps://github.com/PavlopoulosLab/FlameDockerhttps://hub.docker.com/r/pavlopouloslab/flameSupplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Prediction of pathogenic single amino acid substitutions using molecular fragment descriptors

Thu, 03/08/2023 - 5:30am
AbstractMotivationNext Generation Sequencing technologies make it possible to detect rare genetic variants in individual patients. Currently, more than a dozen software and web services have been created to predict the pathogenicity of variants related with changing of amino acid residues. Despite considerable efforts in this area, at the moment there is no ideal method to classify pathogenic and harmless variants, and the assessment of the pathogenicity is often contradictory. In this article, we propose to use peptides structural formulas of proteins as an amino acid residues substitutions description, rather than a single-letter code. This allowed us to investigate the effectiveness of chemoinformatics approach to assess the pathogenicity of variants associated with amino acid substitutions.ResultsThe structure-activity relationships analysis relying on protein-specific data and atom centric substructural multilevel neighborhoods of atoms (MNA) descriptors of molecular fragments appeared to be suitable for predicting the pathogenic effect of single amino acid variants. MNA-based Naïve Bayes classifier algorithm, ClinVar and humsavar data were used for the creation of structure-activity relationships models for 10 proteins. The performance of the models was compared with 11 different predicting tools: eight individual (SIFT 4G, Polyphen2 HDIV, MutationAssessor, PROVEAN, FATHMM, MVP, LIST-S2, MutPred) and three consensus (M-CAP, MetaSVM, MetaLR). The accuracy of MNA-based method varies for the proteins (AUC: 0.631-0.993; MCC: 0.191-0.891). It was similar for both the results of comparisons with the other individual predictors and third-party protein-specific predictors. For several proteins (BRCA1, BRCA2, COL1A2, and RYR1), the performance of the MNA-based method was outstanding, capable of capturing the pathogenic effect of structural changes in amino acid substitutions.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

BERTrand—peptide: TCR binding prediction using Bidirectional Encoder Representations from Transformers augmented with random TCR pairing

Thu, 03/08/2023 - 5:30am
AbstractMotivationThe advent of T cell receptor (TCR) sequencing experiments allowed for a significant increase in the amount of peptide: TCR binding data available and a number of machine learning models appeared in recent years. High-quality prediction models for a fixed epitope sequence are feasible, provided enough known binding TCR sequences are available. However, their performance drops significantly for previously unseen peptides.ResultsWe prepare the dataset of known peptide: TCR binders and augment it with negative decoys created using healthy donors’ T-cell repertoires. We employ deep learning methods commonly applied in Natural Language Processing (NLP) to train part a peptide: TCR binding model with a degree of cross-peptide generalization (0.69 AUROC). We demonstrate that BERTrand outperforms the published methods when evaluated on peptide sequences not used during model training.AvailabilityThe datasets and the code for model training are available at https://github.com/SFGLab/bertrandSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Pages

Calendar

Mon
Tue
Wed
Thu
Fri
Sat
Sun
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
 
September 2023