IsoFrog: a Reversible Jump Monte Carlo Markov Chain feature selection-based method for predicting isoform functions
AbstractMotivationA single gene may yield several isoforms with different functions through alternative splicing. Continuous efforts are devoted to developing machine learning methods to predict isoform functions. However, existing methods do not consider the relevance of each feature to specific functions and ignore the noise caused by the irrelevant features. In this case, we hypothesize that constructing a feature selection framework to extract the function-relevant features might help improve the model accuracy in isoform function prediction.ResultsIn this paper, we present a feature selection-based approach named IsoFrog to predict isoform functions. First, IsoFrog adopts a Reversible Jump Monte Carlo Markov Chain (RJMCMC)-based feature selection framework to assess the feature importance to gene functions. Second, a sequential feature selection (SFS) procedure is applied to select a subset of function-relevant features. This strategy screens the relevant features for the specific function while eliminating irrelevant ones, improving the effectiveness of the input features. Then the selected features are input into our proposed method modified domain-invariant partial least squares (MdiPLS), which prioritizes the most likely positive isoform for each positive MIG and utilizes diPLS for isoform function prediction. Tested on three datasets, our method achieves superior performance over six state-of-the-art methods, and the RJMCMC-based feature selection framework outperforms three classic feature selection methods. We expect this proposed methodology will promote the identification of isoform functions and further inspire the development of new methods.Availability and implementationIsoFrog is freely available at https://github.com/genemine/IsoFrog.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
diseaseGPS: auxiliary diagnostic system for genetic disorders based on genotype and phenotype
AbstractSummaryThe next-generation sequencing brought opportunities for the diagnosis of genetic disorders due to its high throughput capabilities. However, the majority of existing methods were limited to only sequencing candidate variants, and the process of linking these variants to a diagnosis of genetic disorders still required medical professionals to consult databases. Therefore, we introduce diseaseGPS, an integrated platform for the diagnosis of genetic disorders that combines both phenotype and genotype data for analysis. It offers not only a user-friendly GUI web application for those without a programming background, but also scripts that can be executed in batch mode for bioinformatics professionals. The genetic and phenotypic data are integrated using the ACMG-Bayes method and a novel phenotypic similarity method, to prioritize the results of genetic disorders. diseaseGPS was evaluated on 6085 cases from Deciphering Developmental Disorders project and 187 cases from Shanghai Children’s hospital. The results demonstrated that diseaseGPS performed better than other commonly used methods.AvailabilitydiseaseGPS is available to freely accessed at https://diseasegps.sjtu.edu.cn with source code at https://github.com/BioHuangDY/diseaseGPS.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Accessibility of covariance information creates vulnerability in Federated Learning frameworks
AbstractMotivationFederated Learning (FL) is gaining traction in various fields as it enables integrative data analysis without sharing sensitive data, such as in healthcare. However, the risk of data leakage caused by malicious attacks must be considered. In this study, we introduce a novel attack algorithm that relies on being able to compute sample means, sample covariances, and construct known linearly independent vectors on the data owner side.ResultsWe show that these basic functionalities, which are available in several established FL frameworks, are sufficient to reconstruct privacy-protected data. Additionally, the attack algorithm is robust to defense strategies that involve adding random noise. We demonstrate the limitations of existing frameworks and propose potential defense strategies analyzing the implications of using differential privacy. The novel insights presented in this study will aid in the improvement of FL frameworks.Availability and ImplementationThe code examples are provided at GitHub (https://github.com/manuhuth/Data-Leakage-From-Covariances.git). The CNSIM1 data set which we used in the manuscript is available within the DSData R package (https://github.com/datashield/DSData/tree/main/data).Supplementary informationMathematical proves and further information are available online.
Categories: Bioinformatics Trends
RNA 3D structure modeling by fragment assembly with small angle X-ray scattering restraints
Abstract Structure determination is a key step in the functional characterization of many non-coding RNA molecules. High-resolution RNA 3D structure determination efforts, however, are not keeping up with the pace of discovery of new non-coding RNA sequences. This increases the importance of computational approaches and low-resolution experimental data, such as from the Small Angle X-ray Scattering experiments. We present RNA Masonry, a computer program and a web service for a fully automated modeling of RNA 3D structures. It assemblies RNA fragments into geometrically plausible models that meet user-provided secondary structure constraints, restraints on tertiary contacts and Small Angle X-ray Scattering data. We illustrate the method description with detailed benchmarks and its application to structural studies of viral RNAs with SAXS restraints.AvailabilityThe program web server is available at http://iimcb.genesilico.pl/rnamasonry. The source code is available at https://gitlab.com/gchojnowski/rnamasonry.Supplementary informationDetailed benchmarks of the method using simulated and experimental Small Angle X-ray Scattering data are available at Bioinformatics online.
Categories: Bioinformatics Trends
dRFEtools: Dynamic recursive feature elimination for omics
AbstractMotivationAdvances in technology have generated larger omics datasets with potential applications for machine learning. In many datasets, however, cost and limited sample availability result in an excessively higher number of features as compared to observations. Moreover, biological processes are associated with networks of core and peripheral genes, while traditional feature selection approaches capture only core genes.ResultsTo overcome these limitations, we present dRFEtools that implements dynamic recursive feature elimination (RFE), reducing computational time with high accuracy compared to standard RFE, expanding dynamic RFE to regression algorithms, and outputting the subsets of features that hold predictive power with and without peripheral features. dRFEtools integrates with scikit-learn (the popular Python machine learning platform) and thus provides new opportunities for dynamic RFE in large-scale omics data while enhancing its interpretability.AvailabilitydRFEtools is freely available on PyPI at https://pypi.org/project/drfetools/ or on GitHub https://github.com/LieberInstitute/dRFEtools, implemented in Python 3, and supported on Linux, Windows, and Mac OS.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online and https://github.com/LieberInstitute/dRFEtools_manuscript.
Categories: Bioinformatics Trends
Joint embedding of biological networks for cross-species functional alignment
AbstractMotivationModel organisms are widely used to better understand the molecular causes of human disease. While sequence similarity greatly aids this transfer, sequence similarity does not imply functional similarity, and thus, several current approaches incorporate protein-protein interactions (PPIs) to help map findings between species. Existing transfer methods either formulate the alignment problem as a matching problem which pits network features against known orthology, or more recently, as a joint embedding problem.ResultsWe propose a novel state-of-the-art joint embedding solution: Embeddings to Network Alignment (ETNA). ETNA generates individual network embeddings based on network topological structures and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence-based orthologs. The final embedding preserves both within and between species gene functional relationships, and we demonstrate that it captures both pairwise and group functional relevance. In addition, ETNA’s embeddings can be used to transfer genetic interactions across species and identify phenotypic alignments, laying the groundwork for potential opportunities for drug repurposing and translational studies.Availabilityhttps://github.com/ylaboratory/ETNASupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
VAPEX: an interactive web server for the deep exploration of natural virus and phage genomes
AbstractMotivationStudying the genetic makeup of viruses and phages through genome analysis is crucial for comprehending their function in causing diseases, progressing medicine, tracing their evolutionary history, monitoring the environment, and creating innovative biotechnologies. However, accessing the necessary data can be challenging due to a lack of dedicated comparative genomic tools and viral and phage databases, which are often outdated. Moreover, many wet bench experimentalists may not have the computational proficiency required to manipulate large amounts of genomic data.ResultsWe have developed VAPEX (Virus And Phage EXplorer), a web server which is supported by a database and features a user-friendly web interface. This tool enables users to easily perform various genomic analysis queries on all natural viruses and phages that have been fully sequenced and are listed in the NCBI compendium. VAPEX therefore excels in producing visual depictions of fully resolved synteny maps, which is one of its key strengths. VAPEX has the ability to exhibit a vast array of orthologous gene classes simultaneously through the use of symbolic representation. Additionally, VAPEX can fully analyze user-submitted viral and phage genomes, including those that have not yet been annotated.Availability and implementationVAPEX can be accessed from all current web browsers such as Chrome, Firefox, Edge, Safari and Opera. VAPEX is freely accessible at https://archaea.i2bc.paris-saclay.fr/vapex/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Somatic mutation effects diffused over microRNA dysregulation
AbstractMotivationAs an important player in transcriptome regulation, microRNAs may effectively diffuse somatic mutation impacts to broad cellular processes and ultimately manifest disease and dictate prognosis. Previous studies that tried to correlate mutation with gene expression dysregulation neglected to adjust for the disparate multitudes of false positives associated with unequal sample sizes and uneven class balancing scenarios.ResultsTo properly address this issue, we developed a statistical framework to rigorously assess the extent of mutation impact on microRNAs in relation to a permutation-based null distribution of a matching sample structure. Carrying out the framework in a pan-cancer study, we ascertained 9008 protein-coding genes with statistically significant mutation impacts on miRNAs. Of these, the collective miRNA expression for 83 genes showed significant prognostic power in nine cancer types. For example, in lower-grade glioma, 10 genes’ mutations broadly impacted miRNAs, all of which showed prognostic value with the corresponding miRNA expression. Our framework was further validated with functional analysis and augmented with rich features including the ability to analyze miRNA isoforms; aggregative prognostic analysis; advanced annotations such as mutation type, regulator alteration, somatic motif, and disease association; and instructive visualization such as mutation OncoPrint, Ideogram, and interactive mRNA-miRNA network.Availabilityhttp://innovebioinfo.com/Database/TmiEx/MutMix.phpSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Cell-connectivity-guided trajectory inference from single-cell data
AbstractMotivationSingle-cell RNA-sequencing enables cell-level investigation of cell differentiation, which can be modelled using trajectory inference methods. While tremendous effort has been put into designing these methods, inferring accurate trajectories automatically remains difficult. Therefore, the standard approach involves testing different trajectory inference methods and picking the trajectory giving the most biologically sensible model. As the default parameters are often suboptimal, their tuning requires methodological expertise.ResultsWe introduce Totem, an open-source, easy-to-use R package designed to facilitate inference of tree-shaped trajectories from single-cell data. Totem generates a large number of clustering results, estimates their topologies as minimum spanning trees, and uses them to measure the connectivity of the cells. Besides automatic selection of an appropriate trajectory, cell connectivity enables to visually pinpoint branching points and milestones relevant to the trajectory. Furthermore, testing different trajectories with Totem is fast, easy, and does not require in-depth methodological knowledge.AvailabilityTotem is available as an R package at https://github.com/elolab/Totem.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
DEP2: an upgraded comprehensive analysis toolkit for quantitative proteomics data
AbstractSummaryMass spectrometry (MS)-based proteomics has become the most powerful approach to study the proteome of given biological and clinical samples. Advancements in sample preparation and MS detection have extended the application of proteomics, but have also brought new demands on data analysis. Appropriate proteomics data analysis workflow mainly requires quality control, hypothesis testing, functional mining, and visualization. Although there are numerous tools for each process, an efficient and universal tandem analysis toolkit to obtain a quick overall view of various proteomics data is still urgently needed. Here, we present DEP2, an updated version of DEP we previously established, for proteomics data analysis. We amended the analysis workflow by incorporating alternative approaches to accommodate diverse proteomics data, introducing peptide-protein summarization and coupling biological function exploration. In summary, DEP2 is a well-rounded toolkit designed for protein- and peptide-level quantitative proteomics data. It features a more flexible differential analysis workflow and includes a user-friendly Shiny application to facilitate data analysis.Availability and implementationDEP2 is available at https://github.com/mildpiggy/DEP2, released under the MIT license. For further information and usage details, please refer to the package website at https://mildpiggy.github.io/DEP2/.
Categories: Bioinformatics Trends
libSBOLj3: A graph-based library for design and data exchange in synthetic biology
AbstractSummaryThe Synthetic Biology Open Language version 3 data standard provides a graph-based approach to exchange information about biological designs. The new data model has major updates and offers several features for software tools. Here, we present libSBOLj3 to facilitate data exchange and provide interoperability between computer-aided design and automation tools using this standard. The library adopts a graph-based approach. Tool developers can extend these graphs with application-specific information and use detailed validation reports to identify errors and interoperability issues and apply best practice rules.Availability and ImplementationThe libSBOLj3 library is implemented in Java and can be downloaded or used as a Maven dependency. The open-source project, code examples and documentation about accessing and using the library are available via GitHub at https://github.com/SynBioDex/libSBOLj3.
Categories: Bioinformatics Trends
Gonomics: Uniting high performance and readability for genomics with Go
AbstractSummaryMany existing software libraries for genomics require researchers to pick between competing considerations: the performance of compiled languages and the accessibility of interpreted languages. Go, a modern compiled language, provides an opportunity to address this conflict. We introduce Gonomics, an open-source collection of command line programs and bioinformatic libraries implemented in Go that unites readability and performance for genomic analyses. Gonomics contains packages to read, write, and manipulate a wide array of file formats (e.g. FASTA, FASTQ, BED, BEDPE, SAM, BAM, and VCF), and can convert and interface between these formats. Furthermore, our modular library structure provides a flexible platform for researchers developing their own software tools to address specific questions. These commands can be combined and incorporated into complex pipelines to meet the growing need for high-performance bioinformatic resources.Availability and implementationGonomics is implemented in the Go programming language. Source code, installation instructions, and documentation are freely available at https://github.com/vertgenlab/gonomics.
Categories: Bioinformatics Trends
MULGA, a unified multi-view graph autoencoder-based approach for identifying drug-protein interaction and drug repositioning
AbstractMotivationIdentifying drug-protein interactions (DPIs) is a critical step in drug repositioning, which allows reuse of approved drugs that may be effective for treating a different disease and thereby alleviates the challenges of new drug development. Despite the fact that a great variety of computational approaches for DPI prediction have been proposed, key challenges, such as extendable and unbiased similarity calculation, heterogeneous information utilization and reliable negative sample selection, remain to be addressed.ResultsTo address these issues, we propose a novel, unified multi-view graph autoencoder framework, termed MULGA, for both DPI and drug repositioning predictions. MULGA is featured by: (i) a multi-view learning technique to effectively learn authentic drug affinity and target affinity matrices; (ii) a graph autoencoder to infer missing DPI interactions; and (iii) a new “guilty-by-association”-based negative sampling approach for selecting highly reliable non-DPIs. Benchmark experiments demonstrate that MULGA outperforms state-of-the-art methods in DPI prediction and the ablation studies verify the effectiveness of each proposed component. Importantly, we highlight the top drugs shortlisted by MULGA that target the spike glycoprotein of severe acute respiratory syndrome coronavirus 2 (SAR-CoV-2), offering additional insights into and potentially useful treatment option for COVID-19. Together with the availability of datasets and source codes, we envision that MULGA can be explored as a useful tool for DPI prediction and drug repositioning.Availability and implementationMULGA is publicly available for academic purposes at https://github.com/jianiM/MULGA/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
CellAnn: A comprehensive, super-fast, and user-friendly single-cell annotation web server
AbstractMotivationSingle-cell sequencing technology has become a routine in studying many biological problems. A core step of analyzing single-cell data is the assignment of cell clusters to specific cell types. Reference-based methods are proposed for predicting cell types for single-cell clusters. However, the scalability and lack of preprocessed reference datasets prevent them from being practical and easy to use.ResultsHere we introduce a reference-based cell annotation web server, CellAnn, which is super-fast and easy to use. CellAnn contains a comprehensive reference database with 204 human and 191 mouse single-cell datasets. These reference datasets cover 32 organs. Furthermore, we developed a cluster-to-cluster alignment method to transfer cell labels from the reference to the query datasets, which is superior to the existing methods with higher accuracy and higher scalability. Finally, CellAnn is an online tool that integrates all the procedures in cell annotation, including reference searching, transferring cell labels, visualizing results, and harmonizing cell annotation labels. Through the user-friendly interface, users can identify the best annotation by cross-validating with multiple reference datasets. We believe that CellAnn can greatly facilitate single-cell sequencing data analysis.Availability and implementationThe web server is available at www.cellann.io, and the source code is available at https://github.com/Pinlyu3/CellAnn_shinyapp.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Pretrained Transformer Models for Predicting the Withdrawal of Drugs from the Market
AbstractMotivationThe process of drug discovery is notoriously complex, costing an average of 2.6 billion dollars and taking approximately 13 years to bring a new drug to the market. The success rate for new drugs is alarmingly low (around 0.0001%), and severe adverse drug reactions (ADRs) frequently occur, some of which may even result in death. Early identification of potential ADRs is critical to improve the efficiency and safety of the drug development process.ResultsIn this study, we employed pretrained large language models (LLMs) to predict the likelihood of a drug being withdrawn from the market due to safety concerns. Our method achieved an area under the curve (AUC) of over 0.75 through cross-database validation, outperforming classical machine-learning models and graph-based models. Notably, our pretrained LLMs successfully identified over 50% drugs that were subsequently withdrawn, when predictions were made on a subset of drugs with inconsistent labeling between the training and test sets.AvailabilityThe code and datasets are available at https://github.com/eyalmazuz/DrugWithdrawn.Supplementary informationSupplementary dataSupplementary data associated with this research are available at Bioinformatics online.
Categories: Bioinformatics Trends
Coherent pathway enrichment estimation by modeling inter-pathway dependencies using regularized regression
AbstractMotivationGene set enrichment methods are a common tool to improve the interpretability of gene lists as obtained, for example, from differential gene expression analyses. They are based on computing whether dysregulated genes are located in certain biological pathways more often than expected by chance. Gene set enrichment tools rely on pre-existing pathway databases such as KEGG, Reactome, or the Gene Ontology. These databases are increasing in size and in the number of redundancies between pathways, which complicates the statistical enrichment computation.ResultsWe address this problem and develop a novel gene set enrichment method, called pareg, which is based on a regularized generalized linear model and directly incorporates dependencies between gene sets related to certain biological functions, for example, due to shared genes, in the enrichment computation. We show that pareg is more robust to noise than competing methods. Additionally, we demonstrate the ability of our method to recover known pathways as well as to suggest novel treatment targets in an exploratory analysis using breast cancer samples from TCGA.Availability and Implementationpareg is freely available as an R package on Bioconductor (https://bioconductor.org/packages/release/bioc/html/pareg.html) as well as on https://github.com/cbg-ethz/pareg. The GitHub repository also contains the Snakemake workflows needed to reproduce all results presented here.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
metGWAS 1.0: An R workflow for network-driven over-representation analysis between independent metabolomic and meta-genome-wide association studies
AbstractMotivationThe method of GWAS and metabolomics combined provide an quantitative approach to pinpoint metabolic pathways and genes linked to specific diseases; however, such analyses require both genomics and metabolomics datasets from the same individuals/samples. In most cases, this approach is not feasible due to high costs, lack of technical infrastructure, unavailability of samples, and other factors. Therefore, an unmet need exists for a bioinformatics tool that can identify gene loci-associated polymorphic variants for metabolite alterations seen in disease states using standalone metabolomics.ResultsHere, we developed a bioinformatics tool, metGWAS 1.0, that integrates independent GWAS data from the GWAS database and standalone metabolomics data using a network-based systems biology approach to identify novel disease/trait-specific metabolite-gene associations. The tool was evaluated using standalone metabolomics datasets extracted from two metabolomics-GWAS case studies. It discovered both the observed and novel gene loci with known single nucleotide polymorphisms when compared to the original studies.Availability and implementationThe developed metGWAS 1.0 framework is implemented in an R pipeline and available at: https://github.com/saifurbd28/metGWAS-1.0.
Categories: Bioinformatics Trends
crosshap: R package for local haplotype visualization for trait association analysis
AbstractSummaryGWAS excels at harnessing dense genomic variant datasets to identify candidate regions responsible for producing a given phenotype. However, GWAS and traditional fine-mapping methods do not provide insight into the complex local landscape of linkage that contains and has been shaped by the causal variant(s). Here, we present ‘crosshap’, an R package that performs robust density-based clustering of variants based on their linkage profiles to capture haplotype structures in a local genomic region of interest. Following this, ‘crosshap’ is equipped with visualization tools for choosing optimal clustering parameters (ɛ) before producing an intuitive figure that provides an overview of the complex relationships between linked variants, haplotype combinations, phenotype and metadata traits.AvailabilityThe ‘crosshap’ package is freely available under the MIT license and can be downloaded directly from CRAN with R > 4.0.0. The development version is available on GitHub alongside issue support (https://github.com/jacobimarsh/crosshap). Tutorial vignettes and documentation are available (https://jacobimarsh.github.io/crosshap/).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
MSDRP: a deep learning model based on multi-source data for predicting drug response
AbstractMotivationCancer heterogeneity drastically affects cancer therapeutic outcomes. Predicting drug response in vitro is expected to help formulate personalized therapy regimens. In recent years, several computational models based on machine learning and deep learning have been proposed to predict drug response in vitro. However, most of these methods capture drug features based on a single drug description (e.g., drug structure), without considering the relationships between drugs and biological entities (e.g., target, diseases and side effects). Moreover, most of these methods collect features separately for drugs and cell lines but fail to consider the pairwise interactions between drugs and cell lines.ResultsIn this paper, we propose a deep learning framework, named MSDRP for drug response prediction. MSDRP uses an interaction module to capture interactions between drugs and cell lines, and integrates multiple associations/interactions between drugs and biological entities through similarity network fusion (SNF) algorithms, outperforming some state-of-the-art models in all performance measures for all experiments. The experimental results of de novo test and independent test demonstrate the excellent performance of our model for new drugs. Furthermore, several case studies illustrate the rationality for using feature vectors derived from drug similarity matrices from multi-source data to represent drugs and the interpretability of our model.AvailabilityThe codes of MSDRP are available at https://github.com/xyzhang-10/MSDRP.
Categories: Bioinformatics Trends
Minmers are a generalization of minimizers that enable unbiased local jaccard estimation
AbstractMotivationThe Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates.ResultsTo address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.AvailabilityMashMap3 is available at https://github.com/marbl/MashMap
Categories: Bioinformatics Trends