IndepthPathway: an integrated tool for in-depth pathway enrichment analysis based on single cell sequencing data
AbstractMotivationSingle-cell sequencing (SCS) enables exploring the pathways and processes of cells, and cell populations. However, there is a paucity of pathway enrichment methods designed to tolerate the high noise and low gene coverage of this technology. When gene expression data are noisy and signals are sparse, testing pathway enrichment based on the genes expression may not yield statistically significant results which is particularly problematic when detecting the pathways enriched in less abundant cells that are vulnerable to disturbances.ResultsIn this project, we developed a Weighted Concept Signature Enrichment Analysis (WCSEA) specialized for pathway enrichment analysis from single cell transcriptomics (scRNA-seq). WCSEA took a broader approach for assessing the functional relations of pathway gene sets to differentially expressed genes, and leverage the cumulative signature of molecular concepts characteristic of the highly differentially expressed genes, which we termed as the universal concept signature, to tolerate the high noise and low coverage of this technology. We then incorporated WCSEA into an R package called “IndepthPathway” for biologists to broadly leverage this method for pathway analysis based on bulk and single cell sequencing data. Through simulating technical variability and dropouts in gene expression characteristic of scRNA-seq as well as benchmarking on a real dataset of matched single cell and bulk RNAseq data, we demonstrate that IndepthPathway presents outstanding stability and depth in pathway enrichment results under stochasticity of the data, thus will substantially improve the scientific rigor of the pathway analysis for single cell sequencing data.AvailabilityThe IndepthPathway R package is available through: https://github.com/wangxlab/IndepthPathway.Supplementary informationThe supplementary informationsupplementary information is available at Bioinformatics online.
Categories: Bioinformatics Trends
Deep single-cell RNA-seq data clustering with graph prototypical contrastive learning
AbstractMotivationSingle-cell RNA sequencing enables researchers to study cellular heterogeneity at single-cell level. To this end, identifying cell types of cells with clustering techniques becomes an important task for downstream analysis. However, challenges of scRNA-seq data such as pervasive dropout phenomena hinder obtaining robust clustering outputs. Although existing studies try to alleviate these problems, they fall short of fully leveraging the relationship information and mainly rely on reconstruction-based losses that highly depend on the data quality, which is sometimes noisy.ResultsThis work proposes a graph-based prototypical contrastive learning method, named scGPCL. Specifically, scGPCL encodes the cell representations using Graph Neural Networks on cell-gene graph that captures the relational information inherent in scRNA-seq data and introduces prototypical contrastive learning to learn cell representations by pushing apart semantically dissimilar pairs and pulling together similar ones. Through extensive experiments on both simulated and real scRNA-seq data, we demonstrate the effectiveness and efficiency of scGPCL.Availability and implementationCode is available at https://github.com/Junseok0207/scGPCLSupplementary informationSupplementary dataSupplementary data is attached.
Categories: Bioinformatics Trends
Online bias-aware disease module mining with ROBUST-Web
AbstractSummaryWe present ROBUST-Web which implements our recently presented ROBUST disease module mining algorithm in a user-friendly web application. ROBUST-Web features seamless downstream disease module exploration via integrated gene set enrichment analysis, tissue expression annotation, and visualization of drug-protein and disease-gene links. Moreover, ROBUST-Web includes bias-aware edge costs for the underlying Steiner tree model as a new algorithmic feature, which allow to correct for study bias in protein-protein interaction networks and further improves the robustness of the computed modules.Availability and implementationWeb application: https://robust-web.net. Source code of web application and Python package with new bias-aware edge costs: https://github.com/bionetslab/robust-web, https://github.com/bionetslab/robust_bias_aware.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Pygenomics: manipulating genomic intervals and data files in Python
AbstractSummaryWe present pygenomics, a Python package for working with genomic intervals and bioinformatic data files. The package implements interval operations, provides both API and CLI, and supports reading and writing data in widely used bioinformatic formats, including BAM, BED, GFF3 and VCF. The source code of pygenomics is provided with in-source documentation and type annotations and adheres to the functional programming paradigm. These features facilitate seamless integration of pygenomics routines into scripts and pipelines. The package is implemented in pure Python using its standard library only and contains the property-based testing framework. Comparison of pygenomics with other Python bioinformatic packages with relation to features and performance is presented. The performance comparison covers operations with genomic intervals, read alignments, and genomic variants and demonstrates that pygenomics is suitable for computationally effective analysis.Availability and ImplementationThe source code is available at https://gitlab.com/gtamazian/pygenomics.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Mutate and Observe: Utilizing Deep Neural Networks to Investigate the Impact of Mutations on Translation Initiation
AbstractMotivationThe primary regulatory step for protein synthesis is translation initiation, which makes it one of the fundamental steps in the central dogma of molecular biology. In recent years, a number of approaches relying on deep neural networks (DNNs) have demonstrated superb results for predicting translation initiation sites. These state-of-the art results indicate that DNNs are indeed capable of learning complex features that are relevant to the process of translation. Unfortunately, most of those research efforts that employ DNNs only provide shallow insights into the decision-making processes of the trained models and lack highly sought-after novel biologically relevant observations.ResultsBy improving upon the state-of-the-art DNNs and large-scale human genomic datasets in the area of translation initiation, we propose an innovative computational methodology to get neural networks to explain what was learned from data. Our methodology, which relies on in silico point mutations, reveals that DNNs trained for translation initiation site detection correctly identify well-established biological signals relevant to translation, including (1) the importance of the Kozak sequence, (2) the damaging consequences of ATG mutations in the 5’ untranslated region (UTR), (3) the detrimental effect of premature stop codons in the coding region, and (4) the relative insignificance of cytosine mutations for translation. Furthermore, we delve deeper into the Beta-globin gene and investigate various mutations that lead to the Beta thalassemia disorder. Finally, we conclude our work by laying out a number of novel observations regarding mutations and translation initiation.AvailabilityFor data, models, and code, visit github.com/utkuozbulak/mutate-and-observeSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
GraphscoreDTA: optimized graph neural network for protein-ligand binding affinity prediction
AbstractMotivationComputational approaches for identifying the protein-ligand binding affinity can greatly facilitate drug discovery and development. At present, many deep learning-based models are proposed to predict the protein-ligand binding affinity and achieve significant performance improvement. However, protein-ligand binding affinity prediction still has fundamental challenges. One challenge is that the mutual information between proteins and ligands is hard to capture. Another challenge is how to find and highlight the important atoms of the ligands and residues of the proteins.ResultsTo solve these limitations, we develop a novel graph neural network strategy with the Vina distance optimization terms (GraphscoreDTA) for predicting protein-ligand binding affinity, which takes the combination of graph neural network, Bi-transport information mechanism and physics-based distance terms into account for the first time. Unlike other methods, GraphscoreDTA can not only effectively capture the protein-ligand pairs’ mutual information but also highlight the important atoms of the ligands and residues of the proteins. The results show that GraphscoreDTA significantly outperforms existing methods on multiple test sets. Furthermore, the tests of drug-target selectivity on the cyclin-dependent kinase and the homologous protein families demonstrate that GraphscoreDTA is a reliable tool for protein-ligand binding affinity prediction.AvailabilityThe resource codes are available at https://github.com/CSUBioGroup/GraphscoreDTA.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
KNeMAP: A Network Mapping Approach for Knowledge-driven Comparison of Transcriptomic Profiles
AbstractMotivationTranscriptomic data can be used to describe the mechanism of action (MOA) of a chemical compound. However, omics data tend to be complex and prone to noise, making the comparison of different datasets challenging. Often, transcriptomic profiles are compared at the level of individual gene expression values, or sets of differentially expressed genes. Such approaches can suffer from underlying technical and biological variance, such as the biological system exposed on or the machine/method used to measure gene expression data, technical errors and further neglect the relationships between the genes. We propose a network mapping approach for knowledge-driven comparison of transcriptomic profiles (KNeMAP), which combines genes into similarity groups based on multiple levels of prior information, hence adding a higher level view onto the individual gene view. When comparing KNeMAP with fold change (expression) based and deregulated gene set based methods, KNeMAP was able to group compounds with higher accuracy with respect to prior information as well as is less prone to noise corrupted data.ResultWe applied KNeMAP to analyze the Connectivity Map dataset, where the gene expression changes of three cell lines were analyzed after treatment with 676 drugs as well as the Fortino et al. dataset where two cell lines with 31 nanomaterials were analyzed. Although the expression profiles across the biological systems are highly different, KNeMAP was able to identify sets of compounds that induce similar molecular responses when exposed on the same biological system.AvailabilityRelevant data and the KNeMAP function is available at: https://github.com/fhaive/KNeMAP and 10.5281/zenodo.7334711.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
itol.toolkit accelerates working with iTOL (Interactive Tree Of Life) by an automated generation of annotation files
AbstractSummaryiTOL is a powerful and comprehensive phylogenetic tree visualization engine. However, adjusting to new templates can be time-consuming, especially when many templates are available. We developed an R package namely itol.toolkit to help users generate all 23 types of annotation files in iTOL. This R package also provides an all-in-one data structure to store data and themes, accelerating the step from metadata to annotation files of iTOL visualizations through automatic workflows.AvailabilityThe manual and source code are available at https://github.com/TongZhou2017/itol.toolkitSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
scME: A Dual-Modality Factor Model for Single-Cell Multi-Omics Embedding
AbstractMotivationSingle-cell multi-omics technologies are emerging to characterize different molecular features of cells. This gives rise to an issue of combining various kinds of molecular features to dissect cell heterogeneity. Most single-cell multi-omics integration methods focus on shared information among modalities while complementary information specific to each modality is often discarded.ResultsTo disentangle and combine shared and complementary information across modalities, we develop a dual-modality factor model named scME by using deep factor modeling. Our results demonstrate that scME can generate a better joint representation of multiple modalities than those generated by other single-cell multi-omics integration algorithms, which gives a clear elucidation of nuanced differences among cells. We also demonstrate that the joint representation of multiple modalities yielded by scME can provide salient information to improve both single-cell clustering and cell-type classification. Overall, scME will be an efficient method for combining various kinds of molecular features to facilitate the dissection of cell heterogeneity.Availability and implementationThe code is public for academic use and available on the GitHub site (https://github.com/bucky527/scME).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
A Co-adaptive Duality-aware Framework for Biomedical Relation Extraction
AbstractMotivationBiomedical relation extraction is a vital task for electronic health record mining and biomedical knowledge base construction. Previous work often adopts pipeline methods or joint methods to extract subject, relation, and object while ignoring the interaction of subject-object entity pair and relation within the triplet structure. However, we observe that entity pair and relation within a triplet are highly related, which motivates us to build a framework to extract triplets that can capture the rich interactions among the elements in a triplet.ResultsWe propose a novel co-adaptive biomedical relation extraction framework based on a duality-aware mechanism. This framework is designed as a bidirectional extraction structure that fully takes interdependence into account in the duality-aware extraction process of subject-object entity pair and relation. Based on the framework, we design a co-adaptive training strategy and a co-adaptive tuning algorithm as collaborative optimization methods between modules to promote better mining framework performance gain. The experiments on two public datasets show that our method achieves the best F1 among all state-of-the-art baselines and provides strong performance gain on complex scenarios of various overlapping patterns, multiple triplets, and cross-sentence triplets.Availability and implementationCode is available at https://github.com/11101028/CADA-BioRE.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Multi-modal deep learning improves grain yield prediction in wheat breeding by fusing genomics and phenomics
AbstractMotivationDeveloping new crop varieties with superior performance is highly important to ensure robust and sustainable global food security. The speed of variety development is limited by long field cycles and advanced generation selections in plant breeding programs. While methods to predict yield from genotype or phenotype data have been proposed, improved performance and integrated models are needed.ResultsWe propose a machine learning model that leverages both genotype and phenotype measurements by fusing genetic variants with multiple data sources collected by unmanned aerial systems. We use a deep multiple instance learning framework with an attention mechanism that sheds light on the importance given to each input during prediction, enhancing interpretability. Our model reaches 0.754 ± 0.024 Pearson correlation coefficient when predicting yield in similar environmental conditions; a 34.8% improvement over the genotype-only linear baseline (0.559 ± 0.050). We further predict yield on new lines in an unseen environment using only genotypes, obtaining a prediction accuracy of 0.386 ± 0.010, a 13.5% improvement over the linear baseline. Our multi-modal deep learning architecture efficiently accounts for plant health and environment, distilling the genetic contribution and providing excellent predictions. Yield prediction algorithms leveraging phenotypic observations during training therefore promise to improve breeding programs, ultimately speeding up delivery of improved varieties.Availability and ImplementationAvailable at https://github.com/BorgwardtLab/PheGeMIL (code) and https://doi.org/doi:10.5061/dryad.kprr4xh5p (data).
Categories: Bioinformatics Trends
gExcite — A start-to-end framework for single-cell gene expression, hashing, and antibody analysis
AbstractSummaryRecently, CITE-seq emerged as a multimodal single-cell technology capturing gene expression and surface protein information from the same single-cells, which allows unprecedented insights into disease mechanisms and heterogeneity, as well as immune cell profiling. Multiple single-cell profiling methods exist, but they are typically focussed on either gene expression or antibody analysis, not their combination. Moreover, existing software suites are not easily scalable to a multitude of samples. To this end, we designed gExcite, a start-to-end workflow that provides both gene and antibody expression analysis, as well as hashing deconvolution. Embedded in the Snakemake workflow manager, gExcite facilitates reproducible and scalable analyses. We showcase the output of gExcite on a study of different dissociation protocols on PBMC samples.AvailabilitygExcite is open source available on github at https://github.com/ETH-NEXUS/gExcite_pipeline. The software is distributed under the GNU General Public License 3 (GPL3).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
HONMF: integration analysis of multi-omics microbiome data via matrix factorization and hypergraph
AbstractMotivationThe accumulation of multi-omics microbiome data provides an unprecedented opportunity to understand the diversity of bacterial, fungal and viral components from different conditions. The changes in the composition of viruses, bacteria and fungi communities have been associated with environments and critical illness. However, identifying and dissecting the heterogeneity of microbial samples and cross-kingdom interactions remains challenging.ResultsWe propose HONMF for the integrative analysis of multi-modal microbiome data, including bacterial, fungal and viral composition profiles. HONMF enables identification of microbial samples and data visualization, and also facilitates downstream analysis, including feature selection and cross-kingdom association analysis between species. HONMF is an unsupervised method based on hypergraph induced orthogonal nonnegative matrix factorization, where it assumes that latent variables are specific for each composition profile and integrates the distinct sets of latent variables through graph fusion strategy, which better tackles the distinct characteristics in bacterial, fungal and viral microbiome. We implemented HONMF on several multi-omics microbiome datasets from different environments and tissues. The experimental results demonstrate the superior performance of HONMF in data visualization and clustering. HONMF also provides rich biological insights by implementing discriminative microbial feature selection and bacterium-fungus-virus association analysis, which improves our understanding of ecological interactions and microbial pathogenesis.AvailabilityThe software and datasets are available at https://github.com/chonghua-1983/HONMF.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
MultiNEP: a Multi-omics Network Enhancement framework for Prioritizing disease genes and metabolites simultaneously
AbstractMotivationMany studies have successfully used network information to prioritize candidate omics profiles associated with diseases. The metabolome, as the link between genotypes and phenotypes, has accumulated growing attention. Using a ”multi-omics” network constructed with a gene-gene network, a metabolite-metabolite network, and a gene-metabolite network to simultaneously prioritize candidate disease-associated metabolites and gene expressions could further utilize gene-metabolite interactions that are not used when prioritizing them separately. However, the number of metabolites is usually 100 times fewer than that of genes. Without accounting for this imbalance issue, we cannot effectively use gene-metabolite interactions when simultaneously prioritizing disease-associated metabolites and genes.ResultsHere we developed a Multi-omics Network Enhancement Prioritization (MultiNEP) framework with a weighting scheme to reweight contributions of different sub-networks in a multi-omics network to effectively prioritize candidate disease-associated metabolites and genes simultaneously. In simulation studies, MultiNEP outperforms competing methods that do not address network imbalances and identifies more true signal genes and metabolites simultaneously when we down-weight relative contributions of the gene-gene network and up-weight that of the metabolite-metabolite network to the gene-metabolite network. Applications to two human cancer cohorts show that MultiNEP prioritizes more cancer-related genes by effectively using both within- and between-omics interactions after handling network imbalance.AvailabilityThe developed MultiNEP framework is implemented in an R package and available at: https://github.com/Karenxzr/MultiNep
Categories: Bioinformatics Trends
Deep learning-based multi-functional therapeutic peptides prediction with a multi-label focal dice loss function
AbstractMotivationWith the great number of peptide sequences produced in the postgenomic era, it is highly desirable to identify the various functions of therapeutic peptides quickly. Furthermore, it is a great challenge to predict accurate multi-functional therapeutic peptides (MFTP) via sequence-based computational tools.ResultsHere we propose a novel multi-label-based method, named ETFC, to predict 21 categories of therapeutic peptides. The method utilizes a deep learning-based model architecture, which consists of four blocks: embedding, text convolutional neural network, feed-forward network, and classification blocks. This method also adopts an imbalanced learning strategy with a novel multi-label focal dice loss function (MLFDL). MLFDL is applied in the ETFC method to solve the inherent imbalance problem in the multi-label dataset and achieve competitive performance. The experimental results state that the ETFC method is significantly better than the existing methods for MFTP prediction. With the established framework, we use the teacher-student-based knowledge distillation to obtain the attention weight from the self-attention mechanism in the MFTP prediction, and quantify their contributions towards each of the investigated activities.AvailabilityThe source code and dataset are available via: https://github.com/xialab-ahu/ETFC.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
RaggedExperiment: the missing link between genomic ranges and matrices in Bioconductor
AbstractMotivationMeasurement of copy number, mutation, single nucleotide polymorphism, and other genomic attributes that may be stored as VCF files produce “ragged” genomic ranges data: that is, across different genomic coordinates in each sample. Ragged data are not rectangular or matrix-like, presenting informatics challenges for downstream statistical analyses. We present the RaggedExperiment R/Bioconductor data structure for lossless representation of ragged genomic data, with associated reshaping tools for flexible and efficient calculation of tabular representations to support a wide range of downstream statistical analyses. We demonstrate its applicability to copy number and somatic mutation data across 33 TCGA cancer datasets.AvailabilityRaggedExperiment is publicly available under an Artistic 2.0 license at Bioconductor (https://dx.doi.org/doi:10.18129/B9.bioc.RaggedExperiment) with open development and issue tracking on GitHub (https://github.com/Bioconductor/RaggedExperiment).
Categories: Bioinformatics Trends
GIL: A python package for designing custom indexing primers
AbstractSummaryGIL (Generate Indexes for Libraries) is a software tool for generating primers to be used in the production of multiplexed sequencing libraries. GIL can be customized in numerous ways to meet user specifications, including length, sequencing modality, color balancing, and compatibility with existing primers, and produces ordering and demultiplexing-ready outputs.AvailabilityGIL is written in Python and is freely available on GitHub under the MIT license: https://github.com/de-Boer-Lab/GIL and can be accessed as a web-application implemented in Streamlit at https://dbl-gil.streamlitapp.com.Supplementary informationSupplementary data are available at https://doi.org/10.5281/zenodo.7922539.
Categories: Bioinformatics Trends
BRGenomics for analyzing high resolution genomics data in R
AbstractSummaryI present here the R/Bioconductor package BRGenomics, which provides fast and flexible methods for post-alignment processing and analysis of high resolution genomics data within an interactive R environment. Utilizing GenomicRanges and other core Bioconductor packages, BRGenomics provides various methods for data importation and processing, read counting and aggregation, spike-in and batch normalization, re-sampling methods for robust “metagene” analyses, and various other functions for cleaning and modifying sequencing and annotation data. Simple yet flexible, the included methods are optimized for handling multiple datasets simultaneously, make extensive use of parallel processing, and support multiple strategies for efficiently storing and quantifying different kinds of data, including whole reads, quantitative single-base data, and run-length encoded coverage information. BRGenomics has been used to analyze ATAC-seq, ChIP-seq/ChIP-exo, PRO-seq/PRO-cap, and RNA-seq data; is built to be unobtrusive and maximally compatible with the Bioconductor ecosystem; is extensively tested; and includes complete documentation, examples, and tutorials.Availability and ImplementationBRGenomics is an R package distributed through Bioconductor (https://bioconductor.org/packages/BRGenomics). Full documentation with examples and tutorials are available online (https://mdeber.github.io).
Categories: Bioinformatics Trends
Letter to the editor: Testing on External Independent Datasets is Necessary to Corroborate Machine Learning Model Improvement
Categories: Bioinformatics Trends
ODNA: Identification of Organellar DNA by Machine Learning
AbstractMotivationIdentifying organellar DNA, such as mitochondrial or plastid sequences, inside a whole genome assembly, remains challenging and requires biological background knowledge. To address this, we developed ODNA based on genome annotation and machine learning to fulfill.ResultsODNA is a software that classifies organellar DNA sequences within a genome assembly by machine learning based on a pre-defined genome annotation workflow. We trained our model with 829,769 DNA sequences from 405 genome assemblies and achieved high predictive performance (e.g., MCC of 0.61 for mitochondria and 0.73 for chloroplasts) on independent validation data, thus outperforming existing approaches significantly.AvailabilityOur software ODNA is freely accessible as a web service at https://odna.mathematik.uni-marburg.de and can also be run in a docker container. The source code can be found at https://gitlab.com/mosga/odna and the processed data at Zenodo (DOI: 10.5281/zenodo.7506483).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Pages
