Jump to Navigation
Subscribe to Bioinformatics Oxford Journals feed
Updated: 13 hours 12 min ago

A phylogenetic C interpreter for TNT

Sat, 28/03/2020 - 5:30am
AbstractMotivationTNT (a widely used program for phylogenetic analysis) includes an interpreter for a scripting language, but that implementation is non-standard and uses several conventions of its own. This paper describes the implementation and basic usage of a C-interpreter (with all the ISO essentials) now included in TNT. A phylogenetic library includes functions that can be used for manipulating trees and data, as well as other phylogeny-specific tasks. This greatly extends the capabilities of TNT.Availabilityversions of TNT including the C interpreter for scripts can be downloaded from http://www.lillo.org.ar/phylogeny/tnt/.
Categories: Bioinformatics Trends

QuartataWeb: integrated chemical-protein-pathway mapping for polypharmacology and chemogenomics

Sat, 28/03/2020 - 5:30am
AbstractSummaryQuartataWeb is a user-friendly server developed for polypharmacological and chemogenomics analyses. Users can easily obtain information on experimentally verified (known) and computationally predicted (new) interactions between 5,494 drugs and 2,807 human proteins in DrugBank, and between 315,514 chemicals and 9,457 human proteins in the STITCH database. In addition, QuartataWeb links targets to KEGG pathways and GO annotations, completing the bridge from drugs/chemicals to function via protein targets and cellular pathways. It allows users to query a series of chemicals, drug combinations, or multiple targets, to enable multi-drug, multi-target, multi-pathway analyses, toward facilitating the design of polypharmacological treatments for complex diseases.Availability and implementationQuartataWeb is freely accessible at http://quartata.csb.pitt.edu.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Gcluster: a simple-to-use tool for visualizing and comparing genome contexts for numerous genomes

Sat, 28/03/2020 - 5:30am
AbstractMotivationComparing the organization of gene, gene clusters, and their flanking genomic contexts is of critical importance to the determination of gene function and evolutionary basis of microbial traits. Currently, user-friendly and flexible tools enabling to visualize and compare genomic contexts for numerous genomes are still missing.ResultsWe here present Gcluster, a stand-alone Perl tool that allows researchers to customize and create high-quality linear maps of the genomic region around the genes of interest across large numbers of completed and draft genomes. Importantly, Gcluster integrates homologous gene analysis, in the form of a built-in orthoMCL, and mapping genomes onto a given phylogeny to provide superior comparison of gene contexts.Availability and implementationGclusteris written in Perl and released under GPLv3. The source code is freely available at https://github.com/Xiangyang1984/Gcluster. Gcluster can also be installed through conda: "conda install -c bioconda gcluster".Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

PolishEM: image enhancement in FIB-SEM

Sat, 28/03/2020 - 5:30am
AbstractSummaryWe have developed a software tool to improve the image quality in FIB-SEM stacks: PolishEM. Based on a Gaussian-blur model, it automatically estimates and compensates for the blur affecting each individual image. It also includes correction for artefacts commonly arising in FIB-SEM (e.g. curtaining). PolishEM has been optimized for an efficient processing of huge FIB-SEM stacks on standard computers.Availability and implementationpolishEM has been developed in C. GPL source code and binaries for Linux, OSX and Windows are available at http://www.cnb.csic.es/%7ejjfernandez/polishem.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Dual-Dropout Graph Convolutional Network for Predicting Synthetic Lethality in Human Cancers

Sat, 28/03/2020 - 5:30am
AbstractMotivationSynthetic lethality (SL) is a promising form of gene interaction for cancer therapy, as it is able to identify specific genes to target at cancer cells without disrupting normal cells. As high-throughput wet-lab settings are often costly and face various challenges, computational approaches have become a practical complement. In particular, predicting SLs can be formulated as a link prediction task on a graph of interacting genes. Although matrix factorization techniques have been widely adopted in link prediction, they focus on mapping genes to latent representations in isolation, without aggregating information from neighboring genes. Graph convolutional networks (GCN) can capture such neighborhood dependency in a graph. However, it is still challenging to apply GCN for SL prediction as SL interactions are extremely sparse, which is more likely to cause overfitting.ResultsIn this paper, we propose a novel Dual-Dropout GCN (DDGCN) for learning more robust gene representations for SL prediction. We employ both coarse-grained node dropout and fine-grained edge dropout to address the issue that standard dropout in vanilla GCN is often inadequate in reducing overfitting on sparse graphs. In particular, coarse-grained node dropout can efficiently and systematically enforce dropout at the node (gene) level, while fine-grained edge dropout can further fine-tune the dropout at the interaction (edge) level. We further present a theoretical framework to justify our model architecture. Finally, we conduct extensive experiments on human SL datasets and the results demonstrate the superior performance of our model in comparison with state-of-the-art methods.AvailabilityDDGCN is implemented in python 3.7, open-source and freely available at https://github.com/CXX1113/Dual-DropoutGCN
Categories: Bioinformatics Trends

debCAM: a Bioconductor R package for fully unsupervised deconvolution of complex tissues

Fri, 27/03/2020 - 5:30am
AbstractSummaryWe develop a fully unsupervised deconvolution method to dissect complex tissues into molecularly distinctive tissue or cell subtypes based on bulk expression profiles. We implement an R package, deconvolution by Convex Analysis of Mixtures (debCAM) that can automatically detect tissue/cell-specific markers, determine the number of constituent sub-types, calculate subtype proportions in individual samples, and estimate tissue/cell-specific expression profiles. We demonstrate the performance and biomedical utility of debCAM on gene expression, methylation, proteomics, and imaging data. With enhanced data preprocessing and prior knowledge incorporation, debCAM software tool will allow biologists to perform a more comprehensive and unbiased characterization of tissue remodeling in many biomedical contexts.Availability and implementationhttp://bioconductor.org/packages/debCAMSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

CNV-BAC: Copy Number Variation Detection in Bacterial Circular Genome

Fri, 27/03/2020 - 5:30am
AbstractMotivationWhole genome sequencing (WGS) is widely used for copy number variation (CNV) detection. However, for most bacteria, their circular genome structure and high replication rate make reads more enriched near the replication origin. CNV detection based on read depth could be seriously influenced by such replication bias.ResultsWe show that the replication bias is widespread using ~200 bacterial WGS data. We develop CNV-BAC that can properly normalize the replication bias as well as other known biases in bacterial WGS data and can accurately detect CNVs. Simulation and real data analysis show that CNV-BAC achieves the best performance in CNV detection compared with available algorithms.Availability and implementationCNV-BAC is available at https://github.com/XiDsLab/CNV-BAC.
Categories: Bioinformatics Trends

CRiSP: Accurate Structure Prediction of Disulfide-Rich Peptides with Cystine-Specific Sequence Alignment and Machine Learning

Thu, 26/03/2020 - 5:30am
AbstractMotivationHigh-throughput sequencing discovers many naturally occurring disulfide-rich peptides or cystine-rich peptides (CRPs) with diversified bioactivities. However, their structure information, which is very important to peptide drug discovery, is still very limited.ResultsWe have developed a CRP-specific structure prediction method called CRiSP, based on a customized template database with cystine-specific sequence alignment and three machine-learning predictors. The modeling accuracy is significantly better than several popular general-purpose structure modeling methods, and our CRiSP can provide useful model quality estimations.AvailabilityThe CRiSP server is freely available on the website at http://wulab.com.cn/CRISP.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

HiChIP-Peaks: A HiChIP peak calling algorithm

Tue, 24/03/2020 - 5:30am
AbstractMotivationHiChIP is a powerful tool to interrogate 3D chromatin organization. Current tools to analyse chromatin looping mechanisms using HiChIP data require the identification of loop anchors to work properly. However, current approaches to discover these anchors from HiChIP data are not satisfactory, having either a very high false discovery rate or strong dependence on sequencing depth. Moreover, these tools do not allow quantitative comparison of peaks across different samples, failing to fully exploit the information available from HiChIP datasets.ResultsWe develop a new tool based on a representation of HiChIP data centred on the re-ligation sites to identify peaks from HiChIP datasets, which can subsequently be used in other tools for loop discovery. This increases the reliability of these tools and improves recall rate as sequencing depth is reduced. We also provide a method to count reads mapping to peaks across samples, which can be used for differential peak analysis using HiChIP data.AvailabilityHiChIP-Peaks is freely available at https://github.com/ChenfuShi/HiChIP_peaksSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Cancer subtype classification and modeling by pathway attention and propagation

Tue, 24/03/2020 - 5:30am
AbstractMotivationBiological pathway is important curated knowledge of biological processes. Thus, cancer subtype classification based on pathways will be very useful to understand differences in biological mechanisms among cancer subtypes. However, pathways include only a fraction of the entire gene set, only 1/3 of human genes in KEGG, and pathways are fragmented. For this reason, there are few computational methods to use pathways for cancer subtype classification.ResultsWe present an explainable deep learning model with attention mechanism and network propagation for cancer subtype classification. Each pathway is modeled by a graph convolutional network. then, a multi-attention based ensemble model combines several hundreds of pathways in an explainable manner. Lastly, network propagation on pathway-gene network explains why gene expression profiles in subtypes are different. In experiments with five TCGA cancer data sets, our method achieved very good classification accuracies and, additionally, identified subtype-specific pathways and biological functions.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Brewery: Deep Learning and deeper profiles for the prediction of 1D protein structure annotations

Tue, 24/03/2020 - 5:30am
AbstractMotivationProtein Structural Annotations are essential abstractions to deal with the prediction of Protein Structures. Many increasingly sophisticated Protein Structural Annotations have been devised in the last few decades. However the need for annotations that are easy to compute, process and predict has not diminished. This is especially true for protein structures that are hardest to predict such as novel folds.ResultsWe propose Brewery, a suite of ab initio predictors of 1D Protein Structural Annotations. Brewery uses multiple sources of evolutionary information to achieve state-of-the-art predictions of Secondary Structure, Structural Motifs, Relative Solvent Accessibility and Contact Density.AvailabilityThe web server, standalone program, Docker image and training sets of Brewery are available at http://distilldeep.ucd.ie/brewery/.
Categories: Bioinformatics Trends

Annotation of tandem mass spectrometry data using stochastic neural networks in shotgun proteomics

Tue, 24/03/2020 - 5:30am
AbstractMotivationThe discrimination ability of score functions to separate correct from incorrect peptide-spectrum matches in database-searching-based spectrum identification are hindered by many superfluous peaks belonging to unexpected fragmentation ions or by the lacking peaks of anticipated fragmentation ions.ResultsHere, we present a new method, called BoltzMatch, to learn score functions using a particular stochastic neural networks, called restricted Boltzmann machines, in order to enhance their discrimination ability. BoltzMatch learns chemically explainable patterns among peak pairs in the spectrum data, and it can augment peaks depending on their semantic context or even reconstruct lacking peaks of expected ions during its internal scoring mechanism. As a result, BoltzMatch achieved 50% and 33% more annotations on high- and low-resolution MS2 data than XCorr at a 0.1% false discovery rate in our benchmark; conversely, XCorr yielded the same number of spectrum annotations as BoltzMatch, albeit with 4-6 times more errors. In addition, BoltzMatch alone does yield 14% more annotations than Prosit (which runs with Percolator), and BoltzMatch with Percolator yields 32% more annotations than Prosit at 0.1% FDR level in our benchmark.AvailabilityBoltzMatch is freely available at: https://github.com/kfattila/BoltzMatchSupporting informationSupplementary materials are available at Bioinformatics Online.
Categories: Bioinformatics Trends

Automatic identification of relevant genes from low-dimensional embeddings of single cell RNAseq data

Tue, 24/03/2020 - 5:30am
AbstractDimensionality reduction is a key step in the analysis of single-cell RNA sequencing data. It produces a low-dimensional embedding for visualization and as a calculation base for downstream analysis. Nonlinear techniques are most suitable to handle the intrinsic complexity of large, heterogeneous single cell data. However, with no linear relation between gene and embedding coordinate, there is no way to extract the identity of genes driving any cell’s position in the low-dimensional embedding, making it more difficult to characterize the underlying biological processes.In this paper, we introduce the concepts of local and global gene relevance to compute an equivalent of principal component analysis loadings for non-linear low-dimensional embeddings. Global gene relevance identifies drivers of the overall embedding, while local gene relevance identifies those of a defined subregion. We apply our method to single-cell RNAseq datasets from different experimental protocols and to different low dimensional embedding techniques. This shows our method’s versatility to identify key genes for a variety of biological processes.To ensure reproducibility and ease of use, our method is released as part of destiny 3.0, a popular R package for building diffusion maps from single-cell transcriptomic data. It is readily available through Bioconductor.
Categories: Bioinformatics Trends

Resolving single-cell heterogeneity from hundreds of thousands of cells through sequential hybrid clustering and NMF

Tue, 24/03/2020 - 5:30am
AbstractMotivationThe rapid proliferation of single-cell RNA-Sequencing (scRNA-Seq) technologies has spurred the development of diverse computational approaches to detect transcriptionally coherent populations. While the complexity of the algorithms for detecting heterogeneity has increased, most require significant user-tuning, are heavily reliant on dimension reduction techniques and are not scalable to ultra-large datasets. We previously described a multi-step algorithm, Iterative Clustering and Guide-gene selection (ICGS), which applies intra-gene correlation and hybrid clustering to uniquely resolve novel transcriptionally coherent cell populations from an intuitive graphical user interface.ResultsWe describe a new iteration of ICGS that outperforms state-of-the-art scRNA-Seq detection workflows when applied to well-established benchmarks. This approach combines multiple complementary subtype detection methods (HOPACH, sparse-NMF, cluster “fitness”, SVM) to resolve rare and common cell-states, while minimizing differences due to donor or batch effects. Using data from multiple cell atlases, we show that the PageRank algorithm effectively down-samples ultra-large scRNA-Seq datasets, without losing extremely rare or transcriptionally similar yet distinct cell-types and while recovering novel transcriptionally distinct cell populations. We believe this new approach holds tremendous promise in reproducibly resolving hidden cell populations in complex datasets.Availability and implementationICGS2 is implemented in Python. The source code and documentation are available at: http://altanalyze.org.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Fast and robust ancestry prediction using principal component analysis

Fri, 20/03/2020 - 5:30am
AbstractMotivationPopulation stratification (PS) is a major confounder in genome-wide association studies (GWAS) and can lead to false positive associations. To adjust for PS, principal component analysis (PCA)-based ancestry prediction has been widely used. Simple projection (SP) based on principal component loadings and the recently developed data augmentation-decomposition-transformation (ADP), such as LASER and TRACE, are popular methods for predicting PC scores. However, the predicted PC scores from SP can be biased toward NULL. On the other hand, ADP has a high computation cost because it requires running PCA separately for each study sample on the augmented data set.ResultsWe develop and propose two alternative approaches, bias-adjusted projection (AP) and online ADP (OADP). Using random matrix theory, AP asymptotically estimates and adjusts for the bias of SP. OADP uses a computationally efficient online singular value decomposition algorithm, which can greatly reduce the computation cost of ADP. We carried out extensive simulation studies to show that these alternative approaches are unbiased and the computation speed can be 16 times to 16,000 times faster than ADP. We applied our approaches to the UK Biobank data of 488,366 study samples with 2,492 samples from the 1000 Genomes data as the reference. AP and OADP required 0.82 and 21 CPU hours, respectively, while the projected computation time of ADP was 1,628 CPU hours. Furthermore, when inferring sub-European ancestry, SP clearly showed bias, unlike the proposed approaches.AvailabilityThe OADP and AP methods, as well as SP and ADP, have been implemented in the open source Python software FRAPOSA, available at github.com/daviddaiweizhang/fraposa.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Position-wise binding preference is important for miRNA target site prediction

Wed, 18/03/2020 - 5:30am
AbstractMotivationMotivation: It is a fundamental task to identify microRNAs (miRNA) targets and accurately locate their target sites. Genome-scale experiments for miRNA target site detection are still costly. The prediction accuracies of existing computational algorithms and tools are often not up to the expectation due to a large number of false positives. One major obstacle to achieve a higher accuracy is the lack of knowledge of the target binding features of miRNAs. The published high-throughput experimental data provide an opportunity to analyze position-wise preference of miRNAs in terms of target binding, which can be an important feature in miRNA target prediction algorithms.ResultsWe developed a Markov model to characterize position-wise pairing patterns of miRNA-target interactions. We further integrated this model as a scoring method and developed a dynamic programming (DP) algorithm, MDPS (Markov model-scored Dynamic Programming algorithm for miRNA target site Selection) that can screen putative target sites of miRNA-target binding. The MDPS algorithm thus can take into account both the dependency of neighboring pairing positions and the global pairing information. Based on the trained Markov models from both miRNA specific and general datasets, we discovered that the position-wise binding information specific to a given miRNA would benefit its target prediction. We also found that miRNAs maintain region-wise similarity in their target binding patterns. Combining MDPS with existing methods significantly improves their precision while only slightly reduces their recall. Therefore, position-wise pairing patterns have the promise to improve target prediction if incorporated into existing software tools.AvailabilityThe source code and tool to calculate MDPS score is available at http://hulab.ucf.edu/research/projects/MDPS/index.html.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Estimation of dynamic SNP-heritability with Bayesian Gaussian process models

Wed, 18/03/2020 - 5:30am
AbstractMotivationImproved DNA technology has made it practical to estimate single nucleotide polymorphism (SNP)-heritability among distantly related individuals with unknown relationships. For growth and development related traits, it is meaningful to base SNP-heritability estimation on longitudinal data due to the time-dependency of the process. However, only few statistical methods have been developed so far for estimating dynamic SNP-heritability and quantifying its full uncertainty.ResultsWe introduce a completely tuning-free Bayesian Gaussian process (GP) based approach for estimating dynamic variance components and heritability as their function. For parameter estimation, we use a modern Markov Chain Monte Carlo (MCMC) method which allows full uncertainty quantification. Several data sets are analysed and our results clearly illustrate that the 95 % credible intervals of the proposed joint estimation method (which "borrows strength" from adjacent time points) are significantly narrower than of a two-stage baseline method that first estimates the variance components at each time point independently and then performs smoothing. We compare the method with a random regression model using MTG2 and BLUPF90 softwares and quantitative measures indicate superior performance of our method. Results are presented for simulated and real data with up to 1000 time points. Finally, we demonstrate scalability of the proposed method for simulated data with tens of thousands of individuals.AvailabilityThe C++ implementation dynBGP and simulated data are available in GitHub (https://github.com/aarjas/dynBGP). The programs can be run in R. Real datasets are available in QTL archive (https://phenome.jax.org/centers/QTLA).Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Redundancy-Weighting the PDB for Detailed Secondary Structure Prediction Using Deep-Learning Models

Wed, 18/03/2020 - 5:30am
AbstractMotivationThe Protein Data Bank (PDB), the ultimate source for data in structural biology, is inherently imbalanced. To alleviate biases, virtually all structural biology studies use non-redundant subsets of the PDB, which include only a fraction of the available data. An alternative approach, dubbed redundancy-weighting, down-weights redundant entries rather than discarding them. This approach may be particularly helpful for Machine Learning (ML) methods that use the PDB as their source for data.Methods for Secondary Structure Prediction (SSP) have greatly improved over the years with recent studies achieving above 70% accuracy for 8-class (DSSP) prediction. As these methods typically incorporate machine learning techniques, training on redundancy-weighted datasets might improve accuracy, as well as pave the way toward larger and more informative secondary structure alphabets.ResultsThis article compares the SSP performances of Deep Learning (DL) models trained on either redundancy-weighted or non-redundant datasets. We show that training on redundancy-weighted sets consistently results in better prediction of 3-class (HCE), 8-class (DSSP) and 13-class (STR2) secondary structures.AvailabilityData and DL models are available in http://meshi1.cs.bgu.ac.il/rw.
Categories: Bioinformatics Trends

powmic: an R package for power assessment in microbiome case-control studies

Wed, 18/03/2020 - 5:30am
AbstractSummaryPower analysis is essential to decide the sample size of metagenomic sequencing experiments in a case-control study for identifying differentially abundant microbes. However, the complexity of microbial data characteristics such as excessive zeros, over-dispersion, compositionality, intrinsically microbial correlations and variable sequencing depths makes the power analysis particularly challenging because the analytical form is usually unavailable. Here, we develop a simulation-based power assessment strategy and R package powmic, which considers the complexity of microbial data characteristics. A real data example demonstrates the usage of powmic.Availability and Implementationpowmic R package and online tutorial are available at https://github.com/lichen-lab/powmicSupplementary InformationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends



March 2020