Jump to Navigation

SEQing: web-based visualization of iCLIP and RNA-seq data in an interactive python framework

BMC Bioinformatics - Wed, 18/03/2020 - 5:30am
RNA-binding proteins interact with their target RNAs at specific sites. These binding sites can be determined genome-wide through individual nucleotide resolution crosslinking immunoprecipitation (iCLIP). Subs...
Categories: Bioinformatics Trends

A deep learning-based framework for lung cancer survival analysis with biomarker interpretation

BMC Bioinformatics - Wed, 18/03/2020 - 5:30am
Lung cancer is the leading cause of cancer-related deaths in both men and women in the United States, and it has a much lower five-year survival rate than many other cancers. Accurate survival analysis is urge...
Categories: Bioinformatics Trends

RASflow: an RNA-Seq analysis workflow with Snakemake

BMC Bioinformatics - Wed, 18/03/2020 - 5:30am
With the cost of DNA sequencing decreasing, increasing amounts of RNA-Seq data are being generated giving novel insight into gene expression and regulation. Prior to analysis of gene expression, the RNA-Seq da...
Categories: Bioinformatics Trends

LCQS: an efficient lossless compression tool of quality scores with random access functionality

BMC Bioinformatics - Wed, 18/03/2020 - 5:30am
Advanced sequencing machines dramatically speed up the generation of genomic data, which makes the demand of efficient compression of sequencing data extremely urgent and significant. As the most difficult par...
Categories: Bioinformatics Trends

Position-wise binding preference is important for miRNA target site prediction

Bioinformatics Oxford Journals - Wed, 18/03/2020 - 5:30am
AbstractMotivationMotivation: It is a fundamental task to identify microRNAs (miRNA) targets and accurately locate their target sites. Genome-scale experiments for miRNA target site detection are still costly. The prediction accuracies of existing computational algorithms and tools are often not up to the expectation due to a large number of false positives. One major obstacle to achieve a higher accuracy is the lack of knowledge of the target binding features of miRNAs. The published high-throughput experimental data provide an opportunity to analyze position-wise preference of miRNAs in terms of target binding, which can be an important feature in miRNA target prediction algorithms.ResultsWe developed a Markov model to characterize position-wise pairing patterns of miRNA-target interactions. We further integrated this model as a scoring method and developed a dynamic programming (DP) algorithm, MDPS (Markov model-scored Dynamic Programming algorithm for miRNA target site Selection) that can screen putative target sites of miRNA-target binding. The MDPS algorithm thus can take into account both the dependency of neighboring pairing positions and the global pairing information. Based on the trained Markov models from both miRNA specific and general datasets, we discovered that the position-wise binding information specific to a given miRNA would benefit its target prediction. We also found that miRNAs maintain region-wise similarity in their target binding patterns. Combining MDPS with existing methods significantly improves their precision while only slightly reduces their recall. Therefore, position-wise pairing patterns have the promise to improve target prediction if incorporated into existing software tools.AvailabilityThe source code and tool to calculate MDPS score is available at http://hulab.ucf.edu/research/projects/MDPS/index.html.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Estimation of dynamic SNP-heritability with Bayesian Gaussian process models

Bioinformatics Oxford Journals - Wed, 18/03/2020 - 5:30am
AbstractMotivationImproved DNA technology has made it practical to estimate single nucleotide polymorphism (SNP)-heritability among distantly related individuals with unknown relationships. For growth and development related traits, it is meaningful to base SNP-heritability estimation on longitudinal data due to the time-dependency of the process. However, only few statistical methods have been developed so far for estimating dynamic SNP-heritability and quantifying its full uncertainty.ResultsWe introduce a completely tuning-free Bayesian Gaussian process (GP) based approach for estimating dynamic variance components and heritability as their function. For parameter estimation, we use a modern Markov Chain Monte Carlo (MCMC) method which allows full uncertainty quantification. Several data sets are analysed and our results clearly illustrate that the 95 % credible intervals of the proposed joint estimation method (which "borrows strength" from adjacent time points) are significantly narrower than of a two-stage baseline method that first estimates the variance components at each time point independently and then performs smoothing. We compare the method with a random regression model using MTG2 and BLUPF90 softwares and quantitative measures indicate superior performance of our method. Results are presented for simulated and real data with up to 1000 time points. Finally, we demonstrate scalability of the proposed method for simulated data with tens of thousands of individuals.AvailabilityThe C++ implementation dynBGP and simulated data are available in GitHub (https://github.com/aarjas/dynBGP). The programs can be run in R. Real datasets are available in QTL archive (https://phenome.jax.org/centers/QTLA).Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Redundancy-Weighting the PDB for Detailed Secondary Structure Prediction Using Deep-Learning Models

Bioinformatics Oxford Journals - Wed, 18/03/2020 - 5:30am
AbstractMotivationThe Protein Data Bank (PDB), the ultimate source for data in structural biology, is inherently imbalanced. To alleviate biases, virtually all structural biology studies use non-redundant subsets of the PDB, which include only a fraction of the available data. An alternative approach, dubbed redundancy-weighting, down-weights redundant entries rather than discarding them. This approach may be particularly helpful for Machine Learning (ML) methods that use the PDB as their source for data.Methods for Secondary Structure Prediction (SSP) have greatly improved over the years with recent studies achieving above 70% accuracy for 8-class (DSSP) prediction. As these methods typically incorporate machine learning techniques, training on redundancy-weighted datasets might improve accuracy, as well as pave the way toward larger and more informative secondary structure alphabets.ResultsThis article compares the SSP performances of Deep Learning (DL) models trained on either redundancy-weighted or non-redundant datasets. We show that training on redundancy-weighted sets consistently results in better prediction of 3-class (HCE), 8-class (DSSP) and 13-class (STR2) secondary structures.AvailabilityData and DL models are available in http://meshi1.cs.bgu.ac.il/rw.
Categories: Bioinformatics Trends

powmic: an R package for power assessment in microbiome case-control studies

Bioinformatics Oxford Journals - Wed, 18/03/2020 - 5:30am
AbstractSummaryPower analysis is essential to decide the sample size of metagenomic sequencing experiments in a case-control study for identifying differentially abundant microbes. However, the complexity of microbial data characteristics such as excessive zeros, over-dispersion, compositionality, intrinsically microbial correlations and variable sequencing depths makes the power analysis particularly challenging because the analytical form is usually unavailable. Here, we develop a simulation-based power assessment strategy and R package powmic, which considers the complexity of microbial data characteristics. A real data example demonstrates the usage of powmic.Availability and Implementationpowmic R package and online tutorial are available at https://github.com/lichen-lab/powmicSupplementary InformationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

CPVA: a web-based metabolomic tool for chromatographic peak visualization and annotation

Bioinformatics Oxford Journals - Wed, 18/03/2020 - 5:30am
AbstractMotivationLiquid chromatography–mass spectrometry-based non-targeted metabolomics is routinely performed to qualitatively and quantitatively analyze a tremendous amount of metabolite signals in complex biological samples. However, false-positive peaks in the datasets are commonly detected as metabolite signals by using many popular software, resulting in non-reliable measurement.ResultsTo reduce false-positive calling, we developed an interactive web tool, termed CPVA, for visualization and accurate annotation of the detected peaks in non-targeted metabolomics data. We used a chromatogram-centric strategy to unfold the characteristics of chromatographic peaks through visualization of peak morphology metrics, with additional functions to annotate adducts, isotopes and contaminants. CPVA is a free, user-friendly tool to help users to identify peak background noises and contaminants, resulting in decrease of false-positive or redundant peak calling, thereby improving the data quality of non-targeted metabolomics studies.AvailabilityThe CPVA is freely available at http://cpva.eastus.cloudapp.azure.com. Source code and installation instructions are available on GitHub: https://github.com/13479776/cpva.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Benchmarking of computational error-correction methods for next-generation sequencing data

Genome Biology - BiomedCentral - Tue, 17/03/2020 - 5:30am
Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors pres...
Categories: Bioinformatics Trends

Benchmarking of computational error-correction methods for next-generation sequencing data

Genome Biology - Tue, 17/03/2020 - 5:30am
Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors pres...
Categories: Bioinformatics Trends

A Blind and Independent Benchmark Study for Detecting Differentially Methylated Regions in Plants

Bioinformatics Oxford Journals - Tue, 17/03/2020 - 5:30am
AbstractMotivationBisulfite sequencing (BS-seq) is a state-of-the-art technique for investigating methylation of the DNA to gain insights into the epigenetic regulation. Several algorithms have been published for identification of differentially methylated regions (DMRs). However, the performances of the individual methods remain unclear and it is difficult to optimally select an algorithm in application settings.ResultsWe analyzed BS-seq data from four plants covering three taxonomic groups. We first characterized the data using multiple summary statistics describing methylation levels, coverage, noise, as well as frequencies, magnitudes and lengths of methylated regions. Then, simulated data sets with most similar characteristics to real experimental data were created. Seven different algorithms (metilene, methylKit, MOABS, DMRcate, Defiant, BSmooth, MethylSig) for DMR identification were applied and their performances were assessed. A blind and independent study design was chosen to reduce bias and to derive practical method selection guidelines. Overall, metilene had superior performance in most settings. Data attributes such as coverage and spread of the DMR lengths were found to be useful for selecting the best method for DMR detection. A decision tree to select the optimal approach based on these data attributes is provided. The presented procedure might serve as a general strategy for deriving algorithm selection rules tailored to demands in specific application settings.AvailabilityScripts that were used for the analyses and that can be used for prediction of the optimal algorithm are provided at https://github.com/kreutz-lab/DMR-DecisionTree. Simulated and experimental data are available at https://doi.org/10.6084/m9.figshare.11619045Supplementary Information is available at Bioinformatics online.
Categories: Bioinformatics Trends

LogoJS: a Javascript package for creating sequence logos and embedding them in web applications

Bioinformatics Oxford Journals - Tue, 17/03/2020 - 5:30am
AbstractSummarySequence logos were introduced nearly 30 years ago as a human-readable format for representing consensus sequences, and they remain widely used. As new experimental and computational techniques have developed, logos have been extended: extra symbols represent covalent modifications to nucleotides, logos with multiple letters at each position illustrate models with multi-nucleotide features, and symbols extending below the x-axis may represent a binding energy penalty for a residue or a negative weight output from a neural network. Web-based visualization tools for genomic data are increasingly taking advantage of modern web technology to offer dynamic, interactive figures to users, but support for sequence logos remains limited. Here we present LogoJS, a Javascript package for rendering customizable, interactive, vector-graphic sequence logos and embedding them in web applications. LogoJS supports all the aforementioned logo extensions and is bundled with a companion web application for creating and sharing logos.AvailabilityLogoJS is implemented both in plain Javascript and ReactJS, a popular user-interface framework. The web application is hosted at logojs.wenglab.org. All major browsers and operating systems are supported. The package and application are open-source; code is available at GitHub.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

TCRBuilder: Multi-state T-cell receptor structure prediction

Bioinformatics Oxford Journals - Tue, 17/03/2020 - 5:30am
AbstractMotivationT-cell receptors (TCRs) are immune proteins that primarily target peptide antigens presented by the major histocompatibility complex. They tend to have lower specificity and affinity than their antibody counterparts, and their binding sites have been shown to adopt multiple conformations, which is potentially an important factor for their polyspecificity. None of the current TCR modelling tools predict this variability which limits our ability to accurately predict TCR binding.ResultsWe present TCRBuilder, a multi-state TCR structure prediction tool. Given a paired α βTCR sequence, TCRBuilder returns a model or an ensemble of models covering the potential conformations of the binding site. This enables the analysis of structurally-driven polyspecificity in TCRs, which is not possible with existing tools.Availabilityhttp://opig.stats.ox.ac.uk/resourcesSupplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

BANDITS: Bayesian differential splicing accounting for sample-to-sample variability and mapping uncertainty

Genome Biology - BiomedCentral - Mon, 16/03/2020 - 5:30am
Alternative splicing is a biological process during gene expression that allows a single gene to code for multiple proteins. However, splicing patterns can be altered in some conditions or diseases. Here, we p...
Categories: Bioinformatics Trends

BANDITS: Bayesian differential splicing accounting for sample-to-sample variability and mapping uncertainty

Genome Biology - Mon, 16/03/2020 - 5:30am
Alternative splicing is a biological process during gene expression that allows a single gene to code for multiple proteins. However, splicing patterns can be altered in some conditions or diseases. Here, we p...
Categories: Bioinformatics Trends

M2IA: a Web Server for Microbiome and Metabolome Integrative Analysis

Bioinformatics Oxford Journals - Mon, 16/03/2020 - 5:30am
AbstractMotivationMicrobiome-metabolome association studies have experienced exponential growth for an in-depth understanding of the impact of microbiota on human health over the last decade. However, analyzing the resulting multi-omics data and their correlations remains a significant challenge due to the lack of a comprehensive computational tool that can facilitate data integration and interpretation. In this study, an automated microbiome and metabolome integrative analysis pipeline (M2IA) has been developed to meet the urgent needs for tools that can effectively integrate microbiome and metabolome data to derive biological insights.ResultsM2IA streamlines the integrative data analysis between metabolome and microbiome, from data preprocessing, univariate and multivariate statistical analyses, advanced functional analysis for biological interpretation, to a summary report. The functionality of M2IA was demonstrated using TwinsUK cohort datasets consisting of 1116 fecal metabolites and 16s rRNA microbiome from 786 individuals. Moreover, two important metabolic pathways, i.e., benzoate degradation and phosphotransferase system, were identified to be closely associated with obesity.AvailabilityM2IA is public available at http://m2ia.met-bioinformatics.cnSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Projected t-SNE for batch correction

Bioinformatics Oxford Journals - Mon, 16/03/2020 - 5:30am
AbstractMotivationLow-dimensional representations of high-dimensional data are routinely employed in biomedical research to visualize, interpret and communicate results from different pipelines. In this article, we propose a novel procedure to directly estimate t-SNE embeddings that are not driven by batch effects. Without correction, interesting structure in the data can be obscured by batch effects. The proposed algorithm can therefore significantly aid visualization of high-dimensional data.ResultsThe proposed methods are based on linear algebra and constrained optimization, leading to efficient algorithms and fast computation in many high-dimensional settings. Results on artificial single-cell transcription profiling data show that the proposed procedure successfully removes multiple batch effects from t-SNE embeddings, while retaining fundamental information on cell types. When applied to single-cell gene expression data to investigate mouse medulloblastoma, the proposed method successfully removes batches related with mice identifiers and the date of the experiment, while preserving clusters of oligodendrocytes, astrocytes, and endothelial cells and microglia, which are expected to lie in the stroma within or adjacent to the tumors.AvailabilitySource code implementing the proposed approach is available as an R package at https://github.com/emanuelealiverti/BC_tSNE, including a tutorial to reproduce the simulation studies.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

DeCoDe: degenerate codon design for complete protein-coding DNA libraries

Bioinformatics Oxford Journals - Mon, 16/03/2020 - 5:30am
AbstractMotivationHigh-throughput protein screening is a critical technique for dissecting and designing protein function. Libraries for these assays can be created through a number of means, including targeted or random mutagenesis of a template protein sequence or direct DNA synthesis. However, mutagenic library construction methods often yield vastly more non-functional than functional variants and, despite advances in large-scale DNA synthesis, individual synthesis of each desired DNA template is often prohibitively expensive. Consequently, many protein screening libraries rely on the use of degenerate codons (DCs), mixtures of DNA bases incorporated at specific positions during DNA synthesis, to generate highly diverse protein variant pools from only a few low-cost synthesis reactions. However, selecting DCs for sets of sequences that covary at multiple positions dramatically increases the difficulty of designing a DC library and leads to the creation of many undesired variants that can quickly outstrip screening capacity.ResultsWe introduce a novel algorithm for total DC library optimization, DeCoDe, based on integer linear programming. DeCoDe significantly outperforms state-of-the-art DC optimization algorithms and scales well to more than a hundred proteins sharing complex patterns of covariation (e.g., the lab-derived avGFP lineage). Moreover, DeCoDe is, to our knowledge, the first DC design algorithm with the capability to encode mixed-length protein libraries. We anticipate DeCoDe to be broadly useful for a variety of library generation problems, ranging from protein engineering attempts that leverage mutual information to the reconstruction of ancestral protein states.Availabilitygithub.com/OrensteinLab/DeCoDeSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Exploring High-Dimensional Biological Data with Sparse Contrastive Principal Component Analysis

Bioinformatics Oxford Journals - Mon, 16/03/2020 - 5:30am
AbstractMotivationStatistical analyses of high-throughput sequencing data have re-shaped the biological sciences. In spite of myriad advances, recovering interpretable biological signal from data corrupted by technical noise remains a prevalent open problem. Several classes of procedures, among them classical dimensionality reduction techniques and others incorporating subject-matter knowledge, have provided effective advances; however, no procedure currently satisfies the dual objectives of recovering stable and relevant features simultaneously.ResultsInspired by recent proposals for making use of control data in the removal of unwanted variation, we propose a variant of principal component analysis, sparse contrastive principal component analysis, that extracts sparse, stable, interpretable, and relevant biological signal. The new methodology is compared to competing dimensionality reduction approaches through a simulation study as well as via analyses of several publicly available protein expression, microarray gene expression, and single-cell transcriptome sequencing datasets.AvailabilityA free and open-source software implementation of the methodology, the scPCA R package, is made available via the Bioconductor Project. Code for all analyses presented in the paper is also available via GitHub.
Categories: Bioinformatics Trends

Pages

Subscribe to Centre for Bioinformatics aggregator - Bioinformatics Trends

Calendar

Mon
Tue
Wed
Thu
Fri
Sat
Sun
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
 
April 2020