Jump to Navigation

Con-AAE: Contrastive Cycle Adversarial Autoencoders for Single-cell Multi-omics Alignment and Integration

Bioinformatics Oxford Journals - Tue, 28/03/2023 - 5:30am
AbstractMotivationWe have entered the multi-omics era and can measure cells from different aspects. Hence, we can get a more comprehensive view by integrating or matching data from different spaces corresponding to the same object. However, it is particularly challenging in the single-cell multi-omics scenario because such data are very sparse with extremely high dimensions. Though some techniques can be used to measure scATAC-seq and scRNA-seq simultaneously, the data are usually highly noisy due to the limitations of the experimental environment.ResultsTo promote single-cell multi-omics research, we overcome the above challenges, proposing a novel framework, contrastive cycle adversarial autoencoders, which can align and integrate single-cell RNA-seq data and single-cell ATAC-seq data. Con-AAE can efficiently map the above data with high sparsity and noise from different spaces to a coordinated subspace, where alignment and integration tasks can be easier. We demonstrate its advantages on several datasets.AvailabilityZenodo link: https://zenodo.org/badge/latestdoi/368779433 github: https://github.com/kakarotcq/Con-AAE.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

A Framework for High-throughput Sequence Alignment using Real Processing-in-Memory Systems

Bioinformatics Oxford Journals - Mon, 27/03/2023 - 5:30am
AbstractMotivationSequence alignment is a memory bound computation whose performance in modern systems is limited by the memory bandwidth bottleneck. Processing-in-memory architectures alleviate this bottleneck by providing the memory with computing competencies. We propose Alignment-in-Memory (AIM), a framework for high-throughput sequence alignment using processing-in-memory, and evaluate it on UPMEM, the first publicly-available general-purpose programmable processing-in-memory system.ResultsOur evaluation shows that a real processing-in-memory system can substantially outperform server-grade multi-threaded CPU systems running at full-scale when performing sequence alignment for a variety of algorithms, read lengths, and edit distance thresholds. We hope that our findings inspire more work on creating and accelerating bioinformatics algorithms for such real processing-in-memory systems.AvailabilityOur code is available at https://github.com/safaad/aim.
Categories: Bioinformatics Trends

Unmixing Biological Fluorescence Image Data with Sparse and Low-Rank Poisson Regression

Bioinformatics Oxford Journals - Sat, 25/03/2023 - 5:30am
AbstractMotivationMultispectral biological fluorescence microscopy has enabled the identification of multiple targets in complex samples. The accuracy in the unmixing result degrades (1) as the number of fluorophores used in any experiment increases and (2) as the signal-to-noise ratio in the recorded images decreases. Further, the availability of prior knowledge regarding the expected spatial distributions of fluorophores in images of labeled cells provides an opportunity to improve the accuracy of fluorophore identification and abundance.ResultsWe propose a regularized sparse and low-rank Poisson unmixing approach (SL-PRU) to deconvolve spectral images labeled with highly overlapping fluorophores which are recorded in low signal-to-noise regimes. Firstly, SL-PRU implements multi-penalty terms when pursuing sparseness and spatial correlation of the resulting abundances in small neighborhoods simultaneously. Secondly, SL-PRU makes use of Poisson regression for unmixing instead of least squares regression to better estimate photon abundance. Thirdly, we propose a method to tune the SL-PRU parameters involved in the unmixing procedure in the absence of knowledge of the ground truth abundance information in a recorded image. By validating on simulated and real-world images, we show that our proposed method leads to improved accuracy in unmixing fluorophores with highly overlapping spectra.Availability and implementationThe source code used for this paper was written in MATLAB and is available with the test data at https://github.com/WANGRUOGU/SL-PRUSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Integrative Analysis of Individual-Level Data and High-Dimensional Summary Statistics

Bioinformatics Oxford Journals - Sat, 25/03/2023 - 5:30am
AbstractMotivationResearchers usually conduct statistical analyses based on models built on raw data collected from individual participants (individual-level data). There is a growing interest in enhancing inference efficiency by incorporating aggregated summary information from other sources, such as summary statistics on genetic markers’ marginal associations with a given trait generated from genome-wide association studies. However, combining high-dimensional summary data with individual-level data using existing integrative procedures can be challenging due to various numeric issues in optimizing an objective function over a large number of unknown parameters.ResultsWe develop a procedure to improve the fitting of a targeted statistical model by leveraging external summary data for more efficient statistical inference (both effect estimation and hypothesis testing). To make this procedure scalable to high-dimensional summary data, we propose a divide-and-conquer strategy by breaking the task into easier parallel jobs, each fitting the targeted model by integrating the individual-level data with a small proportion of summary data. We obtain the final estimates of model parameters by pooling results from multiple fitted models through the minimum distance estimation procedure. We improve the procedure for a general class of additive models commonly encountered in genetic studies. We further expand these two approaches to integrate individual-level and high-dimensional summary data from different study populations. We demonstrate the advantage of the proposed methods through simulations and an application to the study of the effect on pancreatic cancer risk by the polygenic risk score defined by BMI-associated genetic markers.Availability and ImplementationR package is available at https://github.com/fushengstat/MetaGIM.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

De Novo Drug Design by Iterative Multi-Objective Deep Reinforcement Learning with Graph-based Molecular Quality Assessment

Bioinformatics Oxford Journals - Fri, 24/03/2023 - 5:30am
AbstractMotivationGenerating molecules of high quality and drug-likeness in the vast chemical space is a big challenge in the drug discovery. Most existing molecule generative methods focus on diversity and novelty of molecules, but ignoring drug potentials of the generated molecules during the generation process.ResultsIn this study, we present a novel de novo multi-objective quality assessment-based drug design approach QADD, which integrates an iterative refinement framework with a novel graph-based molecular quality assessment model on drug potentials. QADD designs a multi-objective deep reinforcement learning pipeline to generate molecules with multiple desired properties iteratively, where a graph neural network-based model for accurate molecular quality assessment on drug potentials is introduced to guide molecule generation. Experimental results show that QADD can jointly optimize multiple molecular properties with a promising performance and the quality assessment module is capable of guiding the generated molecules with high drug potentials. Furthermore, applying QADD to generate novel molecules binding to a biological target protein DRD2 also demonstrates the algorithm’s efficacy.AvailabilityQADD is freely available online for academic use at https://github.com/yifang000/QADD or http://www.csbio.sjtu.edu.cn/bioinf/QADD.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

HiCube: Interactive visualization of multiscale and multimodal Hi-C and 3D genome data

Bioinformatics Oxford Journals - Fri, 24/03/2023 - 5:30am
AbstractSummaryHiCube is a lightweight web application for interactive visualization and exploration of diverse types of genomics data at multiscale resolutions. Especially, HiCube displays synchronized views of Hi-C contact maps and three-dimensional (3D) genome structures with user-friendly annotation and configuration tools, thereby facilitating the study of 3D genome organization and function.Availability and implementationHiCube is implemented in Javascript and can be installed via NPM. The source code is freely available at GitHub (https://github.com/wmalab/HiCube).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Scrooge: A Fast and Memory-Frugal Genomic Sequence Aligner for CPUs, GPUs, and ASICs

Bioinformatics Oxford Journals - Fri, 24/03/2023 - 5:30am
AbstractMotivationPairwise sequence alignment is a very time-consuming step in common bioinformatics pipelines. Speeding up this step requires heuristics, efficient implementations, and/or hardware acceleration. A promising candidate for all of the above is the recently proposed GenASM algorithm. We identify and address three inefficiencies in the GenASM algorithm: it has a high amount of data movement, a large memory footprint, and does some unnecessary work.ResultsWe propose Scrooge, a fast and memory-frugal genomic sequence aligner. Scrooge includes three novel algorithmic improvements which reduce the data movement, memory footprint, and the number of operations in the GenASM algorithm. We provide efficient open-source implementations of the Scrooge algorithm for CPUs and GPUs, which demonstrate the significant benefits of our algorithmic improvements. For long reads, the CPU version of Scrooge achieves a 20.1×, 1.7×, and 2.1× speedup over KSW2, Edlib, and a CPU implementation of GenASM, respectively. The GPU version of Scrooge achieves a 4.0× 80.4×, 6.8×, 12.6× and 5.9× speedup over the CPU version of Scrooge, KSW2, Edlib, Darwin-GPU, and a GPU implementation of GenASM, respectively. We estimate an ASIC implementation of Scrooge to use 3.6× less chip area and 2.1× less power than a GenASM ASIC while maintaining the same throughput. Further, we systematically analyze the throughput and accuracy behavior of GenASM and Scrooge under various configurations. As the best configuration of Scrooge depends on the computing platform, we make several observations that can help guide future implementations of Scrooge.Availability and implementationhttps://github.com/CMU-SAFARI/Scrooge
Categories: Bioinformatics Trends

nf-core/isoseq: Simple gene and isoform annotation with PacBio Iso-Seq long-read sequencing

Bioinformatics Oxford Journals - Fri, 24/03/2023 - 5:30am
AbstractMotivationIso-Seq RNA long read sequencing enables the identification of full-length transcripts and isoforms, removing the need for complex analysis such as transcriptome assembly. However, the raw sequencing data need to be processed in a series of steps before annotation is complete. Here, we present nf-core/isoseq, a pipeline for automatic read processing and genome annotation. Following nf-core guidelines, the pipeline has few dependencies and can be run on any of platforms.AvailabilityThe pipeline is freely available online on the nf-core website (https://nf-co.re/isoseq) and on github (https://github.com/nf-core/isoseq) under MIT License (DOI: 10.5281/zenodo.7116979).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Foldcomp: a library and format for compressing and indexing large protein structure sets

Bioinformatics Oxford Journals - Fri, 24/03/2023 - 5:30am
AbstractSummaryHighly accurate protein structure predictors have generated hundreds of millions of protein structures; these pose a challenge in terms of storage and processing. Here we present Foldcomp, a novel lossy structure compression algorithm and indexing system to address this challenge. By using a combination of internal and cartesian coordinates and a bi-directional NeRF-based strategy, Foldcomp improves the compression ratio by a factor of 3 compared to the next best method. Its reconstruction error of 0.08 Å is comparable to the best lossy compressor. It is 5 times faster than the next fastest compressor and competes with the fastest decompressors. With its multi-threading implementation and a Python interface that allows for easy database downloads and efficient querying of protein structures by accession, Foldcomp is a powerful tool for managing and analyzing large collections of protein structures.AvailabilityFoldcomp is a free open-source software (GPLv3) and available for Linux, macOS and Windows at https://foldcomp.foldseek.com. Foldcomp provides the AlphaFold Swiss-Prot (2.9GB), TrEMBL (1.1TB) and ESMatlas HQ (114GB) database ready-for-download.Supplementary informationSupplementary dataSupplementary data is available at Bioinformatics online.
Categories: Bioinformatics Trends

Flex Meta-Storms elucidates the microbiome local beta-diversity under specific phenotypes

Bioinformatics Oxford Journals - Wed, 22/03/2023 - 5:30am
AbstractMotivationBeta-diversity quantitatively measures the difference among microbial communities, thus enlightening the association between microbiome composition and environment properties or host phenotypes. The beta-diversity analysis mainly relies on distances among microbiomes that are calculated by all microbial features. However, in some cases, only a small fraction of members in a community plays crucial roles. Such tiny proportion is insufficient to alter the overall distance, which is always missed by end-to-end comparison. On the other hand, beta-diversity pattern can also be interfered due to the data sparsity when only focusing on non-abundant microbes.ResultsHere we develop Flex Meta-Storms (FMS) distance algorithm that implements the “local alignment” of microbiomes for the first time. Using a flexible extraction that considers the weighted phylogenetic and functional relations of microbes, FMS produces a normalized phylogenetic distance among members of interest for microbiome pairs. We demonstrated the advantage of FMS in detecting the subtle variations of microbiomes among different states using artificial and real datasets, which were neglected by regular distance metrics. Therefore, FMS effectively discriminates microbiomes with higher sensitivity and flexibility, thus contributing to in-depth comprehension of microbe-host interactions, as well as promoting the utilization of microbiome data such as disease screening and prediction.AvailabilityFMS is implemented in C ++ and the source code is released at https://github.com/qdu-bioinfo/flex-meta-storms.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

A high-performance deep-learning-based pipeline for whole-brain vasculature segmentation at the capillary resolution

Bioinformatics Oxford Journals - Wed, 22/03/2023 - 5:30am
AbstractMotivationReconstructing and analyzing all blood vessels throughout the brain is significant for understanding brain function, revealing the mechanisms of brain disease, and mapping the whole-brain vascular atlas. Vessel segmentation is a fundamental step in reconstruction and analysis. The whole-brain optical microscopic imaging method enables the acquisition of whole-brain vessel images at the capillary resolution. Due to the massive amount of data and the complex vascular features generated by high-resolution whole-brain imaging, achieving rapid and accurate segmentation of whole-brain vasculature becomes a challenge.ResultsWe introduce HP-VSP, a high-performance vessel segmentation pipeline based on deep learning. The pipeline consists of three processes: data blocking, block prediction, and block fusion. We used parallel computing to parallelize this pipeline to improve the efficiency of whole-brain vessel segmentation. We also designed a lightweight deep neural network based on multi-resolution vessel feature extraction to segment vessels at different scales throughout the brain accurately. We validated our approach on whole-brain vascular data from three transgenic mice collected by HD-fMOST. The results show that our proposed segmentation network achieves the state-of-the-art level under various evaluation metrics. In contrast, the parameters of the network are only 1% of those of similar networks. The established segmentation pipeline could be used on various computing platforms and complete the whole-brain vessel segmentation in 3 hours. We also demonstrated that our pipeline could be applied to the vascular analysis.AvailabilityThe dataset is available at http://atlas.brainsmatics.org/a/li2301. The source code is freely available at https://github.com/visionlyx/HP-VSP.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

OutSingle: A Novel Method of Detecting and Injecting Outliers in RNA-seq Count Data Using the Optimal Hard Threshold for Singular Values

Bioinformatics Oxford Journals - Wed, 22/03/2023 - 5:30am
AbstractMotivationFinding outliers in RNA Sequencing (RNA-Seq) gene expression (GE) can help in identifying genes that are aberrant and cause Mendelian disorders. Recently developed models for this task rely on modeling RNA-Seq GE data using the Negative Binomial distribution (NBD). However, some of those models either rely on procedures for inferring NBD’s parameters in a non-biased way that are computationally demanding, and thus make confounder control challenging, while others rely on less computationally demanding but biased procedures and convoluted confounder control approaches that hinder interpretability.ResultsIn this paper we present OutSingle (Outlier detection using Singular Value Decomposition), an almost instantaneous way of detecting outliers in RNA-Seq GE data. It uses a simple log-normal approach for count modeling. For confounder control it uses the recently discovered Optimal Hard Threshold (OHT) method for noise detection, which itself is based on Singular Value Decomposition (SVD). Due to its SVD/OHT utilization, OutSingle’s model is straightforward to understand and interpret. We then show that our novel method, when used on RNA-Seq GE data with real biological outliers masked by confounders, outcompetes the previous state-of-the art model based on an ad-hoc denoising autoencoder (AE). Additionally, OutSingle can be used to inject artificial outliers masked by confounders, which is difficult to achieve with previous approaches. We describe a way of using OutSingle for outlier injection and proceed to show how OutSingle outperforms its competition on 16 out of 18 datasets that were generated from 3 real datasets using OutSingle’s injection procedure with different outlier types and magnitudes. Our methods are applicable to other types of similar problems involving finding outliers in matrices under the presence of confounders.AvailabilityThe code for OutSingle is available at https://github.com/esalkovic/outsingleSupplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Seqpac: A framework for sRNA-seq analysis in R using sequence-based counts

Bioinformatics Oxford Journals - Tue, 21/03/2023 - 5:30am
AbstractMotivationFeature-based counting is commonly used in RNA-sequencing (RNA-seq) analyses. Here, sequences must align to target features (like genes or non-coding RNAs) and related sequences with different compositions are counted into the same feature. Consequently, sequence integrity is lost, making results less traceable against raw data.ImplementationSmall RNA (sRNA) often maps to multiple features and shows an incredible diversity in form and function. Therefore, applying feature-based strategies may increase the risk of misinterpretation. We present a strategy for sRNA-seq analysis that preserves the integrity of the raw sequence making the data lineage fully traceable. We have consolidated this strategy into Seqpac: An R package that makes a complete sRNA analysis available on multiple platforms. Using published biological data, we show that Seqpac reveals hidden bias and adds new insights to studies that were previously analyzed using feature-based counting.ConclusionsWe have identified limitations in the concurrent analysis of RNA-seq data. We call it the traceability dilemma in alignment-based sequencing strategies. By building a flexible framework that preserves the integrity of the read sequence throughout the analysis, we demonstrate better interpretability in sRNA-seq experiments, which are particularly vulnerable to this problem. Applying similar strategies to other transcriptomic workflows may aid in resolving the replication crisis experienced by many fields that depends on transcriptome analyses.AvailabilitySeqpac is available on Bioconductor (https://bioconductor.org/packages/seqpac) and GitHub (https://github.com/danis102/seqpac).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Phenonaut; multiomics data integration for phenotypic space exploration

Bioinformatics Oxford Journals - Tue, 21/03/2023 - 5:30am
AbstractSummaryData integration workflows for multiomics data take many forms across academia and industry. Efforts with limited resources often encountered in academia can easily fall short of data integration best practices for processing and combining high content imaging, proteomics, metabolomics and other omics data. We present Phenonaut, a Python software package designed to address the data workflow needs of migration, control, integration, and auditability in the application of literature and proprietary techniques for data source and structure agnostic workflow creation.Availability and implementationSource code: https://github.com/CarragherLab/phenonaut, Documentation: https://carragherlab.github.io/phenonaut, PyPI package: https://pypi.org/project/phenonaut/Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Effective and Efficient Active Learning for Deep Learning Based Tissue Image Analysis

Bioinformatics Oxford Journals - Tue, 21/03/2023 - 5:30am
AbstractMotivationDeep learning attained excellent results in Digital Pathology recently. A challenge with its use is that high quality, representative training data sets are required to build robust models. Data annotation in the domain is labor intensive and demands substantial time commitment from expert pathologists. Active Learning (AL) is a strategy to minimize annotation. The goal is to select samples from the pool of unlabeled data for annotation that improves model accuracy. However, AL is a very compute demanding approach. The benefits for model learning may vary according to the strategy used, and it may be hard for a domain specialist to fine tune the solution without an integrated interface.ResultsWe developed a framework that includes a friendly user interface along with run-time optimizations to reduce annotation and execution time in AL in digital pathology. Our solution implements several AL strategies along with our Diversity-Aware Data Acquisition (DADA) acquisition function, which enforces data diversity to improve the prediction performance of a model. In this work, we employed a model simplification strategy (Network Auto-Reduction (NAR)) that significantly improves AL execution time when coupled with DADA. NAR produces less compute demanding models, which replace the target models during the AL process to reduce processing demands. An evaluation with a Tumor-Infiltrating Lymphocytes (TILs) classification application shows that: (i) DADA attains superior performance compared to state-of-the-art AL strategies for different Convolutional Neural Networks (CNNs), (ii) NAR improves the AL execution time by up to 4.3 ×, and (iii) target models trained with patches/data selected by the NAR reduced versions achieve similar or superior classification quality to using target CNNs for data selection.AvailabilitySource code: https://github.com/alsmeirelles/DADASupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

The hidden factor: accounting for covariate effects in power and sample size computation for a binary trait

Bioinformatics Oxford Journals - Tue, 21/03/2023 - 5:30am
AbstractMotivationAccurate power and sample size estimation is crucial to the design and analysis of genetic association studies. When analyzing a binary trait via logistic regression, important covariates such as age and sex are typically included in the model. However, their effects are rarely properly considered in power or sample size computation during study planning. Unlike when analyzing a continuous trait, the power of association testing between a binary trait and a genetic variant depends, explicitly, on covariate effects, even under the assumption of gene-environment independence. Earlier work recognizes this hidden factor but the implemented methods are not flexible.MethodWe thus propose and implement a generalized method for estimating power and sample size for (discovery or replication) association studies of binary traits that a) accommodates different types of non-genetic covariates E, b) deals with different types of G-E relationships, and c) is computationally efficient.ResultsExtensive simulation studies show that the proposed method is accurate and computationally efficient for both prospective and retrospective sampling designs with various covariate structures. A proof-of-principle application focused on the understudied African sample in the UK Biobank data. Results show that, in contrast to studying the continuous blood pressure trait, when analyzing the binary hypertension trait ignoring covariate effects of age and sex leads to overestimated power and underestimated replication sample size.AvailabilityThe simulated datasets can be found on the online web-page of this manuscript, and the UK Biobank application data can be accessed at https://www.ukbiobank.ac.uk. The R package SPCompute that implements the proposed method is available at CRAN. The genome-wide association studies are carried out using the software PLINK 2.0 (Purcell et al., 2007).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

IntestLine: a Shiny-based application to map the rolled intestinal tissue onto a line

Bioinformatics Oxford Journals - Tue, 21/03/2023 - 5:30am
AbstractSummaryTo allow the comprehensive histological analysis of the whole intestine, it is often rolled to a spiral before imaging. This Swiss-rolling technique facilitates robust experimental procedures, but it limits the possibilities to comprehend changes along the intestine. Here, we present IntestLine, a Shiny-based open-source application for processing imaging data of (rolled) intestinal tissues and subsequent mapping onto a line. The visualization of the mapped data facilitates the assessment of the whole intestine in both proximal-distal and serosa-luminal axis, and enables the observation of location-specific cell types and markers. Accordingly, IntestLine can serve as a tool to characterize intestine in multi-modal imaging studies.AvailabilitySource code can be found at Zenodo (https://doi.org/10.5281/zenodo.7081864) and GitHub (https://github.com/SchlitzerLab/IntestLine).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

An expectation-maximization framework for comprehensive prediction of isoform-specific functions

Bioinformatics Oxford Journals - Fri, 17/03/2023 - 5:30am
AbstractMotivationAdvances in RNA sequencing technologies have achieved an unprecedented accuracy in the quantification of mRNA isoforms, but our knowledge of isoform-specific functions has lagged behind. There is a need to understand the functional consequences of differential splicing, which could be supported by the generation of accurate and comprehensive isoform-specific Gene Ontology (GO) annotations.ResultsWe present Isopret (Isoform Interpretation), a method that uses expectation-maximization to infer isoform specific functions based on the relationship between sequence and functional isoform similarity. We predicted isoform-specific functional annotations for 85,617 isoforms of 17,900 protein-coding human genes spanning a range of 17,430 distinct GO terms. Comparison with a gold-standard corpus of manually annotated human isoform functions showed that isopret significantly outperforms state-of-the-art competing methods. We provide experimental evidence that functionally related isoforms predicted by isopret show a higher degree of domain sharing and expression correlation than functionally related genes. We also show that isoform sequence similarity correlates better with inferred isoform function than with gene level function.Availability and implementationSource code, documentation, and resource files are freely available under a GNU3 license at https://github.com/TheJacksonLaboratory/isopretEM and https://zenodo.org/record/7594321.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

scMCs: a framework for single cell multi-omics data integration and multiple clusterings

Bioinformatics Oxford Journals - Fri, 17/03/2023 - 5:30am
AbstractMotivationThe integration of single-cell multi-omics data can uncover the underlying regulatory basis of diverse cell types and states. However, contemporary methods disregard the omics individuality, and the high noise, sparsity, and heterogeneity of single-cell data also impact the fusion effect. Furthermore, available single-cell clustering methods only focus on the cell type clustering, which can not mine the alternative clustering to comprehensively analyze cells.ResultsWe propose a single-cell data fusion based multiple clustering (scMCs) approach that can jointly model single-cell transcriptomics and epigenetic data, and explore multiple different clusterings. scMCs first mines the omics-specific and cross-omics consistent representations, then fuses them into a co-embedding representation, which can dissect cellular heterogeneity and impute data. To discover the potential alternative clustering embedded in multi-omics, scMCs projects the co-embedding representation into different salient subspaces. Meanwhile, it reduces the redundancy between subspaces to enhance the diversity of alternative clusterings and optimizes the cluster centers in each subspace to boost the quality of corresponding clustering. Unlike single clustering, these alternative clusterings provide additional perspectives for understanding complex genetic information such as cell types and states. Experimental results show that scMCs can effectively identify subcellular types, impute dropout events, and uncover diverse cell characteristics by giving different but meaningful clusterings.AvailabilityThe code is available at www.sdu-idea.cn/codes.php?name=scMCs.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

DeepOM: Single-molecule optical genome mapping via deep learning

Bioinformatics Oxford Journals - Fri, 17/03/2023 - 5:30am
AbstractMotivationEfficient tapping into genomic information from a single microscopic image of an intact DNA molecule is an outstanding challenge and its solution will open new frontiers in molecular diagnostics. Here, a new computational method for optical genome mapping utilizing Deep Learning is presented, termed DeepOM. Utilization of a Convolutional Neural Network (CNN), trained on simulated images of labeled DNA molecules, improves the success rate in alignment of DNA images to genomic references.ResultsThe method is evaluated on acquired images of human DNA molecules stretched in nano-channels. The accuracy of the method is benchmarked against state-of-the-art commercial software Bionano Solve. The results show a significant advantage in alignment success rate for molecules shorter than 50 kb. DeepOM improves yield, sensitivity and throughput of optical genome mapping experiments in applications of human genomics and microbiology.Availability and ImplementationThe source code for the presented method is publicly available at https://github.com/yevgenin/DeepOM.Supplementary informationSupplementary informationSupplementary information is available at Bioinformatics online.
Categories: Bioinformatics Trends

Pages

Subscribe to Centre for Bioinformatics aggregator - Bioinformatics Trends

Calendar

Mon
Tue
Wed
Thu
Fri
Sat
Sun
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
March 2023