An improved rhythmicity analysis method using Gaussian Processes detects cell-density dependent circadian oscillations in stem cells
AbstractMotivationDetecting oscillations in time series remains a challenging problem even after decades of research. In chronobiology, rhythms (for instance in gene expression, eclosion, egg-laying and feeding) tend to be low amplitude, display large variations amongst replicates, and often exhibit varying peak-to-peak distances (non-stationarity). Most currently available rhythm detection methods are not specifically designed to handle such datasets, and are also limited by their use of p-values in detecting oscillations.ResultsWe introduce a new method, ODeGP (Oscillation Detection using Gaussian Processes), which combines Gaussian Process (GP) regression and Bayesian inference to incorporate measurement errors, non-uniformly sampled data, and a recently developed non-stationary kernel to improve detection of oscillations. By using Bayes factors, ODeGP models both the null (non-rhythmic) and the alternative (rhythmic) hypotheses, thus providing an advantage over p-values. Using synthetic datasets we first demonstrate that ODeGP almost always outperforms eight commonly used methods in detecting stationary as well as non-stationary symmetric oscillations. Next, by analyzing existing qPCR datasets we demonstrate that our method is more sensitive compared to the existing methods at detecting weak and noisy oscillations. Finally, we generate new qPCR data on mouse embryonic stem cells. Surprisingly, we discover using ODeGP that increasing cell density results in rapid generation of oscillations in the Bmal1 gene, thus highlighting our method’s ability to discover unexpected and new patterns. In its current implementation, ODeGP is meant only for analyzing single or a few time-trajectories, not genome-wide datasets.Availability and implementationODeGP is available at https://github.com/Shaonlab/ODeGPSupplementary informationSupplementary dataSupplementary data are available at Journal Name online.
Categories: Bioinformatics Trends
Design of optimal labeling patterns for optical genome mapping via information theory
AbstractMotivationOptical genome mapping (OGM) is a technique that extracts partial genomic information from optically imaged and linearized DNA fragments containing fluorescently labeled short sequence patterns. This information can be used for various genomic analyses and applications, such as the detection of structural variations and copy-number variations, epigenomic profiling, and microbial species identification. Currently, the choice of labeled patterns is based on the available bio-chemical methods, and is not necessarily optimized for the application.ResultsIn this work, we develop a model of OGM based on information theory, which enables the design of optimal labeling patterns for specific applications and target organism genomes. We validated the model through experimental OGM on human DNA and simulations on bacterial DNA. Our model predicts up to 10-fold improved accuracy by optimal choice of labeling patterns, which may guide future development of OGM bio-chemical labeling methods and significantly improve its accuracy and yield for applications such as epigenomic profiling and cultivation-free pathogen identification in clinical samples.Availability and implementationhttps://github.com/yevgenin/PatternCode
Categories: Bioinformatics Trends
Balancing Biomass Reaction Stoichiometry and Measured Fluxes in Flux Balance Analysis
AbstractMotivationFlux Balance Analysis (FBA) is widely recognized as an important method for studying metabolic networks. When incorporating flux measurements of certain reactions into an FBA problem, it is possible that the underlying linear program may become infeasible, for example, due to measurement or modeling inaccuracies. Furthermore, while the biomass reaction is of central importance in FBA models, its stoichiometry is often a rough estimate and a source of high uncertainty.ResultsIn this work, we present a method that allows modifications to the biomass reaction stoichiometry as a means to (i) render the FBA problem feasible and to (ii) improve the accuracy of the model by corrections in the biomass composition. Optionally, the adjustment of the biomass composition can be used in conjunction with a previously introduced approach for balancing inconsistent fluxes to obtain a feasible FBA system. We demonstrate the value of our approach by analyzing realistic flux measurements of E.coli. In particular, we find that the growth-associated maintenance (GAM) demand of ATP, which is typically integrated in the biomass reaction, is likely overestimated in recent genome-scale models, at least for certain growth conditions. In light of these findings, we discuss issues related to determination and inclusion of GAM values in constraint-based models. Overall, our method can uncover potential errors and suggest adjustments in the assumed biomass composition in FBA models based on inconsistencies between model and measured fluxes.AvailabilityThe developed method has been implemented in our software tool CNApy available from github.com/cnapy-org/CNApy.Supplementary informationSupplementary data can be found at https://github.com/cnapy-org.
Categories: Bioinformatics Trends
compleasm: a faster and more accurate reimplementation of BUSCO
AbstractMotivationEvaluating the gene completeness is critical to measuring the quality of a genome assembly. An incomplete assembly can lead to errors in gene predictions, annotation, and other downstream analyses. BUSCO is a widely used tool for assessing the completeness of genome assembly by testing the presence of a set of single-copy orthologs conserved across a wide range of taxa. However, BUSCO is slow particularly for large genome assemblies. It is cumbersome to apply BUSCO to a large number of assemblies.ResultsHere, we present compleasm, an efficient tool for assessing the completeness of genome assemblies. Compleasm utilizes the miniprot protein-to-genome aligner and the conserved orthologous genes from BUSCO. It is 14 times faster than BUSCO for human assemblies and reports a more accurate completeness of 99.6% than BUSCO’s 95.7%, which is in close agreement with the annotation completeness of 99.5% for T2T-CHM13.Availabilityhttps://github.com/huangnengCSU/compleasmSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Extending protein interaction networks using proteoforms and small molecules
AbstractMotivationBiological network analysis for high-throughput biomedical data interpretation relies heavily on topological characteristics. Networks are commonly composed of nodes representing genes or proteins that are connected by edges when interacting. In this study, we use the rich information available in the Reactome pathway database to build biological networks accounting for small molecules and proteoforms modeled using protein isoforms and post-translational modifications to study the topological changes induced by this refinement of the network representation.ResultsWe find that improving the interactome modeling increases the number of nodes and interactions, but that isoform and post-translational modification annotation is still limited compared to what can be expected biologically. We also note that small molecule information can distort the topology of the network due to the high connectedness of these molecules, which does not necessarily represent the reality of biology. However, by restricting the connections of small molecules to the context of biochemical reactions, we find that these improve the overall connectedness of the network and reduce the prevalence of isolated components and nodes. Overall, changing the representation of the network alters the prevalence of articulation points and bridges globally but also within and across pathways. Hence, some molecules can gain or lose in biological importance depending on the level of detail of the representation of the biological system, which might in turn impact network-based studies of diseases or druggability.AvailabilityNetworks are constructed based on data publicly available in the Reactome Pathway knowledgebase: reactome.orgSupplementary informationThe networks produced by this study are available at the public repository: github.com/PathwayAnalysisPlatform/Networks.
Categories: Bioinformatics Trends
ScribbleDom: Using Scribble-Annotated Histology Images to Identify Domains in Spatial Transcriptomics Data
AbstractMotivationSpatial domain identification is a very important problem in the field of Spatial Transcriptomics (ST). The state-of-the-art solutions to this problem focus on unsupervised methods, as there is lack of data for a supervised learning formulation. The results obtained from these methods highlight significant opportunities for improvement.ResultsIn this paper, we propose a potential avenue for enhancement through the development of a semi-supervised convolutional neural network (CNN) based approach. Named ScribbleDom, our method leverages human expert’s input as a form of semi-supervision, thereby seamlessly combines the cognitive abilities of human experts with the computational power of machines. ScribbleDom incorporates a loss function that integrates two crucial components: similarity in gene expression profiles and adherence to the valuable input of a human annotator through scribbles on histology images, providing prior knowledge about spot labels. The spatial continuity of the tissue domains is taken into account by extracting information on the spot micro-environment through convolution filters of varying sizes, in the form of Inception blocks. By leveraging this semi-supervised approach, ScribbleDom significantly improves the quality of spatial domains, yielding superior results both quantitatively and qualitatively. Our experiments on several benchmark datasets demonstrate the clear edge of ScribbleDom over state-of-the-art methods—between 1.82% to 169.38% improvements in Adjusted Rand Index (ARI) for 9 of the 12 Human DLPFC samples, and 15.54% improvement in the Melanoma cancer dataset. Notably, when the expert input is absent, ScribbleDom can still operate, in a fully unsupervised manner like the state-of-the-art methods, and produces results that remain competitive.AvailabilitySource code is available at Github (https://github.com/1alnoman/ScribbleDom) and Zenodo (https://zenodo.org/badge/latestdoi/681572669).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Simulating structurally variable Nuclear Pore Complexes for Microscopy
AbstractMotivationThe Nuclear Pore Complex (NPC) is the only passageway for macromolecules between nucleus and cytoplasm, and an important reference standard in microscopy: it is massive and stereotypically arranged. The average architecture of NPC proteins has been resolved with pseudo-atomic precision, however observed NPC heterogeneities evidence a high degree of divergence from this average. Single Molecule Localization Microscopy (SMLM) images NPCs at protein-level resolution, whereupon image analysis software studies NPC variability. However the true picture of this variability is unknown. In quantitative image analysis experiments, it is thus difficult to distinguish intrinsically high SMLM noise from variability of the underlying structure.ResultsWe introduce CIR4MICS (”ceramics”, Configurable, Irregular Rings FOR MICroscopy Simulations), a pipeline that synthesizes ground truth datasets of structurally variable NPCs based on architectural models of the true NPC. Users can select one or more N- or C-terminally tagged NPC proteins, and simulate a wide range of geometric variations. We also represent the NPC as a spring-model such that arbitrary deforming forces, of user-defined magnitudes, simulate irregularly shaped variations. Further, we provide annotated reference datasets of simulated human NPCs, which facilitate a side-by-side comparison with real data. To demonstrate, we synthetically replicate a geometric analysis of real NPC radii and reveal that a range of simulated variability parameters can lead to observed results. Our simulator is therefore valuable to test the capabilities of image analysis methods, as well as to inform experimentalists about the requirements of hypothesis-driven imaging studies.AvailabilityCode: https://github.com/uhlmanngroup/cir4mics. Simulated data: BioStudies S-BSST1058.Supplementary informationSupplementary dataSupplementary data are available at
Categories: Bioinformatics Trends
A cell-level discriminative neural network model for diagnosis of blood cancers
AbstractMotivationPrecise identification of cancer cells in patient samples is essential for accurate diagnosis and clinical monitoring but has been a significant challenge in machine learning approaches for cancer precision medicine. In most scenarios, training data are only available with disease annotation at the subject or sample level. Traditional approaches separate the classification process into multiple steps that are optimized independently. Recent methods either focus on predicting sample-level diagnosis without identifying individual pathologic cells or are less effective for identifying heterogeneous cancer cell phenotypes.ResultsWe developed a generalized end-to-end differentiable model, the Cell Scoring Neural Network (CSNN), which takes sample-level training data and predicts the diagnosis of the testing samples and the identity of the diagnostic cells in the sample, simultaneously. The cell-level density differences between samples are linked to the sample diagnosis, which allows the probabilities of individual cells being diagnostic to be calculated using backpropagation. We applied CSNN to two independent clinical flow cytometry datasets for leukemia diagnosis. In both qualitative and quantitative assessments, CSNN outperformed preexisting neural network modeling approaches for both cancer diagnosis and cell-level classification. Post hoc decision trees and 2D dot plots were generated for interpretation of the identified cancer cells, showing that the identified cell phenotypes match the cancer endotypes observed clinically in patient cohorts. Independent data clustering analysis confirmed the identified cancer cell populations.AvailabilityThe source code of CSNN and datasets used in the experiments are publicly available on GitHub (http://github.com/erobl/csnn). Raw FCS files can be downloaded from FlowRepository (ID: FR-FCM-Z6YK).Supplementary informationSupplementary dataSupplementary data are available on GitHub and at Bioinformatics online.
Categories: Bioinformatics Trends
DeepCCI: a deep learning framework for identifying cell-cell interactions from single-cell RNA sequencing data
AbstractMotivationCell-cell interactions (CCIs) play critical roles in many biological processes such as cellular differentiation, tissue homeostasis and immune response. With the rapid development of high throughput single-cell RNA sequencing (scRNA-seq) technologies, it is of high importance to identify CCIs from the ever-increasing scRNA-seq data. However, limited by the algorithmic constraints, current computational methods based on statistical strategies ignore some key latent information contained in scRNA-seq data with high sparsity and heterogeneity.ResultsHere, we developed a deep learning framework named DeepCCI to identify meaningful CCIs from scRNA-seq data. Applications of DeepCCI to a wide range of publicly available datasets from diverse technologies and platforms demonstrate its ability to predict significant CCIs accurately and effectively. Powered by the flexible and easy-to-use software, DeepCCI can provide the one-stop solution to discover meaningful intercellular interactions and build CCI networks from scRNA-seq data.AvailabilityThe source code of DeepCCI is available online at https://github.com/JiangBioLab/DeepCCI.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
MuDCoD: Multi-Subject Community Detection in Personalized Dynamic Gene Networks from Single Cell RNA Sequencing
AbstractMotivationWith the wide availability of single-cell RNA-seq (scRNA-seq) technology, population-scale scRNA-seq datasets across multiple individuals and time points are emerging. While the initial investigations of these datasets tend to focus on standard analysis of clustering and differential expression, leveraging the power of scRNA-seq data at the personalized dynamic gene co-expression network level has the potential to unlock subject and/or time-specific network-level variation, which is critical for understanding phenotypic differences. Community detection from co-expression networks of multiple time points or conditions has been well-studied; however, none of the existing settings included networks from multiple subjects and multiple time points simultaneously. To address this, we develop MuDCoD for multi-subject community detection in personalized dynamic gene networks from scRNA-seq. MuDCoD builds on the spectral clustering framework and promotes information sharing among the networks of the subjects as well as networks at different time points. It clusters genes in the personalized dynamic gene networks and reveals gene communities that are variable or shared not only across time but also among subjects.ResultsEvaluation and benchmarking of MuDCoD against existing approaches reveal that MuDCoD effectively leverages apparent shared signals among networks of the subjects at individual time points, and performs robustly when there is no or little information sharing among the networks. Applications to population-scale scRNA-seq datasets of human-induced pluripotent stem cells during dopaminergic neuron differentiation and CD4+ T cell activation indicate that MuDCoD enables robust inference for identifying time-varying personalized gene modules. Our results illustrate how personalized dynamic community detection can aid in the exploration of subject-specific biological processes that vary across time.AvailabilityMuDCoD is publicly available at https://github.com/bo1929/MuDCoD as a Python package. Implementation includes simulation and real-data experiments together with extensive documentation.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
GOAT: Gene-level biomarker discovery from multi-Omics data using graph ATtention neural network for eosinophilic asthma subtype
AbstractMotivationAsthma is a heterogeneous disease where various subtypes are established and molecular biomarkers of the subtypes are yet to be discovered. Recent availability of multi-omics data paved a way to discover molecular biomarkers for the subtypes. However, multi-omics biomarker discovery is challenging because of the complex interplay between different omics layers.ResultsWe propose a deep attention model named Gene-level biomarker discovery from multi-Omics data using graph ATtention neural network (GOAT) for identifying molecular biomarkers for eosinophilic asthma (EA) subtypes with multi-omics data. GOAT identifies genes that discriminate subtypes using a graph neural network by modeling complex interactions among genes as the attention mechanism in the deep learning model. In experiments with multi-omics profiles of the COREA asthma cohort of 300 patients, GOAT outperforms existing models and suggests interpretable biological mechanisms underlying asthma subtypes. Importantly, GOAT identified genes that are distinct only in terms of relationship with other genes through attention. To better understand the role of biomarkers, we further investigated two transcription factors (TFs), CTNNB1 and JUN, captured by GOAT. We were successful in showing the role of the TFs in EA pathophysiology in a network propagation and transcriptional network analysis, which were not distinct in terms of gene expression level differences.Availabilityhttps://github.com/DabinJeong/Multi-omics_biomarker.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
DeepUMQA3: a web server for accurate assessment of interface residue accuracy in protein complexes
AbstractMotivationModel quality assessment is a crucial part of protein structure prediction and a gateway to proper usage of models in biomedical applications. Many methods have been proposed for assessing the quality of structural models of protein monomers, but few methods for evaluating protein complex models. As protein complex structure prediction becomes a new challenge, there is an urgent need for model quality assessment methods that can accurately assess the accuracy of interface residues of complex structures.ResultsHere, we present DeepUMQA3, a web server for evaluating the accuracy of interface residues of protein complex structures using deep neural networks. For an input complex structure, features are extracted from three levels of overall complex, intra-monomer, and inter-monomer, and a improved deep residual neural network is used to predict per-residue lDDT and interface residue accuracy. DeepUMQA3 ranks first in the blind test of interface residue accuracy estimation in CASP15, with Pearson, Spearman and AUC of 0.564, 0.535 and 0.755 under the lDDT measurement, which are 17.6%, 23.6% and 10.9% higher than the second best method, respectively. DeepUMQA3 can also assess the accuracy of all residues in the entire complex and distinguish high- and low-precision residues.Availability and implementationThe web sever of DeepUMQA3 are freely available at http://zhanglab-bioinf.com/DeepUMQA_server/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Unbiased Curriculum Learning Enhanced Global-Local Graph Neural Network for Protein Thermodynamic Stability Prediction
AbstractMotivationProteins play crucial roles in biological processes, with their functions being closely tied to thermodynamic stability. However, measuring stability changes upon point mutations of amino acid residues using physical methods can be time-consuming. In recent years, several computational methods for protein thermodynamic stability prediction (PTSP) based on deep learning have emerged. Nevertheless, these approaches either overlook the natural topology of protein structures or neglect the inherent noisy samples resulting from theoretical calculation or experimental errors.ResultsWe propose a novel Global-Local Graph Neural Network powered by Unbiased Curriculum Learning (GLGNN-UCL) for the PTSP task. Our method first builds a Siamese graph neural network to extract protein features before and after mutation. Since the graph’s topological changes stem from local node mutations, we design a local feature transformation module to make the model focus on the mutated site. To address model bias caused by noisy samples, which represent unavoidable errors from physical experiments, we introduce an unbiased curriculum learning method. This approach effectively identifies and re-weights noisy samples during the training process. Extensive experiments demonstrate that our proposed method outperforms advanced protein stability prediction methods, and surpasses state-of-the-art learning methods for regression prediction tasks.AvailabilityCode is available at https://github.com/haifangong/UCL-GLGNN.
Categories: Bioinformatics Trends
phippery: a software suite for PhIP-Seq data analysis
AbstractSummaryWe present the phippery software suite for analyzing data from phage display methods that use immunoprecipitation and deep sequencing to capture antibody binding to peptides, often referred to as PhIP-Seq. It has three main components that can be used separately or in conjunction: (1) A Nextflow pipeline, phip-flow, to process raw sequencing data into a compact, multidimensional dataset format and allows for end-to-end automation of reproducible workflows. (2) A Python API, phippery, which provides interfaces for tasks such as count normalization, enrichment calculation, multidimensional scaling, and more. (3) A Streamlit application, phip-viz, as an interactive interface for visualizing the data as a heatmap in a flexible manner.Availability and implementationAll software packages are publicly available under the MIT License.The phip-flow pipeline: https://github.com/matsengrp/phip-flow.The phippery library: https://github.com/matsengrp/phippery.The phip-viz Streamlit application: https://github.com/matsengrp/phip-viz.
Categories: Bioinformatics Trends
Using a novel structure/function approach to select diverse swine major histocompatibility complex 1 alleles to predict epitopes for vaccine development
AbstractMotivationSwine leukocyte antigens (SLAs; i.e. swine major histocompatibility complex proteins (MHC)) conduct a fundamental role in swine immunity. To generate a protective vaccine across an outbred species, such as pigs, it is critical that epitopes that bind to diverse SLA alleles are used in the vaccine development process. We introduced a new strategy for epitope prediction.ResultsWe employed molecular dynamic simulation (MDS) to identify key amino acids for interactions (CAAI) with epitopes. We developed an algorithm wherein each SLA-1 is compared to a crystalized reference allele with unique weighting for non-conserved amino acids based on R group and position. We then performed homology modelling and electrostatic contact mapping to visualize how relatively small changes in sequences impacted the charge distribution in the binding site. We selected eight diverse SLA-1 alleles and performed homology modelling followed, by protein-peptide docking and binding affinity analyses, to identify porcine reproductive and respiratory syndrome virus (PRRSV) matrix protein (M) epitopes that bind with high affinity to these alleles. We also performed docking analysis on the epitopes identified as strong binders using NetMHCpan 4.1. Epitopes predicted to bind to our eight SLA-1 alleles had equivalent or higher energetic interactions than those predicted to bind to the NetMHCpan 4.1 allele repertoire. This approach of selecting diverse SLA-1 alleles, followed by homology modelling, and docking simulations, can be used as a novel strategy for epitope prediction that complements other available tools and is especially useful when available tools do not offer a prediction for SLAs/MHC.AvailabilityThe data underlying this article are available in the online supplementary materialsupplementary material.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Phables: from fragmented assemblies to high-quality bacteriophage genomes
AbstractMotivationMicrobial communities have a profound impact on both human health and various environments. Viruses infecting bacteria, known as bacteriophages or phages, play a key role in modulating bacterial communities within environments. High-quality phage genome sequences are essential for advancing our understanding of phage biology, enabling comparative genomics studies and developing phage-based diagnostic tools. Most available viral identification tools consider individual sequences to determine whether they are of viral origin. As a result of challenges in viral assembly, fragmentation of genomes can occur, and existing tools may recover incomplete genome fragments. Therefore, the identification and characterisation of novel phage genomes remains a challenge, leading to the need of improved approaches for phage genome recovery.ResultsWe introduce Phables, a new computational method to resolve phage genomes from fragmented viral metagenome assemblies. Phables identifies phage-like components in the assembly graph, models each component as a flow network, and uses graph algorithms and flow decomposition techniques to identify genomic paths. Experimental results of viral metagenomic samples obtained from different environments show that Phables recovers on average over 49% more high-quality phage genomes compared to existing viral identification tools. Furthermore, Phables can resolve variant phage genomes with over 99% average nucleotide identity, a distinction that existing tools are unable to make.Availability and ImplementationPhables is available on GitHub at https://github.com/Vini2/phables.
Categories: Bioinformatics Trends
HQAlign: Aligning nanopore reads for SV detection using current-level modeling
AbstractMotivationDetection of structural variants (SV) from the alignment of sample DNA reads to the reference genome is an important problem in understanding human diseases. Long reads that can span repeat regions, along with an accurate alignment of these long reads play an important role in identifying novel SVs. Long read sequencers such as nanopore sequencing can address this problem by providing very long reads but with high error rates, making accurate alignment challenging. Many errors induced by nanopore sequencing have a bias because of the physics of the sequencing process and proper utilization of these error characteristics can play an important role in designing a robust aligner for SV detection problems. In this paper, we design and evaluate HQAlign, an aligner for SV detection using nanopore sequenced reads. The key ideas of HQAlign include (i) using basecalled nanopore reads along with the nanopore physics to improve alignments for SVs (ii) incorporating SV specific changes to the alignment pipeline (iii) adapting these into existing state-of-the-art long read aligner pipeline, minimap2 (v2.24), for efficient alignments.ResultsWe show that HQAlign captures about 4 – 6% complementary SVs across different datasets which are missed by minimap2 alignments while having a standalone performance at par with minimap2 for real nanopore reads data. For the common SV calls between HQAlign and minimap2, HQAlign improves the start and the end breakpoint accuracy by about 10 – 50% for SVs across different datasets. Moreover, HQAlign improves the alignment rate to 89.35% from minimap2 85.64% for nanopore reads alignment to recent telomere-to-telomere CHM13 assembly, and it improves to 86.65% from 83.48% for nanopore reads alignment to GRCh37 human genome.Availabilityhttps://github.com/joshidhaivat/HQAlign.git
Categories: Bioinformatics Trends
pLM-BLAST—distant homology detection based on direct comparison of sequence representations from protein language models
AbstractMotivationThe detection of homology through sequence comparison is a typical first step in the study of protein function and evolution. In this work, we explore the applicability of protein language models to this task.ResultsWe introduce pLM-BLAST, a tool inspired by BLAST, that detects distant homology by comparing single-sequence representations (embeddings) derived from a protein language model, ProtT5. Our benchmarks reveal that pLM-BLAST maintains a level of accuracy on par with HHsearch for both highly similar sequences (with over 50% identity) and markedly divergent sequences (with less than 30% identity), while being significantly faster. Additionally, pLM-BLAST stands out among other embedding-based tools due to its ability to compute local alignments. We show that these local alignments, produced by pLM-BLAST, often connect highly divergent proteins, thereby highlighting its potential to uncover previously undiscovered homologous relationships and improve protein annotation.Availability and ImplementationpLM-BLAST is accessible via the MPI Bioinformatics Toolkit as a web server for searching precomputed databases (https://toolkit.tuebingen.mpg.de/tools/plmblast). It is also available as a standalone tool for building custom databases and performing batch searches (https://github.com/labstructbioinf/pLM-BLAST).
Categories: Bioinformatics Trends
Tcbf: A novel user-friendly tool for pan-3D genome analysis of topologically associating domain in eukaryotic organisms
AbstractSummaryTAD boundaries are essential for organizing the chromatin spatial structure and regulating gene expression in eukaryotes. However, for large-scale pan-3D genome research, identifying conserved and specific TAD boundaries across different species or individuals is computationally challenging. Here, we present Tcbf, a rapid and powerful Python/R tool that integrates gene synteny blocks and homologous sequences to automatically detect conserved and specific TAD boundaries among multiple species, which can efficiently analyze huge genome datasets, greatly reduce the computational burden and enable pan-3D genome research.Availability and implementationTcbf is implemented by Python/R and is available at https://github.com/TcbfGroup/Tcbf under the MIT license.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
BondGraphs.jl: Composable energy-based modelling in systems biology
AbstractSummaryBondGraphs.jl is a Julia implementation of bond graphs. Bond graphs provide a modelling framework that describes energy flow through a physical system and by construction enforce thermodynamic constraints. The framework is widely used in engineering and has recently been shown to be a powerful approach for modelling biology. Models are mutable, hierarchical, multi-scale, multi-physics, and BondGraphs.jl is compatible with the Julia modelling ecosystem.Availability and ImplementationBondGraphs.jl is freely available under the MIT license. Source code and documentation can be found at https://github.com/jedforrest/BondGraphs.jl.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends