fimpera: drastic improvement of Approximate Membership Query data-structures with counts
AbstractMotivationHigh throughput sequencing technologies generate massive amounts of biological sequence datasets as costs fall. One of the current algorithmic challenges for exploiting these data on a global scale consists in providing efficient query engines on these petabyte-scale datasets. Most methods indexing those datasets rely on indexing words of fixed length k, called k-mers. Many applications, such as metagenomics, require the abundance of indexed k-mers as well as their simple presence or absence, but no method scales up to petabyte-scaled datasets. This deficiency is primarily because storing abundance requires explicit storage of the k-mers in order to associate them with their counts. Using counting Approximate Membership Queries (cAMQ) data structures, such as counting Bloom filters, provides a way to index large amounts of k-mers with their abundance, but at the expense of a sensible false positive rate.ResultsWe propose a novel algorithm, called fimpera, that enables the improvement of any cAMQ performance. Applied to counting Bloom filters, our proposed algorithm reduces the false positive rate by two orders of magnitude and it improves the precision of the reported abundances. Alternatively, fimpera allows for the reduction of the size of a counting Bloom filter by two orders of magnitude while maintaining the same precision. fimpera does not introduce any memory overhead and may even reduces the query time.Availabilityhttps://github.com/lrobidou/fimperaSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
LENS: Landscape of Effective Neoantigens Software
AbstractMotivationElimination of cancer cells by T cells is a critical mechanism of anti-tumor immunity and cancer immunotherapy response. T cells recognize cancer cells by engagement of T cell receptors with peptide epitopes presented by major histocompatibility complex (MHC) molecules on the cancer cell surface. Peptide epitopes can be derived from antigen proteins coded for by multiple genomic sources. Bioinformatics tools used to identify tumor-specific epitopes via analysis of DNA and RNA sequencing data have largely focused on epitopes derived from somatic variants, though a smaller number have evaluated potential antigens from other genomic sources.ResultsWe report here an open-source workflow utilizing the Nextflow DSL2 workflow manager, Landscape of Effective Neoantigen Software (LENS), which predicts tumor-specific and tumor-associated antigens from single nucleotide variants, insertions and deletions, fusion events, splice variants, cancer testis antigens, overexpressed self-antigens, viruses, and endogenous retroviruses. The primary advantage of LENS is that it expands the breadth of genomic sources of discoverable tumor antigens using genomics data. Other advantages include modularity, extensibility, ease of use, and harmonization of relative expression level and immunogenicity prediction across multiple genomic sources. We present an analysis of 115 acute myeloid leukemia (AML) samples to demonstrate the utility of LENS. We expect LENS will be a valuable platform and resource for T cell epitope discovery bioinformatics, especially in cancers with few somatic variants where tumor-specific epitopes from alternative genomic sources are an elevated priority.AvailabilityMore information about LENS, including code, workflow documentation, and instructions, can be found at {https://gitlab.com/landscape-of-effective-neoantigens-software}.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
uORF4u: a tool for annotation of conserved upstream open reading frames
AbstractSummaryUpstream open reading frames (uORFs, often encoding so-called leader peptides) can regulate translation and transcription of downstream main ORFs (mORFs) in prokaryotes and eukaryotes. However, annotation of novel functional uORFs is challenging due their short size of usually less than 100 codons. While transcription- and translation-level next generation sequencing (NGS) methods can be used for genome-wide functional uORF identification, this data is not available for the vast majority of species with sequenced genomes. At the same time, the exponentially increasing amount of genome assemblies gives us the opportunity to take advantage of evolutionary conservation in our predictions of functional ORFs.Here we present a tool for conserved uORF annotation in 5ʹ upstream sequences of a user-defined protein of interest or a set of protein homologues. It can also be used to find small conserved ORFs within a set of nucleotide sequences. The output includes publication-quality figures with multiple sequence alignments, sequence logos and locus annotation of the predicted conserved uORFs in graphical vector format.AvailabilityuORF4u is written in Python3 and runs on Linux and MacOS. The command-line interface covers most practical use cases, while the provided Python API allows usage within a Python program and additional customisation. Source code is available from the GitHub page: github.com/GCA-VH-lab/uorf4u Detailed documentation that includes an example-driven guide available at the software home page: gca-vh-lab.github.io/uorf4u. A web version of uORF4u is available at server.atkinson-lab.com/uorf4u.
Categories: Bioinformatics Trends
Efficient short read mapping to a pangenome that is represented by a graph of ED strings
AbstractMotivationA pangenome represents many diverse genome sequences of the same species. In order to cope with small variations as well as structural variations, recent research focused on the development of graph based models of pangenomes. Mapping is the process of finding the original location of a DNA read in a reference sequence, typically a genome. Using a pangenome instead of a (linear) reference genome can e.g. reduce mapping bias, the tendency to incorrectly map sequences that differ from the reference genome. Mapping reads to a graph, however, is more complex and needs more resources than mapping to a reference genome. Reducing the complexity of the graph by encoding simple variations like SNPs in a simple way can accelerate read mapping and reduce the memory requirements at the same time.ResultsWe introduce graphs based on elastic-degenerate strings (ED strings, EDS) and the linearised form of these EDS graphs as a new representation for pangenomes. In this representation, small variations are encoded directly in the sequence. Structural variations are encoded in a graph structure. This reduces the size of the representation in comparison to sequence graphs. In the linearised form, mapping techniques that are known from ordinary strings can be applied with appropriate adjustments. Since most variations are expressed directly in the sequence, the mapping process rarely has to take edges of the EDS graph into account. We developed a prototypical software tool GED-MAP that uses this representation together with a minimizer index to map short reads to the pangenome. Our experiments show that the new method works on a whole human genome scale, taking structural variants properly into account. The advantage of GED-MAP, compared with other pangenomic short read mappers, is that the new representation allows for a simple indexing method. This makes GED-MAP fast and memory efficient.AvailabilitySources are available at: https://github.com/thomas-buechler-ulm/gedmapSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
AIONER: All-in-one scheme-based biomedical named entity recognition using deep learning
AbstractMotivationBiomedical named entity recognition (BioNER) seeks to automatically recognize biomedical entities in natural language text, serving as a necessary foundation for downstream text mining tasks and applications such as information extraction and question answering. Manually labeling training data for the BioNER task is costly, however, due to the significant domain expertise required for accurate annotation. The resulting data scarcity causes current BioNER approaches to be prone to overfitting, to suffer from limited generalizability, and to address a single entity type at a time (e.g., gene or disease).ResultsWe therefore propose a novel all-in-one (AIO) scheme that uses external data from existing annotated resources to enhance the accuracy and stability of BioNER models. We further present AIONER, a general-purpose BioNER tool based on cutting-edge deep learning and our AIO schema. We evaluate AIONER on 14 BioNER benchmark tasks and show that AIONER is effective, robust, and compares favorably to other state-of-the-art approaches such as multi-task learning. We further demonstrate the practical utility of AIONER in three independent tasks to recognize entity types not previously seen in training data, as well as the advantages of AIONER over existing methods for processing biomedical text at a large scale (e.g., the entire PubMed data).Availability and implementationThe source code, trained models and data for AIONER are freely available at https://github.com/ncbi/AIONER.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
NanoPack2: Population scale evaluation of long-read sequencing data
AbstractSummaryIncreases in the cohort size in long-read sequencing projects necessitate more efficient software for quality assessment and processing of sequencing data from Oxford Nanopore Technologies and Pacific Biosciences. Here we describe novel tools for summarizing experiments, filtering datasets, visualizing phased alignments results, and updates to the NanoPack software suite.Availability and implementationThe cramino, chopper, kyber, and phasius tools are written in Rust and available as executable binaries without requiring installation or managing dependencies. Binaries build on musl are available for broad compatibility. NanoPlot and NanoComp are written in Python3. Links to the separate tools and their documentation can be found at https://github.com/wdecoster/nanopack. All tools are compatible with Linux, Mac OS, and the MS Windows Subsystem for Linux and are released under the MIT license. The repositories include test data, and the tools are continuously tested using GitHub Actions and can be installed with the conda dependency manager.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
copMEM2: Robust and scalable maximum exact match finding
AbstractSummaryFinding Maximum Exact Matches, i.e., matches between two strings that cannot be further extended to the left or right, is a classic string problem with applications in genome-to-genome comparisons. The existing tools rarely explicitly address the problem of MEM finding for a pair of very similar genomes, which may be computationally challenging. We present copMEM2, a multi-threaded implementation of its predecessor. Together with a few optimizations, including a carefully built predecessor query data structure and sort procedure selection, and taking care for highly similar data, copMEM2 allows to compute all MEMs of minimum length 50 between the human and mouse genomes in 59 s, using 10.40 GB of RAM and 12 threads, being at least a few times faster than its main contenders. On a pair of human genomes, hg18 and hg19, the results are 324 s and 16.57 GB, respectively.Availability and implementationcopMEM2 is available at https://github.com/wbieniec/copmem2.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
ViralConsensus: A fast and memory-efficient tool for calling viral consensus genome sequences directly from read alignment data
AbstractMotivationIn viral molecular epidemiology, reconstruction of consensus genomes from sequence data is critical for tracking mutations and variants of concern. However, as the number of samples that are sequenced grows rapidly, compute resources needed to reconstruct consensus genomes can become prohibitively large.ResultsViralConsensus is a fast and memory-efficient tool for calling viral consensus genome sequences directly from read alignment data. ViralConsensus is orders of magnitude faster and more memory-efficient than existing methods. Further, unlike existing methods, ViralConsensus can pipe data directly from a read mapper via standard input and performs viral consensus calling on-the-fly, making it an ideal tool for viral sequencing pipelines.AvailabilityViralConsensus is freely available at https://github.com/niemasd/ViralConsensus as an open-source software project.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
High-quality, customizable heuristics for RNA 3D structure alignment
AbstractMotivationTertiary structure alignment is one of the main challenges in the computer-aided comparative study of molecular structures. Its aim is to optimally overlay the three-dimensional shapes of two or more molecules in space to find the correspondence between their nucleotides. Alignment is the starting point for most algorithms that assess structural similarity or find common substructures. Thus, it has applications in solving a variety of bioinformatics problems, e.g., in the search for structural patterns, structure clustering, identifying structural redundancy, and evaluating the prediction accuracy of 3D models. To date, several tools have been developed to align 3D structures of RNA. However, most of them are not applicable to arbitrarily large structures and do not allow users to parameterize the optimization algorithm.ResultsWe present two customizable heuristics for flexible alignment of 3D RNA structures, geometric search (GEOS), and genetic algorithm (GENS). They work in sequence-dependent/independent mode and find the suboptimal alignment of expected quality (below a predefined RMSD threshold). We compare their performance with those of state-of-the-art methods for aligning RNA structures. We show the results of quantitative and qualitative tests run for all of these algorithms on benchmark sets of RNA structures.AvailabilitySource codes for both heuristics are hosted at https://github.com/RNApolis/rnahugsSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online
Categories: Bioinformatics Trends
CscoreTool-M infers 3D sub-compartment probabilities within cell population
AbstractMotivationComputational inference of genome organization based on Hi-C sequencing has greatly aided the understanding of chromatin and nuclear organization in three dimensions (3D). However, existing computational methods fail to address the cell population heterogeneity. Here we describe a probabilistic-modeling-based method called CscoreTool-M that infers multiple 3D genome sub-compartments from Hi-C data.ResultsThe compartment scores inferred using CscoreTool-M represents the probability of a genomic region locating in a specific sub-compartment. Compared to published methods, CscoreTool-M is more accurate in inferring sub-compartments corresponding to both active and repressed chromatin. The compartment scores calculated by CscoreTool-M also help to quantify the levels of heterogeneity in sub-compartment localization within cell populations. By comparing proliferating cells and terminally differentiated non-proliferating cells, we show that the proliferating cells have higher genome organization heterogeneity, which is likely caused by cells at different cell-cycle stages. By analyzing 10 sub-compartments, we found a sub-compartment containing chromatin potentially related to the early-G1 chromatin regions proximal to the nuclear lamina in HCT116 cells, suggesting the method can deconvolve cell cycle stage-specific genome organization among asynchronously dividing cells. Finally, we show that CscoreTool-M can identify sub-compartments that contain genes enriched in housekeeping or cell-type-specific functions.Availabilityhttps://github.com/scoutzxb/CscoreTool-MSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
RaPID-Query for Fast Identity by Descent Search and Genealogical Analysis
AbstractMotivationDue to the rapid growth of the genetic database size, genealogical search, a process of inferring familial relatedness by identifying DNA matches, has become a viable approach to help individuals finding missing family members or law enforcement agencies locating suspects. A fast and accurate method is needed to search an out-of-database individual against millions of individuals. Most existing approaches only offer all-vs-all within panel match. Some prototype algorithms offer one-vs-all query from out-of-panel individual, but they do not tolerate errors.ResultsA new method, random projection-based identical-by-descent (IBD) detection (RaPID) query, is introduced to make fast genealogical search possible. RaPID-Query identifies IBD segments between a query haplotype and a panel of haplotypes. By integrating matches over multiple PBWT indexes, RaPID-Query manages to locate IBD segments quickly with a given cutoff length while allowing mismatched sites. A single query against all UK biobank autosomal chromosomes was completed within 2.76 seconds on average, with the minimum length 7 cM and 700 markers. RaPID-Query achieved a 0.016 false negative rate and a 0.012 false positive rate simultaneously on a chromosome 20 sequencing panel having 86,265 sites. This is comparable to the state-of-the-art IBD detection method TPBWT(out-of-sample) and Hap-IBD. The high-quality IBD segments yielded by RaPID-Query were able to distinguish up to fourth degree of the familial relatedness for a given individual pair, and the area under the receiver operating characteristic curve values are at least 97.28%.AvailabilityThe RaPID-Query program is available at https://github.com/ucfcbb/RaPID-Query.Supplementary informationSupplementary dataSupplementary data is available at Bioinformatics online.
Categories: Bioinformatics Trends
TRASH: Tandem Repeat Annotation and Structural Hierarchy
AbstractMotivationThe advent of long-read DNA sequencing is allowing complete assembly of highly repetitive genomic regions for the first time, including the megabase-scale satellite repeat arrays found in many eukaryotic centromeres. The assembly of such repetitive regions creates a need for their de novo annotation, including patterns of higher order repetition. To annotate tandem repeats, methods are required that can be widely applied to diverse genome sequences, without prior knowledge of monomer sequences.ResultsTRASH (Tandem Repeat Annotation and Structural Hierarchy) is a tool that identifies and maps tandem repeats in nucleotide sequence, without prior knowledge of repeat composition. TRASH analyses a fasta assembly file, identifies regions occupied by repeats and then precisely maps them and their higher order structures. To demonstrate the applicability and scalability of TRASH for centromere research, we apply our method to the recently published Col-CEN genome of Arabidopsis thaliana and the complete human CHM13 genome.AvailabilityTRASH is freely available at: https://github.com/vlothec/TRASH and supported on Linux.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Genome mining for anti-CRISPR operons using machine learning
AbstractMotivationEncoded by (pro-)viruses, anti-CRISPR (Acr) proteins inhibit the CRISPR-Cas immune system of their prokaryotic hosts. As a result, Acr proteins can be employed to develop more controllable CRISPR-Cas genome editing tools. Recent studies revealed that known acr genes often coexist with other acr genes and with phage structural genes within the same operon. For example, we found that 47 of 98 known acr genes (or their homologs) co-exist in the same operons. None of the current Acr prediction tools have considered this important genomic context feature. We have developed a new software tool AOminer to facilitate the improved discovery of new Acrs by fully exploiting the genomic context of known acr genes and their homologs.ResultsAOminer is the first machine learning based tool focused on the discovery of Acr operons (AOs). A two-state HMM (hidden Markov model) was trained to learn the conserved genomic context of operons that contain known acr genes or their homologs, and the learnt features could distinguish AOs and non-AOs. AOminer allows automated mining for potential AOs from query genomes or operons. AOminer outperformed all existing Acr prediction tools with an accuracy = 0.85. AOminer will facilitate the discovery of novel anti-CRISPR operons.AvailabilityThe webserver is available at: http://aca.unl.edu/AOminer/AOminer_APP/. The python program is at: https://github.com/boweny920/AOminer.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
GTExVisualizer: a web platform for supporting ageing studies
AbstractMotivationStudying ageing effects on molecules is an important new topic for life science. To perform such studies the need for data, models, algorithms, and tools arises to elucidate molecular mechanisms. GTEx (standing for Genotype-Tissue Expression) portal is a web-based data source allowing to retrieve patients transcriptomics data annotated with tissues, gender and age information. It represents the more complete data sources for ageing effects studies. Nevertheless, it lacks functionalities to query data at the sex/age level, as well as tools for protein interaction studies, thereby limiting ageing studies. As a result, users need to download query results to proceed to further analysis, such as retrieving the expression of a given gene on different age (or sex) classes in many tissues.ResultsWe present the GTExVisualizer, a platform to query and analyse GTEx data. This tool contains a web interface able to: (i) graphically represent and study query results; (ii) analyse genes using sex/age expression patterns, also integrated with network based modules; (iii) report results as plot-based representation as well as (gene) networks. Finally, it allows the user to obtain basic statistics which evidence differences in gene expression among sex/age groups.ConclusionThe GTExVisualizer novelty consists in providing a tool for studying ageing/sex-related effects on molecular processes.AvailabilityGTExVisualizer is available at : http://gtexvisualizer.herokuapp.com.The source code is available at: https://github.com/UgoLomoio/gtex_visualizer
Categories: Bioinformatics Trends
STEMSIM: a simulator of within-strain short-term evolutionary mutations for longitudinal metagenomic data
AbstractMotivationAs the resolution of metagenomic analysis increases, the evolution of microbial genomes in longitudinal metagenomic data has become a research focus. Some software has been developed for the simulation of complex microbial communities at the strain level. However, the tool for simulating within-strain evolutionary signals in longitudinal samples is still lacking.ResultsIn this study, we introduce STEMSIM, a user-friendly command-line simulator of short-term evolutionary mutations for longitudinal metagenomic data. The input is simulated longitudinal raw sequencing reads of microbial communities or single species. The output is the modified reads with within-strain evolutionary mutations and the relevant information of these mutations. STEMSIM will be of great use for the evaluation of analytic tools that detect short-term evolutionary mutations in metagenomic data.AvailabilitySTEMSIM and its tutorial are freely available online at https://github.com/BoyanZhou/STEMSim.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Atomic protein structure refinement using all-atom graph representations and SE(3)-equivariant graph transformer
AbstractMotivationThe state-of-art protein structure prediction methods such as AlphaFold are being widely used to predict structures of uncharacterized proteins in biomedical research. There is a significant need to further improve the quality and nativeness of the predicted structures to enhance their usability. In this work, we develop ATOMRefine, a deep learning-based, end-to-end, all-atom protein structural model refinement method. It uses a SE(3)-equivariant graph transformer network to directly refine protein atomic coordinates in a predicted tertiary structure represented as a molecular graph.ResultsThe method is first trained and tested on the structural models in AlphaFoldDB whose experimental structures are known, and then blindly tested on 69 CASP14 regular targets and 7 CASP14 refinement targets. ATOMRefine improves the quality of both backbone atoms and all-atom conformation of the initial structural models generated by AlphaFold. It also performs better than two state-of-the-art refinement methods in multiple evaluation metrics including an all-atom model quality score – the MolProbity score based on the analysis of all-atom contacts, bond length, atom clashes, torsion angles, and side-chain rotamers. As ATOMRefine can refine a protein structure quickly, it provides a viable, fast solution for improving protein geometry and fixing structural errors of predicted structures through direct coordinate refinement.AvailabilityThe source code of ATOMRefine is available in the GitHub repository (https://github.com/BioinfoMachineLearning/ATOMRefine). All the required data for training and testing are available at https://doi.org/10.5281/zenodo.6944368.
Categories: Bioinformatics Trends
ROptimus: a parallel general-purpose adaptive optimisation engine
AbstractMotivationVarious computational biology calculations require a probabilistic optimisation protocol to determine the parameters that capture the system at a desired state in the configurational space. Many existing methods excel at certain scenarios, but fail in others due, in part, to an inefficient exploration of the parameter space and easy trapping into local minima. Here, we developed a general-purpose optimisation engine in R that can be plugged to any, simple or complex, modelling initiative through a few lucid interfacing functions, to perform a seamless optimisation with rigorous parameter sampling.ResultsROptimus features simulated annealing and replica exchange implementations equipped with adaptive thermoregulation to drive Monte Carlo optimisation process in a flexible manner, through constrained acceptance frequency but unconstrained adaptive pseudo temperature regimens. We exemplify the applicability of our R optimiser to a diverse set of problems spanning data analyses and computational biology tasks.Availability and ImplementationROptimus is written and implemented in R, and is freely available from CRAN (http://cran.r-project.org/web/packages/ROptimus/index.html), and GitHub (http://github.com/SahakyanLab/ROptimus).Supplementary informationSupplementary informationSupplementary information with more details, tutorials, and developer instructions is available at Bioinformatics online.
Categories: Bioinformatics Trends
HAMPLE: deciphering TF-DNA binding mechanism in different cellular environments by characterizing higher-order nucleotide dependency
AbstractMotivationTranscription factor (TF) binds to conservative DNA binding sites in different cellular environments and development stages by physical interaction with interdependent nucleotides. However, systematic computational characterization of the relationship between higher-order nucleotide dependency and TF-DNA binding mechanism in diverse cell types remains challenging.ResultsHere, we propose a novel multi-task learning framework HAMPLE to simultaneously predict TF binding sites (TFBS) in distinct cell types by characterizing higher-order nucleotide dependencies. Specifically, HAMPLE first represents a DNA sequence through three higher-order nucleotide dependencies, including k-mer encoding, DNA shape and histone modification. Then, HAMPLE employs the customized gate control and the channel attention convolutional architecture to further capture cell-type-specific and cell-type-shared DNA binding motifs and epigenomic languages. Finally, HAMPLE exploits the joint loss function to optimize the TFBS prediction for different cell types in an end-to-end manner. Extensive experimental results on seven datasets demonstrate that HAMPLE significantly outperforms the state-of-the-art approaches in terms of auROC. In addition, feature importance analysis illustrates that k-mer encoding, DNA shape and histone modification have predictive power for TF-DNA binding in different cellular environments and are complementary to each other. Furthermore, ablation study and interpretable analysis validate the effectiveness of the customized gate control and the channel attention convolutional architecture in characterizing higher-order nucleotide dependencies.AvailabilityThe source code is available at https://github.com/ZhangLab312/Hample.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
ppBAM: ProteinPaint BAM track for read alignment visualization and variant genotyping
AbstractSummaryProteinPaint BAM track (ppBAM) is designed to assist variant review for cancer research and clinical genomics. With performant server-side computing and rendering, ppBAM supports on-the-fly variant genotyping of thousands of reads using Smith-Waterman alignment. To better visualize support for complex variants, reads are realigned against the mutated reference sequence using ClustalO. ppBAM also supports the BAM slicing API of the NCI Genomic Data Commons (GDC) portal, letting researchers conveniently examine genomic details of vast amounts of cancer sequencing data and reinterpret variant calls.AvailabilityBAM track examples, tutorial, and GDC file access links are available at https://proteinpaint.stjude.org/bam/. Source code is available at https://github.com/stjude/proteinpaint.
Categories: Bioinformatics Trends
metapaths: similarity search in heterogeneous knowledge graphs via meta paths
AbstractSummaryHeterogeneous knowledge graphs (KGs) have enabled the modeling of complex systems, from genetic interaction graphs and protein-protein interaction networks to networks representing drugs, diseases, proteins, and side effects. Analytical methods for KGs rely on quantifying similarities between entities, such as nodes, in the graph. However, such methods must consider the diversity of node and edge types contained within the KG via, for example, defined sequences of entity types known as meta paths. We present metapaths, the first R software package to implement meta paths and perform meta-path-based similarity search in heterogeneous KGs. The metapaths package offers various built-in similarity metrics for node pair comparison by querying KGs represented as either edge or adjacency lists, as well as auxiliary aggregation methods to measure set-level relationships. Indeed, evaluation of these methods on an open-source biomedical KG recovered meaningful drug and disease-associated relationships, including those in Alzheimer’s disease. The metapaths framework facilitates the scalable and flexible modeling of network similarities in KGs with applications across KG learning.AvailabilityThe metapaths R package is available via GitHub at https://github.com/ayushnoori/metapaths and is released under MPL 2.0 (Zenodo DOI: 10.5281/zenodo.7047209). Package documentation and usage examples are available at https://www.ayushnoori.com/metapaths.Supplementary informationSupplementary informationSupplementary information is available at Bioinformatics online.
Categories: Bioinformatics Trends
Pages
