Jump to Navigation

Discriminative Pattern Discovery for the Characterization of Different Network Populations

Bioinformatics Oxford Journals - Thu, 06/04/2023 - 5:30am
AbstractMotivationAn interesting problem is to study how gene co-expression vary in two different populations, associated with healthy and unhealthy individuals, respectively. To this aim, two important aspects should be taken into account: (1) in some cases, pairs/groups of genes show” collaborative attitudes”, emerging in the study of disorders and diseases; (2) information coming from each single individual may be crucial to capture specific details, at the basis of complex cellular mechanisms; therefore, it is important avoiding to miss potentially powerful information, associated with the single samples.ResultsHere a novel approach is proposed, such that two different input populations are considered, and represented by two datasets of edge-labelled graphs. Each graph is associated to an individual, and the edge label is the co-expression value between the two genes associated to the nodes. Discriminative patterns among graphs belonging to different sample sets are searched for, based on a statistical notion of” relevance” able to take into account important local similarities, and also collaborative effects, involving the co-expression among multiple genes. Four different gene expression datasets have been analyzed by the proposed approach, each associated to a different disease. An extensive set of experiments show that the extracted patterns significantly characterize important differences between healthy and unhealthy samples, both in the cooperation and in the biological functionality of the involved genes/proteins. Moreover, the provided analysis confirms some results already presented in the literature on genes with a central role for the considered diseases, still allowing to identify novel and useful insights on this aspect.AvailabilityThe algorithm has been implemented using the Java programming language and the code is available at https://github.com/CriSe92/DiscriminativeSubgraphDiscovery.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

MARSY: A multitask deep learning framework for prediction of drug combination synergy scores

Bioinformatics Oxford Journals - Thu, 06/04/2023 - 5:30am
AbstractMotivationCombination therapies have emerged as a treatment strategy for cancers to reduce the probability of drug resistance and to improve outcome. Large databases curating the results of many drug screening studies on preclinical cancer cell lines have been developed, capturing the synergistic and antagonistic effects of combination of drugs in different cell lines. However, due to the high cost of drug screening experiments and the sheer size of possible drug combinations, these databases are quite sparse. This necessitates the development of transductive computational models to accurately impute these missing values.ResultsHere, we developed MARSY, a deep learning multi-task model that incorporates information on gene expression profile of cancer cell lines, as well as the differential expression signature induced by each drug to predict drug-pair synergy scores. By utilizing two encoders to capture the interplay between the drug-pairs, as well as the drug-pairs and cell lines, and by adding auxiliary tasks in the predictor, MARSY learns latent embeddings that improve the prediction performance compared to state-of-the-art and traditional machine learning models. Using MARSY, we then predicted the synergy scores of 133,722 new drug-pair cell line combinations, which we have made available to the community as part of this study. Moreover, we validated various insights obtained from these novel predictions using independent studies, confirming the ability of MARSY in making accurate novel predictions.Availability and implementationAn implementation of the algorithms in Python and cleaned input datasets are provided in https://github.com/Emad-COMBINE-lab/MARSY.Supplementary informationOnline-only supplementary datasupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

SpliceAI-10k calculator for the prediction of pseudoexonization, intron retention, and exon deletion

Bioinformatics Oxford Journals - Thu, 06/04/2023 - 5:30am
AbstractSummarySpliceAI is a widely used splicing prediction tool and its most common application relies on the maximum delta score to assign variant impact on splicing. We developed the SpliceAI-10k calculator (SAI-10k-calc) to extend use of this tool to predict: the splicing aberration type including pseudoexonization, intron retention, partial exon deletion, and (multi)exon skipping using a 10 kb analysis window; the size of inserted or deleted sequence; the effect on reading frame; and the altered amino acid sequence. SAI-10k-calc has 95% sensitivity and 96% specificity for predicting variants that impact splicing, computed from a control dataset of 1,212 single nucleotide variants (SNVs) with curated splicing assay results. Notably, it has high performance (≥84% accuracy) for predicting pseudoexon and partial intron retention. The automated amino acid sequence prediction allows for efficient identification of variants that are expected to result in mRNA nonsense-mediated decay or translation of truncated proteins.Availability and implementationSAI-10k-calc is implemented in R (https://github.com/adavi4/SAI-10k-calc) and also available as a Microsoft Excel spreadsheet. Users can adjust the default thresholds to suit their target performance values.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Identification of Protein-Protein Interaction Bridges for Multiple Sclerosis

Bioinformatics Oxford Journals - Wed, 05/04/2023 - 5:30am
AbstractMotivationIdentifying and prioritizing disease-related proteins is an important scientific problem to develop proper treatments. Network science has become an important discipline to prioritize such proteins. Multiple sclerosis (MS), an autoimmune disease for which there is still no cure, is characterized by a damaging process called demyelination. Demyelination is the destruction of myelin, a structure facilitating fast transmission of neuron impulses, and oligodendrocytes, the cells producing myelin, by immune cells. Identifying the proteins that have special features on the network formed by the proteins of oligodendrocyte and immune cells can reveal useful information about the disease.ResultsWe investigated the most significant protein pairs that we define as bridges among the proteins providing the interaction between the two cells in demyelination, in the networks formed by the oligodendrocyte and each type of two immune cells (i.e., macrophage and T-cell) using network analysis techniques and integer programming. The reason we investigated these specialized hubs was that a problem related to these proteins might impose a bigger damage in the system. We showed that 61% to 100% of the proteins our model detected, depending on parametrization, have already been associated with MS. We further observed the mRNA expression levels of several proteins we prioritized significantly decreased in human peripheral blood mononuclear cells (PBMCs) of MS patients. We therefore present a model, BriFin, which can be used for analyzing processes where interactions of two cell types play an important role.AvailabilityBriFin is available at https://github.com/BilkentCompGen/brifin.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Multi-scale Adaptive Differential Abundance Analysis in Microbial Compositional Data

Bioinformatics Oxford Journals - Wed, 05/04/2023 - 5:30am
AbstractMotivationDifferential abundance analysis is an essential and commonly used tool to characterize the difference between microbial communities. However, identifying differentially abundant microbes remains a challenging problem because the observed microbiome data is inherently compositional, excessive sparse, and distorted by experimental bias. Besides these major challenges, the results of differential abundance analysis also depend largely on the choice of analysis unit, adding another practical complexity to this already complicated problem.ResultsIn this work, we introduce a new differential abundance test called the MsRDB test, which embeds the sequences into a metric space and integrates a multi-scale adaptive strategy for utilizing spatial structure to identify differentially abundant microbes. Compared with existing methods, the MsRDB test can detect differentially abundant microbes at the finest resolution offered by data and provide adequate detection power while being robust to zero counts, compositional effect, and experimental bias in the microbial compositional data set. Applications to both simulated and real microbial compositional data sets demonstrate the usefulness of the MsRDB test.AvailabilityAll analyses can be found under https://github.com/lakerwsl/MsRDB-Manuscript-Code.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

SmoothT – a server constructing low energy pathways from conformational ensembles for interactive visualization and enhanced sampling

Bioinformatics Oxford Journals - Wed, 05/04/2023 - 5:30am
AbstractMotivationThe SmoothT software and webservice offers the construction of pathways from an ensemble of conformations. The user provides an archive of molecule conformations in PDB format, from which a starting and a final conformation need to be selected. The individual PDB files need to contain an energy value or score, estimating the quality of the respective conformation. Additionally, the user has to provide a RMSD cutoff, below which conformations are considered neighboring. From this SmoothT constructs a graph that connects similar conformations.ResultsSmoothT returns the energetically most favorable pathway within in this graph. This pathway is directly displayed as interactive animation using the NGL viewer. Simultaneously, the energy along the pathway is plotted, highlighting the conformation that is currently displayed in the 3D window.Availability and implementationSmoothT is available as webservice at: http://proteinformatics.org/smoothT. Examples, a tutorial and FAQs can be found there. Ensembles up to 2 GB (compressed) can be uploaded. Results will be stored for 5 days. The server is completely free and requires no registration. The C ++ source code is available at: https://github.com/starbeachlab/smoothT
Categories: Bioinformatics Trends

eccDB: a comprehensive repository for eccDNA-mediated chromatin contacts in multi-species

Bioinformatics Oxford Journals - Wed, 05/04/2023 - 5:30am
AbstractSummaryWe developed the eccDB database to integrate available resources for eccDNA data. eccDB is a comprehensive repository for storing, browsing, searching, and analyzing extrachromosomal circular DNAs (eccDNAs) from multi-species. The database provides regulatory and epigenetic information on eccDNAs, with a focus on analyzing intrachromosomal and interchromosomal interactions to predict their transcriptional regulatory functions. Moreover, eccDB identifies eccDNAs from unknown DNA sequences and analyzes the functional and evolutionary relationships of eccDNAs among different species. Overall, eccDB offers web-based analytical tools and a comprehensive resource for biologists and clinicians to decipher the molecular regulatory mechanisms of eccDNAs.AvailabilityeccDB is freely available at http://www.xiejjlab.bio/eccDBSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Finite Mixtures of Matrix Variate Poisson-Log Normal Distributions for Three-Way Count Data

Bioinformatics Oxford Journals - Wed, 05/04/2023 - 5:30am
AbstractMotivationThree-way data structures, characterized by three entities, the units, the variables and the occasions, are frequent in biological studies. In RNA sequencing, three-way data structures are obtained when high-throughput transcriptome sequencing data are collected for n genes across p conditions at r occasions. Matrix variate distributions offer a natural way to model three-way data and mixtures of matrix variate distributions can be used to cluster three-way data. Clustering of gene expression data is carried out as means of discovering gene co-expression networks.ResultsIn this work, a mixture of matrix variate Poisson-log normal distributions is proposed for clustering read counts from RNA sequencing. By considering the matrix variate structure, full information on the conditions and occasions of the RNA sequencing dataset is simultaneously considered, and the number of covariance parameters to be estimated is reduced. We propose three different frameworks for parameter estimation: a Markov chain Monte Carlo based approach, a variational Gaussian approximation based approach, and a hybrid approach. Various information criteria are used for model selection. The models are applied to both real and simulated data, and we demonstrate that the proposed approaches can recover the underlying cluster structure in both cases. In simulation studies where the true model parameters are known, our proposed approach shows good parameter recovery.AvailabilityThe GitHub R package for this work is available at https://github.com/anjalisilva/mixMVPLN and is released under the open source MIT license.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

K-RET: Knowledgeable Biomedical Relation Extraction System

Bioinformatics Oxford Journals - Wed, 05/04/2023 - 5:30am
AbstractMotivationRelation Extraction (RE) is a crucial process to deal with the amount of text published daily, for example, to find missing associations in a database. RE is a text mining task for which the state-of-the-art approaches use bidirectional encoders, namely, BERT. However, state-of-the-art performance may be limited by the lack of efficient external knowledge injection approaches, with a larger impact in the biomedical area given the widespread usage and high quality of biomedical ontologies. This knowledge can propel these systems forward by aiding them in predicting more explainable biomedical associations. With this in mind, we developed K-RET, a novel, knowledgeable biomedical relation extraction system that, for the first time, injects knowledge by handling different types of associations, multiple sources and where to apply it, and multi-token entities.ResultsWe tested K-RET on three independent and open-access corpora (DDI, BC5CDR, and PGR) using four biomedical ontologies handling different entities. K-RET improved state-of-the-art results by 2.68% on average, with the DDI Corpus yielding the most significant boost in performance, from 79.30% to 87.19% in F-measure, representing a p-value of 2.91 × 10−12.Availabilityhttps://github.com/lasigeBioTM/K-RETSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

WAVES (Web-based tool for Analysis and Visualization of Environmental Samples) – a web application for visualization of wastewater pathogen sequencing results

Bioinformatics Oxford Journals - Wed, 05/04/2023 - 5:30am
AbstractMotivationEnvironmental monitoring of pathogens provides an accurate and timely source of information for public health authorities and policymakers. In the last two years, wastewater sequencing proved to be an effective way of detection and quantification of SARS-CoV-2 variants circulating in population. Wastewater sequencing produces substantial amounts of geographical and genomic data. Proper visualization of spatial and temporal patterns in this data is crucial for the assessment of the epidemiological situation and forecasting. Here, we present a web-based dashboard application for visualization and analysis of data obtained from sequencing of environmental samples. The dashboard provides multi-layered visualization of geographical and genomic data. It allows to display frequencies of detected pathogen variants as well as individual mutation frequencies. The features of WAVES for early tracking and detection of novel variants in the wastewater are demonstrated in an example of BA.1 variant and the signature Spike mutation S: E484A. WAVES dashboard is easily customized through the editable configuration file and can be used for different types of pathogens and environmental samples.AvailabilityWAVES source code is freely available at https://github.com/ptriska/WavesDash under MIT license.Supplementary informationA demo version of this application can be accessed at: https://wavesdashboard.azurewebsites.net/
Categories: Bioinformatics Trends

ChemGAPP: A tool for Chemical Genomics Analysis and Phenotypic Profiling

Bioinformatics Oxford Journals - Tue, 04/04/2023 - 5:30am
AbstractMotivationHigh-throughput chemical genomic screens produce informative datasets, providing valuable insights into unknown gene function on a genome-wide level. However, there is currently no comprehensive analytic package publicly available. We developed ChemGAPP to bridge this gap. ChemGAPP integrates various steps in a streamlined and user-friendly format, including rigorous quality control measures to curate screening data.ResultsChemGAPP provides three sub-packages for different chemical-genomic screens: ChemGAPP Big for large-scale screens; ChemGAPP Small, for small-scale screens and ChemGAPP GI for genetic interaction screens. ChemGAPP Big, tested against the E. coli KEIO collection, revealed reliable fitness scores which displayed biologically relevant phenotypes. ChemGAPP Small, demonstrated significant changes in phenotype in a small-scale screen. ChemGAPP GI was benchmarked against three sets of genes with known epistasis types and successfully reproduced each interaction type.AvailabilityChemGAPP is available at https://github.com/HannahMDoherty/ChemGAPP, as a standalone Python package as well as Streamlit applications.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets

Bioinformatics Oxford Journals - Mon, 03/04/2023 - 5:30am
AbstractMotivationHuman traits are typically represented in both the biomedical literature and large population studies as descriptive text strings. Whilst a number of ontologies exist, none of these perfectly represent the entire human phenome and exposome. Mapping trait names across large datasets is therefore time-consuming and challenging. Recent developments in language modelling have created new methods for semantic representation of words and phrases, and these methods offer new opportunities to map human trait names in the form of words and short phrases, both to ontologies and to each other. Here we present a comparison between a range of established and more recent language modelling approaches for the task of mapping trait names from UK Biobank to the Experimental Factor Ontology (EFO), and also explore how they compare to each other in direct trait-to-trait mapping.ResultsIn our analyses of 1191 traits from UK Biobank with manual EFO mappings, the BioSentVec model performed best at predicting these, matching 40.3% of the manual mappings correctly. The BlueBERT-EFO model (finetuned on EFO) performed nearly as well (38.8% of traits matching the manual mapping). In contrast, Levenshtein edit distance only mapped 22% of traits correctly. Pairwise mapping of traits to each other demonstrated that many of the models can accurately group similar traits based on their semantic similarity.AvailabilityOur code is available at https://github.com/MRCIEU/vectology.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

PreTP-2L: identification of therapeutic peptides and their types using two-layer ensemble learning framework

Bioinformatics Oxford Journals - Mon, 03/04/2023 - 5:30am
AbstractMotivationTherapeutic peptides play an important role in immune regulation. Recently various therapeutic peptides have been used in the field of medical research, and have great potential in the design of therapeutic schedules. Therefore, it is essential to utilize the computational methods to predict the therapeutic peptides. However, the therapeutic peptides cannot be accurately predicted by the existing predictors. Furthermore, chaotic datasets are also an important obstacle of the development of this important field. Therefore, it is still challenging to develop a multi-classification model for identification of therapeutic peptides and their types.ResultsIn this work, we constructed a general therapeutic peptide dataset. An ensemble learning method named PreTP-2L was developed for predicting various therapeutic peptide types. PreTP-2L consists of two layers. The first layer predicts whether a peptide sequence belongs to therapeutic peptide, and the second layer predicts if a therapeutic peptide belongs to a particular species.AvailabilityA user-friendly webserver PreTP-2L can be accessed at http://bliulab.net/PreTP-2L .Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Dipwmsearch: a python package for searching di-PWM motifs

Bioinformatics Oxford Journals - Mon, 03/04/2023 - 5:30am
AbstractMotivationSeeking probabilistic motifs in a sequence is a common task to annotate putative transcription factor binding sites (TFBS) or other RNA/DNA binding sites. Useful motif representations include Position Weight Matrices (PWMs), dinucleotide PWMs (di-PWMs), and Hidden Markov Models (HMMs). Dinucleotide PWMs combine the simplicity of PWMs—a matrix form and a cumulative scoring function—but also incoporate dependency between adjacent positions in the motif (unlike PWMs which disregard any dependency). For instance to represent binding sites, the HOCOMOCO database provides di-PWM motifs derived from experimental data. Currently, two programs, SPRy-SARUS and MOODS, can search for occurrences of di-PWMs in sequences.ResultsWe propose a Python package called dipwmsearch, which provides an original and efficient algorithm for this task (it first enumerates matching words for the di-PWM, and then search these all at once in the sequence, even if the latter contains IUPAC codes). The user benefits from an easy installation via Pypi or conda, a comprehensive documentation, and executable scripts that facilitate the use of di-PWMs.Availabilitydipwmsearch is available at https://pypi.org/project/dipwmsearch/ and https://gite.lirmm.fr/rivals/dipwmsearch/ under Cecill license.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

DNAscan2: a versatile, scalable, and user-friendly analysis pipeline for human next-generation sequencing data

Bioinformatics Oxford Journals - Mon, 03/04/2023 - 5:30am
AbstractSummaryThe current widespread adoption of next-generation sequencing (NGS) in all branches of basic research and clinical genetics fields means that users with highly variable informatics skills, computing facilities and application purposes need to process, analyse, and interpret NGS data. In this landscape, versatility, scalability, and user-friendliness are key characteristics for an NGS analysis software. We developed DNAscan2, a highly flexible, end-to-end pipeline for the analysis of NGS data, which (i) can be used for the detection of multiple variant types, including SNVs, small indels, transposable elements, short tandem repeats and other large structural variants; (ii) covers all standard steps of NGS analysis, from quality control of raw data and genome alignment to variant calling, annotation and generation of reports for the interpretation and prioritisation of results; (iii) is highly adaptable as it can be deployed and run via either a graphic user interface for non-bioinformaticians and a command line tool for personal computer usage; (iv) is scalable as it can be executed in parallel as a Snakemake workflow, and; (v) is computationally efficient by minimising RAM and CPU time requirements.Availability and ImplementationDNAscan2 is implemented in Python3 and is available at https://github.com/KHP-Informatics/DNAscanv2.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Boolean Network Sketches: A Unifying Framework for Logical Model Inference

Bioinformatics Oxford Journals - Sun, 02/04/2023 - 5:30am
AbstractMotivationThe problem of model inference is of fundamental importance to systems biology. Logical models (e.g., Boolean networks; BNs) represent a computationally attractive approach capable of handling large biological networks. The models are typically inferred from experimental data. However, even with a substantial amount of experimental data supported by some prior knowledge, existing inference methods often focus on a small sample of admissible candidate models only.ResultsWe propose Boolean network sketches as a new formal instrument for the inference of Boolean networks. A sketch integrates (typically partial) knowledge about the network’s topology and the update logic (obtained through, e.g., a biological knowledge base or a literature search), as well as further assumptions about the properties of the network’s transitions (e.g., the form of its attractor landscape), and additional restrictions on the model dynamics given by the measured experimental data. Our new BNs inference algorithm starts with an initial sketch which is extended by adding restrictions representing experimental data to a data-informed sketch and subsequently computes all BNs consistent with the data-informed sketch. Our algorithm is based on a symbolic representation and coloured model-checking. Our approach is unique in its ability to cover a broad spectrum of knowledge and efficiently produce a compact representation of all inferred BNs. We evaluate the method on a non-trivial collection of real-world and simulated data.AvailabilityAll software and data are freely available as a reproducible artefact at https://doi.org/10.5281/zenodo.7688740.Supplementary informationSupplementary dataSupplementary data available online through Bioinformatics.
Categories: Bioinformatics Trends

STGRNS: An interpretable Transformer-based method for inferring gene regulatory networks from single-cell transcriptomic data

Bioinformatics Oxford Journals - Sun, 02/04/2023 - 5:30am
AbstractMotivationSingle-cell RNA-sequencing (scRNA-seq) technologies provide an opportunity to infer cell-specific gene regulatory networks (GRNs) which is an important challenge in systems biology. Although numerous methods have been developed for inferring GRNs from scRNA-seq data, it is still a challenge to deal with cellular heterogeneity.ResultsTo address this challenge, we developed an interpretable transformer-based method namely STGRNS for inferring GRNs from scRNA-seq data. In this algorithm, gene expression motif (GEM) technique was proposed to convert gene pairs into contiguous sub-vectors which can be used as input for the transformer encoder. By avoiding missing phase-specific regulations in a network, GEM can improve the accuracy of GRN inference for different types of scRNA-seq data. To assess the performance of STGRNS, we implemented the comparative experiments with some popular methods on extensive benchmark datasets including 21 static and 27 time-series scRNA-seq dataset. All the results show that STGRNS is superior to other comparative methods. In addition, STGRNS was also proved to be more interpretable than “black box” deep learning methods which are well-known for the difficulty to explain the predictions clearly.AvailabilityThe source code and data are available at https://github.com/zhanglab-wbgcas/STGRNS.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

AdaLiftOver: High-resolution identification of orthologous regulatory elements with adaptive liftOver

Bioinformatics Oxford Journals - Sun, 02/04/2023 - 5:30am
AbstractMotivationElucidating functionally similar orthologous regulatory regions for human and model organism genomes is critical for exploiting model organism research and advancing our understanding of results from genome-wide association studies. Sequence conservation is the de facto approach for finding orthologous non-coding regions between human and model organism genomes. However, existing methods for mapping non-coding genomic regions across species are challenged by the multi-mapping, low precision, and low mapping rate issues.ResultsWe develop Adaptive liftOver (AdaLiftOver), a large-scale computational tool for identifying functionally similar orthologous non-coding regions across species. AdaLiftOver builds on the UCSC liftOver framework to extend the query regions and prioritizes the resulting candidate target regions based on the conservation of the epigenomic and the sequence grammar features. Evaluations of AdaLiftOver with multiple case studies, spanning both genomic intervals from epigenome datasets across a wide range of model organisms and GWAS SNPs yield AdaLiftOver as a versatile method for deriving hard-to-obtain human epigenome datasets as well as reliably identifying orthologous loci for GWAS SNPs.AvailabilityThe R package AdaLiftOver is available from https://github.com/ThomasDCY/AdaLiftOver.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

NEREL-BIO: A Dataset of Biomedical Abstracts Annotated with Nested Named Entities

Bioinformatics Oxford Journals - Sun, 02/04/2023 - 5:30am
AbstractMotivationThis paper describes NEREL-BIO – an annotation scheme and corpus of PubMed abstracts in Russian and smaller number of abstracts in English. NEREL-BIO extends the general domain dataset NEREL (Loukachevitch et al., 2021) by introducing domain-specific entity types. NEREL-BIO annotation scheme covers both general and biomedical domains making it suitable for domain transfer experiments. NEREL-BIO provides annotation for nested named entities as an extension of the scheme employed for NEREL. Nested named entities may cross entity boundaries to connect to shorter entities nested within longer entities, making them harder to detect.ResultsNEREL-BIO contains annotations for 700+ Russian and 100+ English abstracts. All English PubMed annotations have corresponding Russian counterparts. Thus, NEREL-BIO comprises the following specific features: annotation of nested named entities, it can be used as a benchmark for cross-domain (NEREL → NEREL-BIO) and cross-language (English → Russian) transfer. We experiment with both transformer-based sequence models and machine reading comprehension (MRC) models and report their results.AvailabilityThe dataset and annotation guidelines are freely available at https://github.com/nerel-ds/NEREL-BIO.
Categories: Bioinformatics Trends

mlf-core: a framework for deterministic machine learning

Bioinformatics Oxford Journals - Sun, 02/04/2023 - 5:30am
AbstractMotivationMachine learning has shown extensive growth in recent years and is now routinely applied to sensitive areas. To allow appropriate verification of predictive models before deployment, models must be deterministic. Solely fixing all random seeds is not sufficient for deterministic machine learning, as major machine learning libraries default to the usage of non-deterministic algorithms based on atomic operations.ResultsVarious machine learning libraries released deterministic counterparts to the non-deterministic algorithms. We evaluated the effect of these algorithms on determinism and runtime. Based on these results, we formulated a set of requirements for deterministic machine learning and developed a new software solution, the mlf-core ecosystem, which aids machine learning projects to meet and keep these requirements. We applied mlf-core to develop deterministic models in various biomedical fields including a single cell autoencoder with TensorFlow, a PyTorch-based U-Net model for liver-tumor segmentation in CT scans, and a liver cancer classifier based on gene expression profiles with XGBoost.AvailabilityThe complete data together with the implementations of the mlf-core ecosystem and use case models are available at https://github.com/mlf-core.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Pages

Subscribe to Centre for Bioinformatics aggregator - Bioinformatics Trends

Calendar

Mon
Tue
Wed
Thu
Fri
Sat
Sun
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
 
June 2023