Expression data + cell localization.
Can the use of (predicted and experimental) data on cellular
localization help distinguish between true and false
positive when expression data is analyzed to find actors and
Implementing a MUMMER-like algorithm using Daniela’s suffix
tree (Java) library. This involves writing a hybrid
algorithm “k-bands dynamic programming algorithm + suffix
Genomes are evolving at several scales: from point mutations
to large rearrangements. It the late 80s, it became evident
that several closely related genomes had genes that were
extremely similar (say 99 pid),
one to another, but the order of genes along the chromosomes
was not preserved. Review and present the main algorithms to
compare entire genomes. Topics include: sorting by reversals
(Sankoff), break point graph, Hannenhalli and Pevzner algorithm.
What is an ontology? What tools
and knowledge representation formalisms (languages) are
available to support the development of ontologies. Give
examples of ontologies. Expose the problems associated with
ontologies. An ontology is a
controlled vocabulary (e.g. gene ontology). It allows
to resolve some of the problems
associated with data integration.
Because of physical limitations, only relatively short DNA
sequences can be read (some 500 nt).
For processing a complete genome, one approach, called
shut-gun, consists of sampling small reads (500
nt) at random location along the
chromosomes. The total number of reads is chosen so that the
likelihood that each nucleotide is included into more than
one read is high (typically each nt
is part of 3, 5 or 10 reads). Computers are then used to
stitch the reads together. One solution to this problem is
related to the shortest super-string problem.
Grammatical frameworks for RNA structure.
RNA secondary structure information can be represented using
context-free grammars. As with most biological data, the
information is better represented within a statistical
framework. A “Stochastic Context-Free Grammar” (SCFG) has
probabilities attached to its production rules. The two main
issues with SCFGs are the parsing and the induction of the
grammar. Review the literature on SCFGs (this includes COVE,
infernal and pfold), and build a prototype parser in Java.
Predicting Gene-Gene (Protein-Protein) interactions.
There exist a vast number of algorithms that allow
to predict if two genes will be
interacting. This includes: text-mining, co-location along
the chromosomes, phylogenetic footprinting, etc.
Predicting the three-dimensional structure of a protein is a
notoriously difficult problem. So much that alternative
problems have been devised to circumvent it: secondary
structure prediction, inverse folding problem, etc. Some
authors have also been studying simpler systems, such as 2D
and 3D lattices. Create your own implementation; this
includes an algorithm to efficiently search the folding
space and a scoring function. Run some simulations.
Structure comparison methods.
Review the literature on 3D structure comparison. Implement
at least one algorithm. Input: 2 three-dimensional
structures, output: a measure of distance (typically
root-mean-square deviation expressed in Ĺ), and a list of
Methods for detecting trans-membrane helices.
There is class of transmenbrane proteins whose secondary
structure can be reliably predicted. Those proteins are
mainly made of
helices, such that if the loop connecting the helices
+ i is exposed to the
inside of the cell, then the next one will be exposed to the
outside of the cell. Use a Hidden Markov Model or Neural
Network to reproduce this result.
Secondary Structure Prediction.
Implement a secondary structure prediction method and
compare its accuracy to known methods. Common choices for
your implementation include: Neural Networks, Hidden Markov
Models, and possibly decision trees.
Implement a algorithm to predict
the solvent accessibility. Common choices for your
implementation include: Neural Networks, Hidden Markov
Models, and possibly decision trees.
Applications of suffix trees.
Use Daniela Cernea’s suffix tree library and implement some
of the following algorithms: linear time algorithm for
finding the longest common substring of
k strings (interestingly,
Knuth had predicted that no linear time algorithm would be
found for solving this problem), finding all maximal
repetitive substrings in linear time, finding all maximum
Bioinformatics deals with biological and medical data,
according there are numerous related ethical issues: should
patenting genes be allowed? how
to handle patient data? how to
deal with genomic data, imagine that the analysis of a
dataset allows to draw conclusions about a population, a
religious group, people who live in a specific region, etc.
The consequences can be sever: it
could be that this group will be more likely to suffer from
certain diseases, such information could be used by
insurance companies, employers, etc. to screen candidate.
Genome motifs viewer.
Construct a flexible graphical using interface to visualize
shared motifs. Suggestions: make it 3D to ease viewing
multiple strings. Motifs would be extracted from a suffix
interactive linear time construction of a suffix tree,
showing the suffix links, interactive tools for software
Expectation-Maximization (EM) algorithm and some of its
applications in molecular biology.
EM is used for training certain Hidden Markov Models,
Covariance Models and building phylogenetic trees. What is
it? What are the main applications? Prototype
This technique forms the basis for several motif detection
tools. What is it? What are the main applications? Prototype
What are bayesian networks? What is interesting about them?
What are the bioinformatics applications of bayesian
networks? Carry out a small experiment. (S)
Predicting Phenotype from Patterns of Annotation,
One of the goals of bioinformatics research is to transform
molecular biology into a predictive science. For example,
given a certain pattern of gene expression, detected by
-arrays for example, what would be the best treatment
(personalized medicine)? Survey the literature on the use of
bioinformatics techniques to assist medical diagnosis,
prognosis and treatments. Where are we heading? When will
personalized medicine be true? How much data? Remaining
problems to be solved?
Statistics behind BLAST.
Good candidate for a multiple
work, where one team would focus on the statistics of word
matching while the other would focus on hashing. Produce a
Java implementation of hashing techniques for speeding up
the sequence alignment problem. The part on the statistical
analysis of hits requires a statistical background (S) but
not the algorithmic part.
Constructing phylogenetic trees.
Read an overview of the construction of phylogenetic trees
using a neighbour-joining approach. For this project, you
will produce a prototype implementation, in Java, of a
modern method such as: quartet method, maximum likelihood or
maximum parsimony. (s)
One of the main bioinformatics contributions to drug
discovery is the Quantitative Structure Activity
Relationship analysis (QSAR); the other is molecular
docking. QSAR analyses take as input a set of compounds and
their relative activity/efficacy. It then finds the
commonalities between those molecules. The commonalities are
then used to design new/better drugs.
of predicting how two molecules will interact. This can
either be two proteins or one protein and a small compound,
such as a new drug. The two main factors that are taken into
account are the shape and electrostatics of the two
is a large
collection of classes for solving bioinformatics problems.
A protein viewer was developed two years ago in the context
of a CSI 4900 project. Extensions of this project could be
Review the literature on tandem repeats detection and
implement a prototype application. Tandem repeats are
repeats of the form n, s.t. 2
5, in the
case of micro-satellites, and each unit,
, is degenerated (which implies that the algorithms
must allow for mismatches).
Simultaneous alignment and structure prediction for two RNA
Implement a simplified version of dynalign, where the
secondary structure prediction is calculated using the
Nissinov algorithm; i.e. finds the maximum number of base
3 way genome alignments.