Preliminary Discussion

Discussions on the new Rose

What type of program do we want ?

  1. Rose 2.0 (+ Rearrangements + parameter estimation)
  2. Game
    • Educational
    • Entertaining
  Rearrangements + parameter estimation 

Closer look at Rose 2.0

  • Tree vs. DAG
  • Genome data structure including meta data
  • Evolution simulator
  • Individuals vs. species
  • Operatoions:
    • Indels
    • Substitutions
    • Rearrangements
    • Horizontal gene transfer
    • Duplications
  and or or: Grammar or else ... 

Considerations and feature requests for the new Rose version

  • Nice User Interface which helps setting up a configuration file
  • Some niceness score for quality of sequences for fitness function and to pass abilities on to child generation.

What dawg has and Rose hasn't:

  • General time reversal model
  • Model for Recombination
  • Indel parameter estimation
  • Poisson process as model for indel formations
  • Consideration of Indels overlapping at sequence ends
  • Alignment algorithm
  • In Rose mean sequence length after many Indel operations always grows larger
  • Substitution model in Rose: minimum branch length can't get smaller than 1 PAM

What iSG has:

  • Generate root sequence out of a multiple alignment

What GSIMULATOR/SIMGRAM/SIMGENOME has

  • a model!
  • GSIMULATOR: transducer-based simulator supplying substitution, indels and transducer mutations along a phylogenetic tree
  • SIMGRAM: samples data using phylo-grammars. Uses XRATE for parameter estimation
  • SIMGENOME: combines SIMGRAM and GSIMULATOR, can model protein-coding genes, non-coding genes, pseudogenes, transponsons, conserved elements, microsatellites

What covers the infinite sites model?

  • two breakpoint rearrangement
  • deletion/insertion as special cases of two breakpoint rearrangements
  • three breakpoint rearrangement
  • duplication
  • speciation

The infinite sites model treats chromosomes either as continuous intervals or continuous circles, which are divided in sites. No breakpoints are reused. Model can be transfered to finite sites model with some special characteristics.

Requirements

  • fixed default parameters
  • parameter file (created by a wizard)
  • we can use a existing tree format
  • DAG nice to have
  • no wasteful datastructure
  • modular & extendable
  • genome annotatable (Intron, Exon, Tetramer, …)
  • annotation per clicky-clicky possible
  • output:
    • pro Blatt : Sequenz
    • pro Block : Mulitiple Alignment
    • pro Kante : Operations
    • Abfolge der Blöcke
    • Klicki-Bunti
    • interactiv explorer nice to have
  Kantenlängen bedeutung: nicht wie bei Rose 1 (Kantenlänge 1 = Mutationswahrscheinlichkeit 1%) 
  use of a Markov-Cain 

Input Parameters

ROSE

  1. Alphabet /* UndefinedMacro: latex($\Sigma,\ \ |\Sigma|=\ell$) */
  2. root sequence s
    • OR

average sequence length n character frequencies /* UndefinedMacro: latex($f=(f_1,\ldots,f_\ell)$) */

  mutation guide tree //**T**// (edge length, standard: 1) 
  * OR 

sequence distance /* UndefinedMacro: latex($d_{AV}$) */ (generate binary T over average pairwise sequence distance)

  mutation matrix /* UndefinedMacro: latex($M,\ \ \ell\times\ell$) */ (pairwise mutation frequencies for substitutions 
  insertion / deletion probability functions  
  * /* UndefinedMacro: latex($\begin{array}{lclll}p_{ins}&/&p_{del}&&\mbox{probabilities}\\\ell_{ins}&/&\ell_{del}&&\mbox{indel lengths}\end{array}$) */ 
  mutation <del>probability</del> likelihood vector /* UndefinedMacro: latex($\nu,\ \ |\nu|=n$) */ (specify sequence motifs) 

Group Meeting 19.05.2009

Groups as follows:



Input
Marvin
Rolf
Stefan
Tree/Evolution
Christoph
Eyla
Konstantin


Output
Marvin
Rolf
Stefan
Grammar
Daniel
Kai
Madis

Input/Output Group

  • Marvin: Tree parser (Roland/Pina)
  • Rolf: Wizard page (swing labs wizard), branches & threadding possible
  • Out: AGCT, AGCU, RNA, gene order
  • convert sequences

Grammar Group

2 possibilities:

  1. Sequence → Annotation (Intron/Exon) parameter estimation
  2. Evolution (parameters given)

Problems:

  • haskell → java: no SCFG (stochastic context free grammars)
  • haskell → java2: still under construction (Georg)

Alternative: Markov Models/chains (HMM)

Datastructure/Evolution Group

  • Memory:
    • Sequence vs. Operations + root
    • bases, affiliation to regions (eg introns, exons, telomeres, CDS, open reading frame …)
  Circular vs. Linear 
  Edge length: discrete vs. continous 

Interfaces / Interactions between groups

Input Tree Grammar Output Input→Interior
genomes ?
root sequence OR
length
X X root sequence ?
annotation X X frequencies, sequences WATCH: copies!!
Newick tree OR
#species
X X Newick tree ? Roland-Tree
character frequencies Transitions (H)MM Matrix

Data structures

  • Wie: Interfaces (Alles als Java-Objekte übergeben (z.B. “Genom”,“Sequenz”… mit getter & setter Methoden)
    • Sequenz: Genom.getChromosome.getSequence = String zurückgeliefert entweder Typ DNA oder AA
      • Proteine bestehen aus Domänen/AA Container sollte enthalten: Liste von Sequenzen & Array von Annotationsintervallen für jede Sequenz Sequenzinterface sollte an Proteinsequenz & DNASequenz vererben, die in Sequenzcontainer kommen
        • (Sequenz hat: Alphabet, String(Sequenz), Annotation oder Hash

Newick Tree: Roland nach Format fragen & schon fertig geparst übergeben als “Baum-Objekt” - Input - Output - Arbeitsumgebung:

Conventions

  • 1. Gruppe sets style
  • ENGLISH!!
  • Checkstyle
  • Eclipse
  • No commit before update and ALWAYS runnable
  • …. and more ….

Open Questions

Apart from sequences in the input: Genomes (i.e. linear, circular chromosomes) ?

Meeting 4.6.09

 Tafelbild_090604.jpg