Accurate Splice Site Detection (Supplementary Material)

Abstract

For splice site recognition, one has to solve two classification problems: discriminating true from decoy splice sites for both acceptor and donor sites. Gene finding systems typically rely on Markov Chains to solve these tasks. In this work we consider Support Vector Machines for splice site recognition. We employ the so-called weighted degree kernel which turns out well suited for this task, as we will illustrate in several experiments where we compare its prediction accuracy with that of recently proposed systems. We apply our method to the genome-wide recognition of splice sites in Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, and Homo sapiens. Our performance estimates indicate that splice sites can be recognized very accurately in these genomes and that our method outperforms many other methods including Markov Chains, GeneSplicer and SpliceMachine. We provide genome-wide predictions of splice sites and a stand-alone prediction tool ready to be used for incorporation in a gene finder.

The paper is available here.

The data splits, additional information on model selection, the whole genome predictions as well as the stand-alone prediction tool are available on request. If there are questions, please contact raetsch@cbio.mskcc.org.

Positional Oligomer Importance Matrices (POIMs)

  • Single Nucleotide POIMS for all organisms (html, pdf )
  • Di-Nucleotide POIMS for all organisms (html, pdf )

Discriminating k-mers

  • Top discriminating k-mers H.sapiens (pdf )
  • Top discriminating k-mers D.rerio (pdf )
  • Top discriminating k-mers D.melanogaster (pdf)
  • Top discriminating k-mers A.thaliana (pdf)
  • Top discriminating k-mers C.elegans (pdf)

Pilot Studies - Model Selection

Genome-wide Studies

Stand-alone Splice Site Predictor

  • We have developed a stand-alone splice-site predictor software
    • Download version 0.3 from here (version 0.2)
    • alternatively, please use the web server (see below)
  • Prerequisites:
  • Organism specific files
    • H. sapiens (human)
    • D. melanogaster (fly)
    • C. elegans (worm)
    • A. thaliana (cress)
    • D. rerio (zebra fish)

Web-Server

We offer a web server for predicting splice sites with pre-trained SVMs and even for training your own splice site sensors. It is implemented via the Galaxy framework. To use it, please follow the following steps:

  • go to our Galaxy service, http://galaxy.raetschlab.org/
  • use "Upload file" or "Get data" (in the tool bar at the left) to fix a data set in FASTA format
  • in the left, open "mGene Tools"
  • use "GenomeTool" to preprocess your sequences
  • optionally, use "SignalTrain" to train a splice sensor for a new species
  • use "SignalPredict" to predict acceptor or donor splice sites

Final Model Parameters

H.sapiens
window C order shift ppseudo npseudo type method
199+[-60,80] 3 22 0 acceptor WD-SVM
199+[-60,80] 3 22 0.3 acceptor WDS-SVM
199+[-25,25] 3 10 1000 acceptor MCs
199+[-80,60] 3 22 0 donor WD-SVM
199+[-80,60] 3 22 0.3 donor WDS-SVM
199+[-17,18] 3 0.01 1000 donor MCs
A.thaliana
window C order shift ppseudo npseudo type method
199+[-60,80] 3 22 0 acceptor WD-SVM
199+[-60,80] 3 22 0.5 acceptor WDS-SVM
199+[-80,80] 4 10 1 acceptor MCs
199+[-80,60] 3 26 0 donor WD-SVM
199+[-80,60] 3 22 0.5 donor WDS-SVM
199+[-80,80] 4 10 10 donor MCs
C.elegans
window C order shift ppseudo npseudo type method
199+[-60,80] 3 22 0 acceptor WD-SVM
199+[-60,80] 3 22 0.3 acceptor WDS-SVM
199+[-25,25] 3 10 1000 acceptor MCs
199+[-80,60] 3 22 0 donor WD-SVM
199+[-80,60] 3 22 0.3 donor WDS-SVM
199+[-17,18] 3 0.01 1000 donor MCs
D.rerio
window C order shift ppseudo npseudo type method
199+[-60,80] 3 22 0 acceptor WD-SVM
199+[-60,80] 3 22 0.3 acceptor WDS-SVM
199+[-60,60] 3 0 1000 acceptor MCs
199+[-80,60] 3 22 0 donor WD-SVM
199+[-80,60] 3 22 0.3 donor WDS-SVM
199+[-60,60] 3 0 1000 donor MCs

Genome-wide Data Sets for Worm, Fly, Cress, Fish, and Human.

  • All Donor and Acceptor data sets for all the organisms as well as genome wide predictions in custom track format, are available for download from the public ftp server

Genome-wide Predictions (Custom Tracks)

  • Are available for download here