Personal tools
You are here: Home Previous Workshops MLCB2007 Program Abstracts


Abstracts of accepted contributions at MLCB 2007.

Georg Zeller, Gabriele Schweikert, Richard Clark, Stephan Ossowski, Paul Shin, Kelly Frazer, Joeseph Ecker, Detlef Weigel, Bernhard Schölkopf and Gunnar Rätsch. Machine Learning Algorithms for Polymorphism Detection
Abstract: As extensive studies of natural variation require the identification of sequence differences among complete genomes, there exists a high demand for precise high-throughput sequencing techniques. While high-density oligo-nucleotide arrays are capable of rapid and comparatively cheap genomic scans, the resulting data is typically much noisier than dideoxy sequencing data. Therefore algorithmic approaches for the accurate identification of sequence polymorphisms from oligo-nucleotide array data remain a challenge [Gresham et al., 2006]. We present machine learning based methods tackling the problem of identifying Single Nucleotide Polymorphisms (SNPs) as well as deletions and highly polymorphic regions. Here we describe polymorphism discovery in 20 wild strains of the model plant Arabidopsis thaliana, which has a genome of about 125 Mb. A huge set of array hybridization data comprising nearly 19.2 billion measurements has been collected at Perlegen Sciences Inc. (four 25 nt probes for each base on each genomic strand and strain) [Clark et al., 2007].
Laurent Jacob and Jean-Phlippe Vert. Kernel methods for in silico chemogenomics
Abstract: Predicting interactions between small molecules and proteins is a crucial ingredient of the drug discovery process. In particular,accurate predictive models are increasingly used to preselect potential lead compounds from large molecule databases, or to screen for side-effects. While classical in silico approaches focus on predicting interactions with a given specific target, new chemogenomics approaches adopt cross-target views. Building on recent developments in the use of kernel methods in bio- and chemoinformatics, we present a systematic framework to screen the chemical space of small molecules for interaction with the biological space of proteins. We show that this framework allows information sharing across the targets, resulting in a dramatic improvement of ligand prediction accuracy for three important classes of drug targets: enzymes, GPCR and ion channels.
Farshid Moussavi, Fernando Amat, Mark Horowitz, Gal Elidan, Luis Comolli and Kenneth Downing. Markov Random Field Based Automatic Image Alignment for Electron Tomography
Abstract: We present a method for automatic full-precision alignment of the images in a tomographic tilt series. Full-precision automatic alignment of cryo electron microscopy images has remained a difficult challenge to date, due to the limited electron dose and low image contrast. These facts lead to poor signal to noise ratio (SNR) in the images, which causes automatic feature trackers to generate errors, even with high contrast gold particles as fiducial features. To enable fully automatic alignment for full-precision reconstructions, we frame the problem probabilistically as finding the most likely particle tracks given a set of noisy images, using contextual information to make the solution more robust to the noise in each image. To solve this maximum likelihood problem, we use Markov Random Fields (MRF) to establish the correspondence of features in alignment. The resulting algorithm, called Robust Alignment and Projection Estimation for Tomographic Reconstruction, or RAPTOR, has not needed any manual intervention for the difficult datasets we have tried, and has provided sub-pixel alignment that is as good as the manual approach by an expert user. Our method has been applied to challenging cryo electron tomographic datasets with low SNR from intact bacterial cells, as well as several plastic section and X-ray datasets.
Moran Yassour, Tommy Kaplan, Ariel Jaimovich and Nir Friedman. Nucleosome Positioning from Tiling Microarray Data
Abstract: The packaging of DNA around nucleosomes in eukaryotic cells, plays a crucial role in transcriptional regulation, e.g. by altering the accessibility of short transcriptional regulatory elements. To better understand transcription regulation, it is therefore important to identify the position of nucleosomes in 5-10bp resolution. Toward this end, several recent works measured nucleosomal positions in a high-throughput manner using dense tiling arrays.Here we present a fully automated algorithm to analyze such data. Using a probabilistic graphical model, we suggest to improve the resolution of the nucleosome calls beyond that of the microarray platform used. We show how such a model can be compiled into a simple HMM, allowing for a fast inference of the nucleosome positions, without any loss of accuracy.We applied our model to nucleosomal data from mid-log yeast cells reported by Yuan et al. (Science, 2005), and compared our predictions to those of the original paper, to a more recent method that uses five times denser tiling arrays (Lee et al., Nat. Genet. 2007), and to a curation of literature-based positions. Our results suggest that by applying our algorithm to the same data of Yuan et al., we were able to trace 13% more nucleosomes, and increase the overall accuracy in about20%. We believe that such an improvement opens the way for a better understanding of the regulatory mechanisms controlling gene expression,and how they are encoded in the DNA.
Juuso Parkkinen and Samuel Kaski. Searching for functional gene modules with interaction component models
Abstract: Genetic functional modules and protein complexes are being sought from combinations of gene expression and protein-protein interaction data with various clustering-type methods. As far as we know, up to now these methods have used the interaction data as constraints on the clustering of expression data, instead of modeling the noise in the interactions. We model the interaction links with a simple generative "topic model", which is augmented to generate also the expression data. The results outperform a representative set of earlier models in the task of finding modules having enriched functional classes. Moreover, it turns out that the generative model for the links alone works very well in this task.
Allister Bernard, David Orlando, Charles Lin, Edwin Iversen, Steven Haase and Alexander Hartemink. Deconvolution yields a high-resolution view of global gene expression during the cell cycle
Abstract: Gene expression levels measured in a synchronized cell population are convolutions of their expression levels throughout the cell cycle because perfect cell synchrony is neither attainable at synchronization nor maintainable after release. We have developed a mathematical model called CLOCCS (Characterizing Loss of Cell Cycle Synchrony) to describe how cells in a cell population are distributed throughout the cell cycle as a result of synchrony loss. CLOCCS models synchrony loss from three sources: 1)initial asynchrony, 2) varying rates of cell cycle progression, and 3)asymmetric cell division.Using any measured cell cycle marker (e.g., bud or septum presence) as observed data, we can fit a CLOCCS model via Bayesian MCMC. The models we learn reliably fit data from different experimental conditions,synchronization protocols, labs, and species, and accurately predict DNA content distributions as measured independently via flow cytometry. The models enable us to align data from various experiments (even across species), allowing for more sensible comparison. As just one example, with CLOCCS, we can easily align expression data from S. pombe cells elutriated in one lab with expression data from S. cerevisia} cells arrested with alpha-factor in another lab.The models we learn also enable us to accurately deconvolve gene expression data. In doing so, we explicitly model three distinct expression regimes:expression during recovery from synchronization, expression during the cell cycle common to both mother and daughter cells, and expression during daughter-specific early G1. We use wavelet basis regularization to smooth our gene expression estimates and we can learn jointly from multiple replicates of the same experiment. The objective function we optimize in our model is convex so it has a unique global optimum. The resulting deconvolution represents a high-resolution view of global gene expression during the cell cycle, with mother cell, daughter cell, and recovery-specific expression all resolved separately.
Rajesh Narasimha, Hua Ouyang, Alexander Gray, Steven W. McLaughlin and Sriram Subramaniam. AUTOMATIC MINING OF WHOLE CELL TOMOGRAMS FOR CANCER DETECTION
Abstract: We present a machine learning tool for automatic texture-based segmentation of mitochondria in MNT-1 cells imaged using ion-abrasion scanning electron microscope (IA-SEM). For cancer detection, a number of human melanoma whole cell tomograms (each 3D tomogram is about ~2GB) needs to analyzed. Hence, automatic tools that have minimal user intervention needs to be developed for high-throughput data mining and analysis. Challenges for such a tool in electron tomography arise due to low contrast and signal-to-noise ratio (SNR), appearance geometry and viewpoint variation. Our approach is based on block-wise classification of images into a trained list of regions. Given manually labeled images, our goal is to learn models that can localize novel instances of the regions on test datasets. In order to improve the SNR of the tomogram for automatic segmentation, we implement a 2D texture-preserving filter that incorporates a spatially varying fidelity term and thereby locally controls the denoising of image regions proportional to their content. We investigate texture-based region features and block-wise classification is performed by histogram matching using a nearest neighbor (NN) classifier, kNN classifier, support vector machines and adaptive boosting (AdaBoost) methods. In addition, we study the computational complexity vs. segmentation accuracy tradeoff of these classifiers. Segmentation results demonstrate that our approach using minimal training data performs close to semi-automatic carried out using variational level-set method and manual segmentation by an experienced user. Our approach has minimal user intervention and achieves high classification accuracy as shown by experimental results. We then investigate quantitative analysis such as volume of cytoplasm occupied by mitochondria, difference between surface area of inner and outer membranes and mean mitochondrial width which are indicative quantities of segregating a cancerous cell from a normal one.
Alexander Zien and Cheng Soon Ong. An Automated Combination of Kernels for Predicting Protein Subcellular Localization
Abstract: Protein subcellular localization is a crucial ingredient to many important inferences about cellular processes, including prediction of protein function and protein interactions.We propose a new class of protein sequence kernels which considers all motifs including motifs with gaps. This class of kernels allows the inclusion of pairwise amino acid distances into their computation. We utilize an extension of the multiclass support vector machine (SVM)method which directly solves protein subcellular localization without resorting to the common approach of splitting the problem into several binary classification problems. To automatically search over families of possible amino acid motifs, we optimize over multiple kernels at the same time. We compare our automated approach to four other predictors on three different datasets, and show that we perform better than the current state of the art. Furthermore, our method provides some insights as to which features are most useful for determining subcellular localization, which are in agreement with biological reasoning.
Soeren Sonnenburg, Alexander Zien, Petra Philips and Gunnar Rätsch. Positional Oligomer Importance Matrices
Abstract: At the heart of many important bioinformatics problems, such as gene finding and function prediction, is the classification of biological sequences, above all of DNA and proteins. In many cases, the most accurate classifiers are obtained by training SVMs with complex sequence kernels, for instance for transcription starts or splice sites. However, an often criticized downside of SVMs with complex kernels is that it is very hard for humans to understand the learned decision rules and to derive biological insights from them. To close this gap, we introduce the concept of positional oligomer importance matrices (POIMs) and develop an efficient algorithm for their computation. We demonstrate how they overcome the limitations of sequence logos, and how they can be used to find relevant motifs for different biological phenomena in a straight-forward way. Note that the concept of POIMs is not limited to interpreting SVMs, but is applicable to general k−mer based scoring systems.
Raf Van de Plas, Kristiaan Pelckmans, Bart De Moor and Etienne Waelkens. Spatial Querying of Imaging Mass Spectrometry Data: A Nonnegative Least Squares Approach
Abstract: This extended abstract reports on the development of an optimization-based query engine for mining spatial/biochemical data coming from imaging mass spectrometry experiments. It is shown how a high-dimensional linear query model and a non-negative least squares argument provide a practical approach for answering spatial queries. This work elaborates on the technical report (Van de Plas et al. 2007) where further biological motivation and case studies for this approach were reported.
Huzefa Rangwala, Christopher Kauffman and George Karypis. A Generalized Framework for Protein Sequence Annotation
Abstract: Over the last decade several data mining techniques have been developed for determining structural and functional properties of individual protein residues using sequence and sequence-derived information. These protein residue annotation problems are often formulated as either classification or regression problems and solved using a common set of techniques.We develop a generalized protein sequence annotation toolkit (PROSAT)for solving classification or regression problems using support vector machines. The key characteristic of our method is its effective use of window-based information to capture the local environment of a protein sequence residue. This window information is used with several kernel functions available within our framework. We show the effectiveness of using the previously developed normalized second order exponential kernel function and experiment with local window-based information at different levels of granularity.We have tested PROSAT on a diverse set of classification and regression problems: prediction of solvent accessibility, secondary structure, local structure alphabet, trans-membrane helices,DNA-protein interaction sites, contact order, and regions of disorder are all explored. Our methods show either comparable or superior results to several state-of-the-art application tuned prediction methods for these problems. PROSAT provides practitioners an efficient and easy-to-use tool for a wide variety of annotation problems. The results of some of these predictions can be used to assist in solving the overarching 3D structure prediction problem.
Nico Pfeifer, Andreas Leinenbach, Christian Huber and Oliver Kohlbacher. Statistical learning of peptide retention behavior in chromatographic separations: A new kernel-based approach for computational proteomics
Abstract: Background: High-throughput peptide and protein identification technologies have benefited tremendously from strategies based on tandem mass spectrometry (MS/MS)in combination with database searching algorithms. These identification algorithms can be complemented by retention time prediction algorithms to significantly reduce the false positive rate. Current prediction models are derived from a set of measured test analytes but they usually require large amounts of training data. Results: We introduce a new kernel function which can be applied in combination with support vector machines to a wide range of computational proteomics problems. We show the performance of this new approach by applying it to predict chromatographic retention (both, classification and regression). Furthermore, the predicted retention times are used to improve spectrum identifications by a p-value-based filtering approach. The approach was tested on a number of different datasets and shows excellent performance while requiring only very small training sets (about 40 pep-tides instead of thousands). Conclusions: The proposed kernel function is well-suited for the prediction of chromatographic separation in computational proteomics and requires only a limited amount of training data.
Sara Mostafavi, Debajyoti Ray, David Warde-Farley, Chris Grouios and Quaid Morris. Gene Function Prediction from Multiple Data Sources Using GO Priors
Abstract: We describe a new algorithm called GeneMANIA for combining multiple sources of genomic and proteomic data to predict protein function. Our algorithm uses linear regression to combine multiple datasets into a single composite, task-specific functional association network and uses a Gaussian field label propagation algorithm to infer protein function from this combined network. We show that our algorithm achieves high accuracy with low computation time. Furthermore, we present a method for incorporating information about proteins annotations patterns into our prediction framework and demonstrate a significant gain in prediction performance.
Guy Yosiphon, Kimberly Gokoffski, Anne Calof, Arthur Lander and Eric Mjolsness. Stochastic Multiscale Modeling Methods for Stem Cell Niches
Abstract: The Dynamical Grammar modeling language has a foundation in operator algebra, from which heterogeneous stochastic/differential simulation algorithms can be derived using perturbation theory. The language and simulation algorithms have been implemented and are capable of expressing stem cell niche models that incorporate stochastic and deterministic process models at multiple spatial scales. Preliminary exploration of such multiscale models of the mouse olfactory epithelium indicate (1) the ability to reproduce appearances in the 2D spatial distribution of stem cells, depending on the detail of the biological hypothesis modeled, and (2) substantial differences between the model predictions and those of an approximating deterministic model that can be expressed solely in terms of differential equations. Preliminary results also indicate that Bayesian parameter inference on a simplified version of this system can be successful.
Jonathan Carlson, Carl Kadie, Zabrina Brumme, Chanson Brumme, P. Richard Harrigan and David Heckerman. A phylogenetically-corrected dependency network for HIV immune-adaptation
Abstract: Viral plasticity and rapid rates of evolution have confounded HIV vaccine design. Although the specifics of a handful of escape pathways have been experimentally determined, the full extent and nature of these pathways have not been explored. Here we present the construction of a phylogenetically-corrected dependency network that relates each amino acid to (1) HLA types of the host, which mediate the immune response; and (2) other amino acids, whose codependency is apparently linked to maintaining viral fitness despite myriad sequence adaptations. The resulting dependency network is dense, suggesting a complex network of co-adaptations in response to host immune pressure. These networks offer both a mechanism for vaccine failure and the potential to block escape pathways in future vaccine design. In addition, the methods used to construct these networks are generally applicable to detecting correlations among sets of phylogenetically-related variables in a computationally scalable manner.
Edo Airoldi. , Curtis Huttenhower, David Gresham, David Botstein and Olga Troyanskaya. Growth-specific programs of gene expression
Abstract: Growth is a fundamental process in cellular proliferation, and its disruption plays a role in a variety of disorders from viral infection to cancer. Cellular growth is so essential that our ability to probe and reveal the inner mechanisms of the cell crucially depends on our ability to control it. Consider, for instance, that most experiments are performed on cellular cultures growing in artificial environments. An important such example is the investigation of environmental stress response (ESR) in yeast. In this setting, each gene's transcriptional response may be considered to arise from a mixture of two alternative models: either a gene is expressed directly in response to stress, or it is expressed purely in response to the change in growth rate caused indirectly by stress. In practice, growth-related and stress-related effects are confounded in the magnitude of transcriptional responses. We need to separate these effects to gain a clear understanding of the ESR. More in general, our goal is to resolve transcriptional responses into direct effects of biological stimuli and responses from indirect growth effects. The statistical analysis of a carefully designed experimental probe enables us to estimate (in both continuous and batch cultures) the ``instantaneous growth rate'' of new collections of expression data. The effective growth rate of a cellular culture is a novel biological concept, and it is useful in interpreting the system-level connections among growth rate, metabolism, environmental stress response, and the cell division cycle.
Sean O'Rourke and Eleazar Eskin. A finite state transducer approach to haplotype phasing
Abstract: Recent high-throughput genotyping technologies promise a wealth of new information, but the data they supply provide an ambiguous and incomplete view of the true genetic state. In particular, the problems of inferring haplotypes, single-strand deletions, and population-wide recombination patterns from genotype data are both challenging and well-studied. We describe a straightforward maximum likelihood model addressing all of these questions. While solving the model directly takes exponential time, we show that an equivalent weighted finite state transducer model can be solved efficiently for large samples over long genetic regions.
Hiroto Saigo, Masahiro Hattori and Koji Tsuda. Reaction graph kernels for discovering missing enzymes in the plant secondary metabolism
Abstract: Secondary metabolic pathway in plant is important for finding druggable candidate enzymes. However, there are many enzymes whose functions are still undiscovered especially in organism-specific metabolic pathways. We propose reaction graph kernels for automatically assigning the EC numbers to unknown enzymatic reactions in a metabolic network. Experiments are carried out on KEGG/REACTION database and our method successfully predicted the first three digits of the EC number with 83% accuracy.We also exhaustively predicted missing enzymatic functions in the plant secondary metabolism pathways, and evaluated our results in biochemical validity.

Document Actions