Tutorial on Deep Learning for Computational Biology
Tutor, Thrasyvoulos Karydis, MIT Media Lab
The field of computational biology has seen dramatic growth over the past few years, principally due to the advent of high-throughput experimental technologies producing Petabytes of data across different biological scales. As part of this transition, the traditional, systems-based and theory-based, approaches to understand and engineer biology have given place to high-capacity, data-driven deep-learning methods. The goal of this session is to stimulate discussion on how to build, train and interpret data-driven models for biological data, in light of the current experimental work in the field. This session comprises two parts: a talk to present current research problems in computational biology on which deep learning has had a significant impact, with a focus on the presenter’s work on protein biology and a technical tutorial to provide hands-on experience with the standard tools in deep learning and their applications to molecular biology datasets.
High- throughput experimental technologies developed in the last decade now enable us to measure parts of biological systems at various resolutions—at the genome, epigenome, transcriptome, and proteome levels. These technologies are now being used to tackle an ever-increasingly diverse set of challenges, ranging from classical problems such as predicting gene regulation between time points and cell phenotype to models that explore complex mechanistic hypotheses bridging the gap between genetics and disease, as well as between protein sequence, structure, and function. Fully realizing the scientific and clinical potential of this massive amount of data requires developing novel supervised and unsupervised learning methods that are scalable, can accommodate heterogeneity, are robust to systematic noise and confounding factors, and provide mechanistic insights.
Part A: Talk “Deep representation learning for protein analysis, search and design”
The protein universe comprises a broad diversity of molecular machinery, forming the basis of the extraordinarily wide array of cellular processes found in biology ranging from the genome and cellular replication to energy production and chemical synthesis to adaptive immunity to the developmental programs behind the architecture of cells, organs, the brain and the body. Key to elucidating the underlying mechanisms of these protein-mediated processes is a fundamental understanding of the structural makeup and functional organization of the full set of proteins which make up the proteomes of living organisms.
Unfortunately, the protein structure database currently contains the solved structure of only about 130,000 proteins, a very small fraction (~ 0.14%) of known proteins. In contrast, due to the exponential scaling of Next Generation Sequencing, the protein sequence database contains more than one trillion nucleotides coding for approximately 120 million unique protein sequences. Moreover, this number is expected to double roughly every 18 months. There is thus a nearly 3 order of magnitude gap between the number of proteins we have a structure for and the number of proteins we have a sequence for and this shortfall is expected to widen exponentially over time.
In order to bridge the gap between known sequences and known structures, we developed a deep learning framework for proteomics and protein design we call CoMET (Convolutional Motif Embedding Tool).
CoMET employs deep, interpretable architectures to look carefully at the sequences of proteins across species and distill patterns of similar amino acid composition that have distinct functions within the protein. We train models that extract motifs across large- scale protein sequence datasets, without requiring any prior knowledge about the nature of the motifs or their distribution. In prototypical implementations, trained with up to 20 million protein sequences, we have demonstrated that the learned motif embeddings representation can be used to recapitulate efficiently current inter- and intrafamily relationships, as well as identify previously unknown functional protein clusters.
Part B: “Hands-on Deep Learning Tutorial”
This tutorial is a 2hr hands on coding session on Deep Learning with a focus on applications in molecular biology. We will begin with a short introduction to the key concepts of applied Deep Learning and continue with the (minimal) installation of the files needed for the tutorial. Subsequently, we will go through two interactive demos on how to use Jupyter notebooks and Keras to rapidly prototype and evaluate deep learning architectures, as well as build a translator from English to Greek. Finally, we will end with two applications of deep learning in molecular biology problems.
No prior knowledge is required to go through this tutorial. Ideally the audience will be a mixture of computer scientists and biologists to participate in the discussions and come up with interdisciplinary projects.
- Laptops are required to actively participate in the tutorial. Listeners are allowed but priority will be given to active participants.
- For those with no computer science background, we will provide an online account to a pre-installed workshop installation.
- Technical Introduction on Deep Learning (20 mins)
- Installation and Setup (10 mins)
- Coding session (90 mins)
- Introduction to Keras
- Interactive Demo 1: Digits Classification
- Interactive Demo 2: Language translation
- Problem 1: Molecular properties prediction
- Problem 2: Protein properties prediction
Participants are encouraged to bring their laptops in order to participate in the coding session.