The process of transcription is controlled by systems of transcription factors, which bind to specific patterns of binding sites in the transcriptional control regions of genes, called cis-regulatory modules (CRMs). We present an expressive and easily comprehensible CRM representation which is capable of capturing several aspects of a CRM's structure and distinguishing between DNA sequences which do or do not contain it. We also present a learning algorithm tailored for this domain, and a novel method to avoid overfitting by controlling the expressivity of the model.
We are able to find statistically significant CRMs more often then a current state-of-the-art approach on the same data sets. We also show experimentally that each aspect of our expressive CRM model space makes a positive contribution to the learned models on yeast and fly data.
Structural aspects are an important part of CRMs, both in terms of interpreting them biologically and learning them accurately. Source code for our algorithm is available at: http://www.cs.wisc.edu/~noto/crm webcite