Design and Analysis of Bar-seq Experiments
INVESTIGATION
Design and Analysis of Bar-seq Experiments
David G. Robinson,* Wei Chen,† John D. Storey,*,1 and David Gresham‡,1
*Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, †Berlin Institute for
Medical Systems Biology, Max-Delbrück-Center for Molecular Medicine, 13125 Berlin, Germany, and ‡Center for
Genomics and Systems Biology, Department of Biology, New York University, New York, New York 10003
ABSTRACT High-throughput quantitative DNA sequencing enables the parallel phenotyping of pools of
thousands of mutants. However, the appropriate analytical methods and experimental design that maximize
the efficiency of these methods while maintaining statistical power are currently unknown. Here, we have
used Bar-seq analysis of the Saccharomyces cerevisiae yeast deletion library to systematically test the effect
of experimental design parameters and sequence read depth on experimental results. We present computational methods that efficiently and accurately estimate effect sizes and their statistical significance by
adapting existing methods for RNA-seq analysis. Using simulated variation of experimental designs, we
found that biological replicates are critical for statistical analysis of Bar-seq data, whereas technical replicates are of less value. By subsampling sequence reads, we found that when using four-fold biological
replication, 6 million reads per condition achieved 96% power to detect a two-fold change (or more) at a 5%
false discovery rate. Our guidelines for experimental design and computational analysis enables the study
of the yeast deletion collection in up to 30 different conditions in a single sequencing lane. These findings
are relevant to a variety of pooled genetic screening methods that use high-throughput quantitative DNA
sequencing, including Tn-seq.
Uncovering the connection between genotype and phenotype remains
one of the central challenges of modern genetics. At the same time, the
rate at which new genomes are sequenced currently outpaces our
capacity to functionally annotate those genomes. Addressing these
challenges requires efficient means of quantifying phenotypes associated with defined genetic perturbations. Methods for uniquely identifying and quantifying phenotypic effects of mutant alleles in complex
mixtures enable the parallel analysis of hundreds to thousands of
genotypes. Pooled mutant analysis entails the use of either libraries
of defined mutants tagged with unique DNA sequences (molecular
barcodes) (Winzeler et al. 1999; Giaever et al. 2002) or complex
libraries of tens of thousands of unique mutants generated by random
insertional mutagenesis. Analogously, comprehensive libraries of short
Copyright © 2014 Robinson et al.
doi: 10.1534/g3.113.008565
Manuscript received September 16, 2013; accepted for publication October 20,
2013; published Early Online November 5, 2013.
This is an open-access article distributed under the terms of the Creative
Commons Attribution Unported License (http://creativecommons.org/licenses/
by/3.0/), which permits unrestricted use, distribution, and reproduction in any
medium, provided the original work is properly cited.
Supporting information is available online at http://www.g3journal.org/lookup/
suppl/doi:10.1534/g3.113.008565/-/DC1
1
Corresponding authors: Carl Icahn Labs, Princeton University, Princeton, NJ 08544.
E-mail: ; 12 Waverly Place, Room 203, New York University,
New York, NY 10003. E-mail:
KEYWORDS
yeast
Bar-seq
galactose
functional
genomics
Sacchromyces
cerevisiae
hairpin RNAs (shRNAs) enable parallel analysis of perturbations of
mammalian genes in cell culture (Schlabach et al. 2008; Silva et al.
2008; Sims et al. 2011).
Recently, methods for estimating mutant abundances in complex
mixtures have been introduced that capitalize on advances in highthroughput quantitative DNA sequencing. Barcode analysis by
sequencing (Bar-seq) was first developed to analyze libraries of
thousands of Saccharomyces cerevisiae gene deletion mutants (Smith
et al. 2009) and has subsequently been used to analyze a library of
deletion mutants in Schizzosaccharomyces pombe (Han et al. 2010).
The use of Bar-seq enables efficient, accurate, and comprehensive
genetic screens for addressing a variety of questions, such as defining
the genetic requirements for initiation and maintenance of cell quiescence in response to distinct starvation signals (Gresham et al. 2011).
In organisms for which barcoded mutant libraries are not available,
high-throughput DNA sequencing of pools of transposon insertion
mutants (Tn-seq) enables multiplexed mutant analysis. Tn-seq was
initially applied in studies of Streptococcus pneumonia (van Opijnen
et al. 2009) and Haemophilus influenzae (Gawronski et al. 2009) and
has subsequently been adapted for use in diverse organisms (Brutinel
and Gralnick 2012; Gallagher et al. 2011). Similarly, PhiTSeq facilitates
simultaneous analysis of thousands of transposon-mutagenized haploid
human cells (Carette et al. 2011). The widespread adoption of pooled
mutant screens using high-throughput quantitative DNA sequencing
attests to the power of these methods for efficient genetic analysis.
Volume 4
|
January 2014
| 11
In contrast to the rapid technological advances in pooled mutant
analysis, there has not yet been a statistical treatment of the experimental
design and analysis of data generated by high-throughput DNA
sequence analysis of these complex libraries. Thus, major methodological
and analytical questions remain unanswered. What is the appropriate
statistical framework for analyzing DNA sequence count data? What are
the sources of variation? What is the appropriate study design for
maximizing the power and accuracy to detect differences in mutant
abundances? What sequence read depth maximizes the precision of
these methods while minimizing the cost and resources required?
We undertook a study that aimed to address these questions with
the goal of providing guidance for the design and analysis of pooled
mutant screens using high-throughput DNA sequencing. Using
experimental analysis of the S. cerevisiae gene deletion collection in
two different conditions, we studied the contribution of treatment and
biological and technical variation to Bar-seq data (Figure 1). We
demonstrated that the negative binomial models used to analyze
RNA-seq data are also directly applicable to Bar-seq data. Using computational subsampling of our experimental data, we studied the effect
of different experimental designs on the results from Bar-seq analysis.
We found that biological replicates substantially improved statistical
power, whereas technical replicates provided only moderate additional
statistical power. We also found that increasing sequencing depth
beyond 6 million reads per condition provided limited improvement
in the experimental results, regardless of experimental design.
Our results provide information directly relevant to designing
future hig (...truncated)