Detecting statistically significant common insertion sites in retroviral insertional mutagenesis screens

J de Ridder; A Uren; J Kool; MJT Reinders; LFA Wessels

doi:http://compbiol.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pcbi.0020166

Detecting statistically significant common insertion sites in retroviral insertional mutagenesis screens

J de Ridder, A Uren, J Kool, MJT Reinders, LFA Wessels

Multimedia Computing

Research output: Contribution to journal › Article › Scientific › peer-review

105 Citations (Scopus)

Abstract

Retroviral insertional mutagenesis screens, which identify genes involved in tumor development in mice, have yielded a substantial number of retroviral integration sites, and this number is expected to grow substantially due to the introduction of high-throughput screening techniques. The data of various retroviral insertional mutagenesis screens are compiled in the publicly available Retroviral Tagged Cancer Gene Database (RTCGD). Integrally analyzing these screens for the presence of common insertion sites (CISs, i.e., regions in the genome that have been hit by viral insertions in multiple independent tumors significantly more than expected by chance) requires an approach that corrects for the increased probability of finding false CISs as the amount of available data increases. Moreover, significance estimates of CISs should be established taking into account both the noise, arising from the random nature of the insertion process, as well as the bias, stemming from preferential insertion sites present in the genome and the data retrieval methodology. We introduce a framework, the kernel convolution (KC) framework, to find CISs in a noisy and biased environment using a predefined significance level while controlling the family-wise error (FWE) (the probability of detecting false CISs). Where previous methods use one, two, or three predetermined fixed scales, our method is capable of operating at any biologically relevant scale. This creates the possibility to analyze the CISs in a scale space by varying the width of the CISs, providing new insights in the behavior of CISs across multiple scales. Our method also features the possibility of including models for background bias. Using simulated data, we evaluate the KC framework using three kernel functions, the Gaussian, triangular, and rectangular kernel function. We applied the Gaussian KC to the data from the combined set of screens in the RTCGD and found that 53% of the CISs do not reach the significance threshold in this combined setting. Still, with the FWE under control, application of our method resulted in the discovery of eight novel CISs, which each have a probability less than 5% of being false detections.

Original language	Undefined/Unknown
Pages (from-to)	0001-0013
Number of pages	13
Journal	PLoS Computational Biology (Print)
Volume	2
Issue number	12
DOIs	https://doi.org/http://compbiol.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pcbi.0020166
Publication status	Published - 2006

Keywords

Wiskunde en Informatica
Techniek
technische Wiskunde en Informatica
academic journal papers
Vakpubl., Overig wet. > 3 pag

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

http://compbiol.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pcbi.0020166

Cite this

@article{dc026037543e44018231f5d9aa43fa95,

title = "Detecting statistically significant common insertion sites in retroviral insertional mutagenesis screens",

abstract = "Retroviral insertional mutagenesis screens, which identify genes involved in tumor development in mice, have yielded a substantial number of retroviral integration sites, and this number is expected to grow substantially due to the introduction of high-throughput screening techniques. The data of various retroviral insertional mutagenesis screens are compiled in the publicly available Retroviral Tagged Cancer Gene Database (RTCGD). Integrally analyzing these screens for the presence of common insertion sites (CISs, i.e., regions in the genome that have been hit by viral insertions in multiple independent tumors significantly more than expected by chance) requires an approach that corrects for the increased probability of finding false CISs as the amount of available data increases. Moreover, significance estimates of CISs should be established taking into account both the noise, arising from the random nature of the insertion process, as well as the bias, stemming from preferential insertion sites present in the genome and the data retrieval methodology. We introduce a framework, the kernel convolution (KC) framework, to find CISs in a noisy and biased environment using a predefined significance level while controlling the family-wise error (FWE) (the probability of detecting false CISs). Where previous methods use one, two, or three predetermined fixed scales, our method is capable of operating at any biologically relevant scale. This creates the possibility to analyze the CISs in a scale space by varying the width of the CISs, providing new insights in the behavior of CISs across multiple scales. Our method also features the possibility of including models for background bias. Using simulated data, we evaluate the KC framework using three kernel functions, the Gaussian, triangular, and rectangular kernel function. We applied the Gaussian KC to the data from the combined set of screens in the RTCGD and found that 53% of the CISs do not reach the significance threshold in this combined setting. Still, with the FWE under control, application of our method resulted in the discovery of eight novel CISs, which each have a probability less than 5% of being false detections.",

keywords = "Wiskunde en Informatica, Techniek, technische Wiskunde en Informatica, academic journal papers, Vakpubl., Overig wet. > 3 pag",

author = "{de Ridder}, J and A Uren and J Kool and MJT Reinders and LFA Wessels",

year = "2006",

doi = "http://compbiol.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pcbi.0020166",

language = "Undefined/Unknown",

volume = "2",

pages = "0001--0013",

journal = "PLoS Computational Biology (Print)",

issn = "1553-734X",

publisher = "Public Library of Science (PLOS)",

number = "12",

}

TY - JOUR

T1 - Detecting statistically significant common insertion sites in retroviral insertional mutagenesis screens

AU - de Ridder, J

AU - Uren, A

AU - Kool, J

AU - Reinders, MJT

AU - Wessels, LFA

PY - 2006

Y1 - 2006

N2 - Retroviral insertional mutagenesis screens, which identify genes involved in tumor development in mice, have yielded a substantial number of retroviral integration sites, and this number is expected to grow substantially due to the introduction of high-throughput screening techniques. The data of various retroviral insertional mutagenesis screens are compiled in the publicly available Retroviral Tagged Cancer Gene Database (RTCGD). Integrally analyzing these screens for the presence of common insertion sites (CISs, i.e., regions in the genome that have been hit by viral insertions in multiple independent tumors significantly more than expected by chance) requires an approach that corrects for the increased probability of finding false CISs as the amount of available data increases. Moreover, significance estimates of CISs should be established taking into account both the noise, arising from the random nature of the insertion process, as well as the bias, stemming from preferential insertion sites present in the genome and the data retrieval methodology. We introduce a framework, the kernel convolution (KC) framework, to find CISs in a noisy and biased environment using a predefined significance level while controlling the family-wise error (FWE) (the probability of detecting false CISs). Where previous methods use one, two, or three predetermined fixed scales, our method is capable of operating at any biologically relevant scale. This creates the possibility to analyze the CISs in a scale space by varying the width of the CISs, providing new insights in the behavior of CISs across multiple scales. Our method also features the possibility of including models for background bias. Using simulated data, we evaluate the KC framework using three kernel functions, the Gaussian, triangular, and rectangular kernel function. We applied the Gaussian KC to the data from the combined set of screens in the RTCGD and found that 53% of the CISs do not reach the significance threshold in this combined setting. Still, with the FWE under control, application of our method resulted in the discovery of eight novel CISs, which each have a probability less than 5% of being false detections.

AB - Retroviral insertional mutagenesis screens, which identify genes involved in tumor development in mice, have yielded a substantial number of retroviral integration sites, and this number is expected to grow substantially due to the introduction of high-throughput screening techniques. The data of various retroviral insertional mutagenesis screens are compiled in the publicly available Retroviral Tagged Cancer Gene Database (RTCGD). Integrally analyzing these screens for the presence of common insertion sites (CISs, i.e., regions in the genome that have been hit by viral insertions in multiple independent tumors significantly more than expected by chance) requires an approach that corrects for the increased probability of finding false CISs as the amount of available data increases. Moreover, significance estimates of CISs should be established taking into account both the noise, arising from the random nature of the insertion process, as well as the bias, stemming from preferential insertion sites present in the genome and the data retrieval methodology. We introduce a framework, the kernel convolution (KC) framework, to find CISs in a noisy and biased environment using a predefined significance level while controlling the family-wise error (FWE) (the probability of detecting false CISs). Where previous methods use one, two, or three predetermined fixed scales, our method is capable of operating at any biologically relevant scale. This creates the possibility to analyze the CISs in a scale space by varying the width of the CISs, providing new insights in the behavior of CISs across multiple scales. Our method also features the possibility of including models for background bias. Using simulated data, we evaluate the KC framework using three kernel functions, the Gaussian, triangular, and rectangular kernel function. We applied the Gaussian KC to the data from the combined set of screens in the RTCGD and found that 53% of the CISs do not reach the significance threshold in this combined setting. Still, with the FWE under control, application of our method resulted in the discovery of eight novel CISs, which each have a probability less than 5% of being false detections.

KW - Wiskunde en Informatica

KW - Techniek

KW - technische Wiskunde en Informatica

KW - academic journal papers

KW - Vakpubl., Overig wet. > 3 pag

U2 - http://compbiol.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pcbi.0020166

DO - http://compbiol.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pcbi.0020166

M3 - Article

SN - 1553-734X

VL - 2

SP - 1

EP - 13

JO - PLoS Computational Biology (Print)

JF - PLoS Computational Biology (Print)

IS - 12

ER -