Improving Estimates of Genome CNAs by developing Probabilistic Masks for Microarray Data
Copy Number Alterations (CNA)s are hallmarks of cancer, which are gains or losses in copies of Deoxyribonucleic Acid (DNA) sections. Nowadays, CNAs are routinely measured by different techniques for diagnostic and prognostic purposes. The array-Comparative Genomic Hybridization (aCGH Array-Single Nucleotide Polymorphism (aSNP) and Next Generation Sequencing (NGS) are examples of technologies that enable cost-efficient high resolution detection of CNAs. Intensive noise as well as technical and biological biases inherent to modern technologies of CNAs probing often cause inconsistency between the estimates provided by different methods. Efficient and accurate detection of the breakpoint positions in heterogeneous cancer samples measured under such conditions is a challenging practical and methodological problem. Despite the necessity of accurate CNA estimates, there is no much information regarding the estimation errors. Based on studies of the confidence limits for noisy stepwise signals, an efficient algorithm has been developed for computing the upper and lower confidence boundary masks with a specific probability, in order to guarantee an existence of genomic changes within certain regions. This tool combined with estimates can give more information to medical experts about true CNAs structures. The probabilistic confidence masks are initially designed based on the Skew Laplace distribution to represent jitter in the CNA breakpoints. Using experimental measurements, it is concluded that Laplace distribution is accurate when the segmental Signal?to?Noise Ratio (SNR) exceeds unity. In this work the experimental jitter distribution is simulated to different ranges in order to find approximations to actual distributions with minimal errors. Following this procedure, three techniques are described to approximate the experimental jitter distribution: Heuristic approximation, parametrization of skew Laplace distribution, and asymmetric exponential power distribution. The confidence masks algorithm is designed and modified for each approximation. It is also tested by arrays: High?Resolution Comparative Genomic Hybridization and Single Nucleotide Polymorphism data. Additionally, the confidence masks based on the exponential power distribution are tuned to the medical expert annotations of the training set of the breakpoints obtained by the standard circular binary segmentation algorithm. A comparison of modified confidence masks and experts annotations related to CNA profiles of neuroblastoma demonstrates an efficiency of the designed masks to improve the CNA estimates.
