Organization of Variation Based Personal Genetic Data with Relational Database

Onur Çakırgöz; Süleyman Sevinç

Research Article

Organization of Variation Based Personal Genetic Data with Relational Database

Year 2018, Volume: 11 Issue: 3, 295 - 307, 31.07.2018

Onur Çakırgöz Süleyman Sevinç

Abstract

Relational databases are
currently being used effectively in many hospitals and clinics to store patient
records and assay results. With the rapid development of sequencing
technologies, sequencing costs have declined considerably. In addition, the
number of personalized medicine practices is increasing day by day, and
accordingly the size of the personal genetic data that needs to be stored and
questioned is also increasing. Although relational databases are appropriate
for storing patient records and assay results, additional designs and solutions
are needed to efficiently store personal genetic data. In this study, a novel
solution is proposed for the integration of variation-based personal genetic
data into relational database. Within the scope of this solution, formats for
both non-structural and structural variation types have been developed and
compression algorithms have been used. The proposed method was tested with real
data of 2504 people, published by 1000 Genome Project. As a result of the
analyzes made, it was seen that the proposed method requires much less space
than the space required to store the raw sequence data.

Keywords

relational database, data format, genetic data, variation

References

[1] International Human Genome Sequencing Consortium, “Finishing the euchromatic sequence of the human genome”, Nature, 431(7011), 931-945, 2004.
[2] International HapMap Consortium, “A second generation human haplotype map of over 3.1 million SNPs”, Nature, 449, 851–861, 2007.
[3] 1000 Genomes Project Consortium, “A map of human genome variation from population-scale sequencing”, Nature, 467(7319), 1061-1073, 2010.
[4] 1000 Genomes Project Consortium, “An integrated map of genetic variation from 1,092 human genomes”, Nature, 491(7422), 56-65, 2012.
[5] 1000 Genomes Project Consortium, “A global reference for human genetic variation”, Nature, 526(7571), 68-74, 2015.
[6] P. H. Sudmant, et al., “An integrated map of structural variation in 2,504 human genomes”, Nature, 526(7571), 75-81, 2015.
[7] B. Alberts, et al., Molecular Biology of the Cell. Garland Science, New York, A.B.D., 2007.
[8] M. M. Alves, et al., “Contribution of rare and common variants determine complex diseases—Hirschsprung disease as a model”, Developmental biology, 382(1), 320-329, 2013.
[9] W. P. Gilks, J. K. Abbott, E. H. Morrow, “Sex differences in disease genetics: evidence, evolution, and detection”, Trends in Genetics, 30(10), 453-463, 2014.
[10] J. Hardy, A. Singleton, “Genomewide association studies and human disease”, N. Engl. J. Med, 360, 1759–1768, 2009.
[11] W. L. Lowe, T. E. Reddy, “Genomic approaches for understanding the genetics of complex disease”, Genome research, 25(10), 1432-1441, 2015.
[12] C. Katsios, D. H. Roukos, “Individual genomes and personalized medicine: life diversity and complexity”, Personalized Medicine, 7(4), 347-350, 2010.
[13] M. A. Hamburg, F. S. Collins, “The path to personalized medicine”, New England Journal of Medicine, 363(4), 301-304, 2010.
[14] G. S. Ginsburg, J. J. McCarthy, “Personalized medicine: revolutionizing drug discovery and patient care”, TRENDS in Biotechnology, 19(12), 491-496, 2001.
[15] N. J. Schork, “Personalized medicine: time for one-person trials”. Nature, 520(7549), 609-611, 2015.
[16] E. L. Van Dijk, H. Auger, Y. Jaszczyszyn, C. Thermes, “Ten years of next-generation sequencing technology”, Trends in genetics, 30(9), 418-426, 2014.
[17] Internet: Fasta Format, https://en.wikipedia.org/wiki/FASTA_format, 20.04.2018.
[18] A. Löytynoja, N. Goldman, “An algorithm for progressive multiple alignment of sequences with insertions”, Proceedings of the National academy of sciences of the United States of America, 102(30), 10557-10562, 2005.
[19] H. Li, N. Homer, “A survey of sequence alignment algorithms for next-generation sequencing”, Briefings in bioinformatics, 11(5), 473-483, 2010.
[20] T. Lassmann, E. L. Sonnhammer, “Kalign–an accurate and fast multiple sequence alignment algorithm”, BMC bioinformatics, 6(1), 2005.
[21] O. Çakırgöz, Organization and Processing of Personal Genetic Data for Clinical Use, Phd Thesis, Dokuz Eylül University, The Graduate School of Natural and Applied Sciences, 2017.
[22] S. Grümbach, F. Tahi, “Compression of DNAsequences”, Proceedings of the IEEE Data Compression Conference (DCC), 340–350, 1993.
[23] X. Chen, et al., “DNACompress: fast and effective DNA sequence compression”, Bioinformatics, 18(12), 1696-1698, 2002.
[24] B. Behzadi, F. L. Fessant, “DNA compression challenge revisited: a dynamic programming approach”, CPM, Springer, 190–200, 2005.
[25] M. D. Cao, et al., “A simple statistical algorithm for biological sequence compression”, Proceedings of the IEEE Data Compression Conference (DCC), 43–52, 2007.
[26] Internet: The Variation data as VCF files, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/, 23.07.2016.
[27] Internet: The Variation data as BCF files, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/bcf_files, 23.07.2016.
[28] Internet: The VCF File Format, https://github.com/samtools/hts-specs, 19.03.2016.
[29] J. Hammer, M. Schneider, “Genomics Algebra: A New, Integrating Data Model, Language, and Tool for Processing and Querying Genomic Information”, Proceedings of the 2003 CIDR Conference, 2003.
[30] S. Tata, Declarative querying for biological sequences, Phd Thesis, The University of Michigan, Michigan, 2007.
[31] V. Bafna, et al., “Abstractions for genomics”, Communications of the ACM, 56(1), 83-93, 2013.
[32] J. T. Dunnen, et al., “HGVS recommendations for the description of sequence variants: 2016 Update”, Human mutation, 37(6), 564-569, 2016.

Varyasyon Bazlı Kişisel Genetik Verilerin İlişkisel Veritabanı ile Organizasyonu

Year 2018, Volume: 11 Issue: 3, 295 - 307, 31.07.2018

Onur Çakırgöz Süleyman Sevinç

Abstract

İlişkisel veritabanları
halihazırda birçok hastanede ve klinikte hasta kayıtlarını ve tahlil
sonuçlarını depolamak için etkin bir şekilde kullanılmaya devam etmektedir.
Sekanslama teknolojilerinin gelişmesiyle birlikte sekanslama maliyetleri önemli
bir ölçüde düşmüştür. Bunun yanında, kişiselleştirilmiş tıp uygulamalarının sayısı
her geçen gün artmaktadır ve buna bağlı olarak depolanması ve sorgulanması
gereken kişisel genetik verilerin boyutu da yükselmektedir. Her ne kadar
ilişkisel veritabanları hasta kayıtlarını ve tahlil sonuçlarını depolamak için
uygun olsa da kişisel genetik verilerin verimli bir şekilde depolanması için ek
tasarımlara ve çözümlere ihtiyaç vardır. Bu çalışmada, varyasyon bazlı kişisel
genetik verilerin ilişkisel veritabanına entegrasyonu için yeni bir çözüm
önerilmektedir. Bu çözüm kapsamında, hem yapısal olmayan hem de yapısal
varyasyon tipleri için formatlar geliştirilmiştir ve sıkıştırma algoritmaları
kullanılmıştır. Önerilen yöntem 1000 Genom Projesi’nin yayınlamış olduğu 2504
kişiye ait gerçek veriler ile test edilmiştir. Yapılan analizler sonucunda,
önerilen yöntemin ham sekans verisini saklamak için gereken alana kıyasla çok
daha az bir alana ihtiyaç duyduğu görülmüştür.

Keywords

ilişkisel veritabanı, veri formatı, genetik veri, varyasyon

References

[1] International Human Genome Sequencing Consortium, “Finishing the euchromatic sequence of the human genome”, Nature, 431(7011), 931-945, 2004.
[2] International HapMap Consortium, “A second generation human haplotype map of over 3.1 million SNPs”, Nature, 449, 851–861, 2007.
[3] 1000 Genomes Project Consortium, “A map of human genome variation from population-scale sequencing”, Nature, 467(7319), 1061-1073, 2010.
[4] 1000 Genomes Project Consortium, “An integrated map of genetic variation from 1,092 human genomes”, Nature, 491(7422), 56-65, 2012.
[5] 1000 Genomes Project Consortium, “A global reference for human genetic variation”, Nature, 526(7571), 68-74, 2015.
[6] P. H. Sudmant, et al., “An integrated map of structural variation in 2,504 human genomes”, Nature, 526(7571), 75-81, 2015.
[7] B. Alberts, et al., Molecular Biology of the Cell. Garland Science, New York, A.B.D., 2007.
[8] M. M. Alves, et al., “Contribution of rare and common variants determine complex diseases—Hirschsprung disease as a model”, Developmental biology, 382(1), 320-329, 2013.
[9] W. P. Gilks, J. K. Abbott, E. H. Morrow, “Sex differences in disease genetics: evidence, evolution, and detection”, Trends in Genetics, 30(10), 453-463, 2014.
[10] J. Hardy, A. Singleton, “Genomewide association studies and human disease”, N. Engl. J. Med, 360, 1759–1768, 2009.
[11] W. L. Lowe, T. E. Reddy, “Genomic approaches for understanding the genetics of complex disease”, Genome research, 25(10), 1432-1441, 2015.
[12] C. Katsios, D. H. Roukos, “Individual genomes and personalized medicine: life diversity and complexity”, Personalized Medicine, 7(4), 347-350, 2010.
[13] M. A. Hamburg, F. S. Collins, “The path to personalized medicine”, New England Journal of Medicine, 363(4), 301-304, 2010.
[14] G. S. Ginsburg, J. J. McCarthy, “Personalized medicine: revolutionizing drug discovery and patient care”, TRENDS in Biotechnology, 19(12), 491-496, 2001.
[15] N. J. Schork, “Personalized medicine: time for one-person trials”. Nature, 520(7549), 609-611, 2015.
[16] E. L. Van Dijk, H. Auger, Y. Jaszczyszyn, C. Thermes, “Ten years of next-generation sequencing technology”, Trends in genetics, 30(9), 418-426, 2014.
[17] Internet: Fasta Format, https://en.wikipedia.org/wiki/FASTA_format, 20.04.2018.
[18] A. Löytynoja, N. Goldman, “An algorithm for progressive multiple alignment of sequences with insertions”, Proceedings of the National academy of sciences of the United States of America, 102(30), 10557-10562, 2005.
[19] H. Li, N. Homer, “A survey of sequence alignment algorithms for next-generation sequencing”, Briefings in bioinformatics, 11(5), 473-483, 2010.
[20] T. Lassmann, E. L. Sonnhammer, “Kalign–an accurate and fast multiple sequence alignment algorithm”, BMC bioinformatics, 6(1), 2005.
[21] O. Çakırgöz, Organization and Processing of Personal Genetic Data for Clinical Use, Phd Thesis, Dokuz Eylül University, The Graduate School of Natural and Applied Sciences, 2017.
[22] S. Grümbach, F. Tahi, “Compression of DNAsequences”, Proceedings of the IEEE Data Compression Conference (DCC), 340–350, 1993.
[23] X. Chen, et al., “DNACompress: fast and effective DNA sequence compression”, Bioinformatics, 18(12), 1696-1698, 2002.
[24] B. Behzadi, F. L. Fessant, “DNA compression challenge revisited: a dynamic programming approach”, CPM, Springer, 190–200, 2005.
[25] M. D. Cao, et al., “A simple statistical algorithm for biological sequence compression”, Proceedings of the IEEE Data Compression Conference (DCC), 43–52, 2007.
[26] Internet: The Variation data as VCF files, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/, 23.07.2016.
[27] Internet: The Variation data as BCF files, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/bcf_files, 23.07.2016.
[28] Internet: The VCF File Format, https://github.com/samtools/hts-specs, 19.03.2016.
[29] J. Hammer, M. Schneider, “Genomics Algebra: A New, Integrating Data Model, Language, and Tool for Processing and Querying Genomic Information”, Proceedings of the 2003 CIDR Conference, 2003.
[30] S. Tata, Declarative querying for biological sequences, Phd Thesis, The University of Michigan, Michigan, 2007.
[31] V. Bafna, et al., “Abstractions for genomics”, Communications of the ACM, 56(1), 83-93, 2013.
[32] J. T. Dunnen, et al., “HGVS recommendations for the description of sequence variants: 2016 Update”, Human mutation, 37(6), 564-569, 2016.

There are 32 citations in total.

Details

Primary Language	English
Subjects	Computer Software
Journal Section	Articles
Authors	Onur Çakırgöz 0000-0002-9347-1105 Süleyman Sevinç This is me
Publication Date	July 31, 2018
Submission Date	April 20, 2018
Published in Issue	Year 2018 Volume: 11 Issue: 3

Cite

APA	Çakırgöz, O., & Sevinç, S. (2018). Organization of Variation Based Personal Genetic Data with Relational Database. Bilişim Teknolojileri Dergisi, 11(3), 295-307.

Download Cover Image

Article Files

Full Text