Loading…

A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features

Technological advances have lead to the creation of large epigenetic datasets, including information about DNA binding proteins and DNA spatial structure. Hi-C experiments have revealed that chromosomes are subdivided into sets of self-interacting domains called Topologically Associating Domains (TA...

Full description

Saved in:
Bibliographic Details
Published in:PeerJ. Computer science 2020-11, Vol.6, p.e307-e307, Article e307
Main Authors: Rozenwald, Michal B, Galitsyna, Aleksandra A, Sapunov, Grigory V, Khrameeva, Ekaterina E, Gelfand, Mikhail S
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c4287-c27e0640e2ebb0154a144f5324aa7e0a8e97e9ea3bd90ffdc9a52bc7b825e6773
cites cdi_FETCH-LOGICAL-c4287-c27e0640e2ebb0154a144f5324aa7e0a8e97e9ea3bd90ffdc9a52bc7b825e6773
container_end_page e307
container_issue
container_start_page e307
container_title PeerJ. Computer science
container_volume 6
creator Rozenwald, Michal B
Galitsyna, Aleksandra A
Sapunov, Grigory V
Khrameeva, Ekaterina E
Gelfand, Mikhail S
description Technological advances have lead to the creation of large epigenetic datasets, including information about DNA binding proteins and DNA spatial structure. Hi-C experiments have revealed that chromosomes are subdivided into sets of self-interacting domains called Topologically Associating Domains (TADs). TADs are involved in the regulation of gene expression activity, but the mechanisms of their formation are not yet fully understood. Here, we focus on machine learning methods to characterize DNA folding patterns in based on chromatin marks across three cell lines. We present linear regression models with four types of regularization, gradient boosting, and recurrent neural networks (RNN) as tools to study chromatin folding characteristics associated with TADs given epigenetic chromatin immunoprecipitation data. The bidirectional long short-term memory RNN architecture produced the best prediction scores and identified biologically relevant features. Distribution of protein Chriz (Chromator) and histone modification H3K4me3 were selected as the most informative features for the prediction of TADs characteristics. This approach may be adapted to any similar biological dataset of chromatin features across various cell lines and species. The code for the implemented pipeline, Hi-ChiP-ML, is publicly available: https://github.com/MichalRozenwald/Hi-ChIP-ML.
doi_str_mv 10.7717/PEERJ-CS.307
format article
fullrecord <record><control><sourceid>gale_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_7924456</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A643261298</galeid><sourcerecordid>A643261298</sourcerecordid><originalsourceid>FETCH-LOGICAL-c4287-c27e0640e2ebb0154a144f5324aa7e0a8e97e9ea3bd90ffdc9a52bc7b825e6773</originalsourceid><addsrcrecordid>eNptkk1v1DAQhiMEolXpjTOyxAWkZkn8EccXpNWyQFElUAtny3HGiZfETu0Eyr_Hq5bSRdgHj2aeecceT5Y9L4sV5yV_82W7vfyUb65WpOCPsmNMeJUzIfDjB_ZRdhrjriiKkpVpiafZESF1WQlWH2fDGo1K99YBGkAFZ12HTFAj_PThOzI-oLkHNAVorZ6td8gbpPvgRzVbl-JDu89I5rvgo596Oyi0xL0PJtuBg9lqZEDNS4D4LHti1BDh9O48yb69337dfMwvPn8436wvck1xzXONORQVLQBD06RbU1VSahjBVKkUUTUIDgIUaVpRGNNqoRhuNG9qzKDinJxkb291p6UZodXg5qAGOQU7qvBLemXlYcTZXnb-h-QCU8qqJPDqTiD46wXiLEcbNQyDcuCXKDEr6lpUNREJffkPuvNLcOl5EtOKMV7WlP2lOjWAtM74VFfvReW6ogRXJRZ1olb_odJuYbTaOzA2-Q8SXh8kJGaGm7lTS4zy_OrykD27ZXX6qRjA3PejLOR-mOQEEHZSR5mGKeEvHvbwHv4zOuQ31vjFwQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2465571845</pqid></control><display><type>article</type><title>A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features</title><source>PubMed Central Free</source><source>ProQuest - Publicly Available Content Database</source><creator>Rozenwald, Michal B ; Galitsyna, Aleksandra A ; Sapunov, Grigory V ; Khrameeva, Ekaterina E ; Gelfand, Mikhail S</creator><creatorcontrib>Rozenwald, Michal B ; Galitsyna, Aleksandra A ; Sapunov, Grigory V ; Khrameeva, Ekaterina E ; Gelfand, Mikhail S</creatorcontrib><description>Technological advances have lead to the creation of large epigenetic datasets, including information about DNA binding proteins and DNA spatial structure. Hi-C experiments have revealed that chromosomes are subdivided into sets of self-interacting domains called Topologically Associating Domains (TADs). TADs are involved in the regulation of gene expression activity, but the mechanisms of their formation are not yet fully understood. Here, we focus on machine learning methods to characterize DNA folding patterns in based on chromatin marks across three cell lines. We present linear regression models with four types of regularization, gradient boosting, and recurrent neural networks (RNN) as tools to study chromatin folding characteristics associated with TADs given epigenetic chromatin immunoprecipitation data. The bidirectional long short-term memory RNN architecture produced the best prediction scores and identified biologically relevant features. Distribution of protein Chriz (Chromator) and histone modification H3K4me3 were selected as the most informative features for the prediction of TADs characteristics. This approach may be adapted to any similar biological dataset of chromatin features across various cell lines and species. The code for the implemented pipeline, Hi-ChiP-ML, is publicly available: https://github.com/MichalRozenwald/Hi-ChIP-ML.</description><identifier>ISSN: 2376-5992</identifier><identifier>EISSN: 2376-5992</identifier><identifier>DOI: 10.7717/PEERJ-CS.307</identifier><identifier>PMID: 33816958</identifier><language>eng</language><publisher>United States: PeerJ. Ltd</publisher><subject>Analysis ; Artificial neural networks ; Binding sites ; Bioinformatics ; Biotechnology ; Chromatin ; Chromosomes ; Computational Biology ; Data Mining and Machine Learning ; Data Science ; Datasets ; Deoxyribonucleic acid ; DNA ; DNA binding proteins ; Domains ; Drosophila ; Dynamic programming ; Epigenetic inheritance ; Epigenetics ; Folding ; Fruit flies ; Gene expression ; Generalized linear models ; Genes ; Genomes ; Innovations ; Insects ; Machine learning ; Mammals ; Molecular Biology ; Neural networks ; Protein binding ; Proteins ; Recurrent neural networks ; Regression models ; Regularization ; RNA polymerase</subject><ispartof>PeerJ. Computer science, 2020-11, Vol.6, p.e307-e307, Article e307</ispartof><rights>2020 Rozenwald et al.</rights><rights>COPYRIGHT 2020 PeerJ. Ltd.</rights><rights>2020 Rozenwald et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>2020 Rozenwald et al. 2020 Rozenwald et al.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c4287-c27e0640e2ebb0154a144f5324aa7e0a8e97e9ea3bd90ffdc9a52bc7b825e6773</citedby><cites>FETCH-LOGICAL-c4287-c27e0640e2ebb0154a144f5324aa7e0a8e97e9ea3bd90ffdc9a52bc7b825e6773</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/2465571845/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2465571845?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,885,25753,27924,27925,37012,37013,44590,53791,53793,75126</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/33816958$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Rozenwald, Michal B</creatorcontrib><creatorcontrib>Galitsyna, Aleksandra A</creatorcontrib><creatorcontrib>Sapunov, Grigory V</creatorcontrib><creatorcontrib>Khrameeva, Ekaterina E</creatorcontrib><creatorcontrib>Gelfand, Mikhail S</creatorcontrib><title>A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features</title><title>PeerJ. Computer science</title><addtitle>PeerJ Comput Sci</addtitle><description>Technological advances have lead to the creation of large epigenetic datasets, including information about DNA binding proteins and DNA spatial structure. Hi-C experiments have revealed that chromosomes are subdivided into sets of self-interacting domains called Topologically Associating Domains (TADs). TADs are involved in the regulation of gene expression activity, but the mechanisms of their formation are not yet fully understood. Here, we focus on machine learning methods to characterize DNA folding patterns in based on chromatin marks across three cell lines. We present linear regression models with four types of regularization, gradient boosting, and recurrent neural networks (RNN) as tools to study chromatin folding characteristics associated with TADs given epigenetic chromatin immunoprecipitation data. The bidirectional long short-term memory RNN architecture produced the best prediction scores and identified biologically relevant features. Distribution of protein Chriz (Chromator) and histone modification H3K4me3 were selected as the most informative features for the prediction of TADs characteristics. This approach may be adapted to any similar biological dataset of chromatin features across various cell lines and species. The code for the implemented pipeline, Hi-ChiP-ML, is publicly available: https://github.com/MichalRozenwald/Hi-ChIP-ML.</description><subject>Analysis</subject><subject>Artificial neural networks</subject><subject>Binding sites</subject><subject>Bioinformatics</subject><subject>Biotechnology</subject><subject>Chromatin</subject><subject>Chromosomes</subject><subject>Computational Biology</subject><subject>Data Mining and Machine Learning</subject><subject>Data Science</subject><subject>Datasets</subject><subject>Deoxyribonucleic acid</subject><subject>DNA</subject><subject>DNA binding proteins</subject><subject>Domains</subject><subject>Drosophila</subject><subject>Dynamic programming</subject><subject>Epigenetic inheritance</subject><subject>Epigenetics</subject><subject>Folding</subject><subject>Fruit flies</subject><subject>Gene expression</subject><subject>Generalized linear models</subject><subject>Genes</subject><subject>Genomes</subject><subject>Innovations</subject><subject>Insects</subject><subject>Machine learning</subject><subject>Mammals</subject><subject>Molecular Biology</subject><subject>Neural networks</subject><subject>Protein binding</subject><subject>Proteins</subject><subject>Recurrent neural networks</subject><subject>Regression models</subject><subject>Regularization</subject><subject>RNA polymerase</subject><issn>2376-5992</issn><issn>2376-5992</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNptkk1v1DAQhiMEolXpjTOyxAWkZkn8EccXpNWyQFElUAtny3HGiZfETu0Eyr_Hq5bSRdgHj2aeecceT5Y9L4sV5yV_82W7vfyUb65WpOCPsmNMeJUzIfDjB_ZRdhrjriiKkpVpiafZESF1WQlWH2fDGo1K99YBGkAFZ12HTFAj_PThOzI-oLkHNAVorZ6td8gbpPvgRzVbl-JDu89I5rvgo596Oyi0xL0PJtuBg9lqZEDNS4D4LHti1BDh9O48yb69337dfMwvPn8436wvck1xzXONORQVLQBD06RbU1VSahjBVKkUUTUIDgIUaVpRGNNqoRhuNG9qzKDinJxkb291p6UZodXg5qAGOQU7qvBLemXlYcTZXnb-h-QCU8qqJPDqTiD46wXiLEcbNQyDcuCXKDEr6lpUNREJffkPuvNLcOl5EtOKMV7WlP2lOjWAtM74VFfvReW6ogRXJRZ1olb_odJuYbTaOzA2-Q8SXh8kJGaGm7lTS4zy_OrykD27ZXX6qRjA3PejLOR-mOQEEHZSR5mGKeEvHvbwHv4zOuQ31vjFwQ</recordid><startdate>20201130</startdate><enddate>20201130</enddate><creator>Rozenwald, Michal B</creator><creator>Galitsyna, Aleksandra A</creator><creator>Sapunov, Grigory V</creator><creator>Khrameeva, Ekaterina E</creator><creator>Gelfand, Mikhail S</creator><general>PeerJ. Ltd</general><general>PeerJ, Inc</general><general>PeerJ Inc</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISR</scope><scope>3V.</scope><scope>7XB</scope><scope>8AL</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>M0N</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>20201130</creationdate><title>A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features</title><author>Rozenwald, Michal B ; Galitsyna, Aleksandra A ; Sapunov, Grigory V ; Khrameeva, Ekaterina E ; Gelfand, Mikhail S</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c4287-c27e0640e2ebb0154a144f5324aa7e0a8e97e9ea3bd90ffdc9a52bc7b825e6773</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Analysis</topic><topic>Artificial neural networks</topic><topic>Binding sites</topic><topic>Bioinformatics</topic><topic>Biotechnology</topic><topic>Chromatin</topic><topic>Chromosomes</topic><topic>Computational Biology</topic><topic>Data Mining and Machine Learning</topic><topic>Data Science</topic><topic>Datasets</topic><topic>Deoxyribonucleic acid</topic><topic>DNA</topic><topic>DNA binding proteins</topic><topic>Domains</topic><topic>Drosophila</topic><topic>Dynamic programming</topic><topic>Epigenetic inheritance</topic><topic>Epigenetics</topic><topic>Folding</topic><topic>Fruit flies</topic><topic>Gene expression</topic><topic>Generalized linear models</topic><topic>Genes</topic><topic>Genomes</topic><topic>Innovations</topic><topic>Insects</topic><topic>Machine learning</topic><topic>Mammals</topic><topic>Molecular Biology</topic><topic>Neural networks</topic><topic>Protein binding</topic><topic>Proteins</topic><topic>Recurrent neural networks</topic><topic>Regression models</topic><topic>Regularization</topic><topic>RNA polymerase</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Rozenwald, Michal B</creatorcontrib><creatorcontrib>Galitsyna, Aleksandra A</creatorcontrib><creatorcontrib>Sapunov, Grigory V</creatorcontrib><creatorcontrib>Khrameeva, Ekaterina E</creatorcontrib><creatorcontrib>Gelfand, Mikhail S</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Science</collection><collection>ProQuest Central (Corporate)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Computing Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest - Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>PeerJ. Computer science</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Rozenwald, Michal B</au><au>Galitsyna, Aleksandra A</au><au>Sapunov, Grigory V</au><au>Khrameeva, Ekaterina E</au><au>Gelfand, Mikhail S</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features</atitle><jtitle>PeerJ. Computer science</jtitle><addtitle>PeerJ Comput Sci</addtitle><date>2020-11-30</date><risdate>2020</risdate><volume>6</volume><spage>e307</spage><epage>e307</epage><pages>e307-e307</pages><artnum>e307</artnum><issn>2376-5992</issn><eissn>2376-5992</eissn><abstract>Technological advances have lead to the creation of large epigenetic datasets, including information about DNA binding proteins and DNA spatial structure. Hi-C experiments have revealed that chromosomes are subdivided into sets of self-interacting domains called Topologically Associating Domains (TADs). TADs are involved in the regulation of gene expression activity, but the mechanisms of their formation are not yet fully understood. Here, we focus on machine learning methods to characterize DNA folding patterns in based on chromatin marks across three cell lines. We present linear regression models with four types of regularization, gradient boosting, and recurrent neural networks (RNN) as tools to study chromatin folding characteristics associated with TADs given epigenetic chromatin immunoprecipitation data. The bidirectional long short-term memory RNN architecture produced the best prediction scores and identified biologically relevant features. Distribution of protein Chriz (Chromator) and histone modification H3K4me3 were selected as the most informative features for the prediction of TADs characteristics. This approach may be adapted to any similar biological dataset of chromatin features across various cell lines and species. The code for the implemented pipeline, Hi-ChiP-ML, is publicly available: https://github.com/MichalRozenwald/Hi-ChIP-ML.</abstract><cop>United States</cop><pub>PeerJ. Ltd</pub><pmid>33816958</pmid><doi>10.7717/PEERJ-CS.307</doi><tpages>e307</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2376-5992
ispartof PeerJ. Computer science, 2020-11, Vol.6, p.e307-e307, Article e307
issn 2376-5992
2376-5992
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_7924456
source PubMed Central Free; ProQuest - Publicly Available Content Database
subjects Analysis
Artificial neural networks
Binding sites
Bioinformatics
Biotechnology
Chromatin
Chromosomes
Computational Biology
Data Mining and Machine Learning
Data Science
Datasets
Deoxyribonucleic acid
DNA
DNA binding proteins
Domains
Drosophila
Dynamic programming
Epigenetic inheritance
Epigenetics
Folding
Fruit flies
Gene expression
Generalized linear models
Genes
Genomes
Innovations
Insects
Machine learning
Mammals
Molecular Biology
Neural networks
Protein binding
Proteins
Recurrent neural networks
Regression models
Regularization
RNA polymerase
title A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T13%3A44%3A50IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20machine%20learning%20framework%20for%20the%20prediction%20of%20chromatin%20folding%20in%20Drosophila%20using%20epigenetic%20features&rft.jtitle=PeerJ.%20Computer%20science&rft.au=Rozenwald,%20Michal%20B&rft.date=2020-11-30&rft.volume=6&rft.spage=e307&rft.epage=e307&rft.pages=e307-e307&rft.artnum=e307&rft.issn=2376-5992&rft.eissn=2376-5992&rft_id=info:doi/10.7717/PEERJ-CS.307&rft_dat=%3Cgale_pubme%3EA643261298%3C/gale_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c4287-c27e0640e2ebb0154a144f5324aa7e0a8e97e9ea3bd90ffdc9a52bc7b825e6773%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2465571845&rft_id=info:pmid/33816958&rft_galeid=A643261298&rfr_iscdi=true