Repeating sequences of genomes

Repeating sequences of genomes

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

How can we estimate what portion of a genome is repeating sequences without sequencing it?

Thank you

Yes, it is possible to estimate the amount of repetitive DNA in a genome without sequencing. This was an important technique in the 1960s, referred to as C₀t analysis and pioneered by Roy Britten.

The approach taken is to study the kinetics of renaturation of heat-denatured genomic DNA (in a fragmented form). As you can see in this idealised Figure taken from the Wikipedia article on the technique there are components which renature rapidly whilst other components renature more slowly:

The interpretation of this is that repeated sequences renature rapidly because they are at a higher concentration than are single-copy sequences.

As stated at the linked Wikipedia article:

… it was through C₀t analysis that the redundant (repetitive) nature of eukaryotic genomes was first discovered.

And finally here is a quotation from Britten and Kohne's 1968 paper in Science:

The rate of reassociation of the complementary strands of DNA of viral and bacterial origin is inversely proportional to the (haploid) DNA content per cell. However, a large fraction of the DNA of higher organisms reassociates much more rapidly than would be predicted from the DNA content of each cell. Another fraction appears to reassociate at the expected rate. It is concluded that certain segments of the DNA are repeated hundreds of thousands of times. A survey of a number of species indicates that repeated sequences occur widely and probably universally in the DNA of higher organisms.

In agreement with @Nicolai, I don't think you can!

Of course, if you assume having a priori knowledge about the fraction of repeated elements in related organisms, you could try to estimate it. Typically, if you consider a angiosperm, you know you're going to have much more repeated elements than if you consider a bacteria. Also, the size of cells and the size of the nucleus are correlated with the genome size (Gregory 2001) and the genome size is itself correlated with the number of repeating sequences (Flavell et al. 1974, Bennetzen et al. 2005). However, those correlations are pretty lose and it will not be possible to be accurate with these methods.

Repeating sequences of genomes - Biology

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). Currently over 56% of human genomic sequence is identified and masked by the program. Sequence comparisons in RepeatMasker are performed by one of several popular search engines including nhmmer, cross_match, ABBlast/WUBlast, RMBlast and Decypher. RepeatMasker makes use of curated libraries of repeats and currently supports Dfam ( profile HMM library derived from Repbase sequences ) and Repbase, a service of the Genetic Information Research Institute.

If you would like to keep up with news and announcements relating to RepeatMasker, you can either follow us on Twitter: Follow @RepeatMasker

RepeatModeler 2.0.2 Released
Monday, May 3, 2021
A new version of RepeatModeler is available. This release includes a set of manual curation tools for use with de-novo generated TE libraries, in addition to miscellaneous bugfixes and improvements.
RepeatMasker 4.1.2-p1 Released
Thursday, April 1, 2021
A new patch release of RepeatMasker is available for download. This release fixes a bug in 4.1.1/4.1.2 with the processing of Alu sequences in primates. In these prior releases Alu sequences were being correctly masked, however they were not being automatically compared to the larger Alu subfamily library and did not receive detailed subfamily annotation. See the RepeatMasker page for installation details.
RepeatMasker 4.1.2 Released
Friday, March 19, 2021
A new release of RepeatMasker is available for download. This release fixes some minor issues with RepeatMasker and its auxilary tools. More importantly, this release remedies a problem with its use by RepeatModeler that can cause poor classification performance in RepeatModeler's denovo libraries. See the RepeatMasker page for installation details.
RMBlast 2.11.0
Thursday, March 11, 2021
RMBlast has been updated to the latest version of the NCBI BLAST+ tools (2.11.0), including binaries for 64-bit Mac and Linux. This version introduced opt-out usage reporting, which we have modified in our RMBlast distributions. See the RMBlast page for more details.
RepeatMasker 4.1.1 Released
Thursday, September 3, 2020
A new release of RepeatMasker is available for download. In this version we have added support for Dfam 3.2 and for FamDB ( library files. FamDB is an HDF5 based format which stores family models (HMM and consensus sequences), family metadata, and a subset of the NCBI taxonomy database relevant to the families stored within. In addition, RepeatMasker includes the utility tool which supports a wide range of querying and exporting capabilities on the data stored in this format. See the RepeatMasker page for installation details.
RMBlast 2.10.0
Wednesday, January 8, 2020
RMBlast has been updated to the latest version of the NCBI Blast+ tools (2.10.0), including binaries for 64-bit Mac and Linux. See the RMBlast page for installation details.
RepeatModeler 2.0 Released
Wednesday, November 27, 2019
A new version of RepeatModeler is available with support for structure-based LTR discovery using LtrHarvest and Ltr_retriever. The new workflow developed in collaboration with Jullien Flynn, Andrew Clark, and Cedric Feschotte, vastly improves the quality of the LTR families produced by RepeatModeler. In addition to bugfixes we improved the speed of the masking phase, refactored the configuration system to be more flexible for package managers, and generated both Docker and Singularity containers for simplified installation. A preliminary manuscript has been submitted to bioRxiv [856591].
RepeatMasker 4.1.0 Released
Wednesday, October 30, 2019
A new release of RepeatMasker is available for download. In this version the configuration system has been refactored to make it easier to distribute RepeatMasker via package managers and/or bundle into Docker/Singularity containers. In addition, we have included a useful python tool ( developed by David Ray's lab for manipulating/filtering RM annotation files ( *.out ) and saving output to the BED file format. See the RepeatMasker page for installation details.
RMBlast 2.9.0-p1 bugfix
Wednesday, August 7, 2019
We have identified a bug in NCBI BLAST+ that can occasionally cause a crash or garbled alignments when running rmblastn. We have issued a new patch and reported our findings to NCBI so it can be fixed upstream. See the RMBlast page for installation details.
RepeatMasker Masking Service Changes
Monday, May 20, 2019
As of May 20, 2019 GIRI has rescinded our working agreement allowing the website to offer a repeatmasking service utilizing the RepBase RepeatMasker Edition library. At this time we can only offer masking using the open database Dfam, which starting in 3.0 includes consensus sequences in addition to profile hidden Markov models for many transposable element families. Users requiring RepBase will need to purchase a commercial or academic license from GIRI and run RepeatMasker localy. We are working to expand the Dfam database and invite you to visit Dfam ( ) for more information.
RepeatMasker 4.0.9
Tuesday, April 9, 2019
A new release of the RepeatMasker package is now available. RepeatMasker will now work with with the new combined consensus/HMM Dfam database ( Dfam 3.0 ) and/or user-provided custom libraries out-of-the-box. Dfam is an open database of transposable element (TE) profile HMM models and consensus sequences. The current release (Dfam 3.0) contains 6,235 TE families spanning five organisms: human, mouse, zebrafish, fruit fly, nematode, and a growing number of new species. See the RepeatMasker page for installation details.
RMBlast 2.9.0
Friday, April 5, 2019
RMBlast has been updated to the latest version of the NCBI Blast+ tools (2.9.0). This version is being released as both a patch to the NCBI Blast+ source and as compiled binaries for 64-bit Mac and Linux. Thanks again to NCBI for their help with these efforts. See the RMBlast page for installation details.
Introducing Dfam 3.0
Wednesday, March 6, 2019
The Dfam consortium is excited to announce the release of Dfam 3.0. This release represents a major transition for Dfam from a proof-of-concept database into a funded open community resource. Central to this transition is a major infrastructure and technology update, enabling Dfam to handle the increasing pace of genome sequencing and TE library generation. Equally important, we merged Dfam_consensus with Dfam to produce a single resource for transposable element family modeling and annotation. In doing so, Dfam serves the needs of a broader research community while maintaining a high standard for family characterization (based upon seed alignments), and TE annotation sensitivity. Finally, and most importantly, we are working on making Dfam a community driven resource through the development of online curation tools and direct user engagement. [ read more ].
To access the database head over to
RepeatMasker 4.0.8 And Libraries Released
Wednesday November 21, 2018
A new RepeatMasker package, Repeat Protein Database, and RepBase RepeatMasker-edition have been released. The Repeat Protein Database grew by over 7400 entries and includes 16.1 million amino acids covering 133 subclasses of transposable elements. For more information on this library see the documentation that accompanies the library. In addition we have updated the RepeatMasker libraries for RepBase ( Repbase RepeatMasker-edition version 20180826, RepBase version 23.08 ). The update includes over 4500 new families from: rice (1652), the western painted turtle (472), the african clawed frog (215), wood tobacco (210), and the sweet potato whitefly (182) among others. The new RepeatMasker package may be downloaded from here. The new RepBase RepeatMasker-edition is available for download at:
  • Dfam_consensus - Today we released a new version of the database containing several new families for the African Golden Mole and a library for the Collared Flycatcher provided by Alexander Suh.
  • RepeatMasker, RepeatModeler, and Coseg software development repositories are now available on GitHub. Help requests may now be submitted through the GitHub site in addition to the website.
  • RepeatModeler - We have been working hard on eliminating several bugs, and improving the Dfam_consensus import tool based on feedback we have received. The latest version is 10.0.11 and maybe downloaded from: GitHub or
- RepeatMasker uses the Dfam database of repeat profile hidden markov models and consensus sequences to conduct searches.
- RepeatMasker can also make use of Repbase which is a service of the Genetic Information Research Institute. Repbase is a database of repetitive element consensus sequences.
- Data and computational resources for the Pre-Masked Genomes page is provided courtesy of the UCSC Genome Bioinformatics group.

Human Genome Sequencing: Approaches and Applications

A list of different methods used for mapping of human genomes is given below. These techniques are also useful for the detection of normal and disease genes in humans.

1. DNA sequencing : Physical map of DNA can be identified with highest resolution.

2. Use of probes : To identify RFLPs, STS and SNPs.

3. Radiation hybrid mapping: Fragment genome into large pieces and locate markers and genes. Requires somatic cell hybrids.

4. Fluorescence in situ hybridization (FISH) : To localize a gene on chromosome.

5. Sequence tagged site (STS) mapping : Applicable to any part of DNA sequence if some sequence information is available.

6. Expressed sequence tag (EST) mapping : A variant of STS mapping expressed genes are actually mapped and located.

7. Pulsed-field gel electrophoresis (PFGE) : For the separation and isolation of large DNA fragments.

8. Cloning in vectors (plasmids, phages, variable lengths, cosmids, YACs, BACs).: To isolate DNA fragments of variable length.

9. Polymerase chain reaction (PCR) : To amplify gene fragments.

10. Chromosome walking : Useful for cloning of overlapping DNA fragments (restricted to about 200 kb).

11. Chromosome jumping : DNA can be cut into large fragments and circularized for use in chromosome walking.

12. Detection of cytogenetic abnormalities : Certain genetic diseases can be identified by cloning the affected genes e.g. Duchenne muscular dystrophy.

13. Databases : Existing databases facilitate gene identification by comparison of DNA and protein sequences.

For elucidating human genome, different approaches were used by the two HGP groups. IHCSC predominantly employed map first and sequence later approach. The principal method was hierarchical shotgun sequencing. This technique involves fragmentation of the genome into small fragments (100-200 kb), inserting them into vectors (mostly bacterial artificial chromosomes, BACs) and cloning. The cloned fragments could be sequenced.

Celera Genomics used whole genome shotgun approach. This bypasses the mapping step and saves time. Further, Celera group was lucky to have high-throughput sequenators and powerful computer programmes that helped for the early completion of human genome sequence.

Whose Genome was Sequenced?

One of the intriguing questions of human genome project is whose genome is being sequenced and how will it relate to the 6 billion or so population with variations in world? There is no simple answer to this question.

However, looking from the positive side, it does not matter whose genome is sequenced, since the phenotypic differences between individuals are due to variations in just 0.1% of the total genome sequences. Therefore many individual genomes can be used as source material for sequencing.

Much of the human genome work was performed on the material supplied by the Centre for Human Polymorphism in Paris, France. This institute had collected cell lines from sixty different French families, each spanning three generations. The material supplied from Paris was used for human genome sequencing.

Human Genome Sequence -Results Summarised:

The information on the human genome projects is too vast, and only some highlights can be given below. Some of them are briefly described.

Major Highlights of human Genome:

1. The draft represents about 90% of the entire human genome. It is believed that most of the important parts have been identified.

2. The remaining 10% of the genome sequences are at the very ends of chromosomes (i.e. telomeres) and around the centromeres.

3. Human genome is composed of 3200 Mb (or 3.2 Gb) i.e. 3.2 billion base pairs (3,200,000,000).

4. Approximately 1.1 to 1.5% of the genome codes for proteins.

5. Approximately 24% of the total genome is composed of introns that split the coding regions (exons), and appear as repeating sequences with no specific functions.

6. The number of protein coding genes is in the range of 30,000-40,000.

7. An average gene consists of 3000 bases, the sizes however vary greatly. Dystrophin gene is the larget known human gene with 2.4 million bases.

8. Chromosome 1 (the target human chromosome) contains the highest number of genes (2968), while the Y chromosome has the lowest. Chromosomes also differ in their GC content and number of transposable elements.

9. Genes and DNA sequences associated with many diseases such as breast cancer, muscle diseases, deafness and blindness have been identified.

10. About 100 coding regions appear to have been copied and moved by RNA-based transposition (retro- transposons).

11. Repeated sequences constitute about 50% of the human genome.

12. A vast majority of the genome (

97%) has no known functions.

13. Between the humans, the DNA differs only by 0.2% or one in 500 bases.

14. More than 3 million single nucleotide polymorphisms (SNPs) have been identified.

15. Human DNA is about 98% identical to that of chimpanzees.

16. About 200 genes are close to that found in bacteria.

Most of the Genome Sequence is Identified:

About 90% of the human genome has been sequenced. It is composed of 3.2 billion base pairs (3200 Mb or 3.2 Gb). If written in the format of a telephone book, the base sequence of human genome would fill about 200 telephone books of 1000 pages each. Some other interesting analogs/ sidelights of genome are given in Table 12.3.

Individual differences in genomes:

It has to be remembered that every individual, except identical twins, have their own versions of genome sequences. The differences between individuals are largely due to single nucleotide polymorphisms (SNPs). SNPs represent positions in the genome where some individuals have one nucleotide (i.e. an A), and others have a different nucleotide (i.e. a G). The frequency of occurrence of SNPs is estimated to be one per 1000 base pairs. About 3 million SNPs are believed to be present and at least half of them have been identified.

Benefits/Applications of Human Genome Sequencing:

It is expected that the sequencing of human genome and the genomes of other organisms will dramatically change our understanding and perceptions of biology and medicine. Some of the benefits of human genome project are given.

Identification of human genes and their functions:

Analysis of genomes has helped to identify the genes, and functions of some of the genes. The functions of other genes and the interaction between the gene products needs to be further elucidated.

Understanding of polygenic disorders:

The biochemistry and genetics of many single- gene disorders have been elucidated e.g. sickle-cell anemia, cystic fibrosis, and retinoblastoma. A majority of the common diseases in humans, however, are polygenic in nature e.g. cancer, hypertension, diabetes. At present, we have very little knowledge about the causes of these diseases. The information on the genome sequence will certainly help to unravel the mysteries surrounding polygenic diseases.

Improvements in gene therapy:

At present, human gene therapy is in its infancy for various reasons. Genome sequence knowledge will certainly help for more effective treatment of genetic diseases by gene therapy.

Improved diagnosis of diseases:

In the near future, probes for many genetic diseases will be available for specific identification and appropriate treatment.

Development of pharmacogenomics:

The drugs may be tailored to treat the individual patients. This will become possible considering the variations in enzymes and other proteins involved in drug action, and the metabolism of the individuals.

Genetic basis of psychiatric disorders:

By studying the genes involved in behavioural patterns, the causation of psychiatric diseases can be understood. This will help for the better treatment of these disorders.

Understanding of complex social trait:

With the genome sequence now in hand, the complex social traits can be better understood. For instance, recently genes controlling speech have been identified.

Knowledge on mutations:

Many events leading to the mutations can be uncovered with the knowledge of genome.

Better understanding of developmental biology:

By determining the biology of human genome and its regulatory control, it will be possible to understand how humans develop from a fertilized eggs to adults.

Comparative genomics:

Genomes from many organisms have been sequenced, and the number will increase in the coming years. The information on the genomes of different species will throw light on the major stages in evolution.

Development of biotechnology:

The data on the human genome sequence will spur the development of biotechnology in various spheres.

Materials and methods

Plant material

Five representatives of the genus Linum were selected for genome sequencing. Three of them were obtained from the genebank of Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) (Gatersleben, Germany): L. hirsutum subsp. hirsutum L. (accession number LIN 1649, (hir)), L. narbonense L. (accession number LIN 2002, (nar1)) and L. usitatissimum L., var. Stormont cirrus (accession number LIN 2016, (usi)). The sample of L. perenne L. (per1) was collected from the natural population (village Nesvetai neighbourhood, Rostov region, Russian Federation) by Dr. A.A. Svetlova, Komarov Botanical Institute RAS (St. Petersburg, Russian Federation). The sample of L. stelleroides Planchon (ste) was kindly provided by Dr. L.N. Mironova, Botanic Garden Institute of the Far-Eastern Branch of the Russian Academy of Sciences (Vladivostok, Russian Federation).

For genome size estimation, in addition to the above-listed samples, five species from IPK genebank: L. leonii F.W.Schultz (accession number LIN 1672 (leo)), L. lewisii Pursh (accession number LIN 1648, (lew)), L. grandiflorum Desf. (accession number LIN 974 (gra)), L.decumbens Desf. (accession number LIN 1754 (dec)) and L. angustifolium Huds. (accession number LIN 1642 (ang)) were used.

Genome size estimation

The nuclear DNA content of the flax species was determined by comparative Feulgen photometry according to Boyko et al., [55]. Briefly, the modification of this method includes synchronization of cells of root meristems by cold treatment at 0–4 0 C for night, fixation in ethanol:acetic acid fixative (3:1) at 2–4 0 C, hydrolysis at 50 0 C in 1 N HCl for 40 min, maceration with Cellulysin (Calbiochem, USA) and root tips squashing on the microscopic slides. About 50 prophase nuclei have been measured for each flax sample with Opton scanning microphotometer. Measurements were made in relation to to diploid rat hepatocytes containing 7.8 pg DNA per 2C nucleus [56] and Hordeum vulgare var. Odesski 31 containing 22.6 pg DNA per 4C nucleus [57].

DNA extraction, library construction, and sequencing

Flax seedlings were used for DNA extraction as described earlier [50]. High-quality DNA was used for DNA library preparation with TruSeq DNA Sample Prep Kit (Illumina, USA): 1000 ng of each sample was fragmented by nebulization, and then end repair, 3′-end adenylation, and adapter ligation were performed following the manufacturer’s protocol. DNA fragments about 500–700 bp were excised from agarose gel and purified with MinElute Gel Extraction Kit (Qiagen, USA). Enrichment of DNA fragments was performed using PCR Master Mix and Primer Cocktail (Illumina, USA). Quality and concentration of the obtained libraries were evaluated using Agilent 2100 Bioanalyzer (Agilent Technologies, USA) and Qubit 2.0 fluorometer (Life Technologies, USA). The libraries were sequenced on MiSeq sequencer (Illumina, USA), and paired-end reads (300 + 300 nucleotides) were obtained.

Genomic repetitive fraction identification and analysis

For the analysis of the repeatomes, raw paired-end reads obtained as a result of WGS of the five flax species carried out within the framework of this study and also raw paired-end reads of the WGS of wild flax species available in the European Nucleotide Archive (ENA) [58]: L. leonii (accession number SRR1592650, (leo)), L. lewissii (accession number SRR1592654, (lew)), L. perenne (accession number SRR1592548, (per2)), L narbonense (accession number SRR1592545, (nar2)) L.grandiflorum (accession number SRR1592647, (gra)), L. decumbens (accession number SRR1592610, (dec)) and L. angustifolium (accession number SRR1592607, (ang)).

To identify and classify repetitive sequences in flax genomes, raw paired-end WGS reads were analyzed using the RepeatExplorer toolkit [17]. For each studied genome, the WGS reads were filtered by quality, and then they were randomly sampled to final genome coverage about 0.1X, trimmed to 100 bp length and clustered using the graph-based clustering algorithm. A read similarity cut-off of 90% was used for clustering. The reads belonging to the same clusters were assembled into contigs. A minimum sequence overlapping length of 55% was used for assembly. The obtained sequence clusters were identified based on a similarity search against a repeat database implemented in Repeat Explorer. Identification and characterization of tandem repeat sequences were conducted by TAREAN tool of the Repeat Explorer. Clusters containing satellite repeats were identified based on a globular- or ring-like shape of cluster graphs. The monomers reconstruction of satellite repeats were generated using k-mer analysis. We took into consideration only putative satellite sequences having the probability of being a satellite DNA of at least 0.1 and constituting not less than 0.01% of the genome. The obtained putative satellite repeats were compared with known sequences from NCBI by BLASTN.

Primers design and PCR-amplification of putative satellite DNAs

For all flax samples except L. usitatissimum and L. bienne, tandem organization of the found putative satDNAs families were additionally examined by PCR amplification. For this purpose, primers were designed in opposite orientation to most conserved regions of monomers consensus sequences (Table 1).

The characteristic ladder pattern of tandem repeats was checked after electrophoresis in a 2% agarose gel. The reliability of the PCR product was confirmed by Sanger sequencing. For L. usitatissimum and L. angustifolium, the reliability of putative satDNA families were verified by BLASTN comparison with a BioNano genome (BNG) optical map of L. usitatissimum cv. CDC Bethune [59] available in GenBank (NCBI) under GenomeProject ID #68161 (accession numbers CP027619 - CP027633).

Phylogenetic analysis and statistical evaluation

For phylogenetic tree construction based on repeatome compositions, the previously published approach has been used [60]. Clusters corresponding to plastid and mitochondrial DNA sequences were filtered out prior to phylogenetic inference. Each abundance was divided by the correcting factor (largest abundance/65) to make all numbers in the matrices ≤ 65 as it is required for TNT tree searches [61, 62]. Data matrix containing relative proportion of top 200 the most abundant clusters was converted to the TNT format (modified Hennig86). Resampling was performed using 100 replicates. The tree was visualized by the iTOL tool [63].

Why repetitive DNA is essential to genome function

There are clear theoretical reasons and many well-documented examples which show that repetitive, DNA is essential for genome function. Generic repeated signals in the DNA are necessary to format expression of unique coding sequence files and to organise additional functions essential for genome replication and accurate transmission to progeny cells. Repetitive DNA sequence elements are also fundamental to the cooperative molecular interactions forming nucleoprotein complexes. Here, we review the surprising abundance of repetitive DNA in many genomes, describe its structural diversity, and discuss dozens of cases where the functional importance of repetitive elements has been studied in molecular detail. In particular, the fact that repeat elements serve either as initiators or boundaries for heterochromatin domains and provide a significant fraction of scaffolding/matrix attachment regions (S/MARs) suggests that the repetitive component of the genome plays a major architectonic role in higher order physical structuring. Employing an information science model, the 'functionalist' perspective on repetitive DNA leads to new ways of thinking about the systemic organisation of cellular genomes and provides several novel possibilities involving repeat elements in evolutionarily significant genome reorganisation. These ideas may facilitate the interpretation of comparisons between sequenced genomes, where the repetitive DNA component is often greater than the coding sequence component.


Xer tyrosine recombinases are conserved and ancient

In the present study, we devise a classification system for bacterial YRs that are related to Xer recombinases. Based on phylogenetic reconstruction, we divided all YR proteins into two groups and twenty subgroups. Of these, the most abundant and widely distributed is the Xer subgroup of simple YRs, which includes close homologues of the chromosome dimer resolution proteins XerC/D from 16 bacterial phyla (Fig 1 and Appendix Fig S1). Notably, recent reports described an identical taxonomic distribution of conserved dif DNA sequences (Kono et al, 2011 ), which serve as Xer recombination sites on bacterial chromosomes, thus implying a widely conserved functional role for these proteins. The phylogeny of XerC/D recombinases in proteobacteria was further found to be correlated with their host taxonomy, suggesting that their evolution follows a strictly vertical trajectory (Carnoy & Roten, 2009 ). This wide distribution and vertical inheritance of Xer-dif systems indicate that Xer proteins are the most ancient type of bacterial YRs, which may have served as the evolutionary source for other more complex YRs.

Simple tyrosine recombinases can drive the movement of mobile genetic elements

Besides the chromosomal Xer proteins, we identified several simple YRs encoded on MGEs, such as phages or ICEs. Most of these YRs belonged to two subgroups, the IntBrujita subgroup found in mycophages and the IntKX subgroup from gammaproteobacterial ICEs a few isolated examples fell into the Xer and RitB subgroups (Datasets EV1 and EV3). Notably, several of these proteins have been shown to be required for mobilization of their respective elements (Qiu et al, 2006 Fischer et al, 2010 Flannery et al, 2011 Lunt & Hatfull, 2016 ). For example, the simple YR of the Brujita phage actively drives its excision and integration (Lunt & Hatfull, 2016 ), and XerT from Helicobacter pathogenicity islands is needed for their horizontal transfer (Fischer et al, 2010 ). The close relationship of these MGE-borne YRs with the Xer subgroup and their significance for MGE function suggest that phages and ICEs have repeatedly sequestered Xer genes from their host genomes and functionally repurposed these to drive their transfer. Alternatively, MGE integrases may have been domesticated by the host for dimer resolution functions, as proposed recently (Koonin et al, 2020 ). However, this contradicts the observed vertical evolution and wide taxonomical distribution of Xer genes.

Acquisition of Arm-binding domain drove the evolution of mobile genetic element integrases

While some phages and ICEs carry simple YRs, most of them possess more complex AB domain-containing YRs. In fact, our analysis detected 27 phages with simple YRs versus 410 with AB domain-carrying YRs (Dataset EV1). Similarly, we found 12 ICEs that carry only simple YRs, whereas 211 had AB domain-containing YRs (Dataset EV3). These complex YRs cluster into a monophyletic group with six distinct subgroups, where all subgroups contain an N-terminal AB domain with similar secondary structure (Fig 3), suggesting a common evolutionary origin. Although the AB domain is less conserved than the CAT domain, sequence similarity can be detected between different YR subgroups. Pfam AB domains from all subgroups belong to the same clan (CL0081), and sequence logos show significant conservation within and across subgroups (Fig 3). In agreement, previous structural studies also revealed a similar AB fold in the lambda phage integrase and in IntTn916 and IntSXT family YRs (Clubb et al, 1999 Wojciak et al, 2002 Szwagierczak et al, 2009 ).

The functional role of the AB domain may be inferred from its function in the lambda phage integrase, one of the best-characterized members of this YR group. Here, the AB domain binds to internal arm DNA sequences within the phage genome and plays an essential role in guiding DNA recombination toward excision or integration in different stages of the phage lifecycle (Tirumalai et al, 1996 Biswas et al, 2005 Radman-Livaja et al, 2006 ). The presence of an AB domain in the vast majority of ICE and phage integrases indicates that these elements generally benefit from the function of this domain in regulating their integration and excision. MGEs that carry simple YRs may represent an earlier step in evolution with less intricate regulatory features (Lunt & Hatfull, 2016 ). Interestingly, large serine recombinases that perform integration and excision of discrete phages also require a separate DNA-binding domain for precise regulation (Rutherford & Van Duyne, 2014 ), indicating that acquisition of accessory DNA-binding domains may be a common strategy in MGE evolution.

Tyrosine recombinase classification aids the annotation of mobile genetic elements in bacterial genomes

The recent increase in bacterial genome sequence data highlighted the impact of MGEs and motivated the development of automated sequence-based MGE mining tools. For insertion sequences, the simplest MGEs that typically contain an RNaseH-like DDE transposase flanked by short inverted repeats, existing pipelines provide confident annotation through homology-based prediction (Xie & Tang, 2017 ). The recently communicated TIGER pipeline also maps various integrative genomic elements by using Pfam-based annotations (Mageeney et al, 2020 ). However, for YR-carrying elements the close relationship of YR family transposases/integrases to essential bacterial genes has greatly hampered functional annotation. In particular, the TIGER software tackles this by discarding Xer and Integron-related sequences, assuming that all other YRs are MGE integrases, which results in false-positive hits (Mageeney et al, 2020 ). Through comprehensive YR classification, we found that the presence of an AB domain in a YR is a strong predictor of its function in phage or ICE mobility. In fact, acquisition of this AB domain by YRs and their cooperation with Xis proteins may have driven evolutionary adaptation of YRs to a “mobile lifestyle”, helping to promote and regulate MGE movement in and between bacterial cells. Sequence features in all YR domains further support functional annotation of MGE-borne proteins and can even enable identification of simple YRs with a specific mobility function. These direct sequence-to-function relationships open up new opportunities for functional annotation and provide clear rules for predicting the mobile nature of YRs and the genomic regions they reside in.

By using our classification system, we identified new ICEs in diverse genomes with mobile YRs as markers. We found 59 previously unannotated ICEs, substantially expanding the repertoire of known elements and illustrating the power of our approach for automated MGE detection. Going forward, application of our pipeline will help explore the abundance and diversity of MGEs in bacterial genomes in diverse environments. Comprehensive studies of MGE content, distribution, and dynamics are necessary to track MGE transfer and to assess their impact on genetic exchange and bacterial adaptation. These studies are important to better understand the dynamics of microbial communities and to follow the spread of genetic traits, such as antibiotic resistance. Our work prepares the stage for analyzing the full repertoire of MGEs and their cargo in bacterial genomes, thus offering new opportunities to characterize gene flow in bacterial communities.


A transposition in which the transposed material is copied to the transposition site, rather than excised from the original site. [ citation needed ]

Long strands of repetitive DNA can be found at each end of a LTR retrotransposon. These are termed long terminal repeats (LTRs) that are each a few hundred base pairs long, hence retrotransposons with LTRs have the name long terminal repeat (LTR) retrotransposon. LTR retrotransposons are over 5 kilobases long. Between the long terminal repeats there are genes that can be transcribed equivalent to retrovirus genes gag and pol. These genes overlap so they encode a protease that processes the resulting transcript into functional gene products. Gag gene products associate with other retrotransposon transcripts to form virus-like particles. Pol gene products include enzymes reverse transcriptase, integrase and ribonuclease H domains. Reverse transcriptase carries out reverse transcription of retrotransposon DNA. Integrase 'integrates' retrotransposon DNA into eukaryotic genome DNA. Ribonuclease cleaves phosphodiester bonds between RNA nucleotides.

LTR retrotransposons encode transcripts with tRNA binding sites so that they can undergo reverse transcription. The tRNA-bound RNA transcript binds to a genomic RNA sequence. Template strand of retrotransposon DNA can hence be synthesised. Ribonuclease H domains degrade eukaryotic genomic RNA to give adenine- and guanine-rich DNA sequences that flag where the complementary noncoding strand has to be synthesised. Integrase then 'integrates' the retrotransposon into eukaryotic DNA using the hydroxyl group at the start of retrotransposon DNA. This results in a retrotransposon flagged by long terminal repeats at its ends. Because the retrotransposon contains eukaryotic genome information it can insert copies of itself into other genomic locations within a eukaryotic cell.

An endogenous retrovirus is a retrovirus without virus pathogenic effects that has been integrated into the host genome by inserting their inheritable genetic information into cells that can be passed onto the next generation like a retrotransposon. [8] Because of this, they share features with retroviruses and retrotransposons. When the retroviral DNA is integrated into the host genome they evolve into endogenous retroviruses that influence eukaryotic genomes. So many endogenous retroviruses have inserted themselves into eukaryotic genomes that they allow insight into biology between viral-host interactions and the role of retrotransposons in evolution and disease. Many retrotransposons share features with endogenous retroviruses, the property of recognising and fusing with the host genome. However, there is a key difference between retroviruses and retrotransposons, which is indicated by the env gene. Although similar to the gene carrying out the same function in retroviruses, the env gene is used to determine whether the gene is retroviral or retrotransposon. If the gene is retroviral it can evolve from a retrotransposon into a retrovirus. They differ by the order of sequences in pol genes. Env genes are found in LTR retrotransposon types Ty1-copia (Pseudoviridae), Ty3-gypsy (Metaviridae) and BEL/Pao. [9] [8] They encode glycoproteins on the retrovirus envelope needed for entry into the host cell. Retroviruses can move between cells whereas LTR retrotransposons can only move themselves into the genome of the same cell. [10] Many vertebrate genes were formed from retroviruses and LTR retrotransposons. One endogenous retrovirus or LTR retrotransposon has the same function and genomic locations in different species, suggesting their role in evolution. [11]

Like LTR retrotransposons, non-LTR retrotransposons contain genes for reverse transcriptase, RNA-binding protein, nuclease, and sometimes ribonuclease H domain [12] but they lack the long terminal repeats. RNA-binding proteins bind the RNA-transposition intermediate and nucleases are enzymes that break phosphodiester bonds between nucleotides in nucleic acids. Instead of LTRs, non-LTR retrotransposons have short repeats that can have an inverted order of bases next to each other aside from direct repeats found in LTR retrotransposons that is just one sequence of bases repeating itself.

Although they are retrotransposons, they cannot carry out reverse transcription using an RNA transposition intermediate in the same way as LTR retrotransposons. Those two key components of the retrotransposon are still necessary but the way they are incorporated into the chemical reactions is different. This is because unlike LTR retrotransposons, non-LTR retrotransposons do not contain sequences that bind tRNA.

They mostly fall into two types – LINEs and SINEs. SVA elements are the exception between the two as they share similarities with both LINEs and SINEs, containing Alu elements and different numbers of the same repeat. SVAs are shorter than LINEs but longer than SINEs.

While historically viewed as "junk DNA", research suggests in some cases, both LINEs and SINEs were incorporated into novel genes to form new functions. [13]

LINEs Edit

When a LINE is transcribed, the transcript contains an RNA polymerase II promoter that ensures LINEs can be copied into whichever location it inserts itself into. RNA polymerase II is the enzyme that transcribes genes into mRNA transcripts. The ends of LINE transcripts are rich in multiple adenines, [14] the bases that are added at the end of transcription so that LINE transcripts would not be degraded. This transcript is the RNA transposition intermediate.

The RNA transposition intermediate moves from the nucleus into the cytoplasm for translation. This gives the two coding regions of a LINE that in turn binds back to the RNA it is transcribed from. The LINE RNA then moves back into the nucleus to insert into the eukaryotic genome.

LINEs insert themselves into regions of the eukaryotic genome that are rich in bases AT. At AT regions LINE uses its nuclease to cut one strand of the eukaryotic double-stranded DNA. The adenine-rich sequence in LINE transcript base pairs with the cut strand to flag where the LINE will be inserted with hydroxyl groups. Reverse transcriptase recognises these hydroxyl groups to synthesise LINE retrotransposon where the DNA is cut. Like with LTR retrotransposons, this new inserted LINE contains eukaryotic genome information so it can be copied and pasted into other genomic regions easily. The information sequences are longer and more variable than those in LTR retrotransposons.

Most LINE copies have variable length at the start because reverse transcription usually stops before DNA synthesis is complete. In some cases this causes RNA polymerase II promoter to be lost so LINEs cannot transpose further. [15]

Human L1 Edit

LINE-1 (L1) retrotransposons make up a significant portion of the human genome, with an estimated 500,000 copies per genome. Genes encoding for human LINE1 usually have their transcription inhibited by methyl groups binding to its DNA carried out by PIWI proteins and enzymes DNA methyltransferases. L1 retrotransposition can disrupt the nature of genes transcribed by pasting themselves inside or near genes which could in turn lead to human disease. LINE1s can only retrotranspose in some cases to form different chromosome structures contributing to differences in genetics between individuals. [17] There is an estimate of 80–100 active L1s in the reference genome of the Human Genome Project, and an even smaller number of L1s within those active L1s retrotranspose often. L1 insertions have been associated with tumorigenesis by activating cancer-related genes oncogenes and diminishing tumor suppressor genes.

Each human LINE1 contains two regions from which gene products can be encoded. The first coding region contains a leucine zipper protein involved in protein-protein interactions and a protein that binds to the terminus of nucleic acids. The second coding region has a purine/pyrimidine nuclease, reverse transcriptase and protein rich in amino acids cysteines and histidines. The end of the human LINE1, as with other retrotransposons is adenine-rich. [18] [19] [20]

SINEs Edit

SINEs are much shorter (300bp) than LINEs. [21] They share similarity with genes transcribed by RNA polymerase II, the enzyme that transcribes genes into mRNA transcripts, and the initiation sequence of RNA polymerase III, the enzyme that transcribes genes into ribosomal RNA, tRNA and other small RNA molecules. [22] SINEs such as mammalian MIR elements have tRNA gene at the start and adenine-rich at the end like in LINEs.

SINEs do not encode a functional reverse transcriptase protein and rely on other mobile transposons, especially LINEs. [23] SINEs exploit LINE transposition components despite LINE-binding proteins prefer binding to LINE RNA. SINEs cannot transpose by themselves because they cannot encode SINE transcripts. They usually consist of parts derived from tRNA and LINEs. The tRNA portion contains an RNA polymerase III promoter which the same kind of enzyme as RNA polymerase II. This makes sure the LINE copies would be transcribed into RNA for further transposition. The LINE component remains so LINE-binding proteins can recognise the LINE part of the SINE.

Alu elements Edit

Alus are the most common SINE in primates. They are approximately 350 base pairs long, do not encode proteins and can be recognized by the restriction enzyme AluI (hence the name). Their distribution may be important in some genetic diseases and cancers. Copy and pasting Alu RNA requires the Alu's adenine-rich end and the rest of the sequence bound to a signal. The signal-bound Alu can then associate with ribosomes. LINE RNA associates on the same ribosomes as the Alu. Binding to the same ribosome allows Alus of SINEs to interact with LINE. This simultaneous translation of Alu element and LINE allows SINE copy and pasting.

SVA elements are present at lower levels than SINES and LINEs in humans. The starts of SVA and Alu elements are similar, followed by repeats and an end similar to endogenous retrovirus. LINEs bind to sites flanking SVA elements to transpose them. SVA are one of the youngest transposons in great apes genome and among the most active and polymorphic in the human population.

A recent study developed a network method that reveals SVA retroelement (RE) proliferation dynamics in hominid genomes. [24] The method enable to track the course of SVA proliferation, identify yet unknown active communities, and detect tentative "master REs" that play key roles in SVA propagation. Thus, providing support for the fundamental "master gene" model of RE proliferation.

Retrotransposons ensure they are not lost by chance by occurring only in cell genetics that can be passed on from one generation to the next from parent gametes. However, LINEs can transpose into the human embryo cells that eventually develop into the nervous system, raising the question whether this LINE retrotransposition affects brain function. LINE retrotransposition is also a feature of several cancers, but it is unclear whether retrotransposition itself causes cancer instead of just a symptom. Uncontrolled retrotransposition is bad for both the host organism and retrotransposons themselves so they have to be regulated. Retrotransposons are regulated by RNA interference. RNA interference is carried out by a bunch of short non-coding RNAs. The short non-coding RNA interacts with protein Argonaute to degrade retrotransposon transcripts and change their DNA histone structure to reduce their transcription.

LTR retrotransposons came about later than non-LTR retrotransposons, possibly from an ancestral non-LTR retrotransposon acquiring an integrase from a DNA transposon. Retroviruses gained additional properties to their virus envelopes by taking the relevant genes from other viruses using the power of LTR retrotransposon.

Due to their retrotransposition mechanism, retrotransposons amplify in number quickly, composing 40% of the human genome. The insertion rates for LINE1, Alu and SVA elements are 1/200 – 1/20, 1/20 and 1/900 respectively. The LINE1 insertion rates have varied a lot over the past 35 million years, so they indicate points in genome evolution.

Notably a large number of 100 kilobases in the maize genome show variety due to the presence or absence of retrotransposons. However since maize is unusually genetically compared to other plants it cannot be used to predict retrotransposition in other plants.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended Data Fig. 1 Distribution of genome sizes (GS).

Distribution of genome sizes (GS) in (a) 10,770 angiosperms (out of c. 350,000 known species) (b) 506 gymnosperms (out of c. 1,000 known species).

Extended Data Fig. 2 Genome proportion of the different classes of repeats.

Genome proportion of the different classes of repeats based on copy number fitted against ln-transformed genome size. The grey regression lines show estimated trends for all 101 species fitted with a beta regression (see also Supplementary Table 4). Orange lines show the estimated slope from phylogenetic least squares (PGLS) using a phylogeny with proportional branch lengths (phy P) fitted with an Ornstein–Uhlenbeck process. We also tested a phylogeny with branch lengths transformed to a cladogram (phy C), and results were similar to this PGLS (not shown).

Extended Data Fig. 3 Transposable element analysis of 77 seed plant species.

Analysis of 77 seed plant species (69 angiosperms (1 early-diverging angiosperm, 53 eudicots, 15 monocots) and eight gymnosperms) showing how the proportion of the genome occupied by transposable element-related protein coding domains varies with ln-transformed genome size. The regression lines show the slopes estimated from a beta regression, and from a PGLS with an Ornstein–Uhlenbeck process. The regression line of the graph is similar to that seen for the whole repetitive fraction (see Extended Data Fig. 2a).

Extended Data Fig. 4 Genome proportion of repeats in eudicots, monocots and gymnosperms.

Genome proportion of repeats in four categories (sequences ≤ 20 copies, low (21–500), middle (501–10,000) and high (>10,000) copy sequences fitted against ln-transformed genome size, separately for eudicots, monocots and gymnosperms. See also Supplementary Table 4, which shows significant relationships in these datasets.

Physical Maps

A physical map provides detail of the actual physical distance between genetic markers, as well as the number of nucleotides. There are three methods used to create a physical map: cytogenetic mapping, radiation hybrid mapping, and sequence mapping. Cytogenetic mapping uses information obtained by microscopic analysis of stained sections of the chromosome (Figure 2). It is possible to determine the approximate distance between genetic markers using cytogenetic mapping, but not the exact distance (number of base pairs). Radiation hybrid mapping uses radiation, such as x-rays, to break the DNA into fragments. The amount of radiation can be adjusted to create smaller or larger fragments. This technique overcomes the limitation of genetic mapping and is not affected by increased or decreased recombination frequency. Sequence mapping resulted from DNA sequencing technology that allowed for the creation of detailed physical maps with distances measured in terms of the number of base pairs. The creation of genomic libraries and complementary DNA (cDNA) libraries (collections of cloned sequences or all DNA from a genome) has sped up the process of physical mapping. A genetic site used to generate a physical map with sequencing technology (a sequence-tagged site, or STS) is a unique sequence in the genome with a known exact chromosomal location. An expressed sequence tag (EST) and a single sequence length polymorphism (SSLP) are common STSs. An EST is a short STS that is identified with cDNA libraries, while SSLPs are obtained from known genetic markers and provide a link between genetic maps and physical maps.

/>Figure 1: This is a physical map of the human X chromosome. (credit: modification of work by NCBI, NIH)

Analysis of Simple Sequence Repeats in Genomes of Rhizobia

Simple sequence repeats (SSRs) or microsatellites, as genetic markers, are ubiquitous in genomes of various organisms. The analysis of SSR in rhizobia genome provides useful information for a variety of applications in population genetics of rhizobia. We analyzed the occurrences, relative abundance, and relative density of SSRs, the most common in Bradyrhizobium japonicum, Mesorhizobium loti, and Sinorhizobium meliloti genomes sequenced in the microorganisms tandem repeats database, and SSRs in the three species genomes were compared with each other. The result showed that there were 1 410, 859, and 638 SSRs in B. japonicum, M. loti, and S. meliloti genomes, respectively. In the genomes of B. japonicum, M. loti, and S. meliloti, tetranucleotide, pentanucleotide, and hexanucleotide repeats were more abundant and indicated higher mutation rates in these species. The least abundance was mononucleotide repeat. The SSRs type and distribution were similar among these species.