Deletion of codons at the start of a sequence in preparation for heterogeneous expression. Why?

Deletion of codons at the start of a sequence in preparation for heterogeneous expression. Why?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am reading a patent where they isolate a gene from cDNA constructed from RNA extracted from plant matter.

The subsequent step (in preparation for heterogeneous expression in E Coli.) puzzles me:

The codon optimization I can understand but this part confuses me "… remove the first 23 codons from the DNA sequence and replace [the start?] by the ATGGCT sequence."

Why would one remove a set of codons at the start of a sequence? How did they decide how many to remove Is there an obvious reason I am missing? Also what is special about the ATGGCT sequence.

Any ideas why they would do this?

The start of the new sequence (1650 bp) looks like this:

atggctaccg ataatgacag ctctgaaaac cgtcgtatgg gtaattacaa gccgtccatc 60 tggaactacg acttcctgca gtccctggct acccgccaca atatcatgga agagcgccac 120

whereas the original sequence (1710 bp) was this:

atggattctt ccaccgccac cgccatgaga gctccattca ttgatcatac tgatcatgtg 60 aatctcagaa ctgataacga ttcctcagag aatcgaagga tggggaatta taaacccagt 120

Another point I am confused about is that 1710 - 23 x 3 + 6 is 1647. But the new seq. is 1650 bp. What gives?

The easiest answer would be to ask the authors of the patent. Several other possibilities (see comments) were excluded, such as:

  • restriction sites which interfere with cloning
  • signal peptide
  • proteolytic cleavage

Although it's just a theory, removing part of the N-terminus might improve protein stability and half-life time because the first 30-40 amino acids lack an ordered structure (as judged by submitting the structure to Phyre). Those structures are known to aggregate and influence protein stability.


The use of systematic N- and C-terminal deletions to promote production and structural studies of recombinant proteins.

Sequence composition of disordered regions fine-tunes protein half-life

Production of prone-to-aggregate proteins

CRISPR-induced double-strand breaks trigger recombination between homologous chromosome arms

CRISPR–Cas9–based genome editing has transformed the life sciences, enabling virtually unlimited genetic manipulation of genomes: The RNA-guided Cas9 endonuclease cuts DNA at a specific target sequence and the resulting double-strand breaks are mended by one of the intrinsic cellular repair pathways. Imprecise double-strand repair will introduce random mutations such as indels or point mutations, whereas precise editing will restore or specifically edit the locus as mandated by an endogenous or exogenously provided template. Recent studies indicate that CRISPR-induced DNA cuts may also result in the exchange of genetic information between homologous chromosome arms. However, conclusive data of such recombination events in higher eukaryotes are lacking. Here, we show that in Drosophila, the detected Cas9-mediated editing events frequently resulted in germline-transmitted exchange of chromosome arms—often without indels. These findings demonstrate the feasibility of using the system for generating recombinants and also highlight an unforeseen risk of using CRISPR-Cas9 for therapeutic intervention.

Pathobiology of the Human Erythrocyte and Its Hemoglobins

Martin H. Steinberg , . Benjamin L. Ebert , in Hematology (Seventh Edition) , 2018

Posttranscriptional, Translational, and Posttranslational Mechanisms

Processed globin mRNA is exported from the nucleus to the cytoplasm by a mechanism that is not clearly defined. mRNA translation occurs in the cytoplasm (see Fig. 33.7 ). The triplet codons or mRNA are recognized by the anticodons of specific tRNAs that bring activated amino acid residues to the nascent polypeptide chains. The process of translation, in which an mRNA template directs the synthesis of protein, is typically divided into three phases: initiation, elongation, and termination (see Chapters 1 and 4 ). Each phase is regulated by a variety of protein factors.

The globin mRNA molecule becomes associated with four to six ribosomes, forming the polyribosome. At least 11 eukaryotic translation initiation factors interact with the polyribosome. They mediate stabilization of a preinitiation complex, binding of the initiator methionine tRNA to ribosomal subunits, binding of mRNA to the preinitiation complex, stabilization of mRNA binding, recognition of the cap site at the 5′ end of mRNA, and release of initiation factors from the preinitiation complex. Several elongation and termination factors have also been defined. Initiation or an early step in the elongation process is the rate-limiting factor.

The first posttranslational step in tetramer formation is the combination of α-globin and non–α-globin chains to form dimers, an event that appears to depend on the relative charge of each globin subunit. The dimers then form tetrameric Hb. Because of charge differences among non–α-globin chains, there is a hierarchy or affinity of these chains for α-globin chains. The combination of α- and β-globin chains is most favored followed by a combination of α-, γ-, and δ-globin chains. Certain mutant Hbs that have gained or lost a charge may alter this hierarchic arrangement. This may influence the proportion of variant Hb present, especially when the patient also inherits an α-thalassemia syndrome, in which the synthesis of α-globin chains is reduced. The supply of available α-globin chains is then limited, and non–α-globin chains compete with one another to form tetramers with the limiting α-globin chain pool.

Globin chain biosynthesis and heme synthesis are mutually important. Heme plays a role in the regulation of the initiation complex. A deficiency of heme (e.g., in iron deficiency) is associated with the accumulation of a repressor of translation initiation factors. Translation of β-globin mRNA appears to be initiated more efficiently than α-globin mRNA, conferring on the associated anemia some of the features of mild α-thalassemia. This phenomenon occurs because heme deficiency depresses the availability of initiating factors for which the less efficient α-mRNA must compete with the more efficient β-mRNA.

The identification of genetic mutations that cause congenital and acquired forms of anemia provide further insight into the pathways required for the coordinated production of globin and heme. More than half of patients with Diamond-Blackfan anemia, a disorder characterized by a severe macrocytic anemia and a paucity of erythroid progenitor cells, have heterozygous germline mutations in the RPS19 gene or other genes encoding ribosomal proteins. Similarly, the macrocytic anemia in patients with myelodysplastic syndrome and a deletion of chromosome 5q is caused by heterozygous deletion of another ribosomal protein gene, RPS14. Haploinsufficiency for these ribosomal protein genes activates the p53 pathway, leading to cell cycle arrest and apoptosis selectively in the erythroid progenitor cells.

Refractory anemia with ring sideroblasts (RARS) is a subtype of myelodysplastic syndrome characterized by iron-loaded mitochondria evident on Prussian blue staining. In the majority of RARS cases, somatic mutations are present in the SF3B1 gene, encoding a core member of the RNA splicing machinery. The precise targets of SF3B1 have not been identified.


To construct this modified genome, we used coselection multiplex automatable genome engineering (CoS-MAGE) (32, 33) to create an E. coli strain (C123) with all 123 AGR codons removed from its essential genes (see Fig. 1A and Dataset S1 for a complete list of AGR codons in essential genes). CoS-MAGE leverages Lambda Red-mediated recombination (34, 35) and exploits the linkage between a mutation in a selectable allele (e.g., tolC) to nearby edits of interest (e.g., AGR conversions), thereby enriching for cells with those edits (Fig. S1). To streamline C123 construction, we chose to start with E. coli strain EcM2.1, which was previously optimized for efficient Lambda Red-mediated genome engineering (33, 36). Using CoS-MAGE on EcM2.1 improves allele replacement frequency by 10-fold over MAGE in nonoptimized strains but performs optimally when all edits are on the same replichore and within 500 kb of the selectable allele (33). To accommodate this requirement, we divided the genome into 12 segments containing all 123 AGR codons in essential genes. A tolC cassette was moved around the genome to enable CoS-MAGE in each segment, allowing us to prototype each set of AGR→CGU mutations rapidly across large cell populations in vivo. (Please see General Replacement Strategy and Troubleshooting Strategy in Materials and Methods for a more detailed discussion). Of the 123 AGR codons in essential genes, 110 could be changed to CGU by this process (Fig. 1), revealing considerable flexibility of codon use for most essential genes. The frequency of allele replacement (in this case, AGR→CGU codon substitution) varied widely across these 110 permissive codons, with no clear correlation between the frequency of allele replacement and the normalized position of the AGR codon in a gene (Fig. 2A).

Construction of strain C123. (Inner) Workflow used to create and analyze strain C123. The DESIGN phase involved identification of 123 AGR codons in the essential genes of E. coli. MAGE oligos were designed to replace all instances of these AGR codons with the synonymous CGU codon. The BUILD phase used CoS-MAGE to convert 110 AGR codons to CGU and to identify 13 AGR codons that required additional troubleshooting. The in vivo TROUBLESHOOTING phase resolved the 13 codons that could not be readily converted to CGU and identified mechanisms potentially explaining why AGR→CGU was not successful. In the STUDY phase, next-generation sequencing, evolution, and phenotyping were performed on strain C123. (Outer) Schematic of the C123 genome (nucleotide 0 is oriented up numbering is according to strain MG1655). Exterior labels indicate the set groupings of AGR codons. Successful AGR→CGU conversions (110 instances) are indicated by radial green lines, and recalcitrant AGR codons (13 instances) are indicated by radial red lines.

Analysis of attempted AGR→CGU replacements. (A) AGR recombination frequency (mascPCR, n = 96 clones per cell population) was plotted versus the normalized ORF position (residue number of the AGR codon divided by the total length of the ORF). Failed AGR→CGU conversions are indicated by vertical red lines below the x axis. (B) Doubling time of strains in the C123 lineage in LB L medium at 34 °C was determined in triplicate on a 96-well plate reader. Colored bars indicate the set of codons under construction when a doubling time was determined (coloring based on Fig. 1). Each data point represents a different stage of strain construction. Alternative codons were identified for 13 recalcitrant AGR codons in our troubleshooting pipeline, the optimized replacement sequences were incorporated into the final strain (gray section at right, labeled with an asterisk), and the resulting doubling times were measured. Error bars represent SEM in doubling time from at least three replicates of each strain.

Strategy for replacing each set of AGR codons in all of the essential genes of E. coli (EcM2.1). The AGR codons are marked with open triangles (various colors). To start, a dual-selectable tolC cassette (double green line) is recombined into the genome using Lambda Red in a multiplexed recombination along with several oligos targeting nearby (<500 kb) downstream AGR loci (various colored lines). Upon selection for tolC insertion clones, correctly chosen AGR codons (filled triangles) are also observed at a higher frequency because of strong linkage between recombination events at tolC and other nearby (<500 kb) downstream AGR loci. Next, a second recombination is carried out using the same AGR conversion oligo pool but now paired with another oligo to disrupt the tolC ORF with a premature stop then the tolC counterselection is applied, again enriching the population for AGR conversions. A third multiplexed recombination then fixes the tolC ORF, again targeting AGR loci. After the tolC selection is applied, clones are assayed by mascPCR. If most conversions in a given set had been made, the selectable marker is removed using a repair oligo in a singleplexed or multiplexed recombination (depending on need). The tolC counterselection then is leveraged both to leave a scarless chromosome and to free up the tolC cassette for use elsewhere in the genome.

The remaining 13 AGR→CGU mutations were not observed, suggesting codon substitution frequency below our detection limit of 1% of the bacterial population (Materials and Methods and Dataset S2). These “recalcitrant codons” were assumed to be deleterious or nonrecombinogenic and were triaged into a troubleshooting pipeline for further analysis (Fig. 1). Interestingly, all except 1 of the 13 recalcitrant codons were colocalized near the termini of their respective genes, suggesting the importance of codon choice at these positions: seven were at most 30 nt downstream of the start codon, and five were at most 30 nt upstream of the stop codon (Fig. 2A, Lower and Dataset S3). Because of our unbiased design strategy, we anticipated that several AGR→CGU mutations would present obvious design flaws, such as introducing nonsynonymous mutations (two instances) or RBS disruptions (four instances) in overlapping genes. For example, ftsI_AGA1759 overlaps the second and third codons of murE, an essential gene, introducing a missense mutation (murE D3V) that may impair fitness. Replacing ftsI_AGA with CGA successfully replaced the forbidden AGA codon while conserving the primary amino acid sequence of MurE with a minimal impact on fitness (Fig. 3A and Dataset S2). Similarly, holB_AGA4 overlaps the upstream essential gene tmk, and replacing AGA with CGU converts the tmk stop codon to Cys, adding 14 amino acids to the C terminus of tmk. Although some C-terminal extensions are well tolerated in E. coli (37), extending tmk appears to be deleterious. We successfully replaced holB_AGA with CGC by inserting three nucleotides comprising a stop codon before the holB start codon. This insertion reduced the tmk/holB overlap and preserved the coding sequences of both genes (Fig. S2A).

Examples of failure mechanisms for four recalcitrant AGR replacements. Wild-type AGR codons are indicated by bold black letters, design flaws are indicated by red letters, and optimized replacement genotypes are indicated by green letters. (A) The genes ftsI and murE overlap with each other. An AGA→CGU mutation in ftsI would introduce a nonconservative Asp3Val mutation in murE. The amino acid sequence of murE was preserved by using an AGA→CGA mutation. (B) Gene secE overlaps with the RBS for the downstream essential gene nusG. An AGG→CGU mutation is predicted to diminish the RBS strength by 97% (53). RBS strength is preserved by using a nonsynonymous AGG→GAG mutation. (C) Gene ssb has an internal RBS-like motif shortly after its start codon. An AGG→CGU mutation would diminish the RBS strength by 94%. RBS strength is preserved by using an AGA→CGA mutation combined with additional wobble mutations indicated by green letters. (D) Gene rnpA has a defined mRNA structure that would be changed by an AGG→CGU mutation. The original RNA structure is preserved by using an AGG→CGG mutation. The RBS (green), start codon (blue), and AGR codon (red) are annotated with like-colored boxes on the predicted RNA secondary structures.

Schematic of three different cases of failure in recalcitrant AGR→CGU mutations. In each case, the top row is the initial sequence, the middle row is the AGR→CGU mutation, and the third row of the primary DNA sequence is the optimized solution converged on in troubleshooting. Green boxes below the DNA sequence indicate amino acid sequence in the same order (the top box is the initial sequence, the middle box gives results from AGR→CGU, and the bottom box shows results from troubleshooting solution). (A) Cases of C-terminal overlap of AGRs at the ends of essential genes with downstream ORFs. (i) The genes ftsI and murE overlap with each other. An AGA→CGU mutation in ftsI would introduce a nonconservative Asp3Val mutation in murE. The amino acid sequence of murE was preserved by using an AGA→CGA mutation. (ii) The genes holB and tmk overlap with each other. An AGA→CGU mutation in holB would introduce a nonconservative Stop214Cys mutation in tmk. The amino acid sequence of tmk was preserved by using an AGA→CGC mutation and adding three nucleotides. (B) Cases of C-terminal overlap of AGRs at the ends of essential genes with the RBS of a downstream gene. (i) Gene secE overlaps with the RBS for the downstream essential gene nusG. An AGG→CGU mutation would diminish the RBS strength by 97% (53). RBS strength is preserved by using an AGG→GAG mutation. (ii) Gene dnaT overlaps with the RBS for the downstream essential gene dnaC. An AGG→CGU mutation would diminish the RBS strength by 77% (53). RBS strength is preserved by using an AGG→CGA mutation. (ii) Gene folC overlaps with the RBS for the downstream gene dedD, shown to be essential in our strain. An AGGAGA→CGUCGU mutation would diminish the RBS strength by 99% (53). RBS strength is preserved by using an AGG→CGGCGA mutation. (C) N-terminal RBS-like motifs causing recalcitrant AGR conversions at the beginning of essential genes. (i) Gene dnaT has an internal RBS-like motif. An AGG→CGU mutation would increase the RBS strength 26 times (53). RBS strength is better preserved by using an AGA→CGU mutation combined with additional wobble mutations. (ii) Gene prfB has an internal RBS-like motif. This RBS-like motif is involved in a downstream planned frameshift in prfB (39). AGG→CGU mutation was possible only by removing the frameshift (leaving a poor RBS-like site). To maintain the frameshift, AGG→CGG mutation and additional wobble were required. In that case, local RBS strength was maintained (fourth row). (iii) Gene ssb has an internal RBS-like motif. An AGG→CGU mutation would diminish the RBS strength by 94%. RBS strength is preserved by using an AGA→CGA mutation combined with additional wobble mutations.

Additionally, the four remaining C-terminal failures included AGR→CGU mutations that disrupt RBS motifs belonging to downstream genes (secE_AGG376 for nusG, dnaT_AGA532 for dnaC, and folC_AGAAGG1249,1252 for dedD, the last constituting two codons). Both nusG and dnaC are essential, suggesting that replacing AGR with CGU in secE and dnaT lethally disrupts translation initiation and thus the expression of the overlapping nusG and dnaC (Fig. 3B and Fig. S2B). Although dedD is annotated as nonessential (31), we hypothesized that replacing the AGR with CGU in folC disrupted a portion of dedD that is essential to the survival of EcM2.1 (E. coli K-12). In support of this hypothesis, we were unable to delete the 29 nucleotides of dedD that were not deleted by Baba et al. (31) and did not overlap with folC, suggesting that this sequence is essential in our strain. The unexpected failure of this conversion highlights the challenge of predicting design flaws even in well-annotated organisms. Consistent with our observation that disrupting these RBS motifs underlies the failed AGR→CGU conversions, we overcame all four design flaws by selecting codons that conserved RBS strength, including a nonsynonymous (Arg→Gly) conversion for secE.

These lessons, together with previous observations that ribosomes pause during translation when they encounter RBS motifs in coding DNA sequences (20), provided key insights into the N-terminal AGR→CGU failures. Three of the N-terminal failures (ssb_AGA10, dnaT_AGA10, and prfB_AGG64) had RBS-like motifs that were either disrupted or created by CGU replacement. Although prfB_AGG64 is part of the RBS motif that triggers an essential frameshift mutation in prfB (21, 38, 39), pausing motif-mediated regulation of ssb and dnaT expression has not been reported. Nevertheless, ribosomal pausing data (20) showed that ribosomal occupancy peaks are present directly downstream of the AGR codons for ssb and are absent for dnaT (Fig. S3) meanwhile, unsuccessful CGU mutations were predicted to weaken the RBS-like motif for prfB and ssb and to strengthen the RBS-like motif for dnaT (Fig. 3C and Fig. S2C), suggesting a functional relationship between RBS occupancy and cell fitness. Consistent with this hypothesis, successful codon replacements from the troubleshooting pipeline conserve predicted RBS strength compared with the large predicted deviation caused by unsuccessful AGR→CGU mutations (Fig. 4, y axis and comparison of orange asterisks and green dots). Interestingly, attempts to replace dnaT_AGA10 with either CGN or NNN failed only by manipulating the wobble position of surrounding codons and conserving the Arg amino acid could dnaT_AGA10 be replaced (Fig. S2C). These wobble variants appear to compensate for the increased RBS strength caused by the AGA→CGU mutation: RBS motif strength with wobble variants deviated eightfold from the unmodified sequence, whereas RBS motif strength for AGA→CGU alone deviated 27-fold.

RBS strength and mRNA structure predict synonymous mutation success. Scatter plot showing predicted RBS strength [y axis, calculated with the Salis RBS calculator (53)] versus deviations in mRNA folding [x axis, calculated at 37 °C by the UNAFold calculator (40)]. Small gray dots represent nonessential genes in E. coli MG1655 that have an AGR codon within the first 10 or last 10 codons. Large gray dots represent successful AGR→CGU conversions in the first 10 or last 10 codons of essential genes. Orange asterisks represent unsuccessful AGR→CGU mutations (recalcitrant codons) in essential genes. Green dots represent optimized solutions for these recalcitrant codons. The SRZ (blue-shaded region) is an empirically defined range of mRNA folding and RBS strength deviations, based on the successful AGR→CGU replacement mutations observed in this study. Most unsuccessful AGR→CGU mutations (orange asterisks) cause large deviations in RBS strength or mRNA structure that are outside the SRZ. The genes holB and ftsI are two notable exceptions because their initial CGU mutations caused amino acid changes in overlapping essential genes. Gene folC corresponds to two AGRs. Arrows for four examples of optimized replacement codons (ftsA, folC, rnpA, and rpsJ) show that deviations in RBS strength and/or mRNA structure are reduced. Arrows are omitted for the remaining eight optimized replacement codons to increase readability.

Ribosomal-pausing data drawn from previous work (20) for genes ssb, dnaT, and prfB. The green line represents ribosome-profiling data for each gene. The orange line is the average for all genes with an AGR codon within the first 30 nt of the annotated start codon. The region between the two vertical red lines indicates zones of interest (centered 12 bp after the AGR codon). Interestingly, prfB and ssb show a peak after the AGR codon, but no peak in that location is observed for dnaT. Based on predictions from the Salis calculator (53), replacing AGR with CGU in those three cases is believed to disrupt ribosomal pausing (prfB and ssb) or to introduce ribosomal pausing (dnaT).

To understand better the several remaining cases of N-terminal failure that did not exhibit considerable deviations in RBS strength (rnpA_AGG22, ftsA_AGA19, frr_AGA16, and rpsJ_AGA298), we examined other potential nucleic acid determinants of protein expression. Based on the observation that the mRNA secondary structure near the 5′ end of ORFs strongly impacts protein expression (12), we found that these four remaining AGR→CGU mutations changed the predicted folding energy and structure of the mRNA near the start codon of target genes (Fig. 3D and Fig. S4). Successful codon replacements obtained from degenerate MAGE oligos reduced the disruption of the mRNA secondary structure compared with CGU (Fig. 4, green dots). For example, rnpA has a predicted mRNA loop near its RBS and start codon that relies on base pairing between both guanines of the AGG codon to nearby cytosines (Fig. 3D and Fig. S5A). Importantly, only AGG22CGG was observed out of all attempted rnpA AGG22CGN mutations, and the fact that only CGG preserves this mRNA structure suggests that it is physiologically important (Fig. 3D and Fig. S5 B and C). In support of this notion, we successfully introduced an rnpA AGG22CUG mutation (Arg→Leu) only when we changed the complementary nucleotides in the stem from CC (base pairs with AGG) to CA (base pairs with CUG), thus preserving the natural RNA structure (Fig. S5D) while changing both RBS motif strength and amino acid identity. Our analysis of all four optimized gene sequences showed reduced deviation in computational mRNA folding energy [computed with UNAFold (40)] compared with the unsuccessful CGU mutations (Fig. 4, x axis, orange asterisks, and green dots). Similarly, the predicted mRNA structure [computed with different mRNA folding software, NUPACK (41)] for these genes was strongly changed by CGU mutations and was corrected in our empirically optimized solutions (Fig. S4).

mRNA folding predictions for the four recalcitrant AGR→CGU mutations explained by mRNA folding variations. mRNA folding prediction of 100 nt upstream and 30 nt downstream of the start codon using UNAfold (40). Both the shape of the mRNA folding and the folding energy value must be taken into account to understand failure of the AGR→CGU conversion. AGR depicts the predicted wild-type mRNA, CGU is the mRNA folding prediction with an AGR→CGU mutation (generally not observed), and “Optimized” corresponds to the mRNA-folding prediction of the AGR replacement solution found after in vivo troubleshooting. The predicted free energy of folding of the visualized structure expressed in kilocalories per mole is listed under each structure.

mRNA folding predictions for the gene rnpA. For folding predictions, we used 30 nt upstream and100 nt downstream of the rnpA start site using UNAfold (40). (A) The wild-type rnpA sequence, with AGG in the blue box. (B) The wild-type rnpA sequence with AGG→CGU in the blue box (not observed). (C) The wild-type rnpA sequence with AGG→CGG in the blue box (observed with no growth-rate defect). (D) The wild-type rnpA sequence with AGG→CTG in blue box and one complementary mutation CCC→CCA to maintain the mRNA loop (in the blue box) (also observed with no growth-rate defect).

Troubleshooting these 13 recalcitrant codons revealed that mutations causing large deviations from natural mRNA folding energy or RBS strength are associated with failed codon substitutions. By calculating these two metrics for all attempted AGR→CGU mutations, we empirically defined a safe replacement zone (SRZ) within which most CGU mutations were tolerated (Fig. 4, shaded area). The SRZ is defined as the largest multidimensional space that contains none of the AGR→CGU failures associated with mRNA folding energy or RBS strength (Fig. 4, red asterisks). It comprises deviations in mRNA folding energy of less than 10% with respect to the natural codon and deviations in RBS-like motif scores of less than a half log with respect to the natural codon, providing a quantitative guideline for codon substitution. Notably, the optimized solution used to replace the 13 recalcitrant codons always exhibited reduced deviation for at least one of these two parameters compared with the deviation seen with a CGU mutation. Furthermore, solutions to the 13 recalcitrant codons overlapped almost entirely with the empirically defined SRZ. These results suggest that computational predictions of mRNA folding energy and RBS strength can be used as a first approximation to predict whether a designed mutation is likely to be viable. Developing in silico heuristics to predict problematic alleles streamlines the use of in vivo genome engineering methods such as MAGE to identify viable replacement codons empirically. Therefore, these heuristics reduce the search space required to redesign viable genomes, raising the prospect of creating radically altered genomes exhibiting expanded biological functions.

Once we had identified viable replacement sequences for all 13 recalcitrant codons, we combined the successful 110 CGU conversions with the 13 optimized codon substitutions to produce strain C123, in which all 123 AGR codons have been removed from all of its annotated essential genes. C123 then was sequenced to confirm AGR removal and analyzed using Millstone, a publicly available genome resequencing analysis pipeline ( Two spontaneous AAG (Lys) to AGG (Arg) mutations were observed in the essential genes pssA and cca. Although attempts to revert these mutations to AAG were unsuccessful—perhaps suggesting functional compensation—we were able to replace them with CCG (Pro) in pssA and CAG (Gln) in cca using degenerate MAGE oligos. The resulting strain, C123a, is the first strain completely devoid of AGR codons in its annotated essential genes ( (Dataset S4). Although some AGR codons in nonessential genes could prove unexpectedly difficult to change, our success in replacing all 123 instances of AGR codons in essential genes provides strong evidence that the remaining 4,105 AGR codons can be completely removed from the E. coli genome, permitting the unambiguous reassignment of AGR translation function (23).

Kinetic growth analysis showed that the doubling time increased from 52.4 (±2.6) min in EcM2.1 (no AGR codons changed) to 67 (±1.5) min in C123a (123 AGR codons changed in essential genes) in lysogeny broth (LB) at 34 °C in a 96-well plate reader (Materials and Methods). Notably, fitness varied significantly during construction of the C123 strain (Fig. 2B). This variation may be attributed to codon deoptimization (AGR→CGU) and compensatory spontaneous mutations to alleviate fitness defects in a mismatch repair-deficient (mutS-) background. Overall the reduced fitness of C123a may be caused by on-target (AGR→CGU) or off-target (spontaneous) mutations that occurred during strain construction. In this way, mutS inactivation is simultaneously a useful evolutionary tool and a liability. Final genome sequence analysis revealed that, along with the 123 desired AGR conversions, C123a had 419 spontaneous nonsynonymous mutations not found in the EcM2.1 parental strain (Fig. S6). Of particular interest was the mutation argU_G15A, located in the D arm of tRNA Arg (argU), which arose during CoS-MAGE with AGR set 4. We hypothesized that argU_G15A compensates for increased CGU demand and decreased AGR demand, but we observed no direct fitness cost associated with reverting this mutation in C123, and argU_G15A does not impact aminoacylation efficiency in vitro or aminoacyl-tRNA pools in vivo (Fig. S7 and Dataset S5). Consistent with the findings of Mukai et al. (25) and Baba et al. (31), argW (tRNA Arg CCU decodes AGG only) was dispensable in C123a because it can be complemented by argU (tRNA Arg UCU decodes both AGG and AGA). However, argU is the only E. coli tRNA that can decode AGA and remains essential in C123a, probably because it is required to translate the AGR codons for the rest of the proteome (23).

Representational graph of the fully recoded genome relative to MG1655. The outer ring contains the set grouping in which each AGR codon (vertical line) is located. Each line contains information on troubleshooting (red if troubleshooting was required, green if it was not). Relative recombination frequency is represented by the position of the dot. Each internal ring represents the mutations that accumulated during strain construction. The target set of AGR codons for each ring is highlighted. The internal rings with black radial lines represent the mutations that accumulated while the 13 recalcitrant codons were mutated to their optimized codon replacements.

G15A ArgU does not affect expression and aminoacylation levels in wild-type and recoded E. coli strains. Northern blot acid-urea PAGE was performed on wild-type and G15A argU tRNA in wild-type E. coli (WT-WT and WT-G15A) and in the final strains C123a and b (501 and 503) in several growth conditions. Aminoacylation levels are comparable to those in wild type for all conditions and combinations, suggesting no effect on charging levels despite the mutation sweeping into the population (Dataset S5).

To evaluate the genetic stability of C123a after removal of all AGR codons from all the known essential genes, we passaged C123a for 78 d (640 generations) to test whether AGR codons would recur and/or whether spontaneous mutations would improve fitness. After 78 d, no additional AGR codons were detected in a sequenced population (sequencing data are available at, and doubling time of isolated clones ranged from 22% faster to 22% slower than C123a (n = 60).

To gain more insight into how local RBS strength and mRNA folding impact codon choice, we performed an evolution experiment to examine the competitive fitness of all 64 possible codon substitutions at each of the AGR codons (Dataset S6). Although MAGE is a powerful method for exploring viable genomic modifications in vivo, we were interested in mapping the fitness cost associated with less-optimal codon choices, requiring codon randomization depleted of the parental genotype, which we hypothesized to be at or near the global fitness maximum. To do so, we developed a method called “CRAM” (Crispr-assisted MAGE). First, we designed oligos that changed not only the target AGR codon to NNN but also made several synonymous changes at least 50 nt downstream that would disrupt a 20-bp CRISPR target locus. MAGE was used to replace each AGR with NNN in parallel, and CRISPR/cas9 was used to deplete the population of cells with the parental genotype. This approach allowed exhaustive exploration of the codon space, including the original codon, but without the preponderance of the parental genotype. Following CRAM, the population was passaged 1:100 every 24 h for 6 d and was sampled before each passage using Illumina sequencing (Fig. 5 and Dataset S6).

Codon preference of 14 N-terminal AGR codons. CRAM was used to explore codon preference for several AGR codons located within the first 10 codons of their CDS. Briefly, MAGE was used to diversify a population by randomizing the AGR of interest then CRISPR/Cas9 was used to deplete the parental (unmodified) population, allowing exhaustive exploration of all 64 codons at a position of interest. Thereafter codon abundance was monitored over time by serially passaging the population of cells and sequencing using an Illumina MiSeq. The left y axis (codon frequency) indicates relative abundance of a particular codon (stacked area plot). The right y axis indicates the combined deviations in mRNA folding structure (red line) and internal RBS strength (blue line) in arbitrary units (AU) normalized to 0.5 at the initial time point. Zero indicates no deviation from wild type. The horizontal axis indicates the experimental time point (in hours) at which a particular reading of the population diversity was obtained. The genes bcsB and chpS are nonessential in our strains and thus serve as controls for AGR codons that are not under essential gene pressure.

Sequencing 24 h after CRAM showed that all codons were present, including stop codons (Fig. S8), validating the method as a technique to generate massive diversity in a population. All sequences for further analysis were amplified by PCR with allele-specific primers containing the changed downstream sequence. Subsequent passaging of these populations revealed many gene-specific trends (Fig. 5 and Figs. S8 and S9). Notably, all codons that required troubleshooting (dnaT_AGA10, ftsA_AGA19, frr_AGA16, and rnpA_AGG22) converged to their wild-type AGR codon, suggesting that the original codon was globally optimized. For all cases in which an alternate codon replaced the original AGR, we computed the predicted deviation in mRNA folding energy and local RBS strength (as a proxy for ribosome pausing) for these alternative codons and compared these metrics with the evolution of codon distribution at this position over time. We also computed the fraction of sequences that fall within the SRZ inferred from Fig. 4 (Materials and Methods). CRAM initially introduced a large diversity of mRNA folding energies and RBS strengths, but these genotypes rapidly converged toward parameters that are similar to the parental AGR values in many cases (overlays in Fig. 5). Codons that strongly disrupted predicted mRNA folding and internal RBS strength near the start of genes were disfavored after several days of growth, suggesting that these metrics can be used to predict optimal codon substitutions in silico. In contrast, nonessential control genes bcsB and chpS did not converge toward codons that conserved RNA structure or RBS strength, supporting the conclusion that the observed conservation in RNA secondary structure and RBS strength is biologically relevant for essential genes. Interestingly, tilS_AGA19 was less sensitive to this effect, suggesting that codon choice at that particular position is not under selection. Additionally, the average internal RBS strength for the ispG populations converged toward the parental AGR values, but mRNA folding energy averages did not, suggesting that this position in the gene may be more sensitive to RBS disruption than to mRNA folding. Gene lptF followed the opposite trend.

The number of reads for each codon and for each gene in the CRAM experiment at the 24-h time point. CRAM was used to explore codon preference for several N-terminal AGR codons. The left y axis (number of reads) indicates the abundance of a particular codon. The x axis indicates the 64 possible codons ranked from AAA to TTT in alphabetical order. Experimental time point 24 h is presented. Diversity was assayed by Illumina sequencing. The genes bcsB and chpS are nonessential and thus serve as controls for AGR codons that are not under essential gene pressure.

The number of reads for each codon and for each gene in the CRAM experiment at the 144-h time point. CRAM was used to explore codon preference for several N-terminal AGR codons. The left y axis (the number of reads) indicates the abundance of a particular codon. The x axis indicates the 64 possible codons ranked from AAA to TTT in alphabetical order. Experimental time point 144 h is presented. Diversity was assayed by Illumina sequencing. The genes bcsB and chpS are nonessential and thus serve as controls for AGR codons that are not under essential gene pressure.

Interestingly, several genes (lptF, ispG, tilS, gyrA, and rimN) preferred codons that changed the amino acid identity from Arg to Pro, Lys, or Glu, suggesting that noncoding functions trump amino acid identity at these positions. Importantly, all successful codon substitutions in essential genes fell within the SRZ (Fig. 6), validating our heuristics based on an unbiased test of all 64 codons. Meanwhile nonessential control gene chpS exhibited less dependence on the SRZ.

RBS strength and mRNA structure predict codon preference of 14 N-terminal codon substitutions. Scatter plots show the results of the CRAM experiment (Fig. 5). Each panel represents a different gene. The y axis represents RBS strength deviation [calculated with the Salis RBS calculator (53)], and the x axis shows deviations in mRNA folding energy [calculated at 37 °C by the UNAFold calculator (40)]. Codon abundance at the intermediate time point (t = 72 h, chosen to show maximal diversity after selection) is represented by the dot size. Green dots represent the wild-type codon. Blue dots represent synonymous AGR codons. Orange dots represent the remaining 58 nonsynonymous codons, which may introduce nonviable amino acid substitutions. Black squares represent unsuccessful AGR→CGU conversions observed in the genome-wide recoding effort (Fig. 1 and Table 1). The SRZ (blue-shaded region) is the empirically defined range of mRNA folding and RBS strength deviations, based on the successful AGR→CGU replacement mutations observed in this study (Fig. 3). The genes bcsB and chpS are nonessential in our strains and thus serve as controls for AGR codons that are not under essential gene pressure.



First, we applied cRegions and synplot2 on all 24 non-structural polyproteins of alphaviruses (see also ‘alphavirus dataset’ example on the cRegions homepage). We detected a total of six significant signals with cRegions (Fig. 1A) and three significant signals with the synplot2 (Figs. S4A and S5A). The first signal from the 5′ end was recognised by both programs (Fig. 1A) and spanned from positions 138 to 174 in the codon alignment (Table 1). It is a conserved sequence element (CSE) called ‘51 nt CSE’, which acts as an enhancer for the RNA synthesis, affecting viral replication. This CSE forms two stem-loops and is located at positions 155–205 in the Sindbis virus (SINV) genome (Niesters & Strauss, 1990). Thus, the detected signal lies exactly in the region (Table 1).

Figure 1: cRegions analysis of non-structural polyproteins of alphaviruses using Chi-square goodness of fit test.

Signal Description Dataset Position on the codon alignment SFV* SINV*
Non-structural polyprotein 1 51 nt CSE All (Fig. 1A) 138–174 184–220 161–197
2 Signal adjacent to capsid binding region All (Fig. 1A) 1,149 NA 1,148
New world (Fig. 1B) 1,086 and 1,092 NA 1,142 and 1,148
3 Packaging signal of SFV Complex alphaviruses All (Fig. 1A) 2,835 2,812 2,804
SFV Complex (Fig. 1C) 2,730 2,812 2,804
4 Signal inside the b region All (Fig. 1A) 2,967 2,944 2,936
5 Signal adjacent to leaky stop codon All (Fig. 1A) 6,834 5,536 5,768
New world (Fig. 1B) 6,159 NA 5,888
6 Subgenomic promoter of alphaviruses All (Fig. 1A) 8,658 and 8,664 7,354 and 7,360 7,583 and 7,589
Structural polyprotein 1 UUUUUUA motif All (Fig. 2) 2,673–2,679 9,825–9,831 10,022–10,028

The second significant hit, a single nucleotide at position 1,149, was detected only by the cRegions algorithm. However, two adjacent positions 1,143 and 1,146 were just below the threshold. In the New World alphavirus dataset, in addition to position 1,149, 1,143 was also significant. The signal is just adjacent to the packaging signal of SINV and New World alphaviruses (see also ‘New World alphavirus dataset’ example on the cRegions the homepage). It has been shown that a 570 nt fragment positions 684–1,253 from the SINV binds to the viral capsid protein and is required for packaging of SINV. The detected signal lies in this region (Weiss et al., 1989). However, when we analysed VEEVs separately we were able to detect the positions of phylogenetically conserved predicted stem-loops (Fig. S7). The results are similar to the work done by Kim et al. (2011).

The third and the fourth signal are also single nucleotides at positions 2,835 and 2,967, respectively (Fig. 1A). Both signals are located inside nsp2 conserved region called region b. This 266-nucleotide region is located from 2,726 to 2,991 in the SFV genome (White, Thomson & Dimmock, 1998). Previous deletion mutation analysis has shown that nucleotides from 2,767 to 2,824 in the b region are required for efficient packaging of SFV genome. (White, Thomson & Dimmock, 1998). The first signal is located in that region. Additionally, analysis of ‘SFV Complex’ viruses separately led to increased significance of the first signal (Fig. 1C) and the same signal became visible with synplot2 (Figs. S4B and S5B). Expectedly, both signals disappeared in the New World dataset, as the packaging signal is in a different location in these viruses (Fig. 1B). Therefore, dividing datasets to different subsets may help to detect signals that are only characteristic to smaller subgroups.

The fifth significant hit is a single nucleotide at position 6,834. It is downstream of the ‘leaky’ stop codon (stop codon is at 6,814–6,816 on the codon alignment and in the SINV genome at nt 5,748–5,750). Synplot2 was able to detect a much larger region compared to cRegions in the same area (Figs. S4A and S5A). The detected signal is a 3′ stem-loop RNA secondary structure immediately adjacent to the stop codon (+13 nt downstream of the stop codon in SINV). For many alphaviruses, including VEEV and SINV, it has been reported to influence read-through. In the SINV genome, the double helix part (the stem) of the stem-loop is predicted to form between the two regions: 5,763–5,772 and 5,928–5,939 (Firth et al., 2011). Therefore, the detected signal at position 6,834 (5,768 in SINV) is inside the first region. However, when we analysed VEEVs separately, we were able to detect multiple significant signals inside this stem-loop region (Fig. S7).

The sixth signal consists of two positions 8,658 and 8,664 on the codon alignment (Fig. 1A Figs. S4A and S5A). The signal is located within the subgenomic promoter of alphaviruses (Raju & Huang, 1991 Rupp et al., 2015).

The cRegions and the synplot2 were also applied to the structural polyproteins of alphaviruses (see also ‘alphavirus structural dataset’ example on the cRegions homepage). Sliding window size 2 was used with cRegions. A strong signal was detected in positions 2,643–2,649 on the codon alignment, which corresponds to a UUUUUUA motif (Fig. 2). The motif is responsible for a frameshift in a structural protein (Firth et al., 2008 Chung, Firth & Atkins, 2010). The same signal was detected with the synplot2 (Fig. S6).

Figure 2: cRegions analysis of structural polyproteins of alphaviruses.

Requirements on sequences

The method used in cRegions has some limitations and prerequisites (Puustusmaa & Abroi, 2016). First, the sequences under study must have diverged. Second, the embedded functional element must have been under selection. To help users to evaluate their sequences in these aspects, we added an interactive version of Fig. 3 to the web tool. The plot visualises the sequences under study in comparison to randomly mutated sequences and sequences thoroughly analysed in the previous or current study with respect to divergence and selection. To evaluate divergence and selection we used the relationship between average pairwise nucleotide identity and average pairwise amino acid identity (Fig. 3). As shown in Fig. 3, randomly mutated simulated sequences form a clear and narrow assembly on the plot. Randomly mutated sequences with a defined extent (N mutations per bp) were used to model neutral evolution and/or non-diverged sequences (more details in ‘Materials and Methods’). The naturally occurring sequences used in the previous and in the current study locate clearly away of the simulated sequences.

Figure 3: The average pairwise identity of nucleotide sequences from codon alignment plotted against the average pairwise identity of protein sequences in respective MSA.

Sequences having low divergence or/and having close to neutral evolution

As cRegions was designed to work on diverged sequences, the method may give potential false positive signals in low divergence sequences or in sequences locating close to neutrally evolving sequences (Fig. 4 Fig. S8). To avoid this, we recommend enabling threshold correction on the web tool. By enabling this option expected values are corrected with observed values and the adjusted threshold is calculated (see ‘Materials and Methods’). This removes most of the false positive signal from sequences that are close to randomly mutating sequences (Fig. 4 Fig. S8). We would like to note that the correction is needed only in the case of sequences which are close to the neutrally/randomly evolving sequences (Fig. 3). Another option is to use synplot2 which uses neutral evolution as its null hypothesis (Firth, 2014)

Figure 4: The number of signals in randomly mutated sequences.


CDNA and protein sequences of human, mouse and Drosophila MMS19 orthologs

Public databases were screened with the published amino acid sequence of the yMMS19 gene. Two ESTs derived from human adult brain ( <"type":"entrez-nucleotide","attrs":<"text":"R89623","term_id":"954450">> R89623) and from mouse thymus ( <"type":"entrez-nucleotide","attrs":<"text":"AA939567","term_id":"3100344">> AA939567) were identified with significant homology to the C-terminus of the yMMS19 ORF. To obtain full-length MMS19 cDNAs we screened human and mouse cDNA libraries with RT–PCR probes. The largest clone obtained from a human HeLa cell cDNA library was 3772 bp in length. This cDNA contains an ORF of 2616 bp which can encode a protein of 872 amino acids with a predicted molecular mass of 96 kDa. Alignment with the yeast Mms19 protein indicated amino acid sequence homology extending at least 100 amino acids upstream of the methionine start codon in this human transcript (data not shown), suggesting that this cDNA corresponds to an alternatively spliced form of the primary transcript. We therefore performed 5′-RACE from human testis total RNA using primers that map immediately 5′ to the first methionine codon in the cDNA. This confirmed the presence of alternatively spliced forms of the MMS19 gene (see below) and provided a full-length transcript. To verify these results we amplified the full-length ORF of the hMMS19 gene from human testis total RNA. A cDNA identical to the contiguous sequences previously aligned was obtained. The complete cDNA identified for the hMMS19 gene (GenBank accession no. <"type":"entrez-nucleotide","attrs":<"text":"AF319947","term_id":"14029385">> AF319947) contains an ORF of 3090 nt which can encode a protein of 1030 amino acids (Fig. ​ (Fig.1) 1 ) with a predicted molecular mass of 113 kDa.

Amino acid alignment of the Homo sapiens, Mus musculus, D.melanogaster, A.thaliana, C.elegans, S.pombe and S.cerevisiae Mms19 proteins. The alignment was generated using the CLUSTAL-X program. The highly conserved region of � amino acids at the N-terminus is boxed. Exon 8 (that can be alternatively spliced) corresponds to the first 43 amino acids of this region. The four HEAT repeats are also boxed and labeled I–IV. Only repeat IV of C.elegans is included in this box. The methionine start codon used with alternative splicing of exon 2b or 3 is indicated (*). The 39 amino acids that differ from the previously published MMS19 human protein (9) are overlined. Specifically they are the following: SCCRQLQVHLPVTLSPAMYCLYCWNSSTSTVAASGG. The conserved KRX[LV] sequence in HEAT repeat IV is underlined.

A mouse testis cDNA library yielded a single 977 bp clone in the Mms19 coding region. A full-length cDNA was obtained by multiple rounds of 5′-RACE from mouse testis total RNA. The longest mouse Mms19 cDNA identified (designated mMms19) is 3491 bp in length (GenBank accession no. <"type":"entrez-nucleotide","attrs":<"text":"AF319949","term_id":"14029389">> AF319949) and predicts a polypeptide of 1031 amino acids (Fig. ​ (Fig.1) 1 ) with a molecular mass of 113 kDa. The dMMS19 gene was cloned from whole fly total RNA by RT–PCR. The gene comprises 2880 bp (GenBank accession no. <"type":"entrez-nucleotide","attrs":<"text":"AF319948","term_id":"14029387">> AF319948) and can encode a protein of 958 amino acids (Fig. ​ (Fig.1) 1 ) with a calculated molecular mass of 107 kDa.

yMMS19 homologs from other eukaryotes, including S.pombe ( <"type":"entrez-protein","attrs":<"text":"CAB59878","term_id":"6179659">> CAB59878), Caenorhabitis elegans ( <"type":"entrez-nucleotide","attrs":<"text":"AF067936","term_id":"351050523">> AF067936) and A.thaliana ( <"type":"entrez-nucleotide","attrs":<"text":"AB023039","term_id":"4220638">> AB023039) were translated from genomic and partial cDNA clones obtained from public databases. Protein alignments including all the identified MMS19 orthologs confirmed extensive conservation of amino acid sequences (Fig. ​ (Fig.1). 1 ). The predicted amino acid sequences of the human and mouse Mms19 proteins are 90% identical and share 93% similarity. Both the human and mouse polypeptides are essentially the same size as that encoded by the yeast gene (1030, 1031 and 1032 amino acids, respectively). They also share a similar frequency of acidic (10.3% for human and mouse and 12.2% for yeast) and basic (11.5% for human and mouse and 12.1% for yeast) amino acids and a similar estimated pI of 5.92 (5.72 for yeast Mms19 protein). Overall the human and mouse polypeptides share 25% amino acid identity and 50% similarity with the orthologous yeast Mms19 protein. The predicted amino acid sequence for the human protein differs from that previously reported (9) over a stretch of 39 amino acids located between amino acid residues 373 and 412 (Fig. ​ (Fig.1 1 ).

The Mms19 proteins are confidently predicted to possess predominantly α-helical structure (16), with a small α/β-domain potentially present in the middle part of the protein. Sequence analysis using the SEG program (15) suggests that the Mms19 sequences can be partitioned into two predicted globular regions. The longer N-terminal globular region spans amino acid positions 1�, which includes a highly conserved region of � amino acids located between residues 167 and 285 of the yeast Mms19 protein. This region shares 42% amino acid sequence identity and 62% similarity with the mouse and human proteins (Fig. ​ (Fig.1). 1 ). A shorter C-terminal globular region spans amino acid positions ��, whereas the intervening region is predicted to be non-globular. When the sequence of the C-terminal globular region was independently compared to the protein sequence database using the iterative PSI-BLAST program (12), moderate but statistically significant similarity was detected to a variety of proteins containing so-called HEAT repeats (19,20). Using an expect value of 0.05 as the cut-off for including sequences in the search profile the sequences of � HEAT repeat proteins were retrieved from the database in 10 search iterations without any obvious false positives. A more detailed analysis performed by searching the Mms19 sequences with a generic HEAT repeat profile (Fig. ​ (Fig.2) 2 ) showed the presence of four repeats (Figs ​ (Figs1 1 and ​ and2). 2 ). These comprise a tightly spaced cluster, an arrangement characteristic of HEAT repeat proteins (20). The four repeats are conserved in all available sequences of eukaryotic Mms19 orthologs, with the exception of the C.elegans sequence in which the HEAT repeat region is highly diverged and only the distal repeat is clearly conserved (Figs ​ (Figs1 1 and ​ and2). 2 ). Of particular note is the motif KRx[IV]R (alternative amino acids in brackets), which is conserved in the most distal repeat of the Mms19 orthologs (Fig. ​ (Fig.1). 1 ). This motif may function as a specific binding determinant.

Multiple alignment of the HEAT repeats in the Mms19 orthologs Tor1p and β-importin. The alignment was constructed by parsing PSI-BLAST results, with the repeat boundaries derived from the crystal structure of β-importin (PDB code 1QGKA). The secondary structure underneath the alignment is from the 1QGKA structure: h indicates α-helix. The numbers show the positions of the first and last aligned residues in the corresponding protein sequences. The 80% consensus shows the following classes of amino acid residues: h, hydrophobic (ILVMFYWAC) l, aliphatic (ILV) p, polar (STDENQKRH) s, small (GASTVNAC) +, positively charged (KRH). The positions with the highest information content in the overall HEAT repeat alignment are highlighted in reverse shading.

Alternative splicing of the hMMS19 gene

During cloning of the human and mouse cDNAs we identified different clones, presumably reflecting multiple transcripts. To determine whether these correspond to alternatively spliced products we established the exon/intron boundaries of the hMMS19 gene (Fig. ​ (Fig.3A). 3 A). The gene comprises at least 32 exons spanning 㹀 kb. The two highly conserved domains described above are encoded by exons 8� and exons 27�, respectively. The most abundant hMMS19 cDNA (3530 bp) corresponds to �% of all transcripts identified and includes all coding exons (2b-32) plus exon 1 in the 5′-UTR (Fig. ​ (Fig.3B, 3 B, transcript A). This cDNA can potentially encode a polypeptide of 1030 amino acids and utilizes the first polyadenylation signal in the 3′-UTR. Transcript B differs only in the 5′-UTR by the presence of exon 2a instead of exon 1 (Fig. ​ (Fig.3B). 3 B). The remaining cDNAs identified can encode polypeptides that are 43, 158 or 201 amino acids shorter than that encoded by transcripts A or B (1030 amino acids) (Fig. ​ (Fig.3). 3 ). Alternative splicing of exon 8 (Fig. ​ (Fig.3B, 3 B, transcripts E and F) is expected to delete part of the N-terminal conserved region in the human Mms19 protein. At least one of these transcripts (transcript D, Fig. ​ Fig.3B) 3 B) uses the second polyadenylation site (pA2, Fig. ​ Fig.3A). 3 A). Studies are in progress to evaluate the biological significance of these alternative transcripts. All detected transcripts include the region coding for the C-terminal HEAT repeats.

(A) Schematic representation of the hMMS19 genomic structure. Horizontal lines represent introns. Rectangles/vertical lines correspond to exons. For orientation some exon numbers are displayed. The positions of the putative start codons ATG as well as the stop codon TGA are indicated. Coding sequences are represented as filled boxes. Untranslated regions are represented as open boxes. The first polyadenylation signal (pA1), at nucleotide position 3514, is the most frequently used. A downstream polyadenylation signal (pA2) is used in some transcripts. The complete length of exon 32 could not be determined due to the presence of Alu sequences. (B) Schematic representation of parts of the hMMS19 transcripts identified in normal tissues. Exons 10� were found in all transcripts and are not represented for simplification. Black boxes correspond to coding regions. Open boxes represent untranslated exons. Exon 1 or 2a (non-coding) is exclusively used at the 5′-UTR of the MMS19 gene. Transcripts A and B are shown as examples. Transcripts D and E were also identified with exon 2a instead of exon 1. Alternative splicing of exon 8 can apparently occur alone, as shown in transcript C, or in combination with alternative splicing of exon 2b or 3 (transcripts G and E). Both situations delete 43 amino acids in one of the two highly conserved domains of the Mms19 protein (see Fig. 1). Exons 2b and 3 can also be alternatively spliced, resulting in potentially even shorter polypeptides (transcripts D–G).

Human and mouse MMS19 gene expression patterns

Northern blot analysis of human and mouse adult and fetal tissues revealed the presence of moderate to high steady-state levels of MMS19 mRNA in all tissues analyzed (Fig. ​ (Fig.4A 4 A and B and data not shown). The most abundant transcript has an estimated size of 3.9 kb in human tissues (Fig. ​ (Fig.4A) 4 A) and 3.8 kb in mouse tissues (Fig. ​ (Fig.4B). 4 B). These presumably correspond to the most abundant cDNAs described above (3530 and 3491 bp for human and mouse, respectively). Both the human and mouse transcripts have consensus AAUAAA polyadenylation signals at positions 3514 and 3441, respectively, followed immediately by a poly(A) tail. Additional bands with estimated sizes of 4.8 and 5.8 kb in human and mouse tissues, respectively, were detected by northern blot analysis in most tissues examined (Fig. ​ (Fig.4). 4 ). Significant differences in the ratio of the two transcripts were observed in different tissues. For example, the 4.8 kb transcript is relatively abundant in human pancreas, but is almost undetectable in skeletal muscle (Fig. ​ (Fig.4A). 4 A). The 4.8 kb band presumably reflects transcripts with a 3′-UTR extending beyond the first polyadenylation signal, as represented in Figure ​ Figure3A. 3 A. This region of the gene is very rich in Alu repeat sequences, hence it was not possible to design a probe to confirm this presumption by northern blot analysis.

Northern blot of mRNAs from various adult human (A) and mouse (B) tissues. Cloned and sequenced verified 3′-UTR human and mouse MMS19 products were used as probes (top). β-Actin is shown as an internal control (bottom). The marker track is shown on the left.

In an effort to correlate the transcripts detected by northern blot analysis with the putative alternative splice products at exons 3 and 8 we hybridized the human northern blots with probes derived from both exons. In all cases the results were indistinguishable from those obtained with a 3′-end probe (data not shown). We therefore conclude that both bands detected by northern blotting comprise a heterogeneous collection of alternatively spliced forms.

Chromosomal mapping of the hMMS19 gene

FISH with a BAC clone containing the entire hMMS19 ORF yielded a single site of hybridization at chromosomal band 10q24. This localization is consistent with hybrid mapping results for two STSs (A002G19 and Cda19h12) in the NCBI public database, which align with the 3′-UTR of the cloned hMMS19 gene.

Complementation of the yeast mms19Δ mutant by hMMS19 and dMMS19 cDNA

In view of the amino acid conservation between yeast Mms19 protein and that from higher eukaryotes, especially in the C- and N-terminal regions, we asked whether the MMS19 genes from higher organisms can functionally complement mutant phenotypes of the yeast mms19Δ strain, deleted of the entire MMS19 gene. We overexpressed both hMMS19 and dMMS19 cDNAs under control of the yeast GAL1 promoter in the yeast mms19 deletion mutant. As shown in Figure ​ Figure5, 5 , both cDNAs corrected thermosensitivity for growth of the mms19 deletion mutant. Furthermore, whereas the doubling time of the yeast mutant is � h in liquid culture, at 37ଌ this doubling time was reduced to 𢏃.5 h when the mutant strain was transformed with the human, Drosophila or yeast wild-type MMS19 genes. The full-length hMMS19 and dMMS19 genes also complemented the UV radiation sensitivity of the mms19 mutant (Fig. ​ (Fig.6). 6 ). However, these genes failed to complement the methionine auxotrophy of the yeast mms19 mutant (data not shown). As expected, the yMMS19 gene fully complemented the latter phenotype (data not shown).

Rescue of thermosensitive growth of the mms19 deletion mutant. Strains are as follows: mms19Δ (0), mms19Δ+pESC-TRP (1), mms19Δ+yMMS19 (2 and 3), mms19ΔʽMMS19 (4 and 5), mms19Δ+hMMS19 (6 and 7). As positive controls strains were also grown at 30ଌ. In the absence of pESC-TRP the mms19Δ (0) strain is unable to grow.

Functional complementation of the UV radiation sensitivity of the mms19Δ strain by overexpression of human (mms19Δ+hMMS19) or Drosophila (mms19ΔʽMMS19) MMS19 cDNAs. For comparison the survival curves are shown for mms19Δ transformed with pESC-TRP (empty expression vector) and with the yMMS19 ORF (expression vector with the wild-type yMMS19 gene). W303-1B is the isogenic wild-type strain.

Gene Cloning and Expression and Secretion ofListeria monocytogenes Bacteriophage-Lytic Enzymes inLactococcus lactis

Fig. 1 . Schematic illustration of the vectors used for construction (A and B), intracellular production (C), and secretion (D) of endolysin enzymes. Only the relevant coordinates and some important properties are shown details are described in the text. Abbreviations: Amp r and Er r , genes specifying resistance to ampicillin and erythromycin, respectively P32, lactococcal promoterSPslpA, signal sequence of L. brevisS-layer protein A ply511 and ply118, endolysin genes from Listeria bacteriophages A511 and A118, respectively (22).

Production of Ply118 and Ply511 enzymes in L. lactis.

Fig. 2 . Decrease of the OD of a suspension of L. monocytogenes WSLC 1001 cells following the addition of cell extracts of L. lactis MG1363 carrying either pLC-PL118-P32, pLC-PL511, or the control vector pTRKH2 (see Materials and Methods). Fig. 3 . Detection of recombinant Ply118 (30.8 kDa) and Ply511 (36.5 kDa), respectively, in the cytoplasmic fractions of overnight cultures of recombinant L. lactis MG1363 (indicated by arrows). Proteins from the cell extracts were separated by SDS-PAGE and detected by Western blotting with anti-Ply antibodies. Lane 1, MG1363(pLC-PL118-P32) lane 2, negative control MG1363(pTRKH2) lane 3, MG1363(pLC-PL511). The positions of molecular mass markers (in kilodaltons) are indicated on the left.

Staphylococcal nuclease as a reporter forSPSlpA-mediated secretion.

SPSlpA enables membrane translocation of active Ply511.

Fig. 4 . Schematic representation of the genetic fusion of theSPslpA signal sequence and ply511. The corresponding nucleotide sequence and amino acid sequence of the region joining both fragments in SPslpA–ply511is shown enlarged. The arrow indicates the proposed signal peptide cleavage site of SlpA (43). The ply gene region is shown in boldface, and the restriction site used for genetic fusion (AatII) is also indicated. Fig. 5 . Colonies of recombinant L. lactis grown on GM17 agar medium containing suspended L. monocytogenescells. The control strain MG1363(pLC-PL511) shows no lytic effect (A), whereas strain L. lactis MG1363(pSL-PL511ΔC) secreting the C terminally truncated Ply511 enzyme shows clear zones of lysis around the individual colonies (B). Fig. 6 . Detection of Ply in cell extracts and supernatants of recombinant L. lactis cultured for 12 h by immunoblotting with Ply-specific antibodies. Lanes 1 and 2, cell extract and supernatant fraction (respectively) of L. lactisMG1363(pLC-PL511) lanes 3 and 4, cell extract and supernatant fraction (respectively) of MG1363(pSL-PL511) lane 5, supernatant of MG1363(pSL-PL511ΔC). The positions of molecular mass markers (in kilodaltons) are indicated on the left.

Gene Synthesis & Cloning/Mutagenesis FAQs

What are the differences between GENEWIZ&rsquos gene synthesis service options?

Service Service Offering Estimated Turnaround Time Price
FragmentGENE Synthesis of double-stranded DNA fragments 2-4 business days $
PriorityGENE Standard gene synthesis with cloning into a plasmid 8-10 business days $
TurboGENE 5 Expedited gene synthesis with cloning into a plasmid and our fastest available turnaround 5 business days $$
TurboGENE 7 Expedited gene synthesis with cloning into a plasmid 7 business days $$
Antibody DNA Synthesis Synthesis of heavy and light chain sequences with cloning into a plasmid 6-8 business days $
AAV Plasmid Synthesis Synthesis of AAV sequences and cloning into a plasmid with AAV-ITR sequence verification using proprietary technologies 11-15 business days $$
CRISPR Construct Synthesis Synthesis of gRNA and cloning into a custom plasmid 8-10 business days $
ssDNA Synthesis Synthesis of single-stranded DNA fragments 10-15 business days $

Where can I get general pricing before submitting a quote request?

How do I request a quote for a gene synthesis service?

PriorityGENE, TurboGENE, ValueGENE, antibody DNA synthesis, and AAV plasmid synthesis services require you to request a quote before ordering. FragmentGENE, CRISPR construct synthesis, and ssDNA synthesis services are available for direct order, so you do not need to request a quote for these services.

Log in to your online GENEWIZ account and select &ldquoGene Synthesis&rdquo, then click on the name of the specific service you are interested in. You will be directed to a quote request form for that selected service. Once you complete the form, select the &ldquoReview Inquiry&rdquo button at the bottom of the screen. After you review the entered details, select the &ldquoSubmit Quote Request&rdquo button to request your quote. Upon submission of the inquiry, you can expect to receive a quotation within one business day. You can view a PDF copy of the order information including price and download using the &ldquoPrint Price&rdquo button.

What vector does GENEWIZ clone into?

The assembled full-length gene is cloned into the pUC-GW-Kan/Amp vector via the EcoRV site. The final construct is verified with both Sanger DNA sequencing (on at least one strand) and restriction digestion. The sequences of pUC-GW-Kan/Amp can be found here.

How do I submit the starting material required for cloning?

Please submit all required starting material to GENEWIZ. For UK projects (40-) samples should be submitted to our UK subsidiary, and for continental Europe projects (50-/90-) samples should be submitted to our German subsidiary (addresses below) as soon as possible after initiating your order (if required). A detailed list of the required starting material is included in section III "How to start your order" of your gene synthesis quotation.

Kindly submit a 5µg aliquot (minimum concentration of 20ng/µL) of the undigested, purified, circular plasmid. Please ensure that the name of the plasmid on the tube matches the name submitted for the order. Include the first page of the order receipt or quote with the submitted plasmid sample to ensure they are accurately identified

All vectors submitted to our facility will be held in storage for two years and available within this time for future project use.

Direct Shipping (FedEx, UPS & DHL):
If a GENEWIZ Collection Spot is not available, or if your starting material requires dry ice, please use an overnight delivery service such as FedEx, UPS or DHL.

Attn: Project Management
Hope End, Takeley
Essex CM22 6TA
United Kingdom
Tel: +44 01279 873837

Attn: Project Management
Bahnhofstrasse 86
04158 Leipzig,
Phone: +49-341 520 122-41

Review of terminology and DNA sequence variation

DNA is a long double-stranded polymer composed of 4 nucleotides which form complementary base pairs (bp) with each other: adenine (A) with thymine (T), and guanine (G) with cytosine (C). Connected 5′ end to 3′ end (referring to the fifth and third carbons of the sugar), these 4 nucleotides are the building blocks of DNA.

DNA is organized into huge, linear, highly structured molecules which form the chromosomes. Chromatin, the physical organization of DNA and associated proteins, participates in regulating DNA function. Genes are the regions of DNA which encode for proteins. Protein coding regions are defined by the presence of exons, read 5′ to 3′, which are made up of codons, triplets of nucleotides which specify amino acids or signal translation stop. The stretches of nonprotein coding DNA between exons are introns. Splice sites mark the exon-intron boundaries and direct the excision of introns from the RNA message.

Control of gene expression

There are numerous functional noncoding DNA elements which participate in gene expression. Promoters, located immediately upstream of genes, are required for gene transcription. DNA regulatory elements which enhance or repress gene expression are often located near (or within introns of) structural genes, but can also lie at great distance. Some elements can control large genomic regions which contain many genes, such as the globin locus control region. 4 Additionally, there are numerous DNA regions which transcribe noncoding functional RNAs, for example, transfer RNAs, ribosomal RNAs, and microRNAs. DNA nucleotides can also be reversibly chemically modified, such as by methylation, to affect elements which influence developmental or tissue-specific gene expression, such as occurs during imprinting or cell-lineage differentiation. 5,6

DNA sequence variation

DNA is a living molecule in that it is constantly changing. DNA replicates during every mitosis, and recombines and segregates with every meiosis. Although DNA-replicative processes operate at extremely high fidelity, they are not (and cannot be) perfect. 7 Thus, DNA variation is rarely but inevitably introduced during the copying of DNA template or ligation of free ends. DNA errors also arise from misrepair of DNA damaged as a result of routine exposure to cellular and environmental sources or by excess ionizing radiation, UV, or chemical insults. DNA-damage repair processes generally exhibit lower fidelity than DNA replication. This error permissiveness is thought necessary to facilitate restoration of a functional genome from corrupted DNA template without stalling DNA repair entirely, and can result in damage-specific patterns of acquired DNA variation. 8,9

DNA accumulates variation as time progresses longitudinally over generations (germline variation) and within a single individual over many cell divisions (somatic variation). The vast majority of DNA variants cause no observable phenotype. However, a small fraction of variants are functional and can alter phenotypes.

Any difference in the DNA sequence as compared with a common reference sequence is considered a DNA variant (Figure 1). The simplest type of DNA variant is a change in a single-nucleotide base, known as a single-nucleotide variant (SNV). An SNV which is common in human populations (>1%) can also be known as a single-nucleotide polymorphism (SNP). Another type of DNA variation results from insertion or deletion (known as an indel) of a stretch of nucleotides. Structural variants (typically affecting >1000 bp) are DNA variants which include large indels as well as more complex DNA sequence rearrangements such as inversions (a block of DNA which has flipped “backwards”) and translocations (joining of distant genomic regions). Copy number variants (CNVs) are a type of structural variation resulting from gain or loss of a copy of an entire DNA region by deletion or duplication.

Types of DNA sequence variation. (A) SNVs result from the substitution of 1 base, while insertion or deletion (indel) affects a string of nucleotides. (B) Structural variants (typically affecting >1000 bp) include large indels, inversions, duplications, and CNVs.

Types of DNA sequence variation. (A) SNVs result from the substitution of 1 base, while insertion or deletion (indel) affects a string of nucleotides. (B) Structural variants (typically affecting >1000 bp) include large indels, inversions, duplications, and CNVs.

All types of DNA variation hold the potential to alter the expression or function of genes. SNVs can work directly by misspelling a codon’s amino acid translation (missense), creating a STOP codon (nonsense), or altering splice sites. SNVs can affect gene function by varying the sequence of promoters, regulatory elements, or noncoding RNAs. Indels can also create frameshift variants which shift codon registers to create new amino acid sequences downstream. Large indels can similarly disrupt genes as well as impact entire genomic regions or alter chromatin structure. Inversions and translocations not only disrupt their genomic sites of origin, but can also bring together new combinations of genes and/or regulatory elements. Additionally, CNVs which result in gain or loss of whole copies of functional DNA can affect phenotype via a differential gene dose effect. Thus, any type of DNA variant can affect function, and all categories of DNA variation have been implicated in disease.

ELife digest

Eukaryotes such as animals, plants and fungi store their DNA within the nucleus of each of their cells. Genes within this DNA contain the instructions needed to make molecules of RNA some of which can leave the nucleus and be decoded to build proteins. However, not all of the DNA that is copied into RNA actually codes for proteins. Instead, some RNA molecules are important parts of the cell's protein-making machinery in their own right, and others help to regulate the expression of genes as RNAs or proteins.

Nevertheless, many non-coding RNAs don't have such clear roles. Often these RNAs—which are called ‘pervasive transcripts’—are quickly destroyed within the nucleus, but it is likely that some molecules will escape this quality-control mechanism. If the cell's protein-making machinery decodes these RNAs, it could lead to the production of faulty or harmful proteins. Recent research suggested that another quality-control mechanism, which typically eradicates incorrectly processed protein-coding RNAs, could also destroy unneeded or harmful pervasive transcripts. But it was not clear how common it was for this process—called ‘nonsense-mediated decay’—to be used for this purpose.

Now Malabat, Feuerbach et al. have engineered yeast cells that lacked either the genes required to carry out nonsense-mediated decay or the ability to destroy RNA molecules in the nucleus. Experiments with these yeast cells revealed that about half of all pervasive transcripts can be destroyed via nonsense-mediated decay this suggests that this mechanism serves as a fail-safe to prevent the build-up of these potentially harmful molecules.

Malabat, Feuerbach et al. also revealed that the enzyme complex that copies gene sequences to make RNA molecules will often also copy some extra DNA sequence from before the start of the gene. On the other hand, it is also common for this enzyme complex to miss the start of the gene and produce an RNA molecule that lacks some of the instructions needed to build the correct protein. Further experiments showed that in yeast these two kinds of incorrectly made protein-coding RNAs could both be identified and destroyed by nonsense-mediated decay as well. The next challenge will be to see to what extent these phenomena are conserved in other eukaryotes.


Identification of EGFR exon 18 and exon 21 point mutations by SMAP. Before testing clinical samples with our mutant primer sets and SMAP assay, controlled tests on known genomic templates were done. The SMAP reaction was done in the presence of an intercalating dye (SYBR Green I) and was monitored with the Mx3000P System. Genomic DNA isolated from the NC1-H1975 cell line known to be carrying the L858R substitution mutation (exon 21) was used as a template to test the reliability of the SMAP mutation detection assay. SMAP amplification primer sets with a FP specific for detecting the L858R point mutation (2573T > G) as described in Fig. 2 were shown to rapidly amplify (in 20 min) the NC1-H1975 cell line DNA, whereas the same primer set was incapable of amplifying wild-type genomic control DNA even after 60 min (Fig. 4A ). A “no-template” control reaction was also negative after 60 min. The graphs in Fig. 4A are composite graphs, each line representing a different SMAP assay using a wild-type primer set (left) and a mutant primer set (right) and a different genomic template. Each assay was done in duplicate. Note that NCl-H1975 shows amplification with both primer sets, confirming that the cell line is heterozygous for the L858R mutation.

Analysis of allele-specific SMAP reaction for EGFR mutation. A, amplification curves of L858R mutation detection. Left, wild-type–specific primer amplification on •, 20 ng of NCI-H1975 cell line DNA as a template. ×, no template. Right, L858R mutant-specific primer amplification on •, 20 ng of NCI-H1975 cell line DNA ▴, 20 ng of wild-type human genomic DNA as a control and ×, no template. B, SMAP-amplified NCl-H1975 DNA cut by restriction endonuclease MscI. MscI is capable of digesting wild-type amplified DNA at a single site amplified L858R mutant DNA will not cut. SMAP reaction products were run on a 3% of NuSieve GTG agarose gel and stained by ethidium bromide. Lane M, 20-bp ladder used as size marker (TAKARA). Lane 1, uncut DNA of NCl-H1975 amplified by SMAP wild-type specific primers. Lane 2, MscI digestion of DNA used in lane 1. Lane 3, uncut DNA of NCl-H1975 amplified by SMAP L858R mutant-specific primers. Lane (4): MscI digestion of DNA used in lane 3. A single band in lane 2 is the only visible digestion product, consisting of unit lengths of the EGFR amplicon cut once by MscI. C, verification of deletion primer specificity on cloned deletion targets. Both graphs are composite amplification profiles when using all deletion primer sets to amplify 20 ng of wild-type human genomic DNA (left) and 3,000 copies of DE-A plasmid (right). Assays were done in duplicate with primer set for wild-type (•), primer set for DE-A (▪), primer set for DE-B (▴), primer set for DE-C (⧫),primer set for DE-D (*), primer set for DE-E (○), primer set for DE-F (□), primer set for DE-G (▾). Only the primer set matching the template DNA displayed amplification in <60 min. D, sensitivity of SMAP. Left, SMAP reaction using L858R point mutation-specific primer set with serial dilutions of H1975 cell line genomic DNA. Right, SMAP reaction using DE-A deletion mutation-specific primer set with serial dilutions of H1650 cell line genomic DNA. E, specificity of mutant DNA detection in the presence of wild-type DNA. Left, composite graph of SMAP reactions using L858R primer set with a mixture of wild-type genomic DNA and 10%, 5%, 1%, 0.5%, and 0% of target H1975 cell line DNA (in duplicate). Right, composite graph of SMAP reactions using DE-A primer set with a mixture of wild-type genomic DNA and 10%, 1%, 0.1%, and 0% of H1650 cell line DNA (in duplicate).

Electrophoretic analysis of the amplified DNA showed ladder patterns expected and typical of SMAP amplification (Fig. 4B). The DNA bands of ∼50 bp in size (indicated by asterisk) are likely to correspond to the monomeric self-primed amplification product bounded by the 5′ ends of the FP and TP primers. Larger sized bands are alternative forms (derived from intermediate forms, IM1 and IM2), self-hybridized species, and multiple-unit length amplicons of each species. To confirm the accuracy of genotype-specific products by SMAP, restriction analysis followed by electrophoresis was done. The MscI enzyme recognition sequence exists in the wild-type sequence of EGFR exon 21, but not the L858R mutant. As expected, the amplification products derived from the H1975 cell line were digested by MscI. In similar experiments on genomic DNA templates, we could also detect the G719S mutation by SMAP within 30 min (data not shown). The G719S mutation assay also showed similar high specificity, allowing reliable discrimination of mutants from wild-type specimens.

Identification of EGFR exon 19 deletions by SMAP. Again, before testing clinical samples with our mutant primer sets and SMAP assay, controlled tests on known templates were done. For deletions detection, engineered plasmid templates were used because no cell lines were available that have these specific EGFR mutations. The SMAP reaction was done in the presence of an intercalating dye (SYBR Green I) and was monitored with the Mx3000P real-time PCR instrument. Seven different allele-specific primer sets for deletion detection were developed and examined for specificity in the SMAP assay. The primer sets described previously and illustrated in Fig. 3 were each used in SMAP assays with exact match plasmid templates and against all other deletion mutants. In all cases, only the template completely identical with each allele-specific primer set was amplified within 30 min. No amplification was observed from any mismatched combination of primers and templates even when monitored for 60 min (Fig. 4C, Supplementary Fig. S2). Each experiment was done in duplicate hence, each graph shows two positive amplification profiles for each of the deletion primer sets.

Sensitivity of SMAP-based mutation detection in mixed-cell populations. Tumor samples frequently consist of numerous subpopulations of cancer cells. A useful diagnostic for mutation detection must be able to detect mutations in heterogenous genomic DNA samples. To test SMAP for this capability, we conducted serial dilution experiments to examine detection sensitivity and genomic DNA mixing experiments to examine the sensitivity of detecting mutants as a subpopulation in a background of wild-type DNA. In the serial dilution experiments, using the NC1-H1975 and NSC-H1650 cell line genomic DNAs, the allele-specific primers for amplification of the exon 21 L858R point mutation, and the exon 19 E746-A750 deletion could each detect 30 copies in SMAP amplification (25 μL reaction) in 30 min when using the respective full-match cell line DNA (Fig. 4D).

To determine the minimal detection limits for mutant sequences in a background of wild-type DNA, mutant cell line DNA was mixed with increasing amounts of wild-type DNA, and SMAP assays were done with full-match mutant primer sets. Using the NC1-H1975 cell line genomic DNA and the allele-specific primers for amplification of the exon 21 L858R point mutation, the mutant sequences could be detected even when present at only 0.5% in 60 min. Likewise, using the NC1-H1650 cell line genomic DNA and the allele-specific primers for amplification of the exon 19 E746-A750 deletion, the mutant sequences could be detected even when present at only 0.1% in 60 min (Fig. 4E).

Genotyping of clinical samples. We purposefully set out to design a mutation detection system that was both accurate and fast. We therefore minimized the genomic DNA sample preparation to a simple lysis in NaOH as described in Materials and Methods and usually within several minutes had material ready for the SMAP assays. With this crude extraction procedure and SMAP analysis, we were able to diagnose specific mutations from clinical samples in about 30 min.

The genomic status of the EGFR gene was evaluated in a series of 45 primary NSCLC specimens by both SMAP and conventional sequencing. Nine mutations were found by sequencing, of which four cases were substitution in exon 21, and five cases were a deletion in exon 19. All of these mutations were believed to be heterozygous, having also one wild-type allele. The nine cases that were proven to have mutations by sequencing were also verified independently by SMAP (Fig. 5 , Table 1 ). The four clinical samples that had exon 21 substitutions all displayed amplification profiles similar to case 16 (Fig. 5A). Both wild-type and mutant amplification curves were evident in all four cases, indicating that the samples were most likely heterozygous for both wild-type and mutant alleles. Based on these data, we cannot dismiss the possibility that a homozygous mutant subpopulation existed in a nearly equivalent ratio to a normal (homozygous wild-type) subpopulation and accounted for the seemingly heterozygous results. However, this scenario is not likely due to the equivalent ratio of mutant and wild-type alleles in all four independent cases, which is a statistically improbable event. In one case, case 3, the mutation was not detected by direct sequencing but was identified as the L858R mutation by SMAP. This mutation was not detected by sequencing because the abundance of the mutant-containing cells within the tumor sample was very low. Although the amplification curve for detecting this mutant displays exponential kinetics and is easily detectable by SMAP, its long delay in appearance (relative to the wild-type kinetic profile) also suggests that the population of mutant-containing cells is very low in the sample. Visual inspection of the sequencing data shows hardly identifiable mutation peaks of CT for AG at the nucleotide position 2572 to 2573 (Fig. 5B). In the initial calls, these low sequencing peaks were dismissed for background and not considered indicative of a mutation. To prove the validity of the SMAP assay results that indicate a mutation in case 3, we did PCR-RFLP analysis as a confirmatory test. In PCR-RFLP analysis, homozygous wild-type results would result in only two bands when examined by MscI digestion and agarose gel electrophoresis. Whereas samples that contain mutant-type DNA will show amplicon products that are resistant to cleavage by MscI, and hence, three bands would be evident on the gel under conditions favoring complete digestion. The results of case 3 shows three bands, indicating the existence of a mutation and confirming the previous results of the SMAP assay. In total, the PCR-RFLP results found the L858R point mutation in five samples, corresponding completely with the findings of SMAP.

Genotyping lung cancer tissue samples by allele-specific SMAP and fidelity comparison to alternative technologies. A, L858R mutation detection SMAP assays using wild-type and mutant primer sets. Case 1 is an example of homozygous wild-type. Cases 3 and 16 show detection of the L858R mutation in the tumor sample. Abbreviations: Wt, wild-type Mt, mutant type. B, direct sequencing of exon 21: Case 1 shows homozygous wild-type. Case 3 shows a barely identifiable mutation peak representing CT for AG at nucleotides 2572 to 2573 (within the ellipse). Case 16 shows the presence of a mutation of 2573 T for G. C, PCR-RFLP: lane M, 100-bp ladder marker (TAKARA), lanes 1, 3, and 16 correspond to the case number. Left, PCR products. Right lane, PCR products digested by MscI restriction endonuclease. The arrow near case 3 indicates PCR product resistant to digestion by MscI, confirming the presence of the L858R mutation in a subpopulation of DNA from the tumor of case 3. D, detection of wild-type and deletions of the EGFR gene in genomic DNA extracted from lung tissue. Primer set for wild-type (•), primer set for DE-A (▪), primer set for DE-B (▴), primer set for DE-C (⧫), primer set for DE-D (*), primer set for DE-E (○), primer set for DE-F (□), primer set for DE-G (▾). Wild-type DNA was detected in each tumor in addition to different deletions as indicated. Only one deletion-specific primer amplifies and thereby identifies the mutation in the tumor sample.

Watch the video: Codons (February 2023).