What determines the complexity of sequencing these? The number of pairs?

Ovah · on Jan 20, 2017

Today there exists a multitude of different genome sequencing techniques, and distinct complexities are associated with each method. However, the number of pairs is today seldomly the main complexity.

Sanger sequencing was one of the first methods of sequencing, and employs linear sequencing: the synthesis of strands with increasing length. With the advent of the Human Genome Project, Celera instead came up with the idea of fragmenting the genome, amplifying the fragments, sequence the fragments, and match them together using bioinformatics. The complexity here lies in that much of the DNA is repeated (such as microsatellites) or no, which makes it hard to 'phase' the genome. As such, a short 20 nucleotide sequence may be present in may parts of the genome which makes it hard to generate a 100% complete connected genome.

Today, Illumina sequencing is the major sequencing platform (~85% of market share). It relies of the fragmentation of DNA into ~300 bp fragments. By synthesising the complementary strand of each fragment with fluorescent nucleotides, we may employ lasers to detect (sequence) the nucleotides of the fragments. Here we have the same problem as with shotgun sequencing: that we have many repeats in the DNA sequence.

To remedy this, error-prone sequencing methods with long read lengths such as IonTorrent/PacBio/etc. may be employed to. These long reads may then act as a map for stiching together the more precise short reads. This is called 'hybrid' sequencing.

Other sequencing methods, such as Pyrosequencing, has the inherent problem of not being able to discern too many (5) of the same nucleotide in a row. Other methods are primer based (i.e. need to know a short subsequence of the DNA beforehand). This is problematic if we want to perform a de-novo whole genome sequence. Note: Illumina does not rely on primers, and may be deployed directly on unknown sequences, unlike Pyrosequencing/Sanger sequencing.

dekhn · on Jan 20, 2017

The other big complexity is the self-similarity of the genome. To sequence, the genome is duplicated, then sharded physically into many overlapping tiles around an average length, each tile starting at a different position. Each sequence is determined in parallel, and then the shards have to be reassembled computationally (really, this is a post-sequencing process, but is critical to being able to call the result a "sequence" rather than a bag of reads).

If the genome in question contains a lot of regions that are similar to each other, the algorithms that do the assembly will get confused.

niels_olson · on Jan 20, 2017

There are regions with sparse coverage due to higher concentrations of C-G bases. This can make alignment results less reliable. Plasmids may interfere with alignment or need to be isolated out before or during library prep. And you want to detect and discard any members of the population with any evidence of cancer, genetic disease, etc.

And you need good population coverage. What's a normal variant? Newer methods propose a graph alignment instead of just trying to build a single sequence reference genome.

ascotan · on Jan 21, 2017

Genome size and number of repeat in the DNA.

Basically the entire genomic is fragmented and then reassembled like a big puzzle. It works like this:

genome => |1234567890abcdefghijklmnopqqrstuvwxyz|

set A => |123| |456| |ab| |90| |l| |m| |tuvw| ...

set B => |1| |45| {23| |0| |abcdefg| ....

....

Assembly sequence:

1x |1 23| |4 56||7| |8| |9 0|...

2x |1||23 4||56||7 8| |9| |0|...

where the fragments overlap enough that you can match it up to a different fragment set.

The goal is to get 8x coverage for each nucleotide. When you can do that you have a finished sequence.

You can see the problem here. The larger the genome the longer it takes to find a match. The other problem is that sequences that are highly repeated are difficult to assemble. Imaging assembling this sequence from fragments:

atatatatatatatatatatatatatatatat

Did you get the length right? It's easy if everything is unique.

Most bacterial organisms are very small and are easily sequenced. This is true of plasmids and virus. Plants and animals are a different ballgame. Organisms that you think are very simple are in fact MUCH more complex to sequence because of the amount of genetic material. After sequencing the human genome we realized that there are about 30K genes and 3 billion base pairs (baterial are almost all in the ballpark of 5 million base pairs by comparison). The onion you had for lunch on your salad? 16 billion base pairs. The lowly pine tree? 60K genes and 22 billion base pairs. A salamander? 70 billion base pairs. The Paris Japonica flower? 149 billion base pairs.

This flower: https://en.wikipedia.org/wiki/Paris_japonica is 50x more complex genetically that you are.

cixin · on Jan 20, 2017

Yes mostly the number of base pairs. Large repeats can also cause issues finishing a genome.