You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: cookbook/02-alignments.md
+20-5Lines changed: 20 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,9 +10,6 @@ and look for regions of similarity.
10
10
11
11
Pairwise alignment differs from multiple sequence alignment (MSA) because
12
12
it only aligns two sequences, while MSAs align three or more.
13
-
In a pairwise alignment, there is one reference sequence and one query sequence,
14
-
though this may not always be specified by the user.
15
-
16
13
17
14
### Running the Alignment
18
15
There are two main parameters for determining how we want to perform our alignment:
@@ -27,14 +24,32 @@ Currently, four types of alignments are supported:
27
24
- Aligns sequences end-to-end
28
25
- Best for sequences that are already very similar
29
26
- All of query is aligned to all of reference
27
+
- Example use case:
28
+
- Comparing a particular gene from two closely related bacteria
29
+
- Comparing alleles of a gene between two individuals
30
+
- Not ideal when only one conserved region is shared
31
+
30
32
-`SemiGlobalAlignment`: local-to-global alignment
31
33
- A modification of global alignment that allows the user to specify that gaps are penalty-free at the beginning of one of the sequences and/or at the end of one of the sequences (more information can be found [here](https://www.cs.cmu.edu/~durand/03-711/2023/Lectures/20231001_semi-global.pdf)).
34
+
- Example use case:
35
+
- Aligning a contig to a chromosome to see where that contig belongs
36
+
- Aligning a 150 bp Illumina read to a longer reference gene or chromosome segment
37
+
- A simple way to think about it: “the query should align completely, but the reference may have unaligned flanks.”
32
38
-`LocalAlignment`: local-to-local alignment
33
39
- Identifies high-similarity, conserved sub-regions within divergent sequences
34
40
- Can occur anywhere in the alignment matrix
35
41
- Maps the query sequence to the most similar region on the reference
42
+
- Example use case:
43
+
- Finding a conserved protein domain inside two otherwise divergent proteins
44
+
- Aligning a short resistance-gene fragment to a genome to see whether that region is present
45
+
- This is the right choice when you care about “where is the best shared region?” rather than “do these two full sequences match end-to-end?”
36
46
-`OverlapAlignment`: end-free alignment
37
47
- A modification of global alignment where gaps at the beginning or end of sequences are permitted
48
+
- Best when the biologically meaningful match is an end-to-end overlap between the two sequences, and terminal overhangs should not be penalized
49
+
- Example use case:
50
+
- Merging paired-end reads when the forward and reverse reads overlap
51
+
- Stitching amplicons or long reads that share an overlapping end region
52
+
- The key distinction from semi-global is that overlap alignment is especially for suffix/prefix-style overlaps between sequence ends, which is why it is so useful in assembly workflows.
38
53
39
54
The alignment type should be selected based on what is already known about the sequences the user is comparing:
40
55
- Are the two sequences very similar and we're looking for a couple of small differences?
@@ -49,8 +64,8 @@ and then finds the alignment that minimizes the total penalty.
49
64
`AffineGapScoreModel` is the scoring model currently supported by `BioAlignments.jl`.
50
65
It imposes an affine gap penalty for insertions and deletions,
51
66
which means that it penalizes the opening of a gap more than a gap extending.
52
-
This aligns (pun intended!!) with the biological principle that creating a gap is a rare event,
53
-
while extending an already existing gap is less so.
67
+
Deletions are rare mutations, but if there's a deletion, the length of the deletion is variable.
68
+
Longer deletions are less likely than short ones only because they change the structure of the encoded protein more.
54
69
55
70
A user can also define their own `CostModel` instead of using `AffineGapScoreModel`.
56
71
This will allow the user to define their own scoring scheme for penalizing insertions, deletions, and substitutions.
0 commit comments