Skip to content

Commit d3892a3

Browse files
rebase onto main and make edits according to kevin's review
1 parent 33dfbd4 commit d3892a3

File tree

2 files changed

+20
-5
lines changed

2 files changed

+20
-5
lines changed
File renamed without changes.

cookbook/02-alignments.md

Lines changed: 20 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,6 @@ and look for regions of similarity.
1010

1111
Pairwise alignment differs from multiple sequence alignment (MSA) because
1212
it only aligns two sequences, while MSAs align three or more.
13-
In a pairwise alignment, there is one reference sequence and one query sequence,
14-
though this may not always be specified by the user.
15-
1613

1714
### Running the Alignment
1815
There are two main parameters for determining how we want to perform our alignment:
@@ -27,14 +24,32 @@ Currently, four types of alignments are supported:
2724
- Aligns sequences end-to-end
2825
- Best for sequences that are already very similar
2926
- All of query is aligned to all of reference
27+
- Example use case:
28+
- Comparing a particular gene from two closely related bacteria
29+
- Comparing alleles of a gene between two individuals
30+
- Not ideal when only one conserved region is shared
31+
3032
- `SemiGlobalAlignment`: local-to-global alignment
3133
- A modification of global alignment that allows the user to specify that gaps are penalty-free at the beginning of one of the sequences and/or at the end of one of the sequences (more information can be found [here](https://www.cs.cmu.edu/~durand/03-711/2023/Lectures/20231001_semi-global.pdf)).
34+
- Example use case:
35+
- Aligning a contig to a chromosome to see where that contig belongs
36+
- Aligning a 150 bp Illumina read to a longer reference gene or chromosome segment
37+
- A simple way to think about it: “the query should align completely, but the reference may have unaligned flanks.”
3238
- `LocalAlignment`: local-to-local alignment
3339
- Identifies high-similarity, conserved sub-regions within divergent sequences
3440
- Can occur anywhere in the alignment matrix
3541
- Maps the query sequence to the most similar region on the reference
42+
- Example use case:
43+
- Finding a conserved protein domain inside two otherwise divergent proteins
44+
- Aligning a short resistance-gene fragment to a genome to see whether that region is present
45+
- This is the right choice when you care about “where is the best shared region?” rather than “do these two full sequences match end-to-end?”
3646
- `OverlapAlignment`: end-free alignment
3747
- A modification of global alignment where gaps at the beginning or end of sequences are permitted
48+
- Best when the biologically meaningful match is an end-to-end overlap between the two sequences, and terminal overhangs should not be penalized
49+
- Example use case:
50+
- Merging paired-end reads when the forward and reverse reads overlap
51+
- Stitching amplicons or long reads that share an overlapping end region
52+
- The key distinction from semi-global is that overlap alignment is especially for suffix/prefix-style overlaps between sequence ends, which is why it is so useful in assembly workflows.
3853

3954
The alignment type should be selected based on what is already known about the sequences the user is comparing:
4055
- Are the two sequences very similar and we're looking for a couple of small differences?
@@ -49,8 +64,8 @@ and then finds the alignment that minimizes the total penalty.
4964
`AffineGapScoreModel` is the scoring model currently supported by `BioAlignments.jl`.
5065
It imposes an affine gap penalty for insertions and deletions,
5166
which means that it penalizes the opening of a gap more than a gap extending.
52-
This aligns (pun intended!!) with the biological principle that creating a gap is a rare event,
53-
while extending an already existing gap is less so.
67+
Deletions are rare mutations, but if there's a deletion, the length of the deletion is variable.
68+
Longer deletions are less likely than short ones only because they change the structure of the encoded protein more.
5469

5570
A user can also define their own `CostModel` instead of using `AffineGapScoreModel`.
5671
This will allow the user to define their own scoring scheme for penalizing insertions, deletions, and substitutions.

0 commit comments

Comments
 (0)