|
| 1 | ++++ |
| 2 | +title = "Pairwise alignment" |
| 3 | +rss_descr = "Align a gene against a reference genome using BioAlignments.jl" |
| 4 | ++++ |
| 5 | + |
| 6 | +# Pairwise Alignment |
| 7 | + |
| 8 | +On the most basic level, aligners take two sequences and use algorithms to try and "line them up" |
| 9 | +and look for regions of similarity. |
| 10 | + |
| 11 | +Pairwise alignment differs from multiple sequence alignment (MSA) because. |
| 12 | +it only aligns two sequences, while MSA's align three or more. |
| 13 | +In a pairwise alignment, there is one reference sequence, and one query sequence, |
| 14 | +though this may not always be specified by the user. |
| 15 | + |
| 16 | + |
| 17 | +### Running the Alignment |
| 18 | +There are two main parameters for determining how we wish to perform our alignment: |
| 19 | +the alignment type and score/cost model. |
| 20 | + |
| 21 | +The alignment type specifies the alignment range (is the alignment local or global?) |
| 22 | +and the score/cost model explains how to score insertions and deletions. |
| 23 | + |
| 24 | +#### Alignment Types |
| 25 | +Currently, four types of alignments are supported: |
| 26 | +- GlobalAlignment: global-to-global alignment |
| 27 | + - Aligns sequences end-to-end |
| 28 | + - Best for sequences that are already very similar |
| 29 | +- SemiGlobalAlignment: local-to-global alignment |
| 30 | + - a modification of global alignment that allows the user to specify that gaps will be penalty-free at the beginning of one of the sequences and/or at the end of one of the sequences (more information can be found [here](https://www.cs.cmu.edu/~durand/03-711/2023/Lectures/20231001_semi-global.pdf)). |
| 31 | +- LocalAlignment: local-to-local alignment |
| 32 | + - Identifies high-similarity, conserved sub-regions within divergent sequences |
| 33 | + - Can occur anywhere in the alignment matrix |
| 34 | +- OverlapAlignment: end-free alignment |
| 35 | + - a modification of global alignment where gaps at the beginning or end of sequences are permitted |
| 36 | + |
| 37 | +Alignment type can also be a distance of two sequences: |
| 38 | +- EditDistance |
| 39 | +- LevenshteinDistance |
| 40 | +- HammingDistance |
| 41 | + |
| 42 | +The alignment type should be selected based on what is already known about the sequences the user is comparing |
| 43 | +(Are they very similar and we're looking for a couple of small differences? |
| 44 | +Are we expecting the query to be a nearly exact match within the reference?). |
| 45 | +and what you may be optimizing for |
| 46 | +(Speed for a quick and dirty analysis? |
| 47 | +Or do we want to use more resources to do a fine-grained comparison?). |
| 48 | + |
| 49 | +Now that we have a good understanding of how `pairalign` works, |
| 50 | + |
| 51 | +```julia |
| 52 | +res = pairalign(GlobalAlignment(), s1, s2, scoremodel) # run pairwise alignment |
| 53 | + |
| 54 | +``` |
| 55 | + |
| 56 | + |
| 57 | +### Understanding how alignments are represented |
| 58 | +The output of an alignment is a series of `AlignmentAnchor` objects. |
| 59 | +This data structure gives information on the position of the start of the alignment, |
| 60 | +sections where nucleotides match, as well as where there may be deletions or insertions. |
| 61 | + |
| 62 | +Below is an example Alignment: |
| 63 | +```julia |
| 64 | +julia> Alignment([ |
| 65 | + AlignmentAnchor(0, 4, 0, OP_START), |
| 66 | + AlignmentAnchor(4, 8, 4, OP_MATCH), |
| 67 | + AlignmentAnchor(4, 12, 8, OP_DELETE) |
| 68 | + ]) |
| 69 | +``` |
| 70 | +In this example, the alignment starts at the 0 position for the query sequence and 4th position for the reference sequence. |
| 71 | +The next nucleotides are a match in the query and reference sequence. |
| 72 | +The last 8 nucleotides in the alignment are missing/deleted in the query sequence. |
| 73 | + |
| 74 | +To understand more about the output of the alignment created using BioAlignments.jl, |
| 75 | +more information can be found [here](https://biojulia.dev/BioAlignments.jl/stable/alignments/). |
0 commit comments