Skip to content

Commit 13a1e62

Browse files
initial commit
1 parent 9ad11e7 commit 13a1e62

1 file changed

Lines changed: 115 additions & 0 deletions

File tree

docs/src/rosalind/10-cons.md

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# Consensus and Profile
2+
3+
🤔 [Problem link](https://rosalind.info/problems/cons/)
4+
5+
!!! warning "The Problem"
6+
A matrix is a rectangular table of values divided into rows and columns.
7+
An m×n matrix has m rows and ncolumns.
8+
Given a matrix A, we write Ai,j.
9+
to indicate the value found at the intersection of row i and column j.
10+
11+
Say that we have a collection of DNA strings,
12+
all having the same length n.
13+
Their profile matrix is a 4×n matrix P in which P1,
14+
j represents the number of times that 'A' occurs in the jth position of one of the strings,
15+
P2,j represents the number of times that C occurs in the jth position,
16+
and so on (see below).
17+
18+
A consensus string c is a string of length n
19+
formed from our collection by taking the most common symbol at each position;
20+
the jth symbol of c therefore corresponds to the symbol having the maximum value
21+
in the j-th column of the profile matrix.
22+
Of course, there may be more than one most common symbol,
23+
leading to multiple possible consensus strings.
24+
25+
### DNA Strings
26+
A T C C A G C T
27+
G G G C A A C T
28+
A T G G A T C T
29+
A A G C A A C C
30+
T T G G A A C T
31+
A T G C C A T T
32+
A T G G C A C T
33+
34+
### Profile
35+
36+
A 5 1 0 0 5 5 0 0
37+
C 0 0 1 4 2 0 6 1
38+
G 1 1 6 3 0 1 0 0
39+
T 1 5 0 0 0 1 1 6
40+
41+
Consensus A T G C A A C T
42+
43+
Given:
44+
A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.
45+
46+
Return:
47+
A consensus string and profile matrix for the collection.
48+
(If several possible consensus strings exist, then you may return any one of them.)
49+
50+
Sample Dataset
51+
>Rosalind_1
52+
ATCCAGCT
53+
>Rosalind_2
54+
GGGCAACT
55+
>Rosalind_3
56+
ATGGATCT
57+
>Rosalind_4
58+
AAGCAACC
59+
>Rosalind_5
60+
TTGGAACT
61+
>Rosalind_6
62+
ATGCCATT
63+
>Rosalind_7
64+
ATGGCACT
65+
66+
Sample Output
67+
ATGCAACT
68+
A: 5 1 0 0 5 5 0 0
69+
C: 0 0 1 4 2 0 6 1
70+
G: 1 1 6 3 0 1 0 0
71+
T: 1 5 0 0 0 1 1 6
72+
73+
74+
The first thing we will need to do is read in the input fasta.
75+
In this case, we will not be reading in a fasta file,
76+
but a set of strings in fasta format.
77+
Once it is read in, we can iterate over the strings and store the strings in a data matrix.
78+
79+
From there, we can generate the profile matrix.
80+
We'll need to sum the number of times each nucleotide appears at a particular row of the data matrix.
81+
82+
Then, we can identify the most common nucleotide at each column of the data matrix.
83+
After we have done this for all columns of the data matrix,
84+
we can generate the consensus string.
85+
86+
It is possible that there can be multiple consensus strings,
87+
as some nucleotides may appear the same number of times
88+
in each column of the data matrix.
89+
If this is the case, we can return multiple consensus strings.
90+
91+
92+
```julia
93+
94+
function consensus(fasta)
95+
# read in strings in fasta format
96+
97+
data_matrix = []
98+
# iterate over strings and store in matrix
99+
100+
# make consensus matrix
101+
102+
103+
# make consensus string
104+
105+
106+
107+
108+
109+
110+
111+
112+
113+
114+
115+
```

0 commit comments

Comments
 (0)