You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All neighbour-joining algorithms rely on matrices of distances between clusters (of one taxon or multiple taxa). Most calculate adjusted distances, taking into account how far individual clusters are, on average, from other clusters, and some use auxiliary matrices (such as the estimated variance matrix of BIONJ), but all use matrices of distances.<p>
2
+
All neighbour-joining algorithms rely on matrices of distances between clusters (of one taxon or of multiple taxa). Most calculate adjusted distances, taking into account how far individual clusters are, on average, from other clusters, and some use auxiliary matrices (such as the estimated variance matrix of BIONJ), but all use matrices of distances.<p>
3
3
4
-
The decenttree algorithms (except for the ONJ algorithms, which use triangular matrices), make use of square matrices (using triangular matrices would reduce memory consumption by a factor of 2, but would considerably increase the cost of accessing entries in the matrix, and matrix access would also have been much more difficult to vectorize efficiently).
4
+
The algorithms that decenttree makes available (except for the ONJ algorithms, which use triangular matrices), make use of square matrices (using triangular matrices would reduce memory consumption by a factor of 2, but would considerably increase the cost of accessing entries in the matrix, and matrix access would also have been much more difficult to vectorize efficiently).
5
5
6
-
The bulk of the memory required by the decenttree distance matrix tree inference algorithms, is that used to track the ( n * n entry) distance matrices themselves. There are also some vectors, the track clusters to be considered, or the structure of the subtrees for the clsuters yet to be joined, but these are much smaller (of size proportional to n, rather than n * n).
6
+
(in the following <i>n</i> is shorthand for number of taxa)
7
+
8
+
The bulk of the memory required by the simpler decenttree distance matrix tree inference algorithms (UPGMA and NJ in particular), is that used to track the ( n * n entry) distance matrices themselves. There are also some vectors, the track clusters to be considered, or the structure of the subtrees for the clusters yet to be joined, but these are much smaller (all are of size proportional to n, rather than n * n).
7
9
8
10
<h2>Memory Requirements</h2>
9
11
@@ -38,13 +40,16 @@ appear to be slightly slower) than the NJ-R and NJ-R-V implementations.
38
40
39
41
<h2>Other common features</h2>
40
42
All of the distance-based algorithms implemented in decentTree make use of distance matrices.
41
-
Distance-*matrix* algorithms take, as their principle input (apart from a list of names of the N Taxa),
42
-
an N row, N column matrix of distances; the distance between taxa *a* and *b* can be read by
43
-
looking at the *b*th entry in the *a*th row (or the *a*th entry in the *b*th row,
43
+
Distance-matrix phylogenetic tree inference algorithms take, as their principle input (apart from a list of names of the N Taxa),
44
+
an N row, N column matrix of distances; the distance between taxa <i>a</i> and <i>b</i> can be read by
45
+
looking at the <i>b</i>th entry in the <i>a</i>th row (or the <i>a</i>th entry in the <i>b</i>th row,
44
46
if distances are symmetric). In practice, distances *are* symmetric and
45
-
the distance measured from a to b is the same as the distance measured from b to a.
47
+
the distance measured from <i>a</i> to <i>b</i> is the same as the distance measured from <i>b</i> to <i>a</i>.
46
48
The distance between any sequence and itelf is assumed to be zero.
47
49
<br><br>
50
+
(The consequences of violating the "every sequence or cluster is distance zero from itself" assumption have not
51
+
been tested but are probably dire!)
52
+
<br><br>
48
53
Uncorrected distances are typically calculated by counting the number of characters that must differ,
49
54
between two sequences in a sequence alignment, and dividing the count by the total number
50
55
of sites that are informative in both sequences (sites that are entirely unknown in either seqence are not counted).
@@ -57,9 +62,9 @@ decentTree can be supplied a sequence alignment (rather than a distance matrix).
57
62
distances; if uncorrected distances are desired, they can be requested with the <i>-uncorrected</i> command line parameter)
58
63
<br><br>
59
64
Neighbour-joining algorithms (except for AUCTION and STICHUP algorithms, which use raw distances) tend to look for neighbours by searching for pairs of clusters (or indivdidual taxa) with a minimal adjusted
60
-
difference. (the literature tends to talk about a Q matrix, where Qij is the adjusted difference between
61
-
the *i*th and *j*th cluster). The details vary from algorithm to algorithm, but typically the adjusted
62
-
distances are calculated by subtracting "compensatory" terms to adjust for how distant each of the two clusters is, on average, from all other clusters. In the literature entries in the Q matrix are calculated as
65
+
difference. (the literature tends to talk about a <b>Q</b> matrix, where <b>Q</b><i>i</i><i>j</i> is the adjusted difference between
66
+
the <i>i</i>th and <i>j</i>th cluster). The details vary from algorithm to algorithm, but typically the adjusted
67
+
distances are calculated by subtracting "compensatory" terms to adjust for how distant each of the two clusters is, on average, from all other clusters. In the literature entries in the <b>Q</b> matrix are calculated as
63
68
<br><br>
64
69
(N-2)*D(x,y) - sum(D row for x) - sum(D row for Y)
65
70
<br><br>
@@ -101,24 +106,32 @@ column for cluster a, reduces the amount of memory in use (though, not the amoun
101
106
of memory allocated!).
102
107
103
108
Since the sum of the first N squares is N(N+1)(2N+1)/6 (approximately one third of the cube of N), the effect of physically (rather than virtually) deleting rows and columns,
104
-
over the course of the inference of a phylogenetic tree, is, for large enough N, to reduce the expected number of cache fetches (or cache misses) resulting from access to the distance matrix, by a factor of three. <i>If all of the distances in the distance matrix are actually examined, one every iteration, as they are in the NJ and BIONJ algorithms, but <b>not</b> in the NJ-R, BIONJ-R, and the other "-R" algorithms.</i>
109
+
over the course of the inference of a phylogenetic tree, is, for large enough N, to reduce the expected number of cache fetches (or cache misses) resulting from access to the distance matrix, by a factor of three. <i>If all of the distances in the distance matrix are actually examined, once every iteration, as they are in the NJ and BIONJ algorithms, but as they are <b>not</b> in the NJ-R, BIONJ-R, and the other "-R" algorithms.</i>
105
110
106
111
Maintaining the entire matrix (and not just the upper or lower triangle) makes it
107
112
possible to do the memory accesses almost entirely sequentially (except for the
108
113
column rewriting and moving when clusters are moved.
109
114
110
-
In algorithms that have Variance Estimate matrices (BIONJ and its variants), operations (row and column overwrites, row and column deletes) are "mirrored" on the Variance Estimate matrices.
115
+
In algorithms (BIONJ and its variants) that have <b>V</b> matrices (Variance Estimate matrices), operations (row and column overwrites, row and column deletes) are "mirrored" on the Variance Estimate matrices.
111
116
112
-
In algorithms that maintain them (NJ-R, and BIONJ-R, and their variants), row (but not column!) operations are mirrored on the "sorted distance" (S) and "cluster index" I matrices (references to clusters that have already been joined into larger clusters, in I, are treated as having been "virtually deleted"; when a lower adjusted difference to a cluster is found, it is necessary to check if that cluster is "still in play", or if it has already been joined).
117
+
In algorithms that maintain them (NJ-R, and BIONJ-R, and their variants), row (but not column!) operations are mirrored on the "sorted distance" (<b>S</b>) and "cluster index" <b>I</b> matrices (references to clusters that have already been joined into larger clusters, in <b>I</b>, are treated as having been "virtually deleted"; when a lower adjusted difference to a cluster is found, it is necessary to check if that cluster is "still in play", or if it has already been joined).
113
118
114
-
Columns cannot easily be deleted out of existing rows of the S and I matrices (if the algorithm has them), because in each row of those arrays, then entries are sorted by ascending distance (so to find out which entry
119
+
Columns cannot easily be deleted out of existing rows of the <b>S</b> and <b>I</b> matrices (if the algorithm has them), because in each row of those arrays, then entries are sorted by ascending distance (so to find out which entry
115
120
is for a column that is to be removed, a search would be necessary, and to
116
121
write an entry for the column for a newly joined cluster, an insert into
117
-
a sorted array would be necessary). The I and S matrices contain entries
118
-
for clusters which have a cluster number less than that of the cluster
119
-
mapped to the row they are in. As neighbour joining continues, some of these will be for clusters that are no longer under consideration, because
120
-
they have already been joined into another, newer, cluster. Distances to
121
-
these clusters are "skipped" over.
122
+
a sorted array would be necessary).
123
+
124
+
The <b>S</b> and <b>I</b> matrices contain entries for clusters which have a cluster number less than that of the cluster mapped to the row they are in.
125
+
As neighbour joining continues, some of these will be for clusters that
126
+
are no longer under consideration, because they have already been joined
127
+
into another, newer, cluster. Distances to these clusters are "skipped" over.
128
+
129
+
(in practice they are usually skipped without the need for a check that
130
+
the cluster is still "in play", because such clusters are treated as having
131
+
an average distance, to other clusters, that is negative and very large, which
132
+
is enough to rule them out of consideration without the need for checks to see
133
+
whether they are actually "in play", but there are "last resort" checks
134
+
that possible cluster joins reference only clusters that are still "in play")
122
135
123
136
<h2>Row search order</h2>
124
137
(this is one of the few areas of meaningful difference between the NJ-R implementation in decenttree and the reference RapidNJ implementation). In NJ-R and BIONJ-R the order in
@@ -128,6 +141,7 @@ for the rows (the idea is to search rows that are "likelier", when they are sear
128
141
to yield better lower bounds, first, since: the better the lower bound already found early, the fewer cluster pairs will have to be considered later).
129
142
<br><br>
130
143
(RapidNJ doesn't do this)
144
+
<br><br>
131
145
<h2>Working matrix reallocation</h2>
132
146
During the course of the execution of a distance matrix algorithm, as the number of rows and columns in use falls, less and less of the memory allocated to the matrix remains in use. Periodically, the items still in use in the matrix are moved, so that a smaller block of sequential memory contains all the distances in the matrix (in row major order), with no "unused" memory between rows.
0 commit comments