The mixdchlet program infers a maximum likelihood mixture Dirichlet
distribution to fit a dataset of many observed count vectors
[Sjolander et al,
1996]. Mixture
Dirichlet priors are used in parameter estimation for HMMs, SCFGs, and
other discrete probabilistic models [Durbin et al.,
1998]. HMMER
and Infernal use mixture Dirichlet
priors on residue emission probabilities and transition probabilities.
The mixdchlet code uses our Easel code
library, so start by cloning both repos, with easel in the mixdchlet
directory:
% git clone https://github.com/cryptogenomicon/mixdchlet
% cd mixdchlet
% git clone https://github.com/EddyRivasLab/easel
Build Easel separately. It has its own configuration and tests.
% cd easel
% autoconf
% ./configure
% make
% make check
% cd ..
Then compile the mixdchlet program, which only uses a simple
Makefile. You can edit some things at the top of the Makefile if you
want (like CC for your compiler, CFLAGS for your compiler flags)
but the defaults are almost certainly fine.
% make
mixdchlet has four subcommands. mixdchlet -h or just mixdchlet
by itself with no options or arguments will print a short help page
that lists them:
mixdchlet fit: estimate a new mixture Dirichlet priormixdchlet score: score count data with a mixture Dirichletmixdchlet gen: generate synthetic count data from a mixture Dirichletmixdchlet samplesample a random mixture Dirichlet
For any of these subcommands, you can get brief help and usage reminder
with a -h option, e.g.:
% mixdchlet fit -h
The most important subcommand is mixdchlet fit. You can use
mixdchlet gen and mixdchlet sample to generate synthetic data and
distributions for testing, and mixdchlet score lets you evaluate the
log likelihood of a mixture Dirichlet on a count dataset other than
the training data (for example, a held-out test dataset).
The file example.pri is an example of a mixture Dirichlet prior
file. (It's the 9-component prior for amino acid sequence data from
Sjolander et al,
1996, and the
default amino acid residue prior for HMMER3.) The first two fields in
this file are <K> and <Q>: the alphabet size (20, for amino acid
data), and the number of mixture components (9, here). Then there are
You can use this example prior to generate a synthetic count dataset:
% mixdchlet gen -M 100 -N 10000 example.pri > example-10k.ct
The -M 100 option says to sample 100 counts per count vector, and
the -N 10000 option says to sample 10K count vectors. The
example-10k.ct file contains 10K lines, one per vector; each line
has
Although these are counts, the numbers don't necessarily have to be
integers; you may have used real-valued sequence weights in collecting
estimated counts, for example. Anything after a # on a line is
ignored as a comment; blank lines are ignored too.
To train a new mixture Dirichlet:
% mixdchlet fit 9 20 example-10k.ct test.pri
This will estimate a mixture Dirichlet with Q=9 components for an
alphabet size of K=20 on the count data in the example-10k.ct file,
and save the mixture Dirichlet to the file test.pri. (These suffixes
aren't necessary and they can be anything you want; .ct and .pri
are my own personal conventions.)
It will also output (to stdout) an arcane data table containing details from the optimization; you can ignore this table. The one relevant number to look at is the last line, where it will say something like "nll = 367413". This is the negative log likelihood of the fitted mixture Dirichlet. A lower nll is better (e.g. a higher likelihood fit). You use this objective function value to compare different models fit to the same dataset.
This inference is computationally intensive. Expect this particular
command to take about 90 sec. Compute time scales roughly as
The maximum likelihood fitting algorithm uses conjugate gradient descent. (The original Sjolander paper used expectation-maximization; the log likelihood is differentiable, so gradient descent is a tad more efficient.) The log likelihood is not convex and the optimization is local, not global, so different runs will give you different results. It's advisable to do multiple runs and take the one with the best (lowest) negative log likelihood, to avoid spurious local optima.