START

Type START at the MATLAB prompt.
  1. First, you will be asked for a data set. You have the choice between the Neisseria data (type -1), and a synthetic data set (type an integer >=0). The latter was created with the program make_data.m and can be found in Data.dat. If you don't modify this file, you have the choice between 16 synthetic DNA sequence alignments, so type in a number between 1 and 16 to pick out the sequence you want. If you want to read in your own data, choose 0. Your file needs to be called mydata.dat and must adhere strictly to the given format specifications.

  2. The program will then ask you if you want to change the initialization. If you don't, the following initialization will be chosen:

    Data Synthetic Neisseria
    Transition-transversion ratio 2.0 2.3
    Vector of initial state probabilities [1/3 1/3 1/3] [1/3 1/3 1/3]
    Initial branch lengths [0.25 0.25 0.25 0.25 0.25] [0.1 0.1 0.1 0.1 0.1]
    Initial recombination probability 0.8 0.9

  3. Your next choice is the training scheme. You can select one of the following three alternatives:

    0 Constrained maximum likelihood (CML), fixed recombination parameter.
    1 Joint maximum likelihood with the EM algorithm.
    2 Bayesian variational free energy minimization.

  4. It is now time to choose the training parameters. First, you have to set the number of EM-cylces. In each EM-step, the site-dependent weights Q(s_t) are updated in the E-step. Then, the parameters of the model are updated in the consecutive M-step. These are the recombination probability NU and the vector of branch lengths of all possible trees, w. Ideally, you would like to take the M-step for granted without having to bother about any further parameter settings. Unfortunately, this is not the case since the branch-length optimization is a non-linear iterative process. In the current implementation this is accomplished with gradient ascent, which requires you to set a learning rate and a number of gradient ascent steps. Do not be overzealous in trying to ensure convergence here. For a given, limited amount of computer time, a partial convergence in this step will allow you to increase the number of EM cycles, which usually results in a faster convergence of the overall training process. The recombination parameter is adapted with the forward-backward algorithm. The program allows you to repeat this a specified number of times (rather than just once). This is equivalent to a constrained M-step, where only NU is optimized while the branch-lengths are temporarily kept fixed. Play around with this option to test if this increases the convergence speed. The default values gave reasonable results in my simulation experiments, so I suggest you start from the as a basis for your own explorations. Note that if you have chosen constrained maximum likelihood (CML) as your optimization scheme, then your site-dependent weights are set to the constant value of 1, and the recombination parameter is fixed. This is equivalent to carrying out only a sinlge EM-step with a fixed value for NU, so you only need to specify the number of gradient ascent steps and the learning rate . Obviously, this increases the training speed of the algorithm, but usually at the price of sub-optimal results.

  5. Have you chosen the Bayesian approach? If yes, the program will prompt you for the hyperparameters of the prior on NU: alpha and beta. This prior is a beta distribution; if you want to plot it for a given choice of your hyperparameters, call the function PlotPriorPostNu.m.

The program will finish by showing you a complete list of the parameter settings.


Last modified: Mon May 22 14:03:05 BST