                                 fseqbootall



Wiki

   The master copies of EMBOSS documentation are available at
   http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

   Please help by correcting and extending the Wiki pages.

Function

   Bootstrapped sequences algorithm

Description

   Reads in a data set, and produces multiple data sets from it by
   bootstrap resampling. Since most programs in the current version of the
   package allow processing of multiple data sets, this can be used
   together with the consensus tree program CONSENSE to do bootstrap (or
   delete-half-jackknife) analyses with most of the methods in this
   package. This program also allows the Archie/Faith technique of
   permutation of species within characters. It can also rewrite a data
   set to convert it from between the PHYLIP Interleaved and Sequential
   forms, and into a preliminary version of a new XML sequence alignment
   format which is under development

Algorithm

   SEQBOOT is a general bootstrapping and data set translation tool. It is
   intended to allow you to generate multiple data sets that are resampled
   versions of the input data set. Since almost all programs in the
   package can analyze these multiple data sets, this allows almost
   anything in this package to be bootstrapped, jackknifed, or permuted.
   SEQBOOT can handle molecular sequences, binary characters, restriction
   sites, or gene frequencies. It can also convert data sets between
   Sequential and Interleaved format, and into the NEXUS format or into a
   new XML sequence alignment format.

   To carry out a bootstrap (or jackknife, or permutation test) with some
   method in the package, you may need to use three programs. First, you
   need to run SEQBOOT to take the original data set and produce a large
   number of bootstrapped or jackknifed data sets (somewhere between 100
   and 1000 is usually adequate). Then you need to find the phylogeny
   estimate for each of these, using the particular method of interest.
   For example, if you were using DNAPARS you would first run SEQBOOT and
   make a file with 100 bootstrapped data sets. Then you would give this
   file the proper name to have it be the input file for DNAPARS. Running
   DNAPARS with the M (Multiple Data Sets) menu choice and informing it to
   expect 100 data sets, you would generate a big output file as well as a
   treefile with the trees from the 100 data sets. This treefile could be
   renamed so that it would serve as the input for CONSENSE. When CONSENSE
   is run the majority rule consensus tree will result, showing the
   outcome of the analysis.

   This may sound tedious, but the run of CONSENSE is fast, and that of
   SEQBOOT is fairly fast, so that it will not actually take any longer
   than a run of a single bootstrap program with the same original data
   and the same number of replicates. This is not very hard and allows
   bootstrapping or jackknifing on many of the methods in this package.
   The same steps are necessary with all of them. Doing things this way
   some of the intermediate files (the tree file from the DNAPARS run, for
   example) can be used to summarize the results of the bootstrap in other
   ways than the majority rule consensus method does.

   If you are using the Distance Matrix programs, you will have to add one
   extra step to this, calculating distance matrices from each of the
   replicate data sets, using DNADIST or GENDIST. So (for example) you
   would run SEQBOOT, then run DNADIST using the output of SEQBOOT as its
   input, then run (say) NEIGHBOR using the output of DNADIST as its
   input, and then run CONSENSE using the tree file from NEIGHBOR as its
   input.

   The resampling methods available are:
     * The bootstrap. Bootstrapping was invented by Bradley Efron in 1979,
       and its use in phylogeny estimation was introduced by me
       (Felsenstein, 1985b; see also Penny and Hendy, 1985). It involves
       creating a new data set by sampling N characters randomly with
       replacement, so that the resulting data set has the same size as
       the original, but some characters have been left out and others are
       duplicated. The random variation of the results from analyzing
       these bootstrapped data sets can be shown statistically to be
       typical of the variation that you would get from collecting new
       data sets. The method assumes that the characters evolve
       independently, an assumption that may not be realistic for many
       kinds of data.
     * The partial bootstrap.. This is the bootstrap where fewer than the
       full number of characters are sampled. The user is asked for the
       fraction of characters to be sampled. It is primarily useful in
       carrying out Zharkikh and Li's (1995) Complete And Partial
       Bootstrap method, and Shimodaira's (2002) AU method, both of which
       correct the bias of bootstrap P values.
     * Block-bootstrapping. One pattern of departure from indeopendence of
       character evolution is correlation of evolution in adjacent
       characters. When this is thought to have occurred, we can correct
       for it by samopling, not individual characters, but blocks of
       adjacent characters. This is called a block bootstrap and was
       introduced by Kuensch (1989). If the correlations are believed to
       extend over some number of characters, you choose a block size, B,
       that is larger than this, and choose N/B blocks of size B. In its
       implementation here the block bootstrap "wraps around" at the end
       of the characters (so that if a block starts in the last B-1
       characters, it continues by wrapping around to the first character
       after it reaches the last character). Note also that if you have a
       DNA sequence data set of an exon of a coding region, you can ensure
       that equal numbers of first, second, and third coding positions are
       sampled by using the block bootstrap with B = 3.
     * Partial block-bootstrapping. Similar to partial bootstrapping
       except sampling blocks rather than single characters.
     * Delete-half-jackknifing.. This alternative to the bootstrap
       involves sampling a random half of the characters, and including
       them in the data but dropping the others. The resulting data sets
       are half the size of the original, and no characters are
       duplicated. The random variation from doing this should be very
       similar to that obtained from the bootstrap. The method is
       advocated by Wu (1986). It was mentioned by me in my bootstrapping
       paper (Felsenstein, 1985b), and has been available for many years
       in this program as an option. Note that, for the present,
       block-jackknifing is not available, because I cannot figure out how
       to do it straightforwardly when the block size is not a divisor of
       the number of characters.
     * Delete-fraction jackknifing. Jackknifing is advocated by Farris et.
       al. (1996) but as deleting a fraction 1/e (1/2.71828). This retains
       too many characters and will lead to overconfidence in the
       resulting groups when there are conflicting characters. However it
       is made available here as an option, with the user asked to supply
       the fraction of characters that are to be retained.
     * Permuting species within characters. This method of resampling
       (well, OK, it may not be best to call it resampling) was introduced
       by Archie (1989) and Faith (1990; see also Faith and Cranston,
       1991). It involves permuting the columns of the data matrix
       separately. This produces data matrices that have the same number
       and kinds of characters but no taxonomic structure. It is used for
       different purposes than the bootstrap, as it tests not the
       variation around an estimated tree but the hypothesis that there is
       no taxonomic structure in the data: if a statistic such as number
       of steps is significantly smaller in the actual data than it is in
       replicates that are permuted, then we can argue that there is some
       taxonomic structure in the data (though perhaps it might be just
       the presence of aa pair of sibling species).
     * Permuting characters. This simply permutes the order of the
       characters, the same reordering being applied to all species. For
       many methods of tree inference this will make no difference to the
       outcome (unless one has rates of evolution correlated among
       adjacent sites). It is included as a possible step in carrying out
       a permutation test of homogeneity of characters (such as the
       Incongruence Length Difference test).
     * Permuting characters separately for each species. This is a method
       introduced by Steel, Lockhart, and Penny (1993) to permute data so
       as to destroy all phylogenetic structure, while keeping the base
       composition of each species the same as before. It shuffles the
       character order separately for each species.
     * Rewriting. This is not a resampling or permutation method: it
       simply rewrites the data set into a different format. That format
       can be the PHYLIP format. For molecular sequences and discrete
       morphological character it can also be the NEXUS format. For
       molecular sequences one other format is available, a new (and
       nonstandard) XML format of our own devising. When the PHYLIP format
       is chosen the data set is coverted between Interleaved and
       Sequential format. If it was read in as Interleaved sequences, it
       will be written out as Sequential format, and vice versa. The NEXUS
       format for molecular sequences is always written as interleaved
       sequences. The XML format is different from (though similar to) a
       number of other XML sequence alignment formats. An example will be
       found below. Here is a table to links to those other XML alignment
       formats:

   Andrew Rambaut's BEAST XML format
   http://evolve.zoo.ox.ac.uk/beast/introXML.html and
   http://evolve.zoo.ox.ac.uk/beast/referenindex.html A format for
   alignments. There is also a format for phylogenies described there.
   MSAML M http://xml.coverpages.org/msaml-desc-dec.html Defined by Paul
   Gordon of University of Calgary. See his big list of molecular biology
   XML projects.
   BSML http://www.bsml.org/resources/default.asp Bioinformatic Sequence
   Markup Language includes a multiple sequence alignment XML format

Usage

   Here is a sample session with fseqbootall


% fseqbootall -seed 3
Bootstrapped sequences algorithm
Input (aligned) sequence set: seqboot.dat
Phylip seqboot program output file [seqboot.fseqbootall]:


 bootstrap: true
jackknife: false
 permute: false
 lockhart: false
 ild: false
 justwts: false

completed replicate number   10
completed replicate number   20
completed replicate number   30
completed replicate number   40
completed replicate number   50
completed replicate number   60
completed replicate number   70
completed replicate number   80
completed replicate number   90
completed replicate number  100

Output written to file "seqboot.fseqbootall"

Done.



   Go to the input files for this example
   Go to the output files for this example

Command line arguments

Bootstrapped sequences algorithm
Version: EMBOSS:6.6.0.0

   Standard (Mandatory) qualifiers:
  [-infilesequences]   seqset     (Aligned) sequence set filename and optional
                                  format, or reference (input USA)
  [-outfile]           outfile    [*.fseqbootall] Phylip seqboot program
                                  output file

   Additional (Optional) qualifiers (* if not always prompted):
   -categories         properties File of input categories
   -mixfile            properties File of mixtures
   -ancfile            properties File of ancestors
   -weights            properties Weights file
   -factorfile         properties Factors file
   -datatype           menu       [s] Choose the datatype (Values: s
                                  (Molecular sequences); m (Discrete
                                  Morphology); r (Restriction Sites); g (Gene
                                  Frequencies))
   -test               menu       [b] Choose test (Values: b (Bootstrap); j
                                  (Jackknife); c (Permute species for each
                                  character); o (Permute character order); s
                                  (Permute within species); r (Rewrite data))
*  -regular            toggle     [N] Altered sampling fraction
*  -fracsample         float      [100.0] Samples as percentage of sites
                                  (Number from 0.100 to 100.000)
*  -rewriteformat      menu       [p] Output format (Values: p (PHYLIP); n
                                  (NEXUS); x (XML))
*  -seqtype            menu       [d] Output format (Values: d (dna); p
                                  (protein); r (rna))
*  -morphseqtype       menu       [p] Output format (Values: p (PHYLIP); n
                                  (NEXUS))
*  -blocksize          integer    [1] Block size for bootstraping (Integer 1
                                  or more)
*  -reps               integer    [100] How many replicates (Integer 1 or
                                  more)
*  -justweights        menu       [d] Write out datasets or just weights
                                  (Values: d (Datasets); w (Weights))
*  -enzymes            boolean    [N] Is the number of enzymes present in
                                  input file
*  -all                boolean    [N] All alleles present at each locus
*  -seed               integer    [1] Random number seed between 1 and 32767
                                  (must be odd) (Integer from 1 to 32767)
   -printdata          boolean    [N] Print out the data at start of run
*  -[no]dotdiff        boolean    [Y] Use dot-differencing
   -[no]progress       boolean    [Y] Print indications of progress of run

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-infilesequences" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -scircular1         boolean    Sequence is circular
   -squick1            boolean    Read id and sequence only
   -sformat1           string     Input sequence format
   -iquery1            string     Input query fields or ID list
   -ioffset1           integer    Input start position offset
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit


Input file format

   fseqbootall data files read by SEQBOOT are the standard ones for the
   various kinds of data. For molecular sequences the sequences may be
   either interleaved or sequential, and similarly for restriction sites.
   Restriction sites data may either have or not have the third argument,
   the number of restriction enzymes used. Discrete morphological
   characters are always assumed to be in sequential format. Gene
   frequencies data start with the number of species and the number of
   loci, and then follow that by a line with the number of alleles at each
   locus. The data for each locus may either have one entry for each
   allele, or omit one allele at each locus. The details of the formats
   are given in the main documentation file, and in the documentation
   files for the groups of programsreads any normal sequence USAs.

  Input files for usage example

  File: seqboot.dat

    5    6
Alpha     AACAAC
Beta      AACCCC
Gamma     ACCAAC
Delta     CCACCA
Epsilon   CCAAAC

Output file format

   fseqbootall output will contain the data sets generated by the
   resampling process. Note that, when Gene Frequencies data is used or
   when Discrete Morphological characters with the Factors option are
   used, the number of characters in each data set may vary. It may also
   vary if there are an odd number of characters or sites and the
   Delete-Half-Jackknife resampling method is used, for then there will be
   a 50% chance of choosing (n+1)/2 characters and a 50% chance of
   choosing (n-1)/2 characters.

   The Factors option causes the characters to be resampled together. If
   (say) three adjacent characters all have the same factors characters,
   so that they all are understood to be recoding one multistate
   character, they will be resampled together as a group.

   The order of species in the data sets in the output file will vary
   randomly. This is a precaution to help the programs that analyze these
   data avoid any result which is sensitive to the input order of species
   from showing up repeatedly and thus appearing to have evidence in its
   favor.

   The numerical options 1 and 2 in the menu also affect the output file.
   If 1 is chosen (it is off by default) the program will print the
   original input data set on the output file before the resampled data
   sets. I cannot actually see why anyone would want to do this. Option 2
   toggles the feature (on by default) that prints out up to 20 times
   during the resampling process a notification that the program has
   completed a certain number of data sets. Thus if 100 resampled data
   sets are being produced, every 5 data sets a line is printed saying
   which data set has just been completed. This option should be turned
   off if the program is running in background and silence is desirable.
   At the end of execution the program will always (whatever the setting
   of option 2) print a couple of lines saying that output has been
   written to the output file.

  Output files for usage example

  File: seqboot.fseqbootall

    5     6
Alpha      AAACCA
Beta       AAACCC
Gamma      ACCCCA
Delta      CCCAAC
Epsilon    CCCAAA
    5     6
Alpha      AAACAA
Beta       AAACCC
Gamma      ACCCAA
Delta      CCCACC
Epsilon    CCCAAA
    5     6
Alpha      AAAAAC
Beta       AAACCC
Gamma      AACAAC
Delta      CCCCCA
Epsilon    CCCAAC
    5     6
Alpha      CCCCCA
Beta       CCCCCC
Gamma      CCCCCA
Delta      AAAAAC
Epsilon    AAAAAA
    5     6
Alpha      AAAACC
Beta       AAACCC
Gamma      AACACC
Delta      CCCCAA
Epsilon    CCCACC
    5     6
Alpha      AAAACC
Beta       ACCCCC
Gamma      AAAACC
Delta      CCCCAA
Epsilon    CAAACC
    5     6
Alpha      AACCAA
Beta       AACCCC
Gamma      ACCCAA
Delta      CCAACC
Epsilon    CCAAAA
    5     6
Alpha      AAAACC
Beta       ACCCCC
Gamma      AAAACC
Delta      CCCCAA
Epsilon    CAAACC
    5     6
Alpha      AACACC


  [Part of this file has been deleted for brevity]

Gamma      ACAAAA
Delta      CCCCCC
Epsilon    CCAAAA
    5     6
Alpha      AACAAC
Beta       AACCCC
Gamma      AACAAC
Delta      CCACCA
Epsilon    CCAAAC
    5     6
Alpha      AACAAA
Beta       AACCCC
Gamma      CCCAAA
Delta      CCACCC
Epsilon    CCAAAA
    5     6
Alpha      ACAAAA
Beta       ACCCCC
Gamma      CCAAAA
Delta      CACCCC
Epsilon    CAAAAA
    5     6
Alpha      CAAAAA
Beta       CCCCCC
Gamma      CAAAAA
Delta      ACCCCC
Epsilon    AAAAAA
    5     6
Alpha      CAACCC
Beta       CCCCCC
Gamma      CAACCC
Delta      ACCAAA
Epsilon    AAACCC
    5     6
Alpha      ACAACC
Beta       ACCCCC
Gamma      ACAACC
Delta      CACCAA
Epsilon    CAAACC
    5     6
Alpha      AAAAAA
Beta       AAAAAC
Gamma      ACCCCA
Delta      CCCCCC
Epsilon    CCCCCA
    5     6
Alpha      AACAAC
Beta       AACCCC
Gamma      CCCAAC
Delta      CCACCA
Epsilon    CCAAAC

Data files

   None

Notes

   None.

References

   None.

Warnings

   None.

Diagnostic Error Messages

   None.

Exit status

   It always exits with status 0.

Known bugs

   None.

See also

                    Program name                          Description
                    distmat      Create a distance matrix from a multiple sequence alignment
                    ednacomp     DNA compatibility algorithm
                    ednadist     Nucleic acid sequence distance matrix program
                    ednainvar    Nucleic acid sequence invariants method
                    ednaml       Phylogenies from nucleic acid maximum likelihood
                    ednamlk      Phylogenies from nucleic acid maximum likelihood with clock
                    ednapars     DNA parsimony algorithm
                    ednapenny    Penny algorithm for DNA
                    eprotdist    Protein distance algorithm
                    eprotpars    Protein parsimony algorithm
                    erestml      Restriction site maximum likelihood method
                    eseqboot     Bootstrapped sequences algorithm
                    fdiscboot    Bootstrapped discrete sites algorithm
                    fdnacomp     DNA compatibility algorithm
                    fdnadist     Nucleic acid sequence distance matrix program
                    fdnainvar    Nucleic acid sequence invariants method
                    fdnaml       Estimate nucleotide phylogeny by maximum likelihood
                    fdnamlk      Estimates nucleotide phylogeny by maximum likelihood
                    fdnamove     Interactive DNA parsimony
                    fdnapars     DNA parsimony algorithm
                    fdnapenny    Penny algorithm for DNA
                    fdolmove     Interactive Dollo or polymorphism parsimony
                    ffreqboot    Bootstrapped genetic frequencies algorithm
                    fproml       Protein phylogeny by maximum likelihood
                    fpromlk      Protein phylogeny by maximum likelihood
                    fprotdist    Protein distance algorithm
                    fprotpars    Protein parsimony algorithm
                    frestboot    Bootstrapped restriction sites algorithm
                    frestdist    Calculate distance matrix from restriction sites or fragments
                    frestml      Restriction site maximum likelihood method
                    fseqboot     Bootstrapped sequences algorithm

Author(s)

                    This program is an EMBOSS conversion of a program written by Joe
                    Felsenstein as part of his PHYLIP package.

                    Please report all bugs to the EMBOSS bug team
                    (emboss-bug (c) emboss.open-bio.org) not to the original author.

History

Target users

                    This program is intended to be used by everyone and everything, from
                    naive users to embedded scripts.

Comments

                    None
