Revised 2006-09-03 DMB

Return to the Index


Sample Programs from Analyze Mode

This document gives some example analyze programs and explains how they function.

 

Testing a genome sequence

The following program will load in a genome sequence, run it through a test CPU, and output the information about it in a couple of formats.

  VERBOSE
  LOAD_SEQUENCE rmzavcgmciqqptqpqcpctletncogcbeamqdtqcptipqfpgqxutycuastttva
  RECALCULATE
  DETAIL detail_test.dat fitness merit gest_time length viable sequence
  TRACE
  PRINT

This program starts off with the VERBOSE command so that Avida will print to the screen all of the details about what is going on as it runs the analyze script; I recommend you begin all of your programs this way for debugging purposes. The program then uses the LOAD_SEQUENCE command to allow the user to enter a specific genome sequence in its compressed format. This will translate the genome into the proper genotype as long as you are using the correct instruction set file, since that file determines the mappings of letters to instructions).

The RECALCULATE command places the genome sequence into a test CPU, and determines its fitness, merit, gestation time, etc. so that the DETAIL command that follows it can have access to all of this information as it prints it to the file "detail_test.dat" (its first argument). The TRACE and PRINT commands will then print individual files about this genome, the first tracing its execution line-by-line, and the second summarizing all sorts of statistics about it and displaying the genome. Since no directory was specified for these commands, archive/ is assumed, and the filenames are org-S1.trace and org-S1.gen. If a genotype has a name when it is loaded, that name will be kept, but if it doesn't, it will be assigned a name starting at org-S1, then org-S2, and so on counting higher. The TRACE and PRINT commands add their own suffixes to the genome's name to determine the filename they will be printed as.

 

Using Variables

Often, you will want to run the same section of analyze code with multiple different inputs each time through, or else you might simply want a single value to be easy to change throughout the code. To facilitate such programming practices, variables are available in analyze mode that can be altered for each repitition through the code.

There are actually several types of variables, all of which are a single letter of number. For a command that requires a variable name as an input, you simply put that variable where it is requested. For example, if you were going to set the variable i to be equal to the number 12, you would type:

  SET i 12

But later on in the code, how does Avida know when you type an i if you really want the letter 'i' there, or if you prefer the number 12 to be there? To distinguish these cases, you must put a dollar sign '$' before a variable wherever you want it to be translated to its value instead of just using the variable name itself.

There are a few different commands that allow you to manipulate a variable's value, and sometimes execute a section of code multiple times based off of each of the possible values. Here is one example:

  FORRANGE i 100 199
    SET d /home/charles/dev/avida/runs/evo-neut/evo_neut_$i
    PURGE_BATCH
    LOAD_DETAIL_DUMP $d/detail_pop.100000
    RECALCULATE
    DETAIL $d/detail.dat update length fitness sequence
  END

The FORRANGE command runs the contents of the loop once for each possible value in the range, setting the variable i to each of these values in turn. Thus the first time through the loop, 'i' will be equal to the value '100', then '101', '102', all the way up to '199'. In this particular case, we have 100 runs (numbered 100 through 199) that we want to work with.

The first thing we do once we're inside the loop is set the value of the variable 'd' to be the name of the directory we're going to be working with. Since this is a long directory name, we don't want to have to type it over every time we need it. If we set it to the variable d, then all we need to do is type '$d' in the future, and it will be translated to the full name. Note that in this case we are setting a variable to a string instead of a number; that's just fine and Avida will figure out how to handle it properly. This directory we are working with will change each time through the loop, and that it is no problem to use one variable as part of setting another.

After we know what directory we are using, we run a PURGE_BATCH to get rid of all of the genotypes from the last time through the loop (lest we just keep building up more and more genotypes in the current batch) and then we refill the batch by using LOAD_DETAIL_DUMP to load in all of the genotypes saved in the file detail-100000.pop within our chosen directory. The RECALCULATE command runs all of the genotypes through a test CPU so we have all the statistics we need, and finally DETAIL will print out the stats we want to the file detail.dat, again placing it in the proper directory. The END command signifies the end of the FORRANGE loop.

 

Finding Lineages

Quite often, the portion of an Avida run that we will be most interested in is the lineage from the final dominant genotype back to the original ancestor. As such, there are tools in Avida to get at this information.

  FORRANGE i 100 199
    SET d /home/charles/dev/avida/runs/evo-neut/evo_neut_$i
    PURGE_BATCH
    LOAD_DETAIL_DUMP $d/detail_pop.100000
    LOAD_DETAIL_DUMP $d/historic_dump.100000
    FIND_LINEAGE num_cpus
    RECALCULATE
    DETAIL lineage.$i.html depth parent_dist length fitness html.sequence
  END

This program looks very similar to the last one. The first four lines are actually identical, but after loading the detail dump at update 100,000, we also want to load the historic dump from the same time point. A detail file contains all of the genotypes that were currently alive in the population at the time it was printed, while the historic files contain all of the genotypes that are direct ancestors of those that were still alive. The combination of these two files gives us the lineages of the entire population back to the original ancestor. Since we are only interested in a single lineage, the next thing we do is run the FIND_LINEAGE command to pick out a single genotype, and discard everything else except for its lineage. In this case, we pick the genotype with the highest abundance (the most virtual CPUs associated with it) at the time of printing.

As before, the RECALCULATE command gets us any additional information we may need about the genotypes, and then we print that information to a file using the DETAIL command. The filenames that we are using this time have the format lineage.$i.html, so they are all being written to the current directory with filenames that incorporate the run number right in them. Also, because the filename ends in the suffix '.html', Avida knows to print the file in a proper html format. Note that the specific values that we choose to print take advantage of the fact that we have a lineage (and hence measured things like the genetic distance to the parent) and are in html mode (and thus can print the sequence using colors to specify where exactly mutations occurred).

 

Working with Batches

In analyze mode, we can load genotypes into multiple batches and we then operate on a single batch at a time. So, for example, if we wanted to only consider the dominant genotypes at time points 100 updates apart, but all we had to work with were the detail files (containing all genotypes at each time point) we might write a program like:

  SET d /home/charles/avida/runs/mydir/here-it-is
  SET_BATCH 0
  FORRANGE u 100 100000 100            # Cycle through updates
    PURGE_BATCH                        # Purge current batch (0)
    LOAD_DETAIL_DUMP $d/detail_pop.$u  # Load in the population at this update
    FIND_GENOTYPE num_cpus             # Remove all but most abundant genotype
    DUPLICATE 0 1                      # Duplicate batch 0 into batch 1
  END
  SET_BATCH 1                          # Switch to batch 1
  RECALCULATE                          # Recalculate statistics...
  DETAIL dom.dat fitness sequence      # Print info for all dominants!

This program is slightly more complicated than the others, so I added in comments directly inside it. Basically, what we do here is use batch 0 as our staging area where we load the full detail dumps into, strip them down to only the single most abundant genotype, and then copy that genotype over into batch one. By the time we're done, we have all of the dominant genotypes inside of batch one, so we can print anything we need right from there.

 

Building your own Commands

One really useful feature that I have added to the analyze mode is the ability for the user to construct a variety of their own commands without modifying the source code. This is done with the FUNCTION command. For example, if you know you will always need a file called lineage.html with very specific information in it, you might write a helper command for yourself as follows:

  FUNCTION MY_HTML_LINEAGE  # arg1=run_directory
    PURGE_BATCH
    LOAD_DETAIL_DUMP $1/detail_pop.100000
    LOAD_DETAIL_DUMP $1/historic_dump.100000
    FIND_LINEAGE num_cpus
    RECALCULATE
    DETAIL $1/lineage.html depth parent_dist length fitness html.sequence
  END

This works identically to how we found lineages and printed their data in the section above. Only this time, it has created the new command called MY_HTML_LINEAGE that you can use anytime thereafter. Arguments to functions work similar to variables, but they are numbers instead of letters. Thus $1 translates to the first arguments, $2 becomes the second, and so on. You are limited to 9 arguments at this point, but that should be enough for most tasks. $0 is the name of the function you are running, in case you ever need to use that.

You may be interested in also using functions in conjunction with the SYSTEM command. Anything you type as arguments to this command gets run on the command line, so you can make functions to do anything that could otherwise be done were you at the shell prompt. For example, imagine that you were going to use a lot of compressed files in your analysis that you would first need to uncompress. You might right a function like:

  FUNCTION UNZIP   # Arg1=filename
    SYSTEM gunzip $1
  END

This is a shorter example than you might typically want to write a function for, but it does get the point across. This would allow you to just type UNZIP <filename> whenever you needed to uncompress something.

Functions are particularly useful in conjunction with the INCLUDE command. You can create a file called something like my_functions.cfg in your Avida work directory, define a bunch of functions there, and then start all of your analyze.cfg files with the line:

  INCLUDE my_functions.cfg

and you will have access to all of your functions thereafter. Ideally, as this language becomes more flexible, so will your ability to create functions within the language, so you will be able to develop flexible and useful libraries for yourself.

 

Try it Out...

Here are a couple of example problems you can try to see how well you can use analyze mode. These should get you used to working with it for future projects.

Problem 1. A detail file in Avida contains one line associated with each genotype, in order from the most abundant to the least. Currently, the LOAD_DETAIL_DUMP command will load the entire file's worth of genotypes into the current batch, but what if you only wanted the top few? You should write a function called LOAD_DETAIL_TOP that takes two arguments. The first ($1) is the name file that needs to be loaded in (just as in the original command), and the second is the number of lines you want to load.

The easiest way to go about doing this is by using the SYSTEM command along with the Unix command head which will output the very top of a file. If you typed the line:

  head -42 detail_pop.1000 > my_temp_file

The file my_temp_file would be created, and its contents would be the first 42 lines of detail-1000.pop. So, what you need this function to do is create a temporary file with proper number of lines from the detail file in it, load that temp file into the current batch, and then delete the file (using the rm command). Warning: be very careful with the automated deletions -- you don't want to accidentally remove something that you really need! I recommend that you use the command rm -i until you finish debugging. This problem may end up being a little tricky for you, but you should be able to work your way through it.

Problem 2. Now that you have a working LOAD_DETAIL_TOP command, you can run LOAD_DETAIL_TOP <filename> 1 in order to only load the most dominant genotype from the detail file. Rewrite the example program from the section "Working with Batches" above such that you now only need to work within a single batch.


Return to the Index