Post processing

In some settings, each the execution of each work item results in an individual file, but as a post-processing step, these files should be aggregated into a single one. Since this scenario is fairly common, worker supports it through two command, wcat and wredcue.

Typically, the names of the files are based on one or more variables that are present in the data file. By way of example, we will assume three variables temperature, pressure, and volume in a CSV file data.csv. The PBS script fragment below illustrates how the files are created.

simulate -t $temperature -p $pressure -v $volume \
    > output-$temperature-$pressure-$volume.txt

For the running example, this would result in a set of files such as:

output-293.0-1.0e6-1.0.txt
output-294.0-1.0e6-1.0.txt
output-295.0-1.0e6-1.0.txt
output-296.0-1.0e6-1.0.txt

The wcat command can now be used to conveniently concatenate these files, based on the data file that was used to define the work items, and the pattern that describes the file names.

$ wcat  -pattern 'output-[%temperature%]-[%pressure%]-[%volume%].txt' \
        -data data.csv  -output output.txt

This command will concatenate the individual files into output.txt. wcat has several options such as -skep_fiter <n> that will skip the first n lines of each file so that, .e.g., headers are not repeated each time. By default, blank lines at the end of files are skipped, though this can be avoided by using the -keep_blank option.

To support scenarios where the reduction of output files is complex, or the output is not text, wreduce offers assistance. It works similar to wcat but additionally a reductor script has to be provided. The latter takes two parameters, the name of the file that will contain the aggregated output, and the other the file name that contains data to be added to the former. The following reductor script, reductor.sh mimics the behaviour of wcat

#!/bin/bash
cat $2 >> $1

The reduction would be done using wreduce as follows:

$ wreduce  -pattern 'output-[%temperature%]-[%pressure%]-[%volume%].txt' \
           -data data.csv  -reductor reductor.sh  -output output.txt

This command can be used to deal with R data files, provided the reductor script contains a call to an R script that adds the individual data to the global data structure.