Monday, December 20, 2010

Stabs in the Dark

Gail suggests going to 64 pes

Suspicion attends to "MAX_TASKS_PER_NODE" value="4" ; I changed it to 16. It shoul;d probably be either 16 or 1.

also the telltale failure to run "module" is a module load netcdf/3.6.2

I did this manually.

This adds three variables. So if it works there are three things to unwind.

bah - tried configure -cleanmach, but it remembered the old CASEROOT. Now I;ve clobbered both runs!
Sure enough, you can;t set the shell in a script called by the qsub script; you have to specifiy it in the qsubscript.

I now have sort of got it running:

- CCSM input data directory, DIN_LOC_ROOT_CSMDATA, is /work/00671/tobis/inputdata
- Case input data directory, DIN_LOC_ROOT, is /work/00671/tobis/inputdata
- Checking the existence of input datasets in DIN_LOC_ROOT
rm: No match.
Fri Dec 17 17:41:16 CST 2010 -- CSM EXECUTION BEGINS HERE
Fri Dec 17 17:41:20 CST 2010 -- CSM EXECUTION HAS FINISHED
ls: No match.
Model did not complete - no cpl.log file present - exiting
TACC: Cleaning up after job: 1731370
TACC: Done.

To be clear, I have now loaded the executable, which promptly died without leaving a clue as to why anywhere that is obvious. Of course, who knows where it thinks it ought to leave the clue. I have set "find" the task of finding files created over the weekend. It is amazingly slow, though.

This still amounts to progress: after a week I have actually got the thing to lurch to life and die.

Life in the fast lane.


it says, 32 times,

MPI_Group_range_incl(170).........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x48edec0, new_group=0x7fffc8173f2c) failed
MPIR_Group_check_valid_ranges(302): The 0th element of a range array ends at 31 but must be nonnegative and less than 1
MPI process terminated unexpectedly

OK, before I go whining around, I will try to redo everything.

Friday, December 17, 2010

Pretty weird

You didn;t expect it to actually run did you?

But the failure is damned peculiar

#!/bin/csh -f

foreach i (env_case.xml env_run.xml env_conf.xml env_build.xml env_mach_pes.xml)

fails with

TACC: Done.
./Tools/ccsm_getenv: line 9: syntax error near unexpected token `('
./Tools/ccsm_getenv: line 9: `foreach i (env_case.xml env_run.xml env_conf.xml
env_build.xml env_mach_pes.xml)'
TACC: Cleaning up after job: 1729536
TACC: Done.

The thing is, it's perfectly valid csh; the error message is the one bash would issue!

Thursday, December 16, 2010

Build successful: how to run?

This is new:

Thu Dec 16 13:27:36 CST 2010 /work/00671/tobis/CAM_3/run/ccsm.bldlog.101216-130725
- Locking file env_build.xml
- Locking file Macros.prototype_ranger

and has to be considered good news.

Now the "quick start" seems to have me in the scripts directory issuing

qsub $CASE.$

but how could that work? CCSM doesn't know my account number. All of the hash commands to the runtime environment are missing in $CASE.$ . I will try just splicing them in manually.

How many PE's?

env_mach_pes.xml says (open angle brackets elided):

!-- -->
!-- These variables CANNOT be modified once configure -case has been -->
!-- invoked without first invoking configure -cleanmach. -->
!-- -->
!-- See README/readme_env and README/readme_general for details -->


entry id="TOTALPES" value="32" />
entry id="PES_LEVEL" value="1r" />
entry id="MAX_TASKS_PER_NODE" value="4" />
entry id="PES_PER_NODE" value="$MAX_TASKS_PER_NODE" />

but my prior qsubscript says

#$ -pe 16way 64

Second attempt, then, leave the -pe out; see if it compensates somehow.


#$ -V
#$ -cwd
#$ -j y
#$ -A A-ig2
#$ -l h_rt=00:30:00
#$ -q normal
#$ -N spinup-CCSM
#$ -o ./$JOB_NAME.out

Not sure about the -cwd either...

------------> Rejecting job <------------
Please specify a parallel environment.
Syntax: -pe
Example: #$ -pe 16way 48
To see a list of defined pes: qconf -spl

should I go for 4way 32 or 16way 32 ?

I though they had gotten somewhere on ranger.

Trying 4way 32 which will ask for 8 nodes when 2 would do, I think.

OK, it is in the queue now

find Juli's script example:

#$ -V
# {inherit submission environment}
#$ -cwd
# {use submission directory}
#$ -N myCCSM
# {jobname (myCCSM)}
#$ -j y
# {join stderr and stdout}
#$ -o $JOB_NAME.o$JOB_ID
# {output name jobname.ojobid
#$ -pe 16way 1024
# {use 16 cores/node, 1024 cores total}
#$ -q normal
# {queue name}
#$ -l h_rt=05:30:00
# {request 4 hours}
#$ -M
# {UNCOMMENT & insert Email address}
#$ -m be
# {UNCOMMENT email at Begin/End of job}
set echo #{echo cmds, use "set echo" in csh}
# {account number}
#$ -A TG-CCR090010

# ----------------------------------------
# total number of tasks = 1024
# maximum threads per task = 1
# cpl ntasks=128 nthreads=1 rootpe=0
# cam ntasks=1024 nthreads=1 rootpe=0
# clm ntasks=128 nthreads=1 rootpe=0
# cice ntasks=160 nthreads=1 rootpe=0
# pop2 ntasks=32 nthreads=1 rootpe=0
# total number of hw pes = 1024
# cpl hw pe range ~ from 0 to 127
# cam hw pe range ~ from 0 to 1023
# clm hw pe range ~ from 0 to 127
# cice hw pe range ~ from 0 to 159
# pop2 hw pe range ~ from 0 to 31
# ----------------------------------------
# Determine necessary environment variables

her env_mach_pes:

setenv NTASKS_ATM 1024; setenv NTHRDS_ATM 1; setenv ROOTPE_ATM 0;
setenv NTASKS_LND 128; setenv NTHRDS_LND 1; setenv ROOTPE_LND 0;
setenv NTASKS_ICE 160; setenv NTHRDS_ICE 1; setenv ROOTPE_ICE 0;
setenv NTASKS_OCN 32; setenv NTHRDS_OCN 1; setenv ROOTPE_OCN 0;
setenv NTASKS_CPL 128; setenv NTHRDS_CPL 1; setenv ROOTPE_CPL 0;

alas, a different file format.

OK, looking in the wrong place.

!-- -->
!-- The following values should not be set by the user since they'll be -->
!-- overwritten by scripts. -->
!-- TOTALPES -->
!-- CCSM_PCOST -->
!-- PES_LEVEL -->
!-- PES_PER_NODE -->
!-- CCSM_TCOST -->

Looks like we should be going after

entry id="NTASKS_ATM" value="32" />
entry id="NTHRDS_ATM" value="1" />
entry id="ROOTPE_ATM" value="0" />

entry id="NTASKS_LND" value="32" />
entry id="NTHRDS_LND" value="1" />
entry id="ROOTPE_LND" value="0" />

entry id="NTASKS_ICE" value="32" />
entry id="NTHRDS_ICE" value="1" />
entry id="ROOTPE_ICE" value="0" />

entry id="NTASKS_OCN" value="32" />
entry id="NTHRDS_OCN" value="1" />
entry id="ROOTPE_OCN" value="0" />

entry id="NTASKS_CPL" value="32" />
entry id="NTHRDS_CPL" value="1" />
entry id="ROOTPE_CPL" value="0" />

and the NTASKS is really the variable we control. Unlike older CAM, we need to set these at build time, apparently.

I think I'll submit a 16way 32 as well as try3

priority is very low right now so won't find out for a while.

More tomorrow I guess.

Wednesday, December 15, 2010

Two changes

Two changes in Macros.prototype_ranger will probably correspond to leaping the latest hurdle. Whether that yields a useful result in the end remains to be seen.

< INCLDIR := -I./usr/include
> INCLDIR := -I. /usr/include
< FFLAGS := $(CPPDEFS) -i4 -gopt -Mlist -time -Mextend -byteswapio
> FFLAGS := $(CPPDEFS) -i4 -target=linux -gopt -Mlist -time -Mextend -byteswapio

Isn't this intellectually satisfying work? Far better than being at AGU.



- Build Libraries: mct pio csm_share
Wed Dec 15 17:03:34 CST 2010 /work/00671/tobis/CAM_A2/mct/mct.bldlog.101215-170331
Wed Dec 15 17:05:36 CST 2010 /work/00671/tobis/CAM_A2/pio/pio.bldlog.101215-170331
Wed Dec 15 17:06:51 CST 2010 /work/00671/tobis/CAM_A2/csm_share/csm_share.bldlog.101215-170331
Wed Dec 15 17:07:52 CST 2010 /work/00671/tobis/CAM_A2/run/cpl.bldlog.101215-170331
Wed Dec 15 17:07:52 CST 2010 /work/00671/tobis/CAM_A2/run/atm.bldlog.101215-170331
ERROR: cam.buildexe.csh failed, see /work/00671/tobis/CAM_A2/run/atm.bldlog.101215-170331
ERROR: cat /work/00671/tobis/CAM_A2/run/atm.bldlog.101215-170331
login4% cat /work/00671/tobis/CAM_A2/run/atm.bldlog.101215-170331
Wed Dec 15 17:07:52 CST 2010 /work/00671/tobis/CAM_A2/run/atm.bldlog.101215-170331
cat: Srcfiles: No such file or directory
/work/00671/tobis/CESM_SRC/ccsm4_0/scripts/CAM_A2/Tools/mkSrcfiles > /work/00671/tobis/CAM_A2/atm/obj/Srcfiles
cp -f /work/00671/tobis/CAM_A2/atm/obj/Filepath /work/00671/tobis/CAM_A2/atm/obj/Deppath
/work/00671/tobis/CESM_SRC/ccsm4_0/scripts/CAM_A2/Tools/mkDepends Deppath Srcfiles > /work/00671/tobis/CAM_A2/atm/obj/Depends
mpif90 -c -I. /usr/include -I/opt/apps/pgi7_1/netcdf/3.6.2/include -I/opt/apps/pgi7_1/netcdf/3.6.2/include -I/opt/apps/pgi7_1/mvapich2/1.0/include -I. -I/work/00671/tobis/CESM_SRC/ccsm4_0/scripts/CAM_A2/SourceMods/ -I/work/00671/tobis/CESM_SRC/ccsm4_0/models/atm/cam/src/chemistry/bulk_aero -I/work/00671/tobis/CESM_SRC/ccsm4_0/models/atm/cam/src/chemistry/utils -I/work/00671/tobis/CESM_SRC/ccsm4_0/models/atm/cam/src/physics/cam -I/work/00671/tobis/CESM_SRC/ccsm4_0/models/atm/cam/src/dynamics/eul -I/work/00671/tobis/CESM_SRC/ccsm4_0/models/atm/cam/src/cpl_mct -I/work/00671/tobis/CESM_SRC/ccsm4_0/models/atm/cam/src/control -I/work/00671/tobis/CESM_SRC/ccsm4_0/models/atm/cam/src/utils -I/work/00671/tobis/CESM_SRC/ccsm4_0/models/atm/cam/src/advection/slt -I/work/00671/tobis/CAM_A2/lib/include -DCO2A -DMAXPATCH_PFT=numpft+1 -DLSMLAT=1 -DLSMLON=1 -DPLON=128 -DPLAT=64 -DPLEV=26 -DPCNST=3 -DPCOLS=16 -DPTRM=42 -DPTRN=42 -DPTRK=42 -DSPMD -DMCT_INTERFACE -DHAVE_MPI -DCO2A -DLINUX -DSEQ_ -DFORTRANUNDERSCORE -DNO_SHR_VMATH -DNO_R16 -i4 -target=linux -gopt -Mlist -time -Mextend -byteswapio -O2 -Mvect=nosse -Kieee -O2 -Mvect=nosse -Kieee -Mfree /work/00671/tobis/CESM_SRC/ccsm4_0/models/atm/cam/src/control/cam_logfile.F90
pgf90-Error-Unknown switch: -target=linux
gmake: *** [cam_logfile.o] Error 1

Taking out the "-linux" and removing the space in "-I. /usr/include" does seem to create a .o file with no objections.

How this got to be in the distribution I don't know.

Now, apparently have to hack the Makefile...

But NCAR does this in some bizarre way too... Suppose I should look for FORTRANUNDERSCORE


setenv DIN_LOC_ROOT_CSMDATA $WORK/inputdata # put it where it wants it
setenv DIN_LOC_ROOT $WORK/inputdata # have it both ways
setenv CCSMROOT `pwd`
setenv MACH prototype_ranger
setenv CASEROOT `pwd`/CAM_Alone
setenv CASE CAM_Alone # not mentioned in instructions
setenv RES T42_T42
setenv COMPSET F_2000
cd ccsm4_0/scripts
create_newcase -case $CASEROOT -mach $MACH -compset $COMPSET -res $RES
cd $CASEROOT # not mentioned in instructions
./configure -case
$CASE.$ # you may need to prepend a dot and a slash

OK, I have all the files I guess but the build still fails on the ocnvenient auto-download.

Oops, looks like I just missed one for some reason.

Haha, building at last. MCT done, PIO in progress.

Preusmably ESMF will kill me, right?


Hell is other people's code.

Tuesday, December 14, 2010

After much moaning

OK the slab model seems to be running. I'll give a complete play-by=play=[

Now to take on CCSM, a product which may be easier to use given that I have an account on an official target platform.

FIrst, I need to find the NAME of the machine. I saw it once. It was prototype_ranger or something. Grep may take forever.

Yep; I guess I still have some brain cells left.

> find . -name "*ranger*"


cd ccsm4_0
setenv CCSMROOT `pwd`
setenv MACH prototype_ranger

# mkdir CAM_Alone = Do NOT do this !!! => Caseroot directory /work/00671/tobis/CESM_SRC/ccsm4_0/CAM_Alone already exists

setenv CASEROOT `pwd`/CAM_Alone
setenv RES T42_T42
setenv COMPSET F_2000
create_newcase -case $CASEROOT -mach $MACH -compset $COMPSET -res $RES

Now snagged on auto-download of initial conditions files. AUthentication needed. As I recall it was wide open, but I don't remember what it was.

Found it in email. It appears to be the same for every user; but I'm not going to be the one to post it on a web page.

transcript has the following ugly appearance:

export /work/00671/tobis/inputdata/atm/cam/physprops/ ..... svn: REPORT request failed on '/!svn/vcc/default'
Cannot replace a directory from within

export /work/00671/tobis/inputdata/atm/cam/physprops/ ..... svn: REPORT request failed on '/!svn/vcc/default'
Cannot replace a directory from within

etc. etc. many times over. Is this fetching? WHo knows?

OK, no. In fact, unbelievably bad. It assumed (despite my name choice) that I wanted it in $WORK/inputdata. I do not know how this happened!

I have NO IDEA where it got $WORK/inputdata . I told it NOTHING about $WORK or inputdata !

Thjere seems to be some confusion in the docs with $DIN_LOC_ROOT in the files and $DIN_LOC_ROOT_CSMDATA in the docs.

"For supported machines this variable is preset". Does that include "prototype_ranger"?

Anyway, I try the alternative, using check_input_data (which really should be called checkin_input_data; it is not checking anything! Same error.

Googling for teh error message yields something about merges. So I try wget on one of the files.


ERROR: certificate common name `localhost.localdomain' doesn't match requested host name `'.
To connect to insecurely, use `--no-check-certificate'.

SOmebody tell me I am dealing with grownups here!

Eventually I succeed with

wget --no-check-certificate --http-user=EASY_TO_GUESS --http-password=ALMOST_AS_EASY

To my surprise nobody squawks.

So the next thing to do is to build a script to download all the stuff that check_input_data was supposed to get:

At least it handily reports:

File is missing: /work/00671/tobis/inputdata/atm/cam/chem/trop_mozart_aero/aero/
File is missing: /work/00671/tobis/inputdata/atm/cam/inic/gaus/
File is missing: /work/00671/tobis/inputdata/atm/cam/topo/
File is missing: /work/00671/tobis/inputdata/atm/cam/ozone/
File is missing: /work/00671/tobis/inputdata/atm/cam/chem/trop_mozart/ub/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/lnd/clm2/pftdata/pft-physiology.c100226
File is missing: /work/00671/tobis/inputdata/lnd/clm2/snicardata/
File is missing: /work/00671/tobis/inputdata/lnd/clm2/snicardata/
File is missing: /work/00671/tobis/inputdata/lnd/clm2/surfdata/
File is missing: /work/00671/tobis/inputdata/lnd/clm2/griddata/
File is missing: /work/00671/tobis/inputdata/lnd/clm2/snicardata/
File is missing: /work/00671/tobis/inputdata/lnd/clm2/griddata/
File is missing: /work/00671/tobis/inputdata/lnd/clm2/rtmdata/rdirc.05.061026
File is missing: /work/00671/tobis/inputdata/ice/cice/
File is missing: /work/00671/tobis/inputdata/ice/cice/
File is missing: /work/00671/tobis/inputdata/atm/cam/ocnfrac/
File is missing: /work/00671/tobis/inputdata/atm/cam/ocnfrac/
File is missing: /work/00671/tobis/inputdata/ocn/docn7/SSTDATA/
File is missing: /work/00671/tobis/inputdata/ocn/docn7/SSTDATA/
File is missing: /work/00671/tobis/inputdata/atm/cam/ocnfrac/
File is missing: /work/00671/tobis/inputdata/atm/cam/ocnfrac/
File is missing: /work/00671/tobis/inputdata/ocn/docn7/SSTDATA/

Tuesday, December 7, 2010

prawn build

Am I gaining tolerance for this garbage?

Well, yesterday I couldn't face it at all. I just sort of cowered and avoided work.

Today, however, I managed the infamous prawn build on a new platform in only eight or nine tries.

First, find the files. Then type make. Fails per expectations. Set up missing environment variables for netcdf.

Fails cryptically. Discover that while pgf90 is obviously portland fortran, cc is not pgcc. Hack the makefile.

Fails, unable to include . Mysterious, as the include path is correctly set from the first step. Find and copy it to working directory


Your tax dollars at work. FML.

Thursday, December 2, 2010


It is also necessary to edit cam1/models/atm/cam/src/control

to replace
       read (5,camexp,iostat=ierr)
read(16, camexp)
because we can't read from stdin on ranger nodes (before doing the make of course)

And similarly for the land model.

Also, restart files turn out NOT to be portable.

Now moving onto the slab

Hey, that worked!

I'm actually running!

Things worth checking out:
- where is the land model getting its initialization, since I didn;t change that in the namelist.
- if I do change the namelist, does the result change?

- what is the proper way to configure the makefile, as opposed to going into line 190

- do I have restarts under control

- job names and all that

OK, now need to go back and change to the slab run. This is the old prawn fiasco:

here and here

Recapitulating CAM3.1 on Ranger

Not sure why this works, or whether getting to the bottom of it is useful.



use the default Portland Group settings for MPI

cd to the root of the CAM tree, then issue the following

unsetenv USER_FC
module load netcdf
setenv INC_NETCDF /opt/apps/pgi7_2/netcdf/3.6.2/include
setenv LIB_NETCDF /opt/apps/pgi7_2/netcdf/3.6.2/lib/
setenv INC_MPI /opt/apps/pgi7_2/mvapich/1.0.1/include
setenv LIB_MPI /opt/apps/pgi7_2/mvapich/1.0.1/lib
mkdir buildpar
cd buildpar
../cam1/models/atm/cam/bld/configure -spmd

then edit the Makefile, line 190, replacing $(FC) with mpif90 .

then type


NOTE: Build takes about 7 minutes.



You'd think they'd provide an example.

Before running CAM, I try to establish how to run something. I got a random MPI source off the net; stupidly, it has interactive I/) so my first run was inconclusive.

Anyway, after several bashes at it, I got this script

#$ -V
#$ -cwd
#$ -j y
#$ -A A-ig2
#$ -l h_rt=00:10:00
#$ -q development
#$ -N test
#$ -o ./$JOB_NAME.out
#$ -pe 16way 16
ibrun ./a.out

which is passed to the queue submission command "qsub".

As far as I can figure

#$ -pe 16way 16

is the smallest possible allocation on ranger. And I'm only asking for ten minutes on the dev queue (also tried the normal queue). Yet it takes forever to get loaded.

qstat shows a job number but the queue is marked as empty. Should I worry about this?


this namelist works for an initial run, based on a single CPU exepriment:

absems_data = '/work/00671/tobis/inputdata/atm/cam/rad/'
aeroptics = '/work/00671/tobis/inputdata/atm/cam/rad/'
bnd_topo = '/work/00671/tobis/inputdata/atm/cam/topo/'
bndtvaer = '/work/00671/tobis/inputdata/atm/cam/rad/'
bndtvo = '/work/00671/tobis/inputdata/atm/cam/ozone/'
bndtvs = '/work/00671/tobis/inputdata/atm/cam/sst/'
caseid = 'camrun.bsi'
iyear_ad = 1950
mss_irt = 0
nrevsn = '/work/00671/tobis/camrun/restart/camrun.bsi.cam2.r.0021-01-01-00000'
rest_pfile = './cam2.camrun.bsi.rpointer'
ncdata = '../inputdata/init/'
nestep = 586943
nsrest = 0
nrevsn = '/work/00671/tobis/camrun/restart/camrun.bsi.clm2.r.0021-01-01-00000'
rpntpath = './lnd.camrun.bsi.rpointer'
fpftcon = '/work/00671/tobis/inputdata/lnd/clm2/pftdata/pft-physiology'
fsurdat = '/work/00671/tobis/inputdata/lnd/clm2/srfdata/cam/'

and using qsubscript

#$ -V
#$ -cwd
#$ -j y
#$ -A A-ig2
#$ -l h_rt=03:10:00
#$ -q normal
#$ -N testCAM3
#$ -o ./$JOB_NAME.out
#$ -pe 16way 64
ibrun ./cam


qsub < qsubscript

obviously the input data set is in $WORK/inputdata

Wednesday, December 1, 2010

Many distractions today

But the single cpu version did actually run.

trick is to find the .nc files in inputdata, and set ncdata to point there. Then set the restart mode to zero.

WIll now need to save some restart files, and try to run in parallel.

For some reason the land model component didn't need the parallel fix to namelist. ???

To try: update the namelist for land model initialization; see if it makes any difference.