Monday, December 20, 2010

Stabs in the Dark

Gail suggests going to 64 pes

Suspicion attends to "MAX_TASKS_PER_NODE" value="4" ; I changed it to 16. It shoul;d probably be either 16 or 1.

also the telltale failure to run "module" is a module load netcdf/3.6.2

I did this manually.

This adds three variables. So if it works there are three things to unwind.

bah - tried configure -cleanmach, but it remembered the old CASEROOT. Now I;ve clobbered both runs!
Sure enough, you can;t set the shell in a script called by the qsub script; you have to specifiy it in the qsubscript.

I now have sort of got it running:

- CCSM input data directory, DIN_LOC_ROOT_CSMDATA, is /work/00671/tobis/inputdata
- Case input data directory, DIN_LOC_ROOT, is /work/00671/tobis/inputdata
- Checking the existence of input datasets in DIN_LOC_ROOT
rm: No match.
Fri Dec 17 17:41:16 CST 2010 -- CSM EXECUTION BEGINS HERE
Fri Dec 17 17:41:20 CST 2010 -- CSM EXECUTION HAS FINISHED
ls: No match.
Model did not complete - no cpl.log file present - exiting
TACC: Cleaning up after job: 1731370
TACC: Done.

To be clear, I have now loaded the executable, which promptly died without leaving a clue as to why anywhere that is obvious. Of course, who knows where it thinks it ought to leave the clue. I have set "find" the task of finding files created over the weekend. It is amazingly slow, though.

This still amounts to progress: after a week I have actually got the thing to lurch to life and die.

Life in the fast lane.


it says, 32 times,

MPI_Group_range_incl(170).........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x48edec0, new_group=0x7fffc8173f2c) failed
MPIR_Group_check_valid_ranges(302): The 0th element of a range array ends at 31 but must be nonnegative and less than 1
MPI process terminated unexpectedly

OK, before I go whining around, I will try to redo everything.

Friday, December 17, 2010

Pretty weird

You didn;t expect it to actually run did you?

But the failure is damned peculiar

#!/bin/csh -f

foreach i (env_case.xml env_run.xml env_conf.xml env_build.xml env_mach_pes.xml)

fails with

TACC: Done.
./Tools/ccsm_getenv: line 9: syntax error near unexpected token `('
./Tools/ccsm_getenv: line 9: `foreach i (env_case.xml env_run.xml env_conf.xml
env_build.xml env_mach_pes.xml)'
TACC: Cleaning up after job: 1729536
TACC: Done.

The thing is, it's perfectly valid csh; the error message is the one bash would issue!

Thursday, December 16, 2010

Build successful: how to run?

This is new:

Thu Dec 16 13:27:36 CST 2010 /work/00671/tobis/CAM_3/run/ccsm.bldlog.101216-130725
- Locking file env_build.xml
- Locking file Macros.prototype_ranger

and has to be considered good news.

Now the "quick start" seems to have me in the scripts directory issuing

qsub $CASE.$

but how could that work? CCSM doesn't know my account number. All of the hash commands to the runtime environment are missing in $CASE.$ . I will try just splicing them in manually.

How many PE's?

env_mach_pes.xml says (open angle brackets elided):

!-- -->
!-- These variables CANNOT be modified once configure -case has been -->
!-- invoked without first invoking configure -cleanmach. -->
!-- -->
!-- See README/readme_env and README/readme_general for details -->


entry id="TOTALPES" value="32" />
entry id="PES_LEVEL" value="1r" />
entry id="MAX_TASKS_PER_NODE" value="4" />
entry id="PES_PER_NODE" value="$MAX_TASKS_PER_NODE" />

but my prior qsubscript says

#$ -pe 16way 64

Second attempt, then, leave the -pe out; see if it compensates somehow.


#$ -V
#$ -cwd
#$ -j y
#$ -A A-ig2
#$ -l h_rt=00:30:00
#$ -q normal
#$ -N spinup-CCSM
#$ -o ./$JOB_NAME.out

Not sure about the -cwd either...

------------> Rejecting job <------------
Please specify a parallel environment.
Syntax: -pe
Example: #$ -pe 16way 48
To see a list of defined pes: qconf -spl

should I go for 4way 32 or 16way 32 ?

I though they had gotten somewhere on ranger.

Trying 4way 32 which will ask for 8 nodes when 2 would do, I think.

OK, it is in the queue now

find Juli's script example:

#$ -V
# {inherit submission environment}
#$ -cwd
# {use submission directory}
#$ -N myCCSM
# {jobname (myCCSM)}
#$ -j y
# {join stderr and stdout}
#$ -o $JOB_NAME.o$JOB_ID
# {output name jobname.ojobid
#$ -pe 16way 1024
# {use 16 cores/node, 1024 cores total}
#$ -q normal
# {queue name}
#$ -l h_rt=05:30:00
# {request 4 hours}
#$ -M
# {UNCOMMENT & insert Email address}
#$ -m be
# {UNCOMMENT email at Begin/End of job}
set echo #{echo cmds, use "set echo" in csh}
# {account number}
#$ -A TG-CCR090010

# ----------------------------------------
# total number of tasks = 1024
# maximum threads per task = 1
# cpl ntasks=128 nthreads=1 rootpe=0
# cam ntasks=1024 nthreads=1 rootpe=0
# clm ntasks=128 nthreads=1 rootpe=0
# cice ntasks=160 nthreads=1 rootpe=0
# pop2 ntasks=32 nthreads=1 rootpe=0
# total number of hw pes = 1024
# cpl hw pe range ~ from 0 to 127
# cam hw pe range ~ from 0 to 1023
# clm hw pe range ~ from 0 to 127
# cice hw pe range ~ from 0 to 159
# pop2 hw pe range ~ from 0 to 31
# ----------------------------------------
# Determine necessary environment variables

her env_mach_pes:

setenv NTASKS_ATM 1024; setenv NTHRDS_ATM 1; setenv ROOTPE_ATM 0;
setenv NTASKS_LND 128; setenv NTHRDS_LND 1; setenv ROOTPE_LND 0;
setenv NTASKS_ICE 160; setenv NTHRDS_ICE 1; setenv ROOTPE_ICE 0;
setenv NTASKS_OCN 32; setenv NTHRDS_OCN 1; setenv ROOTPE_OCN 0;
setenv NTASKS_CPL 128; setenv NTHRDS_CPL 1; setenv ROOTPE_CPL 0;

alas, a different file format.

OK, looking in the wrong place.

!-- -->
!-- The following values should not be set by the user since they'll be -->
!-- overwritten by scripts. -->
!-- TOTALPES -->
!-- CCSM_PCOST -->
!-- PES_LEVEL -->
!-- PES_PER_NODE -->
!-- CCSM_TCOST -->

Looks like we should be going after

entry id="NTASKS_ATM" value="32" />
entry id="NTHRDS_ATM" value="1" />
entry id="ROOTPE_ATM" value="0" />

entry id="NTASKS_LND" value="32" />
entry id="NTHRDS_LND" value="1" />
entry id="ROOTPE_LND" value="0" />

entry id="NTASKS_ICE" value="32" />
entry id="NTHRDS_ICE" value="1" />
entry id="ROOTPE_ICE" value="0" />

entry id="NTASKS_OCN" value="32" />
entry id="NTHRDS_OCN" value="1" />
entry id="ROOTPE_OCN" value="0" />

entry id="NTASKS_CPL" value="32" />
entry id="NTHRDS_CPL" value="1" />
entry id="ROOTPE_CPL" value="0" />

and the NTASKS is really the variable we control. Unlike older CAM, we need to set these at build time, apparently.

I think I'll submit a 16way 32 as well as try3

priority is very low right now so won't find out for a while.

More tomorrow I guess.

Wednesday, December 15, 2010

Two changes

Two changes in Macros.prototype_ranger will probably correspond to leaping the latest hurdle. Whether that yields a useful result in the end remains to be seen.

< INCLDIR := -I./usr/include
> INCLDIR := -I. /usr/include
< FFLAGS := $(CPPDEFS) -i4 -gopt -Mlist -time -Mextend -byteswapio
> FFLAGS := $(CPPDEFS) -i4 -target=linux -gopt -Mlist -time -Mextend -byteswapio

Isn't this intellectually satisfying work? Far better than being at AGU.



- Build Libraries: mct pio csm_share
Wed Dec 15 17:03:34 CST 2010 /work/00671/tobis/CAM_A2/mct/mct.bldlog.101215-170331
Wed Dec 15 17:05:36 CST 2010 /work/00671/tobis/CAM_A2/pio/pio.bldlog.101215-170331
Wed Dec 15 17:06:51 CST 2010 /work/00671/tobis/CAM_A2/csm_share/csm_share.bldlog.101215-170331
Wed Dec 15 17:07:52 CST 2010 /work/00671/tobis/CAM_A2/run/cpl.bldlog.101215-170331
Wed Dec 15 17:07:52 CST 2010 /work/00671/tobis/CAM_A2/run/atm.bldlog.101215-170331
ERROR: cam.buildexe.csh failed, see /work/00671/tobis/CAM_A2/run/atm.bldlog.101215-170331
ERROR: cat /work/00671/tobis/CAM_A2/run/atm.bldlog.101215-170331
login4% cat /work/00671/tobis/CAM_A2/run/atm.bldlog.101215-170331
Wed Dec 15 17:07:52 CST 2010 /work/00671/tobis/CAM_A2/run/atm.bldlog.101215-170331
cat: Srcfiles: No such file or directory
/work/00671/tobis/CESM_SRC/ccsm4_0/scripts/CAM_A2/Tools/mkSrcfiles > /work/00671/tobis/CAM_A2/atm/obj/Srcfiles
cp -f /work/00671/tobis/CAM_A2/atm/obj/Filepath /work/00671/tobis/CAM_A2/atm/obj/Deppath
/work/00671/tobis/CESM_SRC/ccsm4_0/scripts/CAM_A2/Tools/mkDepends Deppath Srcfiles > /work/00671/tobis/CAM_A2/atm/obj/Depends
mpif90 -c -I. /usr/include -I/opt/apps/pgi7_1/netcdf/3.6.2/include -I/opt/apps/pgi7_1/netcdf/3.6.2/include -I/opt/apps/pgi7_1/mvapich2/1.0/include -I. -I/work/00671/tobis/CESM_SRC/ccsm4_0/scripts/CAM_A2/SourceMods/ -I/work/00671/tobis/CESM_SRC/ccsm4_0/models/atm/cam/src/chemistry/bulk_aero -I/work/00671/tobis/CESM_SRC/ccsm4_0/models/atm/cam/src/chemistry/utils -I/work/00671/tobis/CESM_SRC/ccsm4_0/models/atm/cam/src/physics/cam -I/work/00671/tobis/CESM_SRC/ccsm4_0/models/atm/cam/src/dynamics/eul -I/work/00671/tobis/CESM_SRC/ccsm4_0/models/atm/cam/src/cpl_mct -I/work/00671/tobis/CESM_SRC/ccsm4_0/models/atm/cam/src/control -I/work/00671/tobis/CESM_SRC/ccsm4_0/models/atm/cam/src/utils -I/work/00671/tobis/CESM_SRC/ccsm4_0/models/atm/cam/src/advection/slt -I/work/00671/tobis/CAM_A2/lib/include -DCO2A -DMAXPATCH_PFT=numpft+1 -DLSMLAT=1 -DLSMLON=1 -DPLON=128 -DPLAT=64 -DPLEV=26 -DPCNST=3 -DPCOLS=16 -DPTRM=42 -DPTRN=42 -DPTRK=42 -DSPMD -DMCT_INTERFACE -DHAVE_MPI -DCO2A -DLINUX -DSEQ_ -DFORTRANUNDERSCORE -DNO_SHR_VMATH -DNO_R16 -i4 -target=linux -gopt -Mlist -time -Mextend -byteswapio -O2 -Mvect=nosse -Kieee -O2 -Mvect=nosse -Kieee -Mfree /work/00671/tobis/CESM_SRC/ccsm4_0/models/atm/cam/src/control/cam_logfile.F90
pgf90-Error-Unknown switch: -target=linux
gmake: *** [cam_logfile.o] Error 1

Taking out the "-linux" and removing the space in "-I. /usr/include" does seem to create a .o file with no objections.

How this got to be in the distribution I don't know.

Now, apparently have to hack the Makefile...

But NCAR does this in some bizarre way too... Suppose I should look for FORTRANUNDERSCORE


setenv DIN_LOC_ROOT_CSMDATA $WORK/inputdata # put it where it wants it
setenv DIN_LOC_ROOT $WORK/inputdata # have it both ways
setenv CCSMROOT `pwd`
setenv MACH prototype_ranger
setenv CASEROOT `pwd`/CAM_Alone
setenv CASE CAM_Alone # not mentioned in instructions
setenv RES T42_T42
setenv COMPSET F_2000
cd ccsm4_0/scripts
create_newcase -case $CASEROOT -mach $MACH -compset $COMPSET -res $RES
cd $CASEROOT # not mentioned in instructions
./configure -case
$CASE.$ # you may need to prepend a dot and a slash

OK, I have all the files I guess but the build still fails on the ocnvenient auto-download.

Oops, looks like I just missed one for some reason.

Haha, building at last. MCT done, PIO in progress.

Preusmably ESMF will kill me, right?


Hell is other people's code.

Tuesday, December 14, 2010

After much moaning

OK the slab model seems to be running. I'll give a complete play-by=play=[

Now to take on CCSM, a product which may be easier to use given that I have an account on an official target platform.

FIrst, I need to find the NAME of the machine. I saw it once. It was prototype_ranger or something. Grep may take forever.

Yep; I guess I still have some brain cells left.

> find . -name "*ranger*"


cd ccsm4_0
setenv CCSMROOT `pwd`
setenv MACH prototype_ranger

# mkdir CAM_Alone = Do NOT do this !!! => Caseroot directory /work/00671/tobis/CESM_SRC/ccsm4_0/CAM_Alone already exists

setenv CASEROOT `pwd`/CAM_Alone
setenv RES T42_T42
setenv COMPSET F_2000
create_newcase -case $CASEROOT -mach $MACH -compset $COMPSET -res $RES

Now snagged on auto-download of initial conditions files. AUthentication needed. As I recall it was wide open, but I don't remember what it was.

Found it in email. It appears to be the same for every user; but I'm not going to be the one to post it on a web page.

transcript has the following ugly appearance:

export /work/00671/tobis/inputdata/atm/cam/physprops/ ..... svn: REPORT request failed on '/!svn/vcc/default'
Cannot replace a directory from within

export /work/00671/tobis/inputdata/atm/cam/physprops/ ..... svn: REPORT request failed on '/!svn/vcc/default'
Cannot replace a directory from within

etc. etc. many times over. Is this fetching? WHo knows?

OK, no. In fact, unbelievably bad. It assumed (despite my name choice) that I wanted it in $WORK/inputdata. I do not know how this happened!

I have NO IDEA where it got $WORK/inputdata . I told it NOTHING about $WORK or inputdata !

Thjere seems to be some confusion in the docs with $DIN_LOC_ROOT in the files and $DIN_LOC_ROOT_CSMDATA in the docs.

"For supported machines this variable is preset". Does that include "prototype_ranger"?

Anyway, I try the alternative, using check_input_data (which really should be called checkin_input_data; it is not checking anything! Same error.

Googling for teh error message yields something about merges. So I try wget on one of the files.


ERROR: certificate common name `localhost.localdomain' doesn't match requested host name `'.
To connect to insecurely, use `--no-check-certificate'.

SOmebody tell me I am dealing with grownups here!

Eventually I succeed with

wget --no-check-certificate --http-user=EASY_TO_GUESS --http-password=ALMOST_AS_EASY

To my surprise nobody squawks.

So the next thing to do is to build a script to download all the stuff that check_input_data was supposed to get:

At least it handily reports:

File is missing: /work/00671/tobis/inputdata/atm/cam/chem/trop_mozart_aero/aero/
File is missing: /work/00671/tobis/inputdata/atm/cam/inic/gaus/
File is missing: /work/00671/tobis/inputdata/atm/cam/topo/
File is missing: /work/00671/tobis/inputdata/atm/cam/ozone/
File is missing: /work/00671/tobis/inputdata/atm/cam/chem/trop_mozart/ub/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/atm/cam/physprops/
File is missing: /work/00671/tobis/inputdata/lnd/clm2/pftdata/pft-physiology.c100226
File is missing: /work/00671/tobis/inputdata/lnd/clm2/snicardata/
File is missing: /work/00671/tobis/inputdata/lnd/clm2/snicardata/
File is missing: /work/00671/tobis/inputdata/lnd/clm2/surfdata/
File is missing: /work/00671/tobis/inputdata/lnd/clm2/griddata/
File is missing: /work/00671/tobis/inputdata/lnd/clm2/snicardata/
File is missing: /work/00671/tobis/inputdata/lnd/clm2/griddata/
File is missing: /work/00671/tobis/inputdata/lnd/clm2/rtmdata/rdirc.05.061026
File is missing: /work/00671/tobis/inputdata/ice/cice/
File is missing: /work/00671/tobis/inputdata/ice/cice/
File is missing: /work/00671/tobis/inputdata/atm/cam/ocnfrac/
File is missing: /work/00671/tobis/inputdata/atm/cam/ocnfrac/
File is missing: /work/00671/tobis/inputdata/ocn/docn7/SSTDATA/
File is missing: /work/00671/tobis/inputdata/ocn/docn7/SSTDATA/
File is missing: /work/00671/tobis/inputdata/atm/cam/ocnfrac/
File is missing: /work/00671/tobis/inputdata/atm/cam/ocnfrac/
File is missing: /work/00671/tobis/inputdata/ocn/docn7/SSTDATA/

Tuesday, December 7, 2010

prawn build

Am I gaining tolerance for this garbage?

Well, yesterday I couldn't face it at all. I just sort of cowered and avoided work.

Today, however, I managed the infamous prawn build on a new platform in only eight or nine tries.

First, find the files. Then type make. Fails per expectations. Set up missing environment variables for netcdf.

Fails cryptically. Discover that while pgf90 is obviously portland fortran, cc is not pgcc. Hack the makefile.

Fails, unable to include . Mysterious, as the include path is correctly set from the first step. Find and copy it to working directory


Your tax dollars at work. FML.

Thursday, December 2, 2010


It is also necessary to edit cam1/models/atm/cam/src/control

to replace
       read (5,camexp,iostat=ierr)
read(16, camexp)
because we can't read from stdin on ranger nodes (before doing the make of course)

And similarly for the land model.

Also, restart files turn out NOT to be portable.

Now moving onto the slab

Hey, that worked!

I'm actually running!

Things worth checking out:
- where is the land model getting its initialization, since I didn;t change that in the namelist.
- if I do change the namelist, does the result change?

- what is the proper way to configure the makefile, as opposed to going into line 190

- do I have restarts under control

- job names and all that

OK, now need to go back and change to the slab run. This is the old prawn fiasco:

here and here

Recapitulating CAM3.1 on Ranger

Not sure why this works, or whether getting to the bottom of it is useful.



use the default Portland Group settings for MPI

cd to the root of the CAM tree, then issue the following

unsetenv USER_FC
module load netcdf
setenv INC_NETCDF /opt/apps/pgi7_2/netcdf/3.6.2/include
setenv LIB_NETCDF /opt/apps/pgi7_2/netcdf/3.6.2/lib/
setenv INC_MPI /opt/apps/pgi7_2/mvapich/1.0.1/include
setenv LIB_MPI /opt/apps/pgi7_2/mvapich/1.0.1/lib
mkdir buildpar
cd buildpar
../cam1/models/atm/cam/bld/configure -spmd

then edit the Makefile, line 190, replacing $(FC) with mpif90 .

then type


NOTE: Build takes about 7 minutes.



You'd think they'd provide an example.

Before running CAM, I try to establish how to run something. I got a random MPI source off the net; stupidly, it has interactive I/) so my first run was inconclusive.

Anyway, after several bashes at it, I got this script

#$ -V
#$ -cwd
#$ -j y
#$ -A A-ig2
#$ -l h_rt=00:10:00
#$ -q development
#$ -N test
#$ -o ./$JOB_NAME.out
#$ -pe 16way 16
ibrun ./a.out

which is passed to the queue submission command "qsub".

As far as I can figure

#$ -pe 16way 16

is the smallest possible allocation on ranger. And I'm only asking for ten minutes on the dev queue (also tried the normal queue). Yet it takes forever to get loaded.

qstat shows a job number but the queue is marked as empty. Should I worry about this?


this namelist works for an initial run, based on a single CPU exepriment:

absems_data = '/work/00671/tobis/inputdata/atm/cam/rad/'
aeroptics = '/work/00671/tobis/inputdata/atm/cam/rad/'
bnd_topo = '/work/00671/tobis/inputdata/atm/cam/topo/'
bndtvaer = '/work/00671/tobis/inputdata/atm/cam/rad/'
bndtvo = '/work/00671/tobis/inputdata/atm/cam/ozone/'
bndtvs = '/work/00671/tobis/inputdata/atm/cam/sst/'
caseid = 'camrun.bsi'
iyear_ad = 1950
mss_irt = 0
nrevsn = '/work/00671/tobis/camrun/restart/camrun.bsi.cam2.r.0021-01-01-00000'
rest_pfile = './cam2.camrun.bsi.rpointer'
ncdata = '../inputdata/init/'
nestep = 586943
nsrest = 0
nrevsn = '/work/00671/tobis/camrun/restart/camrun.bsi.clm2.r.0021-01-01-00000'
rpntpath = './lnd.camrun.bsi.rpointer'
fpftcon = '/work/00671/tobis/inputdata/lnd/clm2/pftdata/pft-physiology'
fsurdat = '/work/00671/tobis/inputdata/lnd/clm2/srfdata/cam/'

and using qsubscript

#$ -V
#$ -cwd
#$ -j y
#$ -A A-ig2
#$ -l h_rt=03:10:00
#$ -q normal
#$ -N testCAM3
#$ -o ./$JOB_NAME.out
#$ -pe 16way 64
ibrun ./cam


qsub < qsubscript

obviously the input data set is in $WORK/inputdata

Wednesday, December 1, 2010

Many distractions today

But the single cpu version did actually run.

trick is to find the .nc files in inputdata, and set ncdata to point there. Then set the restart mode to zero.

WIll now need to save some restart files, and try to run in parallel.

For some reason the land model component didn't need the parallel fix to namelist. ???

To try: update the namelist for land model initialization; see if it makes any difference.

Tuesday, November 30, 2010

CESM Download

Download to home directory failed: file permissions

Download to $WORK ok.

Guessing about $COMPSET, using /work/00671/tobis/CESM_SRC/ccsm4_0/

machinefile looks dubious:
prototype_ranger (TACC Linux Cluster, Linux (pgi), 1 pes/node, batch system is SGE)

I thought it was 16 pes/node.


Meanwhile serial CAM3 reads namelist OK, fails on restart read. Need to change to initialization run.

Monday, November 22, 2010

Building CAM again

Trying to build CAM w/o reference to notes.

Haven't looked at this for two years.

unzipped and untarred

Here is the directory structure out of the box:


Configuration: here

Instructions say to run configure, but it's immediately not obvious where that is. There are four files called configure, but you have to guess that


is right.

So, of the options to configure, the first one to cause any difficulty would be

-cc name
name specifies the C compiler. This allows the user to override the default setting in the Makefile (Linux only). The C compiler can also be specified by setting the environment variable USER_CC. pgcc if using pgf90, otherwise use cc


-fc name
name specifies the Fortran compiler. This allows the user to override the default setting in the Makefile. The Fortran compiler can also be specified by setting the environment variable USER_FC. OS dependent

OK, let's see what f90 we have available. Hmm.

login4% man f90
No manual entry for f90
login4% which f90
f90: Command not found.
login4% which fortran
fortran: Command not found.
login4% which f77
login4% man f77
No manual entry for f77

umm? Here

we get ifort with icc, or pgf95 with pgcc or sunf90 or sunf95 with sun_cc

I haven't heard of NCAR components running under sun, so let's try the other two.
So we can set USER_CC and USER_FC

The next problem is the MPI version, always a head-scratcher.

-mpi_inc dir
dir is the directory that contains the MPI library include files. Only SPMD versions of CAM require MPI. The MPI include directory can also be specified by setting the environment variable INC_MPI. /usr/local/include except on IBM systems. The IBM Fortran compilers mpxlf90 and mpxlf90_r have the MPI include file location built in.
module avail yields
--------------------------------------- /opt/apps/pgi7_2/modulefiles ---------------------------------------
acml/4.1.0 gotoblas2/1.05 (default) mvapich2-debug/1.2
autodock/4.0.1 hdf5/1.6.5 mvapich2-new/1.2
fftw3/3.1.2 hecura-debug/1.5rc2 mvapich2/1.2
glpk/4.40 hecura/1.5.1 nco/3.9.5
gotoblas/1.26 (default) metis/4.0 netcdf/3.6.2
gotoblas/1.30 mvapich-old/1.0.1 openmpi/1.3
gotoblas2/1.00 mvapich/1.0.1

not helpful. But see "Compiling Parallel Programs with MPI" here. This also recommends intel or pgi, but seems inconsistent about which is installed.

login4% which mpif90
login4% which mpicc

So that we get pgi by default. OK, good enough for me, though we've been running intel10 on lonestar. "The compiler and MVAPICH library are selected according to the modules that have been loaded." But no modules loaded. Trying module load intel gives useful info:

Error: You can only have one compiler module loaded at time.
You already have pgi loaded.
To correct the situation, please enter the following command:

module swap pgi intel

This is tedious but going well so far. Still don;t know if the MPI modules will be found; time will tell, I guess, but it looks likely that $MPICH_HOME will need to be specified for this stuff

Similarly with netcdf

Saturday, November 20, 2010

The struggle recapitulated


Note: I am talking to myself. Will move these to another place soon.

setenv USER_CC pgcc
setenv USER_FC mpif90
module load netcdf
setenv INC_NETCDF /opt/apps/pgi7_2/netcdf/3.6.2/include
setenv LIB_NETCDF /opt/apps/pgi7_2/netcdf/3.6.2/lib/
cd cam1/models/atm/cam/bld

Trying WITHOUT setting LIB_MPI or INC_MPI ; for reference they are in


Basis for not setting them:

Directory containing the MPI include files. This is only required when CAM is built with SPMD enabled.

Directory containing the MPI library. This is only required when CAM is built with SPMD enabled.

-spmd enables an SPMD configuration of CAM (via MPI). -nospmd disables an SPMD configuration of CAM. SPMD is enabled by default only on IBM systems.


** Invalid build directory: /share/home/00671/tobis/CAM3Bld/cam1/models/atm/cam/bld
** The specified build directory is the same as the configuration script
** directory. This is not allowed because the Makefile produced by configure
** would overwrite the standard Makefile. Use a different build directory.

go back to CAM root

cd ~/CAM3Bld
mkdir build
cd build/

No complaints from configure. This makes 6 files and an empty esmf directory. One of the files is a Makefile. So



cd /share/home/00671/tobis/CAM3Bld/cam1/models/utils/esmf; \
echo "Build the ESMF library."; \
echo "ESMF is NOT supported by the CCSM project, but by the ESMF core team in NCAR/SCD"; \
echo "See"; \
gmake -j 1 BOPT=O ESMF_BUILD=/share/home/00671/tobis/CAM3Bld/build/esmf ESMF_DIR=/share/home/00671/tobis/CAM3Bld/cam1/models/utils/esmf ESMF_ARCH=;
Build the ESMF library.
ESMF is NOT supported by the CCSM project, but by the ESMF core team in NCAR/SCD
gmake[1]: Entering directory `/share/home/00671/tobis/CAM3Bld/cam1/models/utils/esmf'

makefile:15: /share/home/00671/tobis/CAM3Bld/cam1/models/utils/esmf/build//base: No such file or directory
make[1]: *** No rule to make target `/share/home/00671/tobis/CAM3Bld/cam1/models/utils/esmf/build//base'. Stop.

OK, so a couple of clues here. ESMF_ARCH=; and share/home/00671/tobis/CAM3Bld/cam1/models/utils/esmf/build//base

So THE LIST OF CAM ENVIRONMENT VARIABLES EXCLUDES ESMF! (You have to guess that. Maybe "ESMF is NOT supported by the CCSM project, but by the ESMF core team in NCAR/SCD" is supposed tp be helpful.)Next, look in cam1/models/utils/esmf/build.

Here we see

alpha common_g config Darwin_xlf linux_altix linux_pathscale rs6000_sp
base_variables.defs common_O cray_x1 ES linux_gnupgf90 linux_pgi solaris
common common_variables cray_x1_ssp IRIX linux_intel README solaris_hpc
common_ conf.defs Darwin_absoft IRIX64 linux_lf95 rs6000_64 SX6

README is singularly unhelpful:
# $Id: README,v 2002/04/27 15:38:57 erik Exp $
The build directory contains all the base makefiles that are
included in your actual makefile. See the users manual for a
description of all the flags and rules in the makefiles.

It doesn't even give me a clue where to find "the user's manual". A person not up on NCAR politics might assume this was the CAM or CCSM manual.

So candidates in my case are linux_pgi and linux_gnupgf90. Looking inside the latter, it calls gnucc, which I think I don't want. so

set ESMF_ARCH linux_pgi
cd ~/cd CAM3Bld/
mv build buildx01
mkdir build
cd build

The only diff at this point is two new files in the old directory left over from the failed buiuld, so the re-run of configure was not needed.

Make gets a lot further.

Now... gets past all the ESMF stuff (lots of it, which as we know has little purpose) and fails at

mpif90 -c -DHIDE_MPI /share/home/00671/tobis/CAM3Bld/cam1/models/atm/cam/src/control/string_utils.F90
mpif90 -c -DHIDE_MPI /share/home/00671/tobis/CAM3Bld/cam1/models/csm_share/shr/shr_kind_mod.F90
mpif90 -c -DHIDE_MPI /share/home/00671/tobis/CAM3Bld/cam1/models/csm_share/shr/shr_mpi_mod.F90
mpif90 -c -DHIDE_MPI /share/home/00671/tobis/CAM3Bld/cam1/models/csm_share/shr/shr_sys_mod.F90
mpif90 -c -DHIDE_MPI /share/home/00671/tobis/CAM3Bld/cam1/models/atm/cam/src/control/mpishorthand.F
PGF90-F-0226-Can't find include file misc.h (/share/home/00671/tobis/CAM3Bld/cam1/models/atm/cam/src/control/mpishorthand.F: 1)
PGF90/x86-64 Linux 7.2-5: compilation aborted

but, but, but misc.h is right there in the build directory. Why is PGF90 not looking in the current directory for includes?

Charles suggests using straight pgf90 rather than mpif90. This did nothing; then I thought to rerun configure. Haha! It built. First day out!

This is a serial CAM though. Not too much use. But interesting. if it can builf under gnu in serial, it could run on a Mac, couldn't it?

However, must rebuild for parallel run.


setenv USER_CC pgcc
setenv USER_FC mpif90
module load netcdf
setenv INC_NETCDF /opt/apps/pgi7_2/netcdf/3.6.2/include
setenv LIB_NETCDF /opt/apps/pgi7_2/netcdf/3.6.2/lib/
setenv INC_MPI
setenc LIB_MPI
mkdir buildpar
cd buildpar
../cam1/models/atm/cam/bld/configure -spmd

This fails because INC_MPI is being ignored, as well as the current working directory

The user guide says exatly nothing about command live invocation. Cannot find a PGI document that does. The man page says

Add directory to the compiler's search path for include files. For include files surrounded by < >, each -I
directory is searched followed by the standard area. For include files surrounded by " ", the directory containing
the file containing the #include directive is searched, followed by the -I directories, followed by the standard

So is it

-I dir1 -I dir2
-I dir1 dir2
or what?

I already set MPI_INC per instructions. SO why is that ignored??

The makefile has

ifeq ($(SPMD),TRUE)
LDFLAGS += -L$(LIB_MPI) -lmpi


By the time the routine in question comes arounf the FFLAGS are lost.

I hate Makefiles anyway.


# setenv USER_CC pgcc
# setenv USER_FC mpif90
module load netcdf
setenv INC_NETCDF /opt/apps/pgi7_2/netcdf/3.6.2/include
setenv LIB_NETCDF /opt/apps/pgi7_2/netcdf/3.6.2/lib/
#setenv INC_MPI
#setenv LIB_MPI
mkdir buildpar
cd buildpar
../cam1/models/atm/cam/bld/configure -spmd

** Cannot find mpif.h in specified directory: /usr/local/include

setenv USER_CC pgcc
setenv USER_FC mpif90
module load netcdf
setenv INC_NETCDF /opt/apps/pgi7_2/netcdf/3.6.2/include
setenv LIB_NETCDF /opt/apps/pgi7_2/netcdf/3.6.2/lib/
#setenv INC_MPI
#setenv LIB_MPI
mkdir buildpar
cd buildpar
../cam1/models/atm/cam/bld/configure -spmd


#setenv USER_CC pgcc
#setenv USER_FC mpif90
unsetenv USER_CC
unsetenv USER_FC
module load netcdf
setenv INC_NETCDF /opt/apps/pgi7_2/netcdf/3.6.2/include
setenv LIB_NETCDF /opt/apps/pgi7_2/netcdf/3.6.2/lib/
setenv INC_MPI /opt/apps/pgi7_2/mvapich/1.0.1/include
setenv LIB_MPI /opt/apps/pgi7_2/mvapich/1.0.1/lib
#unsetenv INC_MPI
#unsetenv LIB_MPI
mkdir buildpar
cd buildpar
../cam1/models/atm/cam/bld/configure -spmd

Builds but fails to link:

pgf90 -o /share/home/00671/tobis/CAM3Bld/buildpar/cam BalanceCheckMod.o BareGroundFluxesMod.o Biogeophysics1Mod.o Biogeophysics2Mod.o BiogeophysicsLakeMod.o CanopyFluxesMod.o DGVMAllocationMod.o DGVMEcosystemDynMod.o DGVMEstablishmentMod.o DGVMFireMod.o DGVMKillMod.o DGVMLightMod.o DGVMMod.o DGVMMortalityMod.o DGVMReproductionMod.o DGVMRestMod.o DGVMTurnoverMod.o DriverInitMod.o FracWetMod.o FrictionVelocityMod.o Hydrology1Mod.o Hydrology2Mod.o HydrologyLakeMod.o QSatMod.o RtmMod.o RunoffMod.o STATICEcosysDynMod.o SnowHydrologyMod.o SoilHydrologyMod.o SoilTemperatureMod.o SurfaceAlbedoMod.o SurfaceRadiationMod.o TridiagonalMod.o VOCEmissionMod.o abortutils.o acbnd.o accFldsMod.o accumulMod.o advnce.o aer_optics.o aerosol_intr.o albice.o albocean.o areaMod.o atm_lndMod.o atmdrvMod.o bandij.o basdy.o basdz.o basiy.o bilin.o binary_io.o bnddyi.o bndexch.o buffer.o caer.o caerbnd.o cam.o camice.o camoce.o carbon_intr.o carbonscales.o ccsm_msg.o check_energy.o chem_surfvals.o chemistry.o cldconst.o cldinti.o cldsav.o cldwat.o clm_csmMod.o clm_varcon.o clm_varctl.o clm_varpar.o clm_varsur.o clmtype.o clmtypeInitMod.o cloud_fraction.o cloudsimulator.o cmparray_mod.o comhd.o commap.o comspe.o comsrf.o comsrfdiag.o constituents.o controlMod.o convect_deep.o convect_shallow.o courlim.o cpslec.o cubxdr.o cubydr.o cubzdr.o dadadj.o datetime.o decompMod.o decompinit.o diag_dynvar_ic.o diagnostics.o difcor.o diffusion_solver.o dmsbnd.o do_close_dispose.o do_restwrite.o dp_coupling.o driver.o drydep_mod.o dust.o dust_intr.o dust_sediment_mod.o dycore.o dyn.o dyn_grid.o dynconst.o dyndrv.o dynpkg.o engy_tdif.o engy_te.o error_messages.o esinti.o extx.o extys.o extyv.o f_wrappers.o fft99.o filenames.o fileutils.o filterMod.o flxint.o flxoce.o gauaw_mod.o geopotential.o get_memusage.o getdatetime.o gffgch.o ghg_defaults.o gptl.o gptl_papi.o gptlutil.o grcalc.o grdxy.o grmult.o gw_drag.o hb_diff.o hdinti.o herxin.o heryin.o herzin.o histFileMod.o histFldsMod.o history.o hk_conv.o hordif.o hordif1.o hrintp.o hycoef.o icarus_scops.o ice_constants.o ice_data.o ice_dh.o ice_diagnostics.o ice_globalcalcs.o ice_kinds_mod.o ice_ocn_flux.o ice_sfc_flux.o ice_srf.o ice_tstm.o infnan.o iniTimeConst.o iniTimeVar.o inicFileMod.o inidat.o initGridCellsMod.o inital.o initcom.o initext.o initializeMod.o initindx.o inti.o intp_util.o ioFileMod.o iobinary.o iop.o kdpfnd.o lagyin.o lcbas.o lcdbas.o limdx.o limdy.o limdz.o linebuf_stdout.o linemsdyn.o lininterp.o lnd2atmMod.o lp_coupling.o marsaglia.o massfix.o mkglacier.o mkgridMod.o mklai.o mklanwat.o mkpft.o mkrank.o mksoicol.o mksoitex.o mksrfdatMod.o mkurban.o molec_diff.o mpiinc.o mpishorthand.o nanMod.o ncdio.o ncdio_atm.o omcalc.o ozone_data.o param_cldoptics.o pdelb0.o pft2colMod.o pftvarcon.o phcs.o phys_adiabatic.o phys_buffer.o phys_gmean.o phys_grid.o phys_idealized.o physconst.o physics_types.o physpkg.o pkg_cld_sediment.o pkg_cldoptics.o plevs0.o pmgrid.o ppgrid.o prescribed_aerosols.o print_coverage.o print_memusage.o prognostics.o program_csm.o program_off.o pspect.o qmassa.o qmassd.o qneg3.o qneg4.o quad.o quicksort.o rad_constituents.o radae.o radheat.o radiation.o radlw.o radsw.o ramp_scon.o readinitial.o realloc4.o realloc7.o reordp.o restFileMod.o restart.o restart_dynamics.o restart_physics.o rgrid.o rstwr.o rtcrate.o runtime_opts.o scan2.o scandyn.o scanslt.o scm0.o scyc.o seasalt_intr.o settau.o sgexx.o shr_alarm_mod.o shr_cal_mod.o shr_const_mod.o shr_date_mod.o shr_file_mod.o shr_kind_mod.o shr_mpi_mod.o shr_msg_mod.o shr_orb_mod.o shr_sys_mod.o shr_timer_mod.o shr_vmath_fwrap.o shr_vmath_mod.o snowdp2lev.o soxbnd.o spegrd.o spetru.o sphdep.o spmdGathScatMod.o spmdMod.o spmd_dyn.o spmd_phys.o spmd_utils.o spmdinit.o srchutil.o srfoce.o srfxfer.o sst_data.o stats.o stepon.o stratiform.o string_utils.o subgridAveMod.o sulbnd.o sulchem.o sulemis.o sulfur_intr.o surfFileMod.o swap_comm.o system_messages.o tfilt_massfix.o threadutil.o time_manager.o timeinterp.o tphysac.o tphysbc.o tphysidl.o tracers.o tracers_suite.o trb_mtn_stress.o trjmps.o trunc.o tsinti.o tstep.o units.o upper_bc.o vertical_diffusion.o vertinterp.o virtem.o volcanicmass.o volcemission.o volcrad.o vrtmap.o wetdep.o wrap_mpi.o wrap_nf.o wv_saturation.o xqmass.o zenith.o zm_conv.o -L/opt/apps/pgi7_2/netcdf/3.6.2/lib/ -lnetcdf -L/share/home/00671/tobis/CAM3Bld/buildpar/esmf/lib/libO/linux_pgi -lesmf -L/opt/apps/pgi7_2/mvapich/1.0.1/lib -lmpich

/opt/apps/pgi7_2/mvapich/1.0.1/lib/libmpich.a(dreg.o): In function `flush_dereg_mrs_external':
dreg.c:(.text+0x3b8): undefined reference to `ibv_dereg_mr'
/opt/apps/pgi7_2/mvapich/1.0.1/lib/libmpich.a(dreg.o): In function `dreg_new_entry':
dreg.c:(.text+0xa23): undefined reference to `ibv_reg_mr'
dreg.c:(.text+0xa4a): undefined reference to `ibv_reg_mr'

and many more, all related to links defined in mvapich

Now, is the linker actually looking in LIB_MPI ?

Have to cut and paste that mess into a file and grep for it. Sure is. Right at the end there.