Saturday, December 10, 2016

fix for running CAM3 on TACC supercomputer facility

CAM3 won't run at TACC out of the box,. It will fail to read the namelist file, as the nodes can't read from stdin. This may apply to other HPC setups as well.

Here's my fix.

Copy the following text to a file called patchCAM


diff -ru cam1/models/atm/cam/src/control/runtime_opts.F90 ../CAM3/cam1/models/atm/cam/src/control/runtime_opts.F90
--- cam1/models/atm/cam/src/control/runtime_opts.F90 2005-03-08 11:06:36.000000000 -0600
+++ ../CAM3/cam1/models/atm/cam/src/control/runtime_opts.F90 2012-02-15 17:57:05.000000000 -0600
@@ -1065,7 +1065,15 @@
!
! Read in the camexp namelist from standard input
!
- read (5,camexp,iostat=ierr)
+! read (5,camexp,iostat=ierr)
+
+
+ open(16,file='namelist',iostat = ierr)
+ write(6,*) 'READING NAMELIST'
+ read(16, camexp)
+ write(6,*) 'FINISHED READING NAMELIST'
+
+
if (ierr /= 0) then
write(6,*)'READ_NAMELIST: Namelist read returns ',ierr
call endrun
diff -ru cam1/models/lnd/clm2/src/main/controlMod.F90 ../CAM3/cam1/models/lnd/clm2/src/main/controlMod.F90
--- cam1/models/lnd/clm2/src/main/controlMod.F90 2005-04-08 13:27:10.000000000 -0500
+++ ../CAM3/cam1/models/lnd/clm2/src/main/controlMod.F90 2012-02-16 15:26:58.000000000 -0600
@@ -369,7 +369,14 @@
fsurdat=lsmsurffile
finidat=lsminifile
#else
- read(5, clmexp, iostat=ierr)
+ ! read (5,camexp,iostat=ierr)
+
+
+ ! open(16,file='namelist',iostat = ierr)
+ write(6,*) 'READING NAMELIST 2'
+ read(16, clmexp, iostat = ierr)
+ write(6,*) 'FINISHED READING NAMELIST 2'
+
if (ierr /= 0) then
if (masterproc) then
write(6,*)'error: namelist input resulted in error code ',ierr


Go to the directory above cam1 where you unpacked the source and issue


patch -p0 < patchCAM

Thursday, August 25, 2011

CDMS inconsistency

Not sure this counts as a bug, but it is a weird inconsistency.


>>> import cdms
>>> s1 = cdms.open("ch.17_Lister_0001_all_02.nc")
>>> s2 = cdms.open("ch.18_Lister_0001_all_02.nc")
>>> s3 = cdms.open("ch.12_Lister_0001_all_02.nc")
>>> s4 = cdms.open("ch.13_Lister_0001_all_02.nc")

>>> d1 = s1['FLNSOI']
>>> d2 = s2['FLNSOI']
>>> d3 = s3['FLNSOI']
>>> d4 = s4['FLNSOI']

>>> dd1 = d1 - d3
>>> dd2 = d2 - d4

>>> d1.getAxis(0).id
'time'
>>> d2.getAxis(0).id
'time'
>>> d3.getAxis(0).id
'time'
>>> d4.getAxis(0).id
'time'


>>> dd1.getAxis(0).id
'time'
>>> dd2.getAxis(0).id
'axis_10'


The second behavior occurs repeatably on three out of 180 file pairs, was hard to track down, and prevented the NCO ncrcat command from working. Fortunately the workaround is easy.


>>> dd2.setAxisList(d2.getAxisList())


But it's baffling how this even happens. All of the _all_NN.nc files were created by the NCO ncra command. Probably it is correct to do this and there is no guarantee that an axis list will be inherited after a calculation. But why does it USUALLY happen? These files are all produced by the same processes!

I'd prefer to have my mind read every time or never, not 59 times out of 60.

This wasted three hours of my life.

Tuesday, January 4, 2011

New Slab Run How-To

Take an extant slab run namelist

edit:
ensure input files from $WORK/caminput
change nsrest
change case_id
set rest_pfile

create the file pointed to by rest_pfile, based on the lnd version; or edit as needed

in qsubscript:
change job name
ensure correct executable

ensure you have the right executable!

If everything is under control the executables should be properly named in CAM31Executables

Ah, python, python, when will I see you again? http://www.pauahtun.org/Misc/4nobletruthsofpython.html

Restart howto

> nrevsn = '/work/00671/tobis/camrun/restart/camrun.bsi.cam2.r.0021-01-01-00000'
> rest_pfile = './cam2.camrun.bsi.rpointer'
12,13c14,15
< nelapse = -14966
< nsrest = 0
---
> nestep = -14966
> nsrest = 1
16a19
> rpntpath = './lnd.camrun.bsi.rpointer'

what's with nrevsn vs rest_pfile? nrevsn seems to be for branch runs.

So what I think should be the case is just the nsrest should change; rest_pfile and rpntpath should botrh be set. As always, the cam2.*.rpointer file must be manually edited to remove the hard-wired NCAR path. The land model does not have this bug.

Monday, December 20, 2010

Stabs in the Dark

Gail suggests going to 64 pes

Suspicion attends to "MAX_TASKS_PER_NODE" value="4" ; I changed it to 16. It shoul;d probably be either 16 or 1.

also the telltale failure to run "module" is a module load netcdf/3.6.2

I did this manually.

This adds three variables. So if it works there are three things to unwind.

bah - tried configure -cleanmach, but it remembered the old CASEROOT. Now I;ve clobbered both runs!
Sure enough, you can;t set the shell in a script called by the qsub script; you have to specifiy it in the qsubscript.

I now have sort of got it running:


CCSM PRESTAGE SCRIPT STARTING
- CCSM input data directory, DIN_LOC_ROOT_CSMDATA, is /work/00671/tobis/inputdata
- Case input data directory, DIN_LOC_ROOT, is /work/00671/tobis/inputdata
- Checking the existence of input datasets in DIN_LOC_ROOT
CCSM PRESTAGE SCRIPT HAS FINISHED SUCCESSFULLY
rm: No match.
Fri Dec 17 17:41:16 CST 2010 -- CSM EXECUTION BEGINS HERE
Fri Dec 17 17:41:20 CST 2010 -- CSM EXECUTION HAS FINISHED
ls: No match.
Model did not complete - no cpl.log file present - exiting
TACC: Cleaning up after job: 1731370
TACC: Done.


To be clear, I have now loaded the executable, which promptly died without leaving a clue as to why anywhere that is obvious. Of course, who knows where it thinks it ought to leave the clue. I have set "find" the task of finding files created over the weekend. It is amazingly slow, though.

This still amounts to progress: after a week I have actually got the thing to lurch to life and die.

Life in the fast lane.
...
AHA!

$WORK/CAM_3/run/ccsm.log.101217-174114

it says, 32 times,


MPI_Group_range_incl(170).........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x48edec0, new_group=0x7fffc8173f2c) failed
MPIR_Group_check_valid_ranges(302): The 0th element of a range array ends at 31 but must be nonnegative and less than 1
MPI process terminated unexpectedly


OK, before I go whining around, I will try to redo everything.

Friday, December 17, 2010

Pretty weird

You didn;t expect it to actually run did you?

But the failure is damned peculiar

#!/bin/csh -f

...
foreach i (env_case.xml env_run.xml env_conf.xml env_build.xml env_mach_pes.xml)
...

fails with

TACC: Done.
./Tools/ccsm_getenv: line 9: syntax error near unexpected token `('
./Tools/ccsm_getenv: line 9: `foreach i (env_case.xml env_run.xml env_conf.xml
env_build.xml env_mach_pes.xml)'
TACC: Cleaning up after job: 1729536
TACC: Done.

The thing is, it's perfectly valid csh; the error message is the one bash would issue!