- 积分
- 1406
- 贡献
-
- 精华
- 在线时间
- 小时
- 注册时间
- 2013-5-21
- 最后登录
- 1970-1-1
|
登录后查看更多精彩内容~
您需要 登录 才可以下载或查看,没有帐号?立即注册
x
本帖最后由 晓晓 于 2016-1-28 10:57 编辑
各位,有没有遇到CESM 在多节点,并行计算过程中,mpirun出问题的(编译什么的都没有错误,而且在单个节点上运行并没有出错),这个问题困扰很久,一下是基本的设置和相关的log文件,看看各位同学有没有遇到的,或者能给点建议。。。非常感谢!
这是提交的 .run 文件前面部分(节点数和CPU个数)
#!/bin/csh -f
#===============================================================================
# USERDEFINED
# This is where the batch submission is set. The above code computes
# the total number of tasks, nodes, and other things that can be useful
# here. Use PBS, BSUB, or whatever the local environment supports.
#===============================================================================
#PBS -N test48
#PBS -q normal
#PBS -l nodes=3:ppn=16
#PBS -l walltime=1000:00:00
#PBS -r y
###PBS -j oe
###PBS -S /bin/csh -V
##BSUB -l nodes=3:ppn=16:walltime=1000:00:00
##BSUB -q normal
###BSUB -k eo
###BSUB -J teh32
###BSUB -W 1000:00:00
limit coredumpsize 1000000
limit stacksize unlimited
# ----------------------------------------
这是$CASE路径下的error.log文件后面几行:
-------------------------------------------------------------------------
-------------------------------------------------------------------------
CESM PRESTAGE SCRIPT STARTING
- Case input data directory, DIN_LOC_ROOT, is /public/data/CESM/inputdata
- Checking the existence of input datasets in DIN_LOC_ROOT
The following files were not found, this is informational only
Input Data List Files Found:
$CASE/Buildconf/pop2.input_data_list
$CASE/Buildconf/clm.input_data_list
$CASE/Buildconf/cpl.input_data_list
$CASE/Buildconf/cam.input_data_list
$CASE/Buildconf/rtm.input_data_list
$CASE/Buildconf/cice.input_data_list
File status unknown: same_as_TS
CESM PRESTAGE SCRIPT HAS FINISHED SUCCESSFULLY
-------------------------------------------------------------------------
星期三 11:13:20 CST -- CSM EXECUTION BEGINS HERE
星期三 11:13:31 CST -- CSM EXECUTION HAS FINISHED
Model did not complete - see $CASERUN/run/cesm.log.151227-111247
-----------------------------------------------------------------------------------------------
以下是cesm.log.151227-111247文件后面几行:
cesm.exe 000000000062C45D startup_initialco 54 startup_initialconds.F90
cesm.exe 0000000000537C39 inital_mp_cam_ini 51 inital.F90
cesm.exe 00000000004C528F cam_comp_mp_cam_i 164 cam_comp.F90
cesm.exe 00000000004C1092 atm_comp_mct_mp_a 276 atm_comp_mct.F90
cesm.exe 000000000042B512 ccsm_comp_mod_mp_ 1055 ccsm_comp_mod.F90
cesm.exe 000000000042DBA3 MAIN__ 90 ccsm_driver.F90
cesm.exe 000000000040B0CE Unknown Unknown Unknown
libc.so.6 00002ABD1932AD5D Unknown Unknown Unknown
cesm.exe 000000000040AFD9 Unknown Unknown Unknown
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[25793,1],1]
Exit code: 174
--------------------------------------------------------------------------
以上就是模式提示的最直接的的错误
#------------------------------------------------------------------------------------
以下是编译器及相关库的设置:
config_machines.xml:
<machine MACH="admin1">
<DESC>IAPmach</DESC> <!-- can be anything -->
<OS>LINUX</OS> <!-- LINUX,Darwin,CNL,AIX,BGL,BGP -->
<COMPILERS>intel,gnu</COMPILERS> <!-- intel,ibm,pgi,pathscale,gnu,cray,lahey -->
<MPILIBS>openmpi,mpich</MPILIBS> <!-- openmpi, mpich, ibm, mpi-serial -->
.... ....
config_compilers.xml:
<compiler MACH="admin1">
<NETCDF_PATH>/public/software/intel/netcdf4</NETCDF_PATH>
<PNETCDF_PATH></PNETCDF_PATH>
<ADD_SLIBS>-L/public/software/intel/netcdf4/lib -lnetcdf -lnetcdff</ADD_SLIBS>
<ADD_CPPDEFS></ADD_CPPDEFS>
<CONFIG_ARGS></CONFIG_ARGS>
<ESMF_LIBDIR></ESMF_LIBDIR>
<MPI_LIB_NAME></MPI_LIB_NAME>
<MPI_PATH>/public/software/mpi/openmpi/1.8.4/intel</MPI_PATH>
</compiler>
#-------------------------------------------------------------------
服务器上面即有openpmi也有Intel的impi, bash的环境变量我只设置了netcdf、Intel编译器及openmpi的
mpirun遇到这个问题后,我尝试了很多设置修改,比如config_machines.xml改成<COMPILERS>intel</COMPILERS> <!-- intel,ibm,pgi,pathscale,gnu,cray,lahey -->
<MPILIBS>openmpi</MPILIBS> <!-- openmpi, mpich, ibm, mpi-serial -->
不过都还是这个错误提示,所以想请各位帮忙看看。。。
在此谢过。。。
|
|