【问题求助】WRF多核并行运行时候报错

lhaikun@163.com · 发表于 2015-6-30 16:19:07

登录后查看更多精彩内容~

您需要登录才可以下载或查看，没有帐号？立即注册

x

本帖最后由 lhaikun@163.com 于 2015-8-17 17:11 编辑

单层domain运算，分辨率81km*81km，网格83*59，模拟区域80-140E 5-57N
物理参数化方案：
辐射：cam
陆面：clm
微物理：WSM6
边界：YSU
对流：KF

在使用64核进行长期运算的时候出现如下错误，自己之前用64核测试过短时间的（9天）运行，良好，现在是卡在2011年1月27日12时，下边报错原因搜了几个，也看到论坛中有过，但是论坛中的回答是

“

我贴了你的错误信息到网上看到这个
http://mailman.cse.ohio-state.ed ... 11-July/003466.html
好像是用的节点太多的问题。
clean -a 然后减少节点重新提交试试？

”
“

green_tea789 发表于 2013-12-11 16:12

登录/注册后可看大图

我贴了你的错误信息到网上看到这个
http://mailman.cse.ohio-state.ed ... pich-discuss/2011-J ...

询问了下，可能是大型机相关东西没有调试好，于是现在只能用了三层嵌套，已经出了结果，还是谢谢你~~”
根据回答，感觉并没有解决问题，9天的例子我就调用过64核计算，并没有报错，WRF的官方指导上虽说每个节点的任务量不要小于15*15个格点，否则也不会特高计算效率，但是这个出错出的莫名其妙有点。不知道有么有可以解答一下的大大。

下边为报错信息
1、文件real.e67877输出内容：
[proxy:0:0@c02b13] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:955): assert (!closed) failed
[proxy:0:0@c02b13] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@c02b13] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
[proxy:0:1@c03b13] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:955): assert (!closed) failed
[proxy:0:1@c03b13] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1@c03b13] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
[proxy:0:3@c03b08] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:955): assert (!closed) failed
[proxy:0:3@c03b08] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:3@c03b08] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
[mpiexec@c02b13] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated badly; aborting
[mpiexec@c02b13] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@c02b13] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for completion
[mpiexec@c02b13] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion
vncserver: The HOME environment variable is not set.
No such service
No such service
2、文件real.o67877输出内容
mpdboot_c02b13 (handle_mpd_output 420): from mpd on c02b13, invalid port info:
no_port

=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 256
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_lkaikun); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
In case 1, you can start an mpd on this host with:
mpd &
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MPICH2 Installation Guide.
3、wrf（rsl.out.0000及rsl.error.0000）输出内容
d01 2011-01-16_06:00:00 Input data processed for aux input 4 for domain 1
d01 2011-01-16_12:00:00 Input data processed for aux input 4 for domain 1
d01 2011-01-16_18:00:00 Input data processed for aux input 4 for domain 1
d01 2011-01-17_00:00:00 Input data processed for aux input 4 for domain 1
~
~
~

正在验证27号的数据，从ncl ../util/plotfmt.ncl 'filename="fnl:2011-01-27_12"'命令来看ungrib生成的文件上似乎没什么问题，稍后会继续补充验证的结果，欢迎大家讨论，希望这个问题能得到解决，看样子并不是我自己遇上过

可能错因：
（1）刚在机器上单核运行了2011年1月27日一天的模拟，在运行到11时的时候出现如下错误
Timing for main: time 2011-01-27_10:40:00 on domain 1: 2.59792 elapsed seconds
Timing for main: time 2011-01-27_10:48:00 on domain 1: 3.07335 elapsed seconds
Timing for main: time 2011-01-27_10:56:00 on domain 1: 2.58318 elapsed seconds
Timing for main: time 2011-01-27_11:04:00 on domain 1: 2.57906 elapsed seconds
BalanceCheck: solar radiation balance error nstep =       84 point =    3 imbalance = -0.102527 W/m2
fsa       = 0.000000000000000E+000
fsr       = 102.424089480400
forc_solad(1)= 29.4312877655029
forc_solad(2)= 54.7932929992676
forc_solai(1)= 14.9033489227295
forc_solai(2)= 3.39868640899658
forc_tot    = 102.526616096497
clm model is stopping
-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE:  <stdin>  LINE:    141
-------------------------------------------
-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE:  <stdin>  LINE:    141
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
-------------------------------------------
这个说的是太阳辐射平衡错误吗？这个是原始数据下载错误还是ungrib过程错误？数据使用为集群公共数据，哭晕再厕所已经.... 重新下载了相关时段的数据，然后run，报错跟上边一致{:soso__3669389859068460655_4:}，排除
（2）namelist中有问题，针对文件README.namelist来对照着看，发现
radt (max_dom)                   = 30,    ; minutes between radiation physics calls
                                       recommend 1 min per km of dx (e.g. 10 for 10 km)
ra_sw_physics (max_dom)          shortwave radiation option
                                 = 3, cam scheme also must set levsiz, paerlev, cam_abs_dim1/2 (see below)
cam_abs_freq_s             = 21600 default CAM clearsky longwave absorption calculation frequency
                                          (recommended minimum value to speed scheme up)
levsiz                            = 59 for CAM radiation input ozone levels, set automatically
paerlev                            = 29 for CAM radiation input aerosol levels, set automatically
cam_abs_dim1                      = 4 for CAM absorption save array, set automatically
cam_abs_dim2                      = value of e_vert for CAM 2nd absorption save array, set automatically
参照后，将原本的radt=30改为81（81km，同时添加上述参量修改，之后运行，出错：
Timing for main: time 2011-01-27_07:20:00 on domain 1: 2.86261 elapsed seconds
Timing for main: time 2011-01-27_07:28:00 on domain 1: 2.91984 elapsed seconds
Timing for main: time 2011-01-27_07:36:00 on domain 1: 2.85847 elapsed seconds
Timing for main: time 2011-01-27_07:44:00 on domain 1: 2.75176 elapsed seconds
BalanceCheck: solar radiation balance error nstep =       59 point =    3 imbalance = -0.119937 W/m2
fsa       = 0.000000000000000E+000
fsr       = 119.816595488548
forc_solad(1)= 32.8153114318848
forc_solad(2)= 59.7555809020996
forc_solai(1)= 22.1203804016113
forc_solai(2)= 5.24525928497314
forc_tot    = 119.936532020569
clm model is stopping
-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE:  <stdin>  LINE:    141
分析：时间提前了，看来跟namelist内相关内容设置，应该是cam和clm有关
而后发现帖子 https://forum.cgd.ucar.edu/camoml-e-compset-run-time-error-pacemaker-experiment-soil-balance-error
文中讲述的是clm运行出错，仔细看会发现是ocean模块的qfluxes问题，看到namelist中有
For non-zero mp_physics options, to keep Qv .GE. 0, and to set the other moisture
fields .LT. a critcal value to zero
mp_zero_out                = 0,    ; no action taken, no adjustment to any moist field
                                 = 1,    ; except for Qv, all other moist arrays are set to zero
                                             ; if they fall below a critical value
                                 = 2,    ; Qv is .GE. 0, all other moist arrays are set to zero
                                             ; if they fall below a critical value
mp_zero_out_thresh                = 1.e-8 ; critical value for moist array threshold, below which
                                             ; moist arrays (except for Qv) are set to zero (kg/kg)
再试...然并卵{:soso__8804404059627807918_4:}{:soso__8804404059627807918_4:}{:soso__8804404059627807918_4:}，求大大帮忙看看，下边是我的namelist.input

同样的测试内容，将陆面参数化方案换为ssib之后就可以成功run了，看来就是与clm有关系，有没有朋友用clm参数化方案run过成功的例子，可以给参考下namelist吗？

&time_control
run_days                         = 00,
run_hours                         = 18,
run_minutes                      = 0,
run_seconds                      = 0,
start_year                         = 2011, 2011, 2000,
start_month                      = 01, 01, 01,
start_day                         = 27, 01, 24,
start_hour                         = 00, 00, 12,
start_minute                      = 00, 00, 00,
start_second                      = 00, 00, 00,
end_year                         = 2011, 2011, 2000,
end_month                         = 01, 08, 01,
end_day                            = 28, 31, 25,
end_hour                         = 00, 18, 12,
end_minute                         = 00, 00, 00,
end_second                         = 00, 00, 00,
interval_seconds                   = 21600
input_from_file                   = .true.,.true.,.true.,
history_interval                   = 180,  60, 60,
history_outname                   ='./wrfout/wrfout_d<domain>_<date>',
frames_per_outfile                = 240, 1000, 1000,
restart                            = .false.,
restart_interval                   = 21600,
auxinput4_inname                   = "wrflowinp_d<domain>"
auxinput4_interval                = 360,360,360,
io_form_auxinput4                = 2
io_form_history                   = 2
io_form_restart                   = 2
io_form_input                      = 2
io_form_boundary                   = 2
debug_level                      = 0
/
&domains
time_step                         = 480,
time_step_fract_num                = 0,
time_step_fract_den                = 1,
max_dom                            = 1,
e_we                               = 83, 286, 94,
e_sn                               = 59, 145, 91,
e_vert                            = 30, 30, 30,
p_top_requested                   = 5000,
num_metgrid_levels                = 27,
num_metgrid_soil_levels          = 4,
dx                               = 81000, 27000,  3333.33,
dy                               = 81000, 27000,  3333.33,
grid_id                            = 1,    2,    3,
parent_id                         = 1,    1,    2,
i_parent_start                   = 1,    73, 30,
j_parent_start                   = 1,    30, 30,
parent_grid_ratio                = 1,    3,    3,
parent_time_step_ratio             = 1,    3,    3,
feedback                         = 0,
smooth_option                      = 0
/
&physics
mp_physics                         = 6,    6,    3,
ra_lw_physics                      = 3,    3,    1,
ra_sw_physics                      = 3,    3,    1,
radt                               = 80, 30, 30,
sf_sfclay_physics                = 1,    1,    1,
sf_surface_physics                = 5,    5,    2,
bl_pbl_physics                   = 1,    1,    1,
bldt                               = 0,    0,    0,
cu_physics                         = 1,    1,    0,
cudt                               = 5,    5,    5,
isfflx                            = 1,
ifsnow                            = 1,
icloud                            = 1,
surface_input_source             = 1,
num_soil_layers                   = 10,
sf_urban_physics                   = 0,    0,    0,
mp_zero_out                      = 2,
mp_zero_out_thresh                = 1.e-8,
sst_update                         = 1,
cam_abs_freq_s                   = 21600,
levsiz                            = 59,
paerlev                            = 29,
cam_abs_dim1                      = 4,
cam_abs_dim2                      = 30
/
&fdda
/
&dynamics
w_damping                         = 0,
diff_opt                         = 1,    1,    1,
km_opt                            = 4,    4,    4,
diff_6th_opt                      = 0,    0,    0,
diff_6th_factor                   = 0.12, 0.12, 0.12,
base_temp                         = 290.
damp_opt                         = 0,
zdamp                            = 5000.,  5000.,  5000.,
dampcoef                         = 0.2, 0.2, 0.2
khdif                            = 0,    0,    0,
kvdif                            = 0,    0,    0,
non_hydrostatic                   = .true., .true., .true.,
moist_adv_opt                      = 1,    1,    1,
scalar_adv_opt                   = 1,    1,    1,
/
&bdy_control
spec_bdy_width                   = 5,
spec_zone                         = 1,
relax_zone                         = 4,
specified                         = .true., .false.,.false.,
nested                            = .false., .true., .true.,
/
&grib2
/
&namelist_quilt
nio_tasks_per_group = 0,
nio_groups = 1,
/
--------------------------------------------------------------------分隔线-----------------------------------------------------------------------------------------------------
wrfhelp邮件往来：

Would it help if you make the time step a bit shorter, say from 480 sec to 400 sec?

You can also restart the model, say, from hour 10, and write model output very frequently and examine the output to see if you can find any clue.

wrfhelp

On Wed, Jul 1, 2015 at 12:45 AM, lhaikun <lhaikun@163.com> wrote:

Hi,
I am having problem to run WRF 3.6.1 with CLM land model,model integration stop at 11:00 after starting from 00:00 UTC. at the very beginning,i run a long a long run(more than 1month, with dx=81km) WRF3.6.1 coupled with CLM,and curshed at 27th days with the flowing mistake[proxy:0:0@c02b13] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:955): assert (!closed) failed
[proxy:0:0@c02b13] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@c02b13] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
[proxy:0:1@c03b13] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:955): assert (!closed) failed
[proxy:0:1@c03b13] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1@c03b13] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
[proxy:0:3@c03b08] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:955): assert (!closed) failed
[proxy:0:3@c03b08] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:3@c03b08] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
[mpiexec@c02b13] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated badly; aborting
[mpiexec@c02b13] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@c02b13] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for completion
[mpiexec@c02b13] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion
vncserver: The HOME environment variable is not set.
No such service
No such service
i don't know the reason,someone say it's the core number that matters,however,i tried the first 9days of my simulation with 16/32/64 cores,it proves that core number is not the reason.then i simulated the day i run into trouble ,i got the mistake as follows:Timing for main: time 2011-01-27_10:40:00 on domain 1: 2.59792 elapsed seconds
Timing for main: time 2011-01-27_10:48:00 on domain 1: 3.07335 elapsed seconds
Timing for main: time 2011-01-27_10:56:00 on domain 1: 2.58318 elapsed seconds
Timing for main: time 2011-01-27_11:04:00 on domain 1: 2.57906 elapsed seconds
BalanceCheck: solar radiation balance error nstep = 84 point = 3 imbalance = -0.102527 W/m2
fsa = 0.000000000000000E+000
fsr = 102.424089480400
forc_solad(1)= 29.4312877655029
forc_solad(2)= 54.7932929992676
forc_solai(1)= 14.9033489227295
forc_solai(2)= 3.39868640899658
forc_tot = 102.526616096497
clm model is stopping

-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE: <stdin> LINE: 141
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@------------------------------------------- i noticed that when i change 'radt = 80, 30, 30,'the time when i got into trouble changed.The problem wouldn't come when i use SSIB.I have tried every way i could think,however,it beyond my knowledge.I am attaching my namelist.input for your convenience. I am in desperate need for advice, any help would be greatly appreciated! data use: FNL (grib1) RTG_SST
Thanks

lhaikun@163.com · 发表于 2015-6-30 20:35:02

自己顶一下，别沉了

{:eb303:}{:eb303:}

say苹果有虫 · 发表于 2015-6-30 20:50:06

难道你来自OUC？

lhaikun@163.com · 发表于 2015-6-30 20:53:07

say苹果有虫发表于 2015-6-30 20:50
难道你来自OUC？

恩我们认识？

andrewsoong · 发表于 2015-6-30 21:28:26

time_step = 480,这个改的小一点试试~~~

lhaikun@163.com · 发表于 2015-6-30 21:33:35

andrewsoong 发表于 2015-6-30 21:28
time_step = 480,这个改的小一点试试~~~

这个的推荐不是6倍的dx吗，不管了，先试试

lhaikun@163.com · 发表于 2015-6-30 21:42:01

andrewsoong 发表于 2015-6-30 21:28
time_step = 480,这个改的小一点试试~~~

改成240，错误是一样的

andrewsoong · 发表于 2015-6-30 22:00:04

lhaikun@163.com 发表于 2015-6-30 21:42
改成240，错误是一样的

81000, 27000, 3333.33,这个不应该是1:3:3么？？？？你这个是1:3:9吧

lhaikun@163.com · 发表于 2015-6-30 22:08:46

andrewsoong 发表于 2015-6-30 22:00
81000, 27000, 3333.33,这个不应该是1:3:3么？？？？你这个是1:3:9吧

我只用了一个domain，单层的...

andrewsoong · 发表于 2015-6-30 22:19:08

lhaikun@163.com 发表于 2015-6-30 22:08
我只用了一个domain，单层的...

那我还没看出问题

		自动登录	找回密码
密码			立即注册

【问题求助】WRF多核并行运行时候报错

登录后查看更多精彩内容~

浏览过的版块