Thanks very much for the post, let's hope we can figure it out. I think the reason that the A/R dropped was due to the warning (
W), not the
Just for reference, the probe results are at https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?host=bdii.core.wits.ac.za and the current result of the BDII Validation probe shows that there are a few info messages, but the one which caused the problem was warning
The history of the probe shows that the error was fixed around 10:20 on 20/02 :
So, this is what was causing the drop in A/R, not the current result of the bdii probe.
Let's take a look at those :
This may be a historical setting - we should ask the ARGO team why it has been set to this level and whether it makes sense for them to change it.
Yes, I confirm that this is what is expected to be correct. If ARGO is throwing an error (even if it's not critical) perhaps we need to get in touch with the ARGO team about that.
I guess this is the "crux of the biscuit", to paraphrase the bard. This is clearly a torque configuration issue. Torque information is passed to the bdii via the dynamic information provider, on the CE :
/var/lib/bdii/gip/plugin/glite-info-dynamic-ce - poking into that might bring some enlightenment, but doesn't directly answer the question of "what is meant by the maximum total jobs?".
The naive answer is that the maximum total jobs is the sum of what
qstat -a gives you - running + waiting jobs in all queues. You mentioned before that
I have a cron job that runs at 6pm hat increases max_running to 100 for a while.
Then, Bouchra mentioned :
ldapsearch -x -h bdii.core.wits.ac.za -p 2170 -b o=glue |grep GLUE2ComputingShareMax | grep Jobs
i think it s referring to the first 3 queues
Which I would agree with. If we could figure out where the error is coming from. Perhaps try looking at the dynamic information plugin source code