From: Philip Chase <philipbchase@gmail.com>
Subject: condor_qsub sentinel jobs intermittently release dependencies early

I am seeing a transient failure in condor jobs dependencies.  The sentinel jobs
created by condor_qsub are releasing dependencies early in some cases.  I
adding logging to the sentinel job scripts and can see the count of "hold_jids"
returned by the following line drop dramatically to zero

  condor_q -long -attributes Owner \$hold_jids | grep "$USER" | wc -l

The sentinel job then releases the held job and exits.  As the count of
hold_jids is in reality non-zero, those jobs proceed to normal completion, but
out of sequence.

Repeated tests of condor_q within to condor_qsub sentinel jobs allows detection
of the transient and continuation of the holds.  This patch provides some
relief.

A test of 6 simultaneous FSL bedpostx was run on a condor pool composed of 64
slots spread over 8 nodes.  These 6 runs generated 514 jobs.  Of these, 12 were
sentinel jobs.  The sentinel jobs tested condor_q every 5 seconds recording
13070 tests.  In 18 of these tests condor_q failed to respond correctly.   At
most, two consecutive failures were found.  The patch above was able to correct
the problem in each case.

--- a/src/condor_scripts/condor_qsub
+++ b/src/condor_scripts/condor_qsub
@@ -267,7 +267,14 @@
 }
 
 # as long as there are relevant job in the queue wait and try again
-while [ \$(condor_q -long -attributes Owner \$hold_jids | grep "$USER" | wc -l) -ge 1 ]; do
+counter=3
+while [ \$(condor_q -long -attributes Owner \$hold_jids | grep "$USER" | wc -l) -ge 1 -o \$counter -ge 1 ]; do
+	job_count=\$(condor_q -long -attributes Owner \$hold_jids | grep "$USER" | wc -l)
+	if [ \$job_count -eq 0 ]; then
+		counter=\$((counter-1))
+	else
+		counter=3
+	fi
 	sleep 5
 done
 
