TO: All GCN Customers RE: Delayed and lost Notices and Circulars DT: 14-Jan-09 The GCN computer experienced a heavy load factor problem starting at 18:48 UT (13-Jan-09) and lasted for a few hours. The cause of the load factor was the confluence of six factors: 1) Both Fermi-GBM and -LAT triggered on GRB 090113 which resulted in their full compliments of TDRSS messages to GCN (and the resulting generation and distribution of those notices). 2) Swift-BAT also triggered on this burst. But it happened during a Malindi telemtry downlink pass. As such, the real-time TDRSS messages for this burst were delayed in an on-board holding buffer until the end of the telemetry pass. Once that was done, all the messages from BAT, XRT and UVOT for that burst came down to GCN in a highly time-compressed sequence. There was not the usual few to 10's of seconds between messages to do the generation and distribution for each message before the next message arrived. 3) The Swift and ROTSE teams submitted their usual rapid-response Circulars to GCN. The circulars are processed on the same computer as the notices. 4) Nine minutes after the arrival at GCN of the TDRSS messages from the GRB trigger, Swift-BAT triggered on an outburst from the HMXB 1A 1118-61 source, which generated a second series of TDRSS messages to GCN. 5) And 11 minutes after that second trigger, BAT triggered again on the 1A 1118-61 source, generating a third sequence of TDRSS messages to GCN. These 2nd and 3rd sequences were trying to be processed while the first sequence was being processed. 6) And all during the time of these messages from the 5 instruments on the 2 spacecraft from the 3 events, the Swift, AGILE, and INTEGRAL missions were producing a somewhat higher than normal rate of Pointing Direction messages. The result of these multiple converging streams of messages to GCN caused the load factor on the computer to get into the 90's. The normal load factor dring a Swift burst is 3-5 for about 10 minutes. Once the load factor goes above 8, the sendmail demon temporarily suspends distribution of the emails that were being generated for all these Notices. Normally I would have quickly noticed this high load factor and taken steps to mitigate the delays in notice distribution, but I was in a meeting for about two hours, so the high load factor and email suspension lasted a long time. Once I noticed the problem, I started mitigation procedures (mostly just manually suspending most of the 500+ processes spawned to generate and distribute the notices from these 5 instruments and 3 triggers so that the remaining (active) processes would have cpu cycles to get their processing completed. After about an hour, I got the load factor down to the 2-5 range. Sendmail was just resuming distribution of the email notices when I accidentally sent a wrong load-control command that killed the parent process. So all 500+ child processes also died, thus resulting in the loss to the GCN customers of an unknown fraction of the notices generated for these 3 events. My apologies to everyone who did not get their copies of some of the notices and circulars. I have already implemented an improvement to the dynamic load-control program running within GCN that will help reduce this "perfect storm" scenario. It will reduce the duration of the episode from hours to 20-30 of minutes, thus reducing the probability of that multiple triggers will happen within an overlapping time. While there have already been several simultaneous Fermi-n-Swift burst detections without any load factor problems, it was the time-bunching of the Swift messages from the first BAT trigger plus the extra messsages from the second and third BAT triggers that pushed GCN over the edge. I am working on ways to handle that special scenario as well. Again, my apologies for the lost notices and the lost opportunities.