GCN/TAN System Status
When something happens to the system that causes a loss of sevice or odd behavior,
then status information will be posted here.
17 Mar 2017 ONE OF THE VOEVENT SERVERS WAS DOWN
One of the Voevent servers (184.108.40.206, port 8092 (for the Ver 1.1 private-phase LVC stream))
was down from 09:58 to 15:40 UT (5.7 hrs) due to a failure
in the restart crontab entry when the service providor rebooted all their machines
for a software upgrade action. (The other two servers on that same IP_num restarted
within a couple minutes.)
I will take this opportunity to remind people that there are a total of 9 voevent servers available within GCN.
There are multiple servers of each of the 2 versions (1.1 & 2.0) of the voevents,
and there are multiple servers for the regular GCN Notices and for the private-phase LVC notice types.
Please go to https://gcn.gsfc.nasa.gov/voevent.html - there is a table near the top of that page.
01 Feb 2017 SYSTEM DOWN DUE TO UPS REPLACEMENT
The entire GCN system (Notice & Circulars, computers, router, UPS) was taken down
to replace the UPS (that did not work during the 26jan17 outage). The system
was down from 17:33 to 17:40 UT (7 minutes).
26 Jan 2017 SYSTEM DOWN DUE TO 2 POWER GLITCHES
A power glitch at GSFC caused GCN and the networks to go down for 15 minutes;
from 19:42 to 19:57 UT. Everything is back to normal.
And then for 4 minutes from 20:25 to 20:29 UT; all is back to normal (again).
17 Jan 2017 FIXED PROBLEM WITH SNEWS IMPORTING
The importation of the SNEWS messages has been broken since sometime between 08 and 15 Nov 2016.
There are many tests to validate incoming messages from the various mission/instrument suppliers.
One of these tests is to see if the size of the file is within the proper size range. The min max limits
were set up back in mid 2015. But sometime between 8-15 Nov 2016, the size almost doubled
and hense was being rejected by this size test. I have checked the files and they still look good,
so I have openned up the size range test to allow the new size to be accepted.
So now the weekly "test" version of the SNEWS notices should go out every Tue at noon (US Eastern Time Zone);
and the "real" versions as well (at whatever time they occur).
08-Jan-2017 21:41 UT PROBLEM WITH 1 OF THE 9 VOEVENT SERVERS
The problem with IPnum=220.127.116.11, port 8096 LVC-enabled (VOEvents ver=2.0) not distributing VOEvents has been fixed.
There was a problem with IPnum=18.104.22.168, port 8096 LVC-enabled (VOEvents ver=2.0). It was NOT distributing VOEvents.
This was a problem ONLY with the LVC-enabled/Private-Phase voevent server; the other two Ver2.0/LVC-enabled servers are fine.
(and the public VOevent/2.0, 22.214.171.124:8099 servers and the Private VOEvent/1.1, 126.96.36.199:8092 servers are all operating fine).
(If you are receiving GCN Notices by email or 160-byte-binary packet, there are no problems. This was only with this
one voevent server, and only the Ver2.0 voevents/LVC-Privae phase.
04 Jan 2017 PROBLEMS WITH THE LVC DISTRIBUTIONS TODAY
There were problems with the distribution of the LVC Notices today -- mostly with the VOEvent format,
but to a lesser extent with the email and 160-byte-binary socket methods.
For the VOEvents, there was a syntax problem with one of the Param fields (and extra '/'), and then when new executables
were started, the disappearance of 3 fields for those LVC types
caused some customers to have problems with their ingest/parsing of the voevents. LVC has removed these 3 fields
from what is sent to GCN, and as such, then removed in what is distributed to the clients.
As a result of these problems and software upgrades, the an LVC_Initial notice was distributed 3 times
(the first two had the two problems and then the 3rd distribution was correct (albeit sometimes incompatible
with the receiving end software). I apologise for the repetition and the mistakes and the confusion this caused.
17 Dec 2016 INTERMITTANT OUTAGES DUE TO GSFC NETWORK UPGRADES
During the period 18:41 - 19:14 UT there were connection outages while the GSFC IT people
upgraded the internet routers. There were a total of 16 outages each lasting about 17-20 sec.
12 Oct 2016 ACCIDENTAL DISTRIBUTION OF AN OLD SWIFT BURST
During an malformed test session while porting GCN from the current/old machine to a new/fast machine,
a domain in a top-level email command caused the emails from a playback test (of a very old bust) on the new machine
to use the procmail lists for yesterday's burst on the old machine. My apologies for the spam.
Only email-based recipients were affected; the socket and voevent recipients did not receive this playback spam.
10 OCT 2016 INTERMITANT EMAIL DELAYS
Intermitant problems with the GFSC network are introducing delays for emailed Notices/Circulars in the 5min - 3 hrs (03:00 UT 10oct16).
In between the intermitant problems the delivery times are in the 0-15 sec range. Socket connections are not affected.
The problem stopped about 8 hrs later.
24 SEP 16 ALL OF GCN OFFLINE FOR 2.82 HOURS
As announced, GCN was offline from 16:05 to 18:54 UT (2.82 hrs duration) 24 Sep 2016
due to planned upgrades to the Goddard network.
16 SEP 16 THE ATLANTIC-BASED VOEVENT SERVERS OFFLINE FOR 11.3 HOURS
The 3 GCN VOEvent servers hosted on Atlantic.com (East-coast) were off-line from 04:34 to 15:51 UT.
This includes the public ver2.0 server, the public ver1.1 server, and the LVC_private ver2.0 server.
(The West-coast Atlantic public ver2.0 server did not have an outage,
nor did the 2 Linode-based servers and the 3 eApps-based servers.)
08 Jun 16 GCN NOTICES WAS OFFLINE FOR 26 MINUTES
GCN Notices were offline from 23:27 to 23:53 UT
due to a newly introduced s/w bug. A Fermi-GBM trigger was lost.
The bug has been fixed.
The Circulars portion of GCN was not affected.
06 May 16 GCN WAS OFFLINE FOR ~3 HOURS
GCN Notices and Circulars were offline from 23:NN to 02:35 UT.
A problem with file production filled up the disk partition that email uses,
so the sendmail demon stopped until the freespace was restored.
(The producer of the offending diskspace (the Goddard IT automatic security updates)
was restructured to prevent future occurances.)
There were no bursts or transients during this distribution blockage.
20-22 Feb 16 GCN WILL BE INTERMITANT THIS WEEKEND
There may be interruptions in the internet connectivity with GCN this weekend (Fri evening to Sunday day, local time).
(Feb 20, 01:00 UTC to February 22 08:00 UTC)
The GSFC internet is going through a series of Center-wide upgrades.
The plan calls for several/many 1-5 minute outages for inside-Goddard routers and 10-15 min outages for Goddard-to-utside routers).
Socket connections will drop for each interruption, and then will reconnect 1-4 min after then end of the interruption.
email-based distributions will experience delayed delivery until internet access is re-established.
26 Apr 15 GCN WAS OFFLINE INTERMITTANTLY
Sunday (26 Apr 15) from 18 to 06 UT (Monday, 27 Apr 15) GCN internet access was down
for a few intervals while NASA-wide (all 13 Centers) upgraded of their major routers.
There was a 1-2 hour outage interval near the beginning of the 12-hr total window,
and there was a few 3-5 minute outages spread across the rest of the total window.
27 Feb 15 GCN WAS OFFLINE FOR 0.3 HOURS
GCN Notices and Circulars were offline from 23:05 to 23:25 UT.
The main firewall machines at the Goddard fence were being repaired
after suffering some sort of configuration upset somewhat earlier in the day.
They functioned essentially normally during the upset, but did need re-initialization
to clear the upset.
And then the telemetry feed connection for the Swift mission did not reconnect properly,
so I had to bring the GCN system of programs down, and restart them.
28 May 14 GCN WAS OFFLINE FOR 2.7 HOURS
GCN Notices and Circulars were offline from 27may14 22:50 to 28may14 01:33 UT.
A severe thunderstorm took the GSFC gateways to the outside world offline
until these connections were restored. (GCN continued to run; it was just the connections
at the Goddard "fence" that stopped working for a while.)
18 Feb 14 GCN NOTICES WAS OFFLINE FOR 0.9 HOURS
GCN Notices were offline from 13:04 to 13:57 UT. (Circulars continued to operate, but with delays.)
A Goddard-internal network problem broke some of the GCN socket connections to the outside world.
02 Jan 14 VOEvent SERVERS BRIEFLY DOWN FOR SOFTWARE UPGRADES
Each of the 3 GCN/TAN VOEvent servers was briefly taken off-line to install new software.
(Atlantic=188.8.131.52 for ~8 min, Linode=184.108.40.206 for ~5 min, and
eApps=220.127.116.11 for ~15 min (eApps required a full reboot).
All between 15:34 UT and 16:10 UY on 01 Jan 2014.)
The improvmenets were (a) better information in the logfiles, and (b) adding the preliminary
hooks for the LIGO event handling.
14 Nov 13 SWIFT-BAT SUBSUB_THRESHOLD OUTAGE
The Swift-BAT SUbSubThrehold notices were unavailble from 17 Feb to 14 Nov 2013.
A mistake in the auto-processing of the raw files from the Swift Team was introduced
and not discovered until today. I apolize for the mistake.
15 Aug 13 GCN NOTICES WAS OFFLINE FOR 6.5 HOURS
GCN Notices were offline from 05:22 to 11:52 UT. (Circulars continued to operate.)
A Goddard-internal network problem broke some of the inter-GCN socket connections, and
the auto-detect/restart demon had problems recovering/overcoming from this network problem.
13 Aug 13 GCN NOTICES WAS OFFLINE FOR 2.3 HOURS
GCN Notices were offline from 18:24 to 20:46 UT. (Circulars continued to operate.)
A network problem broke some of the inter-GCN socket connections, and
the auto-detect/restart demon had problems recovering/overcoming from this network problem.
25 Jul 13 Swift-to-GCN NOTICES WAS OFFLINE FOR ~20 HOURS
Due to an operating system upgrade in the computers that feed the real-time Swift TDRSS telemetry
to GCN was corrupted (loss of Frame-lock). This has now been fixed, and GCN is receiving again.
02 Apr 13 GCN NOTICES WAS OFFLINE FOR 22 MINUTES
GCN Notices were offline from 14:31 to 14:53 UT.
Some sort of (probably internal Goddard) network problem.
03 Mar 13 GCN NOTICES WAS OFFLINE FOR ~3.7 HOURS
GCN Notices were offline from 09:04 to 12:49 UT. (Circulars continued to operate.)
Initial investigations have not revealed the cause -- I am continuing to investigate.
The automatic restart demon was not able to overcome the problem; manual intervention was required.
12 Nov 12 GCN WAS OFFLINE FOR ~3.3 HOURS
GCN (Notices and Circulars) were offline from ~13:55 to ~17:10 UT
The Goddard Network people replaced/upgraded the main routers between buildings and to/from outside world.
There was some intermittant operation (partial connections) around the end and resume operations times (+-5 min).
09 Nov 12 GCN WAS OFFLINE FOR 8 MINUTES and 7 MINUTES
GCN (Notices and Circulars) were offline from 12:11 to 12:19 UT
while the Goddard network people tracked down an errant computer
that was generating millions of bogus ethernet packets.
There was a second episode from 16:55 to 17:02 UT.
08 Oct 12 VOTAN ATLANTIC GCN WAS OFFLINE FOR 3.7 DAYS
The Atlantic GCN Votan server was down due to problems at the cloud server from 04oct12 20:08 UT to 08oct12 13:10 UT.
(This is only of the 3 redundant votan servers GCN has -- the other two (different cloud companies) contnued operations just fine.)
Here is the message on the Atlantic status page for my server:
"This email is being sent to you because you may have a Cloud Server or Servers
affected by an outage. At approximately 10PM EDT on October 5th, 2012, one of
our Cloud Storage Clusters encountered an undocumented bug that introduced file
This issue may have caused your Cloud Servers to become unavailable. Because the
bug exists in the storage software itself, failing over to a redundant storage
node exhibited the same behavior. In order to expedite restoring service to our
affected customers, we have brought another Cloud Storage node online.
Initially, we manually restored the affected Cloud Servers from our nightly
backups until our engineers finished automating this specialized recovery
process. As a result, affected Cloud Servers started to come online and continue to do so.
Your cloud server(s) has been restored.
This is the first time we have experienced a problem of this nature and are
working diligently with our vendor to make sure this does not happen again. We
sincerely apologize for the interruption in service you may have experienced and
want to assure you that we are working around the clock to restore service as
quickly as possible. Please be assured that we will follow-up with a conclusive
report as soon as full review has been completed."
This is very uncharacteristic of the Atlantic service. In the ~14 months I have been dealing with Atlantic,
they have outages at a rate of 1 every 2-3 months and their durations are in the 2-10 hours range. A 3+ day outage
is very anomolous. [The other two GCN votan servers have and continue to operate normally.]
08 Sep 12 GCN WAS OFFLINE FOR 3.1 HOURS
Due to server T-storms in the area, an electrical power glitch caused the GSFC-wide internet routers to go down.
As such, all incoming GRB/Transient information was blocked, and all outgoing socket connections and emails were blocked.
The GCN system (computer and programs) continued to run, but without I/O it is moot.
20:04 UT Routers and Internet lost.
23:13 UT Routers and Internet restored (GCN resumes input and output connections).
02 Jul 12 GCN WAS OFFLINE FOR 24.9 HOURS
Due to server T-storms in the area, electrical power was lost and GCN went down.
~23:00 UT Fri 29 Jun: T-storms caused power outages in the Maryland, DC, Virginia area (~600,000 homes).
17:28 UT Sat 30 Jun: The Goddard emergency generators stopped, the Bld34-level UPS ran out, and the GSFC internet stopped -- GCN went off-line.
18:07 UT Sun 01 Jul: Power was restored to Goddard.
18:20 UT Sun 01 Jul: GCN was fully up, running, and on-line.
~13:00 - 14:30 UT Mon 02 Jul: GCN was off/on/off/on-line several times
while various secondary services were being brought back up/online in the Bld34 & Goddard
computing and network services.
20 Jun 12 DOUBLE VOEvent IMALIVE's FIXED
A bug in the GCN/TAN VOEvent servers that was causing both versions
(Ver 1.1 and Ver 2.0) of the iamlive messages to be sent to all connected VOEvent clients
on all 3 VOEvent servers has been fixed. Now only the 18.104.22.168 and 22.214.171.124 servers
send only the Ver1.1 imalive, and the 126.96.36.199 server sends only the Ver2.0 imalive.
(I had thought I fixed this on 28may12, but I was wrong.)
17 Jun 12 (02:15 UT) GCN WAS OFFLINE FOR 52 minutes
At 01:18 UT, GCN stopped communicating with the world.
Preliminary inspections showed nothing obviously wrong,
but since it was not responding to ping, nor could I log in (and current logins were frozen),
I rebotted this system. Fsck took about 30 min since it has not been done for a while.
The suite of GCN programs were restarted at 02:10 UT. GCN is back to full normal operations.
28 May 12 1 of the 3 VOEvent SERVERS OFFLINE
The 188.8.131.52 VOEvent server was non-commuicating to/from the GCN/TAN central machine
from 08:48 27 May to 14:50 28 May (30.0 hours). The watchdog daemon did not correct the outage.
(This did not affect the other two redundant VOEvent servers: 184.108.40.206 and 220.127.116.11 .)
17 Mar 12 GODDARD-WIDE DNS PROBLEM
Access to all 3 of Goddard's DNS became inaccessible for a few minutes,
because Bld 34's firewall machine went off-line, thus deny access to the DNS.
This caused a problem with GCN's ability to maintain socket connections customers.
The auto restart watchdog kicked in, and got the GCN programs running again.
Total outage time was 10 min (14:30 to 14:40 UT).
19 Jan 12 INTERMITANT GODDARD-WIDE NETWORK PROBLEM
During a planned maintanence network equipment upgrade, the GCN Notices system was
(intermitantly) off-line (from the world) from 00:11 to 00:43 UT -- 0.5 hours;
About half the socket connections maintained connection during this interval; half had interruptions.
Out-going emailed notices were held in the outgoing queue (or went straight out).
01 Dec 11 GCN OFFLINE, GODDARD NETWORK PROBLEM
The GCN Notices system was (effectively) off-line from 07:09 to 13:41 UT -- 6.5 hours;
Service has been restored. The root cause is being investigated.
These kinds of outages (one every few months for a few hours each,
that keeps the uptime number at 99.7% -- without these network outages, the uptime would b 99.9%)
have motivated me to develop a GCN that runs on a machine outside of Goddard.
Currently, I have a prototype running on a machine on the Atlantic.net (cloud) service).
The connectivity has been 100% for 4 months, so it seems like the way to go.
It will go public in 1-2 months.
27 Nov 11 LOSS OF IMALIVE PACKETS TO SOME SOCKET SITES
Due to a programming mistake (which erroneously passed the validation suite)
about half the socket sites did not receive the Type=3 Imalive packets
from 18:27 to 19:23 UT 27 Nov (56 min).
(If you had the 60-sec deadbeat feature turned on at your end, then you would have disconnected.)
(I need to tweak the validation suite for this new GCN feature.)
17 Oct 11 GCN NETWORK SLOW-DOWN
Delivery of GCN email Notices were being delayed ~22:00 to ~13:20 UT 19 Oct -- 15.3 hours.
This affected only email delivery, socket delivery had no delay.
Delays on the emails ranged from minutes to hours.
The problem has been fixed and will not occur again.
15 Oct 11 GCN OFFLINE, NETWORK PROBLEM
The GCN Notices system was (effectively) off-line from 15:05 to 15:38 UT -- 0.5 hours;
Service has been restored. The root cause is being investigated.
28 Sep 11 GCN OFFLINE
The GCN Notices system was off-line from 18:19 to 19:02 UT -- 0.7 hours;
due to a (log) file size reaching 2.2GB (due to a run-away error message).
The fundmamental cause of the run-away error message was fixed, and the GCN programs restarted.
12 Aug 11 INTEGRAL'S IBAS SERVER OFFLINE FOR 6 DAYS
The IBAS message server for the iNTREGRAL mission was off-line starting on the 12th
and ending on the 18th. During those 6 days, no INTRGRAL messages were sent to GCN,
(so no INTEGRAL Notices could be distributed by GCN (Position, SPI-ACS, PointingDirection).
22 Jul 11 DUPLICATE FERMI NOTICES
Today there were two sets of back-to-back Fermi-GBM Alert Notices distributed.
This is because the Fermi telemetry processing computers (both Primary and the Secondary)
sent the same telemetry pack to GCN (and GCN pushed both out to the world).
The dual "primary" computer problem has been fixed at the Fermi Ops Center.
06 May 11 INTEGRAL SPIACS NOTICE DISTRIBUTION RESTORED
The spurious INTEGRAL SPI-ACS notices resulted from the SPI being in Anealing mode.
The SPI-ACS messages are now being withheld at ISDC. GCN has removed the its blocking,
and so when SPI returns to normal operations mode and the ISDC removes their blocking,
distribution by GCN will be automatic.
06 May 11 BLOCKING INTEGRAL SPIACS NOTICES
As of 03:24 UT I am blocking the distribution of the INTEGRAL SPI-ACS Notices.
They are being produced by the spacecraft or by the IBAS analysis system
at a rate of about one every 3or4 minutes for the last ~10 hours. These can't be astrophysical,
so I am blocking so that they do not complicate site's operations.
08 Apr 11 NO GOVERNMENT SHUTDOWN -- GCN CONTINUES NORMAL OPERATIONS
A Continuing Resolution was passed, so there will be no Government Shutdown
so GCN will continue with normal operations with no interruptions.
08 Apr 11 GCN DURING A GOVERNMENT SHUTDOWN
Information during a Government Shutdown
19 Feb 11 GCN OFFLINE
The GCN Notices system was off-line from 02:34 to 03:40 UT -- 1.1 hours;
due to a power system failure. (I had replaced the UPS the failed that caused
the prior failure, but I missed moving the router over to the new UPS.
Now the router has been moved.
17 Feb 11 GCN OFFLINE
The GCN Notices system was off-line from 10:29 to 14:14 UT -- 3.8 hours;
due to a power system failure.
And then there was a period of ~70 minutes after the restoration
where the UT time was off by exactly 1 hour (ahead). It took an hour
to figure out that the problem had to due to that there has been no reboot
since the DST-->EST transition last fall.
09 Jan 11 GCN OFFLINE
The GCN Notices system was off-line from 23:10 (08jan11)to 01:35 UT (09jan11) -- 2.4 hours.
The Goddard network was down due to a power outage in building that has
the main Goddard gateway to the outside. The post-incident report says the power outage was
not expected to affect the gateway, so no heads-up announcement of the outage was made.
But things did not go as planned, and it did affect the gateway.
22 Aug 10 GCN OFFLINE
The GCN Notices system was off-line from ~11:02 to 02:16 UT (on 22aug10) -- 12.25 hours.
The was a power failure at Goddard which dirrupted routers and DNS machines.
This caused several of the GCN program (including the key distribution program)
to go off-line. The internet disruption was extensive enough to cause
the demon/watchdog programs I have for GCN ops problems to fail to be able
to notify me. I discovered the outage manually, and restored GCN operations.
07 Aug 10 GCN OFFLINE
The GCN Notices system was offline from ~10:33 to 13:40 UT -- 3.1 hours.
Problems with the center-wide network started ~10:33. By 11:14 UT
it appears that all communications between GCN and the outside world
(both inbound an outbound) were disrupted. Full network operations were restored
by 13:40 UT. So the GCN Notices operations were offline for 2.4-3.1 hours,
and the Circulars operations were offline for about 1-2 hours.
03 Aug 10 GCN NOTICES OFFLINE
The GCN Notices system was offline from 19:00 to 20:10 UT (ie 70 min)
and from 21:44 to 22:11 UT (ie 27 min) due to some il-formed changes
the SysAdmin people made to our computer network today. It took a while
for me to realize what they had done and come up with a work-around.
11 May 10 CIRCULARS OFFLINE
The Circulars system was offline from 11:50 to 19:00 UT.
Incoming email (ie circular submissions) were being blocked by the Goddard-level firewall.
There were 3 Circulars that were delayed in being distributed during this time.
(Notices are unaffected, because that runs off the new capella2 computer.)
30 Apr 10 GCN OFFLINE (for 46 minutes):
Starting at about 15:20 UT (on 30apr10), the GCN system (Notices only)
was offline until 16:06 UT due a software upgrade and subsequent computer fsck problem.
21 Apr 10 SWITCHED TO THE NEW CAPELLA2 MACHINE
As of 14:22, the Notices part of GCN is running on the capella2.gsfc.nasa.gov amchine.
(Please note that the Circulars part is still on the old capella machine.)
10 Apr 10 ACCIDENTALLY SENT OUT OLD NOTICES DURING TESTING
Some of you (email-based customers) may have received copies of old Notices
for trigger 412217 (a burst back on 13 Feb 2010). I was doing some testing
of the new GCN computer (capella2) and forgot to tell it to use the local procmail lists
instead of the default (old) capella copies of the lists. As such, the people
who got the latest trigger (419015, earlier today) also got copies
of the playback burst (412217, later today).
My apologies for this confusion.
13 Mar 10 GCN OFFLINE DUE TO GSFC FIREWALL UPGRADE (for 67 minutes):
Starting at about 13:27 UT (on 13mar10), the GCN system (Notices and Circulars)
was offline while the GSFC main firewall machine was upgraded (to a larger capacity machine).
Normal operations resumed at 15:34 UT (ie a total loss of 67 minutes).
08 Mar 10 RESUMPTION OF SWIFT POINTING_DIRECTION NOTICES:
The GCN/SWIFT_POINT_DIR Notices have been offline since 25 Feb 2010.
This outage was due to an incompatibility with the new Swift MOC protocols
(the MOC upgraded their computers and network). The automated scripts
that grab the Swift Pre-Planned Observing Timeline files from the MOC failed,
and this was not noticed until today. They have been modified to accept
the new protocol, and the SWIFT_POINT_DIR Notices have resumed.
23 Feb 10 ACCIDENTAL DISTRIBUTION OF NEW NOTICE TYPE:
last night (while doing some final final "live" testing on a new notice type),
I accidentally distributed 4 instances the new type to ~460 (out of ~560) GCN sites.
My apologies for this mistake and the inconvenience and confusion it may have caused.
07 Dec 09 18:19-20:05 GCN OFFLINE:
GCN was down for 1.8 hours while work was done to permanently fix yesterday's problem.
06 Dec 09 05:16-14:21 GCN OFFLINE:
GCN was down for ~9.1 hours because the computer died (over-termperature).
This was repaired, and the system and the programs were brought back up.
22 Jun 09 VOEvent CONNECTION TO eSTAR OFFLINE:
The VOEvent feed from GCN to eSTAR was down from ~10 Jun to 22 Jun.
One of the transfer programs had crashed. It has been restarted, and
VOEvent flow (the XML packets) has been proven again.
This affect ONLY the flow of VOEvents to the eSTAR backbone site.
It did NOT affect the flow to the Caltech or NOAO backbone sites.
It did NOT affect the flow of any of the regular GCN Notice distribution methods.
09 Jun 09 07:36-13:10 GCN EFFECTIVELY OFFLINE:
GCN was effectively off-line for ~4.5 hours because of a connection problem
with TDRSS/Whitesands (the telemetry feed for the Swift mission). This is a rare occurrance
(the last one happend 2 years ago). The problem has been cleared and all the services
are back on-line.
02 May 09 21:31-23:05 INTEGRAL SPI-ACS TYPE BLOCKED-to-WORLD:
At 13:02 UT, the INTEGRAL IBAS system started sending a SPI-ACS Notices to GCN
at an average rate of one every 3 or 4 minutes. I communicated with INTEGRAL,
but have not yet received a reply. Clearly these are not astrophysical,
so I have activated the block-to-world filter on the INTEGRAL SPI-ACS Notice type
as of 12:27 UT 03may09. When the problems is resolved at INTEGRAL Operations,
the block-to-world filter will be removed.
27 Apr 09 21:31-23:05 GCN NETWORK PROBLEM:
From 21:31-23:05 UT (1.5 hrs total), the network inside Building 2 was down
due to a failed router. The router was replaced and GCN restarted.
13 Feb 09 01:52-12:41 GCN NETWORK CONNECTIVITY PROBLEM:
From 01:52-12:41 UT (10.9 hrs total), the network inside Goddard was down.
12 Feb 09 19:05-19:20 GCN NETWORK CONNECTIVITY PROBLEM:
From 19:05 to 19:20 UT (15 min total), there was a building 2 network outage due to a power glitch
clobbering one or more of the routers in the building that GCN is housed.
Reseting the routers restored all the connections.
08 Feb 09 GCN/GODDARD NETWORK CONNECTIVITY PROBLEM:
Here is what I have been able to piece together so far (as of 23:00 UT):
1) The GCN connection to "Whitesands" (the Swift TDRSS tlm stream)
was out between 06:20 and 17:04 UT.
No telemetry was received during that 10.6 hr interval.
Scanning the full dataset (the Malindi connection) shows no bursts during this interval.
2) The GCN connection to the Fermi-BAP machine was out 17:22-17:24 UT (2 min).
I am not sure why that connection was better.
(One burst was received from Fermi during the 06:20-17:04 UT interval.)
3) The INTEGRAL connection is much harder to determine (they do UDP and there
are no "imalive" packet exchanges). Some PointingDirection packets were
received during the 06:20-17:04 interval, but since they are sometimes
few and far between, it is hard to know what the real percentage uptime was.
Given all that:
(a) Customer socket connections that were already established
were maintained all during the 06:20-17:04 UT interval. I can't say about
connection that were broken from their end or new attempts to connect
as to how they faired connecting or not.
(b) I do not know what email traffic looked like (ie was there any delays,
say, with the inside-Goddard forwarders, or not).
13 Jan 09 DELAYED AND LOST NOTICES AND CIRCULARS:
Today a confleunce of three back-to-back Swift-BAT triggers, a time compression of the arrival
of the Swift TDRSS messages from the first BAT trigger, plus the Fermi-GBM and -LAT messages
from the burst (same as 1st BAT trigger) caused the load factor on the GCN computer
to become so high that outgoing emailed Notices and Circulars were delayed and lost.
Get the full details here.
13 Jan 09 INTERNET PROBLEMS:
The Godard internet resumed at ~02:15 UT. Socket connections to customers are now being made.
The telemtry feed from Fermi and INTEGRAL are working. However, the feed from Swift
is not connected. The Whitesands end has to initiate the conneciton.
13 Jan 09 INTERNET PROBLEMS:
Starting about 00:10 UT 13 Jan 09, the Goddard internet would no longer support new connections.
This clearly eliminates the socket connections, but it also prevents outgoing email
(the email functionality I am not able to really say since I can not get to an outside location
to check if email is getting through).
12 Dec 08 EMAIL PROBLEMS:
Starting about 20:00 UT 11 Dec 08, the Goddard Network Office implimented a "block"
on all outgoing email traffic that was not going through one of their designated relays.
GCN uses its own email delivery relay FOR WHICH IT HAS A WAIVER TO DO SO FROM THE Network Office!
They unilaterally revoked that waiver (and all of the other self-relay waivers at Goddard).
There was NO heads-up communication that they were going to do this.
Nor did they communicate that they did this even after the fact!
A formal complaint has been filed with Goddard Management about this very unprofessional bahavor
and methods by the Goddard Network Office.
As of ~15:20 UT 12dec08, a work-around has been implemented within GCN to get email service restored.
(Some of you probably received a "pulse" of emails that had been queued up for the last 20 hours.)
I apologize for the degradataion in service.
This is another in a long series of unprofessional methods used by the Goddard Network Office.
I have had numerous discussions/communications/waivers/etc with them to improve the situation.
But there is only some much I can do. I make the arguments that GCN is a world-class operation
distributing real-time data to researchers around the world, but it does not make
any head-way with the security people. The on-going science activities and the potential
for lost scientific results has zero weight with respect to security "directives".
If you feel otherwise, you can communicate your thoughts to the Goddard Management and
to the Goddard Network Office.
08 Aug 08 3 BRIEF SWIFT GCN OUTAGES (9min, 56min, & 45min):
There were 3 outages of Swift_TDRSS-to-GCN connection: 02:31-02:40,
09:47-10:43, and 10:56-11;40 UT. There were no Swift bursts during these intervals.
None of the other mission connection nor services of GCN were affected.
08 Jun 08 GCN OUTAGE (14.1 hours):
At 01:3? UT (08jun08) there was a problem with the power that caused all the computers
in the Low Energy Gamma-ray Group to shutdown (this includes the GCN computer: capella).
This is a repeat of yesterday's problem!
Between 01:3? and 15:48 UT (08jun08) GCN was off-line.
Currently, the power is OK and the GCN programs have been restarted.
Since there was some suspicion that the UPSs were the cause of the outage,
and since it repeated within 24 hours, I replaced the UPSs with new units
(bought as replacements and scheduled for replacement in the near future anyway).
Time will tell if this fixes the problem. (We have been having T-storms in the
last few days, and there have been numerous outages lasting 10's to 1000's of millisec.)
07 Jun 08 GCN OUTAGE (1.1 hours):
At 10:54 UT (07jun08) there was a problem with the power that caused all the computers
in the Low Energy Gamma-ray Group to shutdown (this includes the GCN computer: capella).
(The root cuase of this power glitch is as yet unknown.)
Between 10:54 and 12:04 UT (07jun08) GCN was off-line.
Currently, the power is OK and the GCN programs have been restarted.
14 May 08 SWIFT TDRSS OUTAGE (1.1 hours):
Between 19:43 and 20:52 UT (14may08) there was a problem in the socket connection
the TDRSS Ground Station and GCN. This resulted in a loss of the telemetry srteam
from the Swift spacecraft and GCN. All the rest of GCN fucntionality and connectivity
to all the other missions was OK. There were no Swift bursts during this interval.
05 Apr 08 GCN OUTAGE (3.6 + 2.3 hours):
The Goddard-wide main gateway/router and the Bluiding 2 gateway were being upgraded today.
All GRB functionality/services were off-line (Notices, Circulars, Reports;
incoming and outgoing). The outage started at 10:32 UT and ended at 14:11 UT.
Then the network came back for 1.0 hours, then stopped from 15:10 to 17:25 UT (2.3 hrs).
19 Mar 08 GCN SLOWDOWN (2.0 hours):
Due to the massive amount of traffic from the back-to-back burst tonight
the GCN computer is suffering from a high load factor. This has resulted
in delayed email delivery (both Notices and Circulars). Please note
that this has NOT efffect the socket distribution -- they went in milliseconds
for both bursts. But the email delivery of the Notices and Circulars has been delayed;
especially for the Circulars (several hours for some customers; and more so
for the later Circulars submitted in tonight's series of many follow-up obseervations.
I have taken steps to clear the email backlog. The distrubtion rate is increasing,
but there is still a backlog.
My apologies for the inconvenience and confusion these late emails have caused.
12 Mar 08 GCN OUTTAGE (2.0 hours):
From 16:50 to 18:50, the GCN was off-line (Goddard network switch-over problems).
All of GCN was out: the socket connections, the Notices, the Circulrs.
The Goddard network people said the upgrade switch-over would take 5 seconds,
but afer doing the upgrade they discovered that some routers downstream of the one being upgraded
were no longer compatible with the new upgraded unit.
23 Jan 08 NOTICES OUTAGE (9 hours):
From 04:30 to 13:25, the Notices portion of GCN was off-line (due to a software problem).
The problem started a few minutes after the last notice was distributed for GRB 080123
so there was no loss on the burst to the community. (And of course the Circulars portion
continued to function normally.)
13 Jan 08 BRIEF LOSS OF SOME OF THE WEB PAGES:
From around noon 12 Jan to 11am 13 Jan 08, about 20% of the top-level web pages
in the GCN web site were deleted (due to a stupid mistake on my part).
Everything should be back in place now. If you notice anything missing/old
please tell me (as is always a standing request on anything/anytime
you see wrong or could use improvement).
01 Aug 07 ACCIDENTAL RE-DISTRIBUTION OF SWIFT-BAT_POSITION NOTICE:
While testing out some new code to use the Swift_MOC SERS messages as a backup
to the real-time TDRSS (when there is a TDRSS outage; like last week), I accidentally
distributed a BAT_POSITION Notice for GRB 070729. I thought I had the block2world
active for this test, but no.
26 Jul 07 SOLUTION TO IN-LIMBO SOCKET PROBLEM:
The problem that caused the loss of notification of the two Swift burst 5 days ago
has been solved. Normally, there are demons and watchdogs inplace that monitor
for the loss of any of the socket connections between the various programs
that make up the GCN system. But last Saturday a new problem occurred that left
the socket connections in place, but they were not actually able to pass data.
A new demon/watchdog is in place (and tested) that detects this "in limbo" problem,
and it alerts me within 2 minutes of this occurance.
21 Jul 07 TWO BURSTS LOST DUE:
Two Swift bursts were not distributed to the world because of a problem
with the communications between two programs within the suite of programs that make up GCN.
See full announcment.
09 Mar 07 22:00 UT CIRCULARS and NOTICES WEB PAGE UPDATE DELAYS:
The archive pages for the Circulars and Notices was delayed in being updated
because the computer sys-adminstrators here in building 2 at Goddard were upgrading
all the machines with new op-systems that will handle the new Daylight Savings Time change correctly.
All is fixed.
This all happened for about 3-5 hours this afternoon.
06 Feb 07 20:00 UT NETWORK PROBLEMS AT GODDARD THIS WEEKEND:
The Goddard Center Network people were working on part of the network
and broke the connection of the TDRSS Swift telemetry stream for 4.4 hrs (15:19-19:43 UT).
Towards the end of that window Swift-BAT triggered on what turned out to be
a cosmic ray shower in the spacecraft and BAT instrument.
08 Jan 07 19:00 UT NETWORK PROBLEMS AT GODDARD THIS WEEKEND: REALLY FIXED:
The final problem was solved, and now the email traffic is flowing with no delays.
08 Jan 07 14:00 UT NETWORK PROBLEMS AT GODDARD THIS WEEKEND: STILL RESIDUAL PROBLEMS:
Well, I/they spoke too soon. Most of the functions came back, but there is still
some residual delays in some email deliveries. From my limited testing
the delays seems to now be in the 0.5 - 2 min range.
07 Jan 07 23:50 UT NETWORK PROBLEMS AT GODDARD THIS WEEKEND: FIXED:
The Goddard Network people were doing some modifications to the network
this weekend and never bothered to announce it -- not to worry; they are going
catch it from Goddard management about this snafu. Things seem almost
back to normal (as of midnight). There appears to still be a few emails trickling
out of the backlogged queues, but for the most part, things are flowing again.
07 Jan 07 NETWORK PROBLEMS AT GODDARD THIS WEEKEND:
There are on-going network problems inside GSFC that are causing delays
in the distribution of email-based notifications (the socket connections are fine;
only the email are slow). The delays are time variable and range from 1-5 min.
Requests have been submit to the IT Service branch.
30 Nov 06 GCN OUTAGE (3.0 hrs):
There were network problems inside GSFC that cause GCN to be effectively off-line
for up to 3.0 hours (10:39 to 13:42 UT). I say "up to" because some socket sites
were still connected up to 11:16 UT and some communications were restored
before 13:42 UT. Normal operations have been restored.
28 Nov 06 SWIFT TDRSS DATA CONNECTION TO GCN OUTAGE:
From 10:11 to 14:10 UT, the telemetry connection for the Swift TDRSS data feed to GCN was out.
The outage is over and data is flowing again (total lost 239 minutes).
(From the full data sets, we know that there were no Swift bursts during that time.)
31 Oct 06 PROBLEMS DURING ON-ORBIT XRT POINTING TEST (PART 2):
I have fixed the problem that allowed some of the XRT Notices
to be distributed during this afternoon's XRT on-orbit pointing
alignment testing. The problem had to do with the lack of all
the other messages that come down the TDRSS link when a normal
trigger happens. During this test _all_ the other messages
(all the BAT-related, all the FOM-related, and all the UVOT-related
messages were missing). The state-machine with the GCN programs
got screwed up, causing the swift_receiver front-end program to crash.
And when I restarted the GCN programs there was a brief window
(less than 60 sec) when XRT Notices could come down and be distributed
before I could get the block-to-world-distribution command executed.
And since thre were many XRT messages coming down TDRSS during the test,
some of them slipped through this brief interval and were distributed.
Both the missing-messages-statemachine problem and the brief-window problem
have been fixed.
This new software has been tested to prove that the state-machine problem is fixed.
And regular burst data has been processed to prove that the normal mode
mode of operations has not been effected.
At all times during the last 3-4 hours, the normal burst processing capability
was never compromised.
My apologies for the inconvenience.
31 Oct 06 PROBLEMS DURING ON-ORBIT XRT POINTING TEST (PART 1):
This generated a bunch of messages -- all of which were supposed to be blocked-to-world
but some of which did get distributed. Please ignore all XRT message between 15:00 and 15:45 UT today.
13 Sep 06 INTEGRAL-->GCN-->WORLD BACK TO NORMAL:
The Goddard-level IT people have fixed the firewall rule, and I have switched
back to using the normal connection between IBAS and GCN (stopped the "bridge" program).
05 Sep 06 INTEGRAL-->GCN-->WORLD UPDATE:
Yes, the problem was at the GSFC Firewall end. The Goddard-level IT people
were consolodating their waiver rules and in that process copied one of the GCN rules wrong.
This is being fixed. In the meantime, the "bridge" program is working fine.
03 Sep 06 INTEGRAL-->GCN-->WORLD PATCHED:
The "bridge" program running on the U Chicago machine has been running fine
now for about a day. Things seem stable -- messages from IBAS are getting to GCN,
so if another burst happens, all should work fine.
Now there is time to wait until people are back to work (Tuesday for the US)
to see if this is a GSFC firewall problem (or elsewhere).
02 Sep 06 ON-GOING: INTEGRAL-->GCN-->WORLD PROBLEM:
There are on-going problems with the connectivity between GCN and the
the INTEGRAL burst information server (aka IBAS). This problem appears
to have started ~23 Aug 06. It was not noticed until shortly after
the INTEGRAL burst GRB 060901 (I received a Circular about the burst
but not a Notice -- a couple other people noticed this lack as well).
Although the connection between GCN and IBAS appeared OK, no POINT_DIR
or TEST messages were being received. Killing and restarting the GCN
program did not establish a good link. Since this problem happened
once before (and it turned out to be a Goddard firewall issue), I used
a machine outside of Goddard to set up a "bridge" between GCN and IBAS.
This bridge changed the socket connection protocol from UDP to TCP/IP
(which was key to the firewall issue; and it then relayed the INTEGRAL
messages to GCN (changing the protocol and thus avoiding the firewall problem).
[Carlo Graziani (U Chicago) kindly provides this machine outside of GSFC
that allows this bridge and other outisde-of-Goddard testing activities.]
This worked for about 24 hours and then the POINTDIR messages (and presumably
anything else that might have been generated) stopped again. This indicates
it is not a Goddard firewall issue this time.
I do not know what the problem is, but I am working to discover and solve it.
(Things are complicated because it is the weekend and system support people
here and at IBAS are not available.) I will keep you posted. (It is always
useful to check the GCN System Status wep page to see about this and any
future problem: http://gcn.gsfc.nasa.gov/sys_status.html; and on a broader
scale, check the "what's new" page: http://gcn.gsfc.nasa.gov/whats_new.html .)
This problem affects only INTEGRAL-based notices. The Swift, HETE, XTE, MILAGRO,
and IPN notices are all working fine (ie telemetry or messages are still being received
from these sources and are being distributed).
02 Sep 06 ON-GOING: INTEGRAL-->GCN-->WORLD PROBLEM:
The fixed connection (from yesterday) between INTEGRAL and GCN ran fine for almost 24 hours;
all the POINT_DIR and TEST messages were received as they were sent by IBAS.
But then something happened around 14:00 UT today to stop the flow of messages again.
I am working this on-going problem. (The problem is either with the Goddard firewall
or with IBAS (or GCN's account within IBAS). But given that we are in the weekend,
the lack of support personnel at both ends makes this effort difficult.
01 Sep 06 FIXED: INTEGRAL-->GCN-->WORLD PROBLEM:
The connection between INTEGRAL and GCN has been restored.
01 Sep 06 INTEGRAL-->GCN-->WORLD PROBLEM:
I am looking into why GCN did distribute the INTEGRAL Notice on the 060901 burst.
So far, I know it involves only the INTEGRAL->GCN connection. The other mission-based
connections (eg Swift, HETE, XTE) are all ok.
16 Aug 06 NEW GCN MACHINE SWITCH-OVER OUTAGE WAS 18 min:
The old GCN computer was replaced with a new faster machine today.
The GCN system (both Notices and Circulars) was off-line from 14:02 to 14:20 UT.
22 Jul 06 ACTUAL GCN OUTAGE WAS 14 min:
The actual outage was from 15:42 until 15:56 UT (14 min). (This is small compared to the amount of time
they allocated in the original ITN announcement of the outage (see below).)
No bursts were missed.
21 Jul 06 GCN OUTAGE TOMORROW:
The Goddard IT people are replacing the main gateway machine tomorrow (22jul06)
from as soon as 15:00 to as late as 22:00 UT. This will take GCN Notices and Circulars off-line
for up to as along as that time interval. The see the announcement
for the details.
15 Jul 06 GCN OUTAGE FOR 0.5 HOURS:
While trying to update the active sites.cfg list, the socket connection
to the Swift TDRSS data stream became wedged (this is related to this firewall blocking).
I had to reboot the system to clear it. The outage of the Notices service was 14:11 to 14:44 UT.
14 Jul 06 GCN SOCKET OUTAGE FOR 0.3 HOURS:
The Notices-portion of GCN was off-line from 22:37 until 23:00 UT 14jul06.
The Network Adminstrators were implimenting a wide-ranging set of new blocking rules
in the gateway between the GCN machine and the world. Their first attempt
at setting up these rules with holes all the GCN socket connections
was not quite right. It took us 23 min to get the rules fixed.
Since some of the socket connections were actually broken (not just suspeneded),
there maybe longer than 23-min outages based on how long it takes GCN and/or your end
to go through each end's reconnection/initialization cycle.
15 Jun 06 GCN NOTICES OUTAGE FOR 9.9 HOURS:
The Notices-portion of GCN was off-line from 16:39 15jun06 until 02:32 UT 16jun06.
The program crashed. (This did NOT affect the Circulars portion of GCN.)
15 May 06 SWIFT OUTAGE FOR 30 MIN:
The Swift-portion of GCN was off-line from 23:28 until 23:58 UT 15 May 06.
This was about an hour after the burst, and results in the loss of a few
of the later UVOT data products.
13 May 06 GCN NOTICES OUTAGE:
The GCN Notices system was off-line from 03:08 until 05:12 UT
due to a program crash. This affect only the Notices part of GCN;
the Circulars part continued to work.
The cause of the program crash is being investigated.
26 Apr 06 GCN OPSYS CHANGE OK:
The planned outage to switch to a new operating system on capella
lasted 40 minutes. Both the Notices and Circulars are back on-line.
25 Apr 06 GCN OUTAGE TOMORROW (26apr06):
The GCN System (both Notices and Circulars) will be off-line tommorrow
Wednesday 26 Apr 2006 from 14:00 to 15:30 UT.
NASA has issued a new set of computer security requirements.
The version of RedHat LINUX that GCN is curently running under
is no longer on the approved list; so I have to upgrade.
GCN Notices & Circulars has already been ported and tested on the new OpSys,
so the transition should go smoothly. 90 min has been allocated
for the switch-over, but it should likely take less time.
(If there are problems, this switch-over is being done in a way
which will allow us to go back to the old OpSys.)
This will NOT involve any change apparent to you. The capella name, domain, and
IP number will NOT change (so no firewall changes are needed at your end of things).
I apologize for the somewhat short notice, but the outage of services
should be small (and I want to get this in with sufficient time before
24 Apr 06 TWO 4-MIN OUTAGES WHILE DOING TEST:
The GCN system was offline for about 3-4 min starting at 18:58 and 19:10 UT
while the system was swiitched over to do a second test of GCN under Scientific LINUX.
The test was successful, and the real swift will likely be later this week or next week.
19 Apr 06 TWO 4-MIN OUTAGES WHILE DOING TEST:
The GCN system was offline for about 3-4 min starting at 18:58 and 19:22 UT
while the system was swiitched over to do a test of GCN under Scientific LINUX.
This OpsSys change is being mandated by the IT Security people at GSFC; RedHat
is no longer allowed. The test was successful.
In between the two times listed, GCN was actually running to the world
under the new Sci-LINUX. Had there been a burst during that (brief) window,
it would have been distributed to the world per normal.
10 Jan 06 SCREWY DATES AND TIMES STILL IN ATTACHMENTS:
Please note that while the screwy dates and times in the Notices was fixed 5 days ago,
it still remains in the titles of the lightcurve plots and images being sent as attachments
(and appearing on the GCN web table page). This ill be fixed in a day or two.
05 Jan 06 SCREWY DATES AND TIMES IN SWIFT NOTICES FIXED:
The problems with the Date and Times in the Swift Notices has been fixed.
There were two problems: (1) the GCN software was unpacking the now negative UCTF incorrectly, and
(2) a BAT FSW mistake filling the UTCF data fields.
The UT CorrFactor has been re-instated in all Swift Notices.
02 Jan 06 SCREWY DATES AND TIMES IN SWIFT NOTICES:
There is a problem(s) in the dates and times in the Swift Notices (only the Swift-based notices).
It seems to be related to (a) the year change, (b) the UT Correction Factor going negative
due to the LeapSecond adjustment, and (c) possible a FSW problem.
The problem is being investigated.
In the mean time, the UT CorrFactor has been removed from all Swift Notices.
Since this is always in the range -1.0 to +1.0, it is a small effect on dates and times.
04 Nov 05 LOSS OF SERVICE:
From 07:36 to 12:49 UT (delta_t = 5.2 hr), GCN was off-line due to a program crash.
The cause was due to a limitation in the file system.
Given the formating of the disk and file system, the number of inodes allocated
can not support a directory that has more than 185,494 files.
GCN has a directory that it writes a copy of every message that comes through the system.
This archive directory grew without much inspection, and then recently the addition
of the Startracker-loss-of-lock messages pushed this directory over the top (because these
StarTracker status (good and bad) messages come from Swift every 10 sec). So in roughly
30 days, the 185K limit was reached. The GCN program was changed to not write
these Startracker messages. This particular message does not need to be archived.
26 Oct 05 LOSS OF SERVICE:
From 12:24 to 13:47 UT (delta_t = 50 min), GCN was off-line due to an unplanned system crash
while some cabling work was being done on the cluster of machines.
06 Oct 05 BRIEF LOSS OF SERVICE:
From 18:31 to 18:47 UT (delta_t = 16 min), GCN was off-line while new system s/w was installed.
03 Sep 05 LOTS OF SWIFT GCN NOTICES:
At 21:13 UT, Swift-BAT triggered and issued the standard set of GCN Notices (and so did the NFIs).
About 9 more sets of Notices came out over the next half hour.
The spacecraft Star Tracker lost lock (as it does every couple months) and so with sources
drifting in the BAT FOV, triggers were generated. See GCN Circ #3909.
At 21:46 UT the block-to-world filter was activated for all BAT-based, XRT-based, and UVOT-based Notices.
However, you are still likely to receive some Notices after that time. They should all be generated
before that time however. The delay is due tot he way the sendmail demon works.
During the time when Swift was generating a lot of Notices in a short amount of time,
the load-factor on the GCN computer went up to over 14 (typical values of 2-3 for a regular burst series).
When the load-factor goes over 8, sendmail will suspend outgoing email activites.
And then it picks them up when the load-fact drops below 8 AND when the next retry-to-send interval expires.
This interval is currently 15 minutes. I, persoanlly, have been receiving Notices almost an hour
after the "blocking" time, so there is also something else at work in this email processing,
but most of the delay was due to the high load_factor-sendmail interaction.
15 Aug 05 17:24:
The email deliveries during the GRB050815 burst were very slow -- minutes to 107 min.
The exeact cause is not well understood at the moment, but it is believed to be caused by
a very high loadfactor on the capella machine. What caused this high loadfactor is not known yet.
06 Jul 05 00:00:
GCN was effectively off-line from 00:00 to 01:00 UT (total outage 60 min),
and the Swift_TDRSS_receiver portion for an additional time until 02:40 UT.
They Goddard Center Network people were conducting an emergency power test
in preparation for the Shuttle Return-to-Flight.
The main GCN program came back on-line at 01:00 when power was restored to the routers,
but the Swift TDRSS connection to White Sands needed manual help (at 02:40 UT, Swift total outage 160 min).
I would have announced the outage prior to it had I known it s going to affect GCN,
but the Center Network announcement of the test said that it was not going to affect the part of Goddard
that GCN is located, not was it supposed to affect the Goddard connection to the outside Internet.
03 Jul 05 14:20:
The Swift-to-GCN connection was out from 16:44 to 17:31 UT (total outage 47 min).
Manually restarting the swift_tdrss_receiver program cleared the block between White Sands and here.
13 May 05 14:20:
The Swift-to-GCN connection has been RESTORED (as of 14:20 UT).
Total lost time: 16.3 hours (Swift only; all the other Notice types as well as the Circulars suffered no loss).
13 May 05 03:00:
The Swift-to-GCN connection is down (as of 21:57UT 12may05).
The problem was "worked" for several hours, but no solution was found.
Work will resume Friday morning (13may05).
30 Apr 05 15:30 UT:
There is a problem with the intranet (and/or the mail server machine)
here in Building 2 at GSFC (external to GCN).
This has caused the delay in distribution of the email-based GCN Notices.
It appears to have been occurring for at least 4 hours,
and is still somewhat intermitant at the moment. People are working the problem.
17:00 UT: the problem has been fixed -- email is flowing promptly once again.
12 Feb 05 23:03 to 13 Feb 05 00:20 UT:
GCN (both Notices and Circulars) was down due to planned outage of the Goddard connection to the Internet.
The Goddard network people performed an upgrade to the gateway to the Internet.
The total outage was 1.1 hours.
10 Feb 05 UT:
The GCN Notices system was off-line for 12.3 hours (01:22 to 13.40 UT).
The disk freespace went to zero (due to poor management on my part).
There was no loss on the Circulars system.
12 Dec 04 UT:
The GCN Notices system was off-line for 60 minutes. All functionality was restored at 08:26 UT.
There was no loss on the Circulars system.
08 Nov 04 UT:
GCN (both Notices and Circulars) was down due to a power failure
in the (half of the) building that the GCN computer is located.
The total outage was 11.2 hours.
08 Oct 04 UT:
One of the disk partitions on the GCN computer (capella) became full
some time around 21:00 UT yesterday. It was not discovered and fixed until 15:00 UT today.
This problem caused two Circulars to be mis-numbered,
and it caused some Notices to be delayed in distribution.
Since it is possible that a submitted Circular was lost, you should resubmit your Circular again
(if you did not see it in the outgoing list).
This did NOT affect the socket-site portion of GCN -- that part kept working
right through the disk-full incident.
26 Sep 04 UT:
The interface program between GCN and INTEGRAL exited (for unknown reasons).
A total of 16.8 hrs of connectivity to INTEGRAL was lost (23:55 25sep04 to 16:40 26sep04 UT).
(The rest of GCN continued to operate normally, ie HETE, XTE, IPN, etc).
25 Aug 04 UT:
The system-clock on the GCN computer was found to be off by 3min21sec (ahead).
This has been fixed.
Any use of email "NOTICE_TIME"s or socket_packet times will appear to have caused
a distribution delay of 3min21+sec. This is not the case in actual fact -- only in appearence.
The delays (monitored by other parts ofthe GCN system are still short: 0.1-1.0 sec for
socket sites and 1-3 sec for email sites (the part of the distribution time that is within GCN;
I can not account nor control the part of the distribution time for email once it gets
outside of Goddard Space Flight Center). The GRB Times are completely unaffected and accurate
with respect to this problem.
22 Aug 04 13:31-14:19 UT:
The Goddard Center Network people took the Goddard internet down for upgrades.
This resulted in a loss of connectivity (both incoming and outgoing) of GCN to the outside world.
A loss of 48 min.
29 Jul 04 15:07-15:21 UT:
GCN was taken down to get an even better electrical_power/UPS/internet/router configuration.
A loss of 14 min.
28 Jul 04 00:54-07:39 UT:
GCN was down due to power failure due to T-storm. A loss of 6.7 hrs.
30 Jun 04 00:39-15:15 UT:
The conection between INTEGRAL and GCN was down, so there was no INTEGRAL service
within GCN for those 14.5 hrs (all the rest of GCN HETE, RXTE, IPN, Circulars, etc was
connected and working fine). The INTEGRAL outage connection problem was probably
at the GCN end (still under investigation).
25-27 Jun 04:
There were 3 outages this weekend. They were due to an upgrade in the electrical distribution
within the building that houses the Notices and Circulars portions of the GCN system.
For the Notices portion, the first was on Friday evening when the power
was taken down (for about 1.5 hrs) to start the updgrade. During the upgrade, arrangements were
made to have the GCN computer and router put on generator power. The second outage started
on Sunday around noon (EDT) when the generator failed (~1 hr). A second generator was brought on-line.
Then several hours later the system was brought down to switch back to the normal building power (~1 hr).
During this weekend the response times of GCN were slowed slightly due to the primary Domain Name Server
being down. A 5-sec delay was introduced for about half the socket packets
while GCN timed-out while waiting for the primary DNS.
This was most notable in the round-trip travel times reported in the "Daily Socket Connection Reports"
(sent to those socket sites requesting these reports). You will notice peaks in the round-trip times
histograms at 5-sec and smaller peaks at 10-sec and 15-sec. These increments in the round-trip
times are in the 'return' portion of the round-trip -- not in the 'to you' portion.
The Circulars portion of GCN was off-line for the whole weekend.
12 Jun 04 08:23-21:51 UT:
The GCN Notices was off-line between 08:23 to 21:51 (13.5 hrs) -- the program crashed (cause as yet unknown).
(If anybody knows of a pager -- or other comm system -- that can get through building walls, please let me know.
I'm tired of being out of touch with my watchdog systems.)
04 May 04:
The problem that disabled GCN Circulars has been fixed.
You are now able to send your circular submission to gcncirc@@lheawww.gsfc.nasa.gov
and it will be scanned, accepted, and distributed automatically (just like bofore).
The account was temporarily disabled (for 3.5 day) as a result of a reconfiguration of the machine
by the computer admistration people here at Goddard. This affected only the Circulars portion of GCN;
the Notices portion was never affected.
29 Mar 04 02:40 UT:
The GCN system was offline for 6.0 hours (20:03 28apr04 until 02:08 29apr04 UT).
The main GCN processing demon crashed. It took 6 hours to restore the system
because I was inside a metal building and so the automated monitoring system
was not able to get through to my pager.
15 Mar 04 15:10-19:22 UT:
The main gateway router for Building 2 at Goddard died, which resulted
in GCN being completely disabled (no incoming messages from the various misions
and nothing outgoing -- not even Test Notices.)
The router was replaced and services restored at 19:22 UT; a 4.2 hour loss.
(Give the earlier INTEGRAL-only outage, it never rains but it pours.)
15 Mar 04 14:18 UT:
The connection to the INTEGRAL IBAS GRB_message server was re-established. I have set up
a portnumber translator program on a machine operated by Carlo Graziani (U. Chicago).
This translator is a work-around to the recent blanket port blockage by Goddard Network managment.
(In the mean time I have submitted a request to get the specific port number re-opened
for GCN<-->INTEGRAL use.)
Thankfully, the universe co-operated, and there were no INTEGRAL-detected bursts
during this 4.5 day outage.
Many thanks to Carlo for the use of his machine for this work-around.
12 Mar 04 UT:
GCN's ability to receive (and therefore distribute) INTEGRAL Notices
has been blocked by the GSFC Network Security people instituting
a firewall blockage over a range of port numbers (that includes the IBAS-to-GCN port).
This happened late Wednesday (23:00 UT 10Mar04); was not discovered until Thursday;
and the route cause not identified until late Friday.
I will submit a wiaver request to get the IBAS port number opened back up,
but that will not be possible until Monday morning (14Mar04). A backup pathway is being developed
(using a different method) which will prevent future losses of information should there be another outage
of the INTEGRAL_IBAS socket-connection pathway.
I apologize for the 4-day loss of service.
01 Mar 04 14:57 UT:
The recent RXTE_ASM GRB Noticed was delayed in distribution by 7.3 hrs within the GCN system,
because of a processing error within the GCN system. As part of the transition from the old Building 23 SunOS system
to the new Building 2 LINUX system (a year ago), the entry in the "import" table was not updated properly for this Notice type.
Insufficient testing was performed, and it was not until today's Notice that there was any real use of this Notice type.
I apologize for the mistake and the delay in the distribution of this GRB Notice.
01 Mar 04 12:15 UT:
The Internet connection to the outside world was lost at 12:15 UT. It was re-estatblished at 13:36 UT;
for a loss of 1.26 hrs. At this time (14:11 UT) I do not know the cause of the outage or why it resumed.
(The GCN system proper continued to run throughout this interval. Socket sites are now being reconnnected
via the automated reconnect process. Email/Pagers/cells/etc distribution has also resumed.)
29 Feb 04:
A mistake was made in the correction for this year's Leap year.
This affected only the INTEGRAL Notices. The set of INTEGRAL Test Notices
distributed at 01:00 29Feb04 had bad day-of-year, month, and day-of-month fields.
I believe this has been corrected, but I am waiting for the next set
of INTEGRAL Test Notices to know for sure.
The next set of INTEGRAL Test Notices have been received, processed, and distributed.
Part of the fix was to adjust the time of the "event" from being in 2003/mm/dd (ie in the past)
to 2004/mm/dd (in the future; the current mm/dd being used by INTEGRAL is May 23).
I will look into the possibility of making GCN function properly for dates in the past.
(Currently, anything previous to January 01 of each year is too far into the past to have all
the TJD, DOY, YY/MM/DD work properly. The GCN routines were conceived like GCN was conceived -- everything is real-time.
Having something 2 years into the past is/was out-of-scope.)
11 Feb 04:
The GCN Notices system crashed at 20:36 UT. It was not noticed for a while.
It was restarted and sites were connected by 21:29 -- a loss of 53 minutes.
(The GCN Circulars system was NOT affected.)
30 Dec 03:
While rebooting the computer to install some new security patches in the kernal,
I restarted the connection to the INTEGRAL server with the wrong IP Number.
This mistake was not noticed until 20 hours later. The GCN connection to the INTEGRAL server
was immediately restarted and the connection was re-made. This affected only the
INTEGRAL messages (if there were any) -- the rest of the GCN system was/is operating fine.
06 Dec 03:
An infinite-loop interaction between the Circulars demon
and a spammer's demon caused the disk partition for the outgoing email
to be filled to capacity. I can not tell which of these Circulars
was actually distributed (some where distributed once the offending
messages were dequeued), so I distributed them again. My apologies
for the delay in distribution (for those that never got these) and
my apologies for those that are receiving them twice.
The Circulars demon program has been modified to prevent this new form
of infinite loop in the future.
16 May 03:
A T-storm power outage caused the system to go offline at 07:50 UT this morning.
The outage was longer than the UPS battery capacity.
Power was restored at 09:07 UT and the system was rebooted.
Some sockets sites were able to reconnect automatically starting at 09:07,
however a manual restarting of the program was needed to clear out problems
preventing the rest of the socket sites from connecting. This was done at 13:41 UT.
There was a 1.2-hour loss for some sites and a 3.9-hour loss for the other sites.
21 Apr 03:
The system went offline at 07:00 UT this morning. The cause is unknown.
The system was rebooted. There was a 5-hour loss.
16:30 UT 17 Apr 03:
The recent cluster of identical Circulars was due to the submittor
sending 8 separate copies of the message over the span of an hour.
His account has been disabled.
I am in the process of cleaning up the mess.
I have reset the Circular serial number back to the point after his first submition.
14 Feb 03
The GCN system was off-line for 23 hours.
The details are given in here.
13 Dec 02
The GCN system was off-line for 16 hours (00:00 to 15:58 UT Friday, 13 Dec 02).
The reasons for this outage are as yet undetermined.
19 Oct 02
GCN was off-line (both Notices & Circulars) due to a GSFC-wide network system upgrade.
It was off-line from 14:11 to 15:07 UT, and sluggish from 15:07 to 15:40 UT.
The GCN/TAN contact is: Scott Barthelmy,
This file was last modified on 17-Mar-17.