Like many other companies around the globe we also had some issues with last leap second. We couldn’t figure out why is our hadoop cluster acting strangely and using almost all CPU. After a while of browsing we found out, that the real cause of this was Java and leap second. As you may know in June of 2012 we added 1 second to atomic clock, which caused unpredictable behaviour in some computer systems. As someone may think this is just one second, what harm can it do, it wasn’t really so. For some reason Java can not handle such time behaviour and starts using all resources it can get, which causes our cluster of hadoop servers to stop working or crash miserably. Problem lies in combination of Java and ntpd on the same virtual machine. If you have ntpd on host machine and run java client in guest, there should be no problem – at least we didn’t face them. Just some bigger companies that also weren’t really prepared for this leap second:
One of suggested fixes was to simply reset the time with one of following commands:
date `date +"%m%d%H%M%C%y.%S"`
date -s "$(date)"
After nice talk with Steve Kostecke from NTP Project, I decided to post few of his quotes regarding this matter:
Steve Kostecke: There is a way to check flags for leap second with command:
ntpq -c"rv 0 leap" . Before the last minute of the day you want to check the Leap Indicator (LI) to see if it is “01” or “10”. If it is then you wait until after the leap second. Stop ntpd; run the date command; start ntpd.
Here are is table of Leap flags that Steve was talking about:
Leap Indicator (LI): This is a two-bit code warning of an impending
leap second to be inserted/deleted in the last minute of the current
day, with bit 0 and bit 1, respectively, coded as follows:
LI Value Meaning
00 0 no warning
01 1 last minute has 61 seconds
10 2 last minute has 59 seconds)
11 3 alarm condition (clock not synchronized)
Taken from http://tools.ietf.org/html/rfc1361
There is also another way of handling this leap second gracefully. It is achieved by increasing step in NTP, to slew 60 second period, but as Steve Kostecke pointed, side effect of such modification would be, it would take ~2000 secs to slew the 1 second error. It is up to you, to decide if this is better way of handling leap second.
I have to say big thanks to Steve Kostecke, for explaining me how NTP works in general and pointing me to the right direction.