Leap second, Java and NTP leads to disaster – how to setup ntpd to avoid that

leap secondLike many other companies around the globe we also had some issues with last leap second. We couldn’t figure out why is our hadoop cluster acting strangely and using almost all CPU. After a while of browsing we found out, that the real cause of this was Java and leap second. As you may know in June of 2012 we added 1 second to atomic clock, which caused unpredictable behaviour in some computer systems. As someone may think this is just one second, what harm can it do, it wasn’t really so. For some reason Java can not handle such time behaviour and starts using all resources it can get, which causes our cluster of hadoop servers to stop working or crash miserably. Problem lies in combination of Java and ntpd on the same virtual machine. If you have ntpd on host machine and run java client in guest, there should be no problem – at least we didn’t face them. Just some bigger companies that also weren’t really prepared for this leap second:

One of suggested fixes was to simply reset the time with one of following commands:
date `date +"%m%d%H%M%C%y.%S"`
Or:
/etc/init.d/ntpd stop
date -s "$(date)"

After nice talk with Steve Kostecke from NTP Project, I decided to post few of his quotes regarding this matter:

Steve Kostecke: There is a way to check flags for leap second with command: ntpq -c"rv 0 leap" . Before the last minute of the day you want to check the Leap Indicator (LI) to see if it is “01” or “10”. If it is then you wait until after the leap second. Stop ntpd; run the date command; start ntpd.

Here are is table of Leap flags that Steve was talking about:

    Leap Indicator (LI): This is a two-bit code warning of an impending
       leap second to be inserted/deleted in the last minute of the current
       day, with bit 0 and bit 1, respectively, coded as follows:
     
          LI       Value     Meaning
          -------------------------------------------------------
          00       0         no warning
          01       1         last minute has 61 seconds
          10       2         last minute has 59 seconds)
          11       3         alarm condition (clock not synchronized)
     
    Taken from http://tools.ietf.org/html/rfc1361

There is also another way of handling this leap second gracefully. It is achieved by increasing step in NTP, to slew 60 second period, but as Steve Kostecke pointed, side effect of such modification would be, it would take ~2000 secs to slew the 1 second error. It is up to you, to decide if this is better way of handling leap second.

I have to say big thanks to Steve Kostecke, for explaining me how NTP works in general and pointing me to the right direction.

Leave a Reply

help-hint.png
Purpose of the commenting system is to share your experience. I encourage you to post feedback with your own suggestions, ideas or optimizations regarding the topic of a blog post. What commenting system isn't for, is asking questions about similar issues of yours and requesting support for it. Blog post is provided as is and I am not here to solve all your problems. Please bear that in mind and try to avoid posting such comments. I do take privilege to remove comment from my blog for any reason whatsoever. Usually I do it when I sense a comment was posted only for spam/seo reasons or is out of blog post's topic. Thank you for reading this, now you may continue :)
 

Your email address will not be published. Required fields are marked *