28 October 2018

The tale of Algolia's Scala client and Playframework

Yesterday I’ve spent 5 hours debugging a strange issue when using the Algolia’s Scala client with Playframework during local development. That is, the period where auto-reloading happens when code is changed.

For a long time (~9 months), we’ve experienced a DNS error “flakily”; later, it turned out to be a non-flaky issue. But more on that later.

The error is an injection error. We can’t initialize AlgoliaClient because we can’t initialize DnsNameResolver.

Caused by: java.lang.NullPointerException
  at io.netty.resolver.dns.DnsNameResolver.<init>(DnsNameResolver.java:303)
  at io.netty.resolver.dns.DnsNameResolverBuilder.build(DnsNameResolverBuilder.java:379)
  at algolia.AlgoliaHttpClient.<init>(AlgoliaHttpClient.scala:56)
  at algolia.AlgoliaClient.<init>(AlgoliaClient.scala:64)
...

Please note that, for a newer version of Netty, the error will be around FailedChannel cannot be casted to Channel. But it still occurs at the same place.

Well, ok, it was a network thingy. It could be flaky, so I ignored it for several months.

Yesterday, I’ve found out that this injection error happens almost exactly when Playframework reloaded code around 23 times. It took me a lot of times to test this hypothesis because changing code 23 times was tedious.

I wrote a script that instantiating AlgoliaClient multiple times, and the exception was always raised at the 26rd AlgoliaClient. The exception wasn’t exactly helpful though. I did a lot of random things afterward with no progress for another hour.

What helped me progress was that I tried to instantiate the 27st AlgoliaClient, and there it was. The actual exception showed itself:

Caused by: java.net.SocketException: maximum number of DatagramSockets reached
  at sun.net.ResourceManager.beforeUdpCreate(ResourceManager.java:72)
  at java.net.AbstractPlainDatagramSocketImpl.create(AbstractPlainDatagramSocketImpl.java:69)
  at java.net.TwoStacksPlainDatagramSocketImpl.create(TwoStacksPlainDatagramSocketImpl.java:70)

After googling for ResourceManager, it turned out that the limit to the number of datagram sockets was 25!

I tested this hypothesis by running sbt Dsun.net.maxDatagramSockets=4 'runMain TheScript', and the script failed after the 4st AlgoliaClient.

Now I know that the instantiated AlgoliaClient wasn’t cleaned up properly, so I simply needed to close it.

Unfortunately, algoliaClient.close() doesn’t close its DNS name resolver. Fortunately, the DNS name resolver is public, and I can close it myself with algoliaClient.httpClient.dnsNameResolver.close().

Now I knew I needed to clean up AlgoliaClient before Playframework reloads code. And, fortunately, Playframework offers a stop hook that is invoked before code reloading.

Here are some lessons for me: