Ansible + ControlMaster + Bastion host + high number of forks = flakiness

27 Mar 2017

I have been experimenting with Ansible with our remote 120 machines. I want to update them simultaneously.

Forks = 1

With forks = 1, Ansible playbook's is super reliable. The connection to the bastion host is reused properly.

Forks = 10

With forks = 10, it fails once or twice, and sometimes it asks me to touch my security key. Asking to touch security key means the connection to the bastion host isn't reused.

I think there's problem when 10 processes trying to create the same ControlPath's file for the bastion host at the same location.

A way to get around this is to use serial. The first batch should be 1, so that it can properly create the ControlPath's file for the bastion host without race conditions. The later batch size can be 10.

Forks = 20

With forks = 20, it fails predictably; it immediately errors out for ~10 hosts. From more experiments, it seems only 10 hosts always succeed, and the rest always fail. So, the limit of concurrent SSH connection seems to be 10. The error message is: SSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh. The technique from forks = 10 doesn't help.

It turns out that our bastion host's MaxSessions is 10. Therefore, we can't have more than 10 sessions per connection.

I guess we can find a way to open multiple connections instead.