TLS connection between nodes failling after node reboot

asked 2019-07-01 11:10:30 -0500

VictorCM gravatar image

Hi, I have a kolla multi-node deployment of 2 nodes, one is used as controller and compute and the other just for compute.

All was working fine but yesterday the compute node had a problem and it was restarted, but it is not able to connect to the openstack environment again. I have been troubleshooting and I have found that the problem comes because controller node can not connect to the libvirt daemon on the failled compute node.

From Kibana logs:

log_level:ERROR Payload:Error starting thread.: HypervisorUnavailable: Connection to the hypervisor is broken on host: server3 2019-07-01 17:23:16.220 6 ERROR oslo_service.
log_level:ERROR Payload:Connection to libvirt failed: unable to connect to server at '10.10.150.30:16509': Connection refused: libvirtError: unable to connect to server at '10.10.150.30:16509': Connection refused 2019-07-01 17:23:16.194 6

On controller I have executed 'openstack hypervisor list':

+----+---------------------+-----------------+-------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP     | State |
+----+---------------------+-----------------+-------------+-------+
|  1 | server3.ii          | QEMU            | 10.10.150.30 | down  |
|  2 | controller.ii       | QEMU            | 10.10.150.29 | up    |
+----+---------------------+-----------------+-------------+-------+

And on server3 I have executed 'journalctl -u libvirt-bin':

Jul 01 17:06:13 server3 libvirtd[32751]: 32751: error : virNetTLSContextCheckCertFile:120 : Cannot read CA certificate '/etc/pki/CA/cacert.pem': No such file or dir
Jul 01 17:06:13 server3 systemd[1]: libvirt-bin.service: Main process exited, code=exited, status=6/NOTCONFIGURED
Jul 01 17:06:13 server3 systemd[1]: Failed to start Virtualization daemon.
Jul 01 17:06:13 server3 systemd[1]: libvirt-bin.service: Unit entered failed state.
Jul 01 17:06:13 server3 systemd[1]: libvirt-bin.service: Failed with result 'exit-code'.
Jul 01 17:06:13 server3 systemd[1]: libvirt-bin.service: Service hold-off time over, scheduling restart.
Jul 01 17:06:13 server3 systemd[1]: Stopped Virtualization daemon.
Jul 01 17:06:13 server3 systemd[1]: libvirt-bin.service: Start request repeated too quickly.
Jul 01 17:06:13 server3 systemd[1]: Failed to start Virtualization daemon.
Jul 01 17:06:13 server3 systemd[1]: libvirt-bin.service: Unit entered failed state.
Jul 01 17:06:13 server3 systemd[1]: libvirt-bin.service: Failed with result 'start-limit-hit'.

So from this I get the following info and I wonder:

  1. Why could the compute node have been missconfigured? + My idea is that the libvirt daemon crashed and my controller was trying to connect to one with different PID
  2. I am using TLS connection, but the controller is using server3:16509, where 16509 is the port for TCP connections, instead of 16514 what is the port for TLS. (I have TCP disabled on /etc/libvirt/libvirtd.conf file). +Could this be the problem too?
  3. Unless controller is using the TCP default port for libvirt daemon, it is getting an error for CA certificate, that it is used for TLS connections. + I point this as the main problem, but how can i solve this? I am not vary familiar with that and It is generated automatically on first deployment, so I don't know what ...
(more)
edit retag flag offensive close merge delete