Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

trove guest instance starts normally, but trove-taskmanager still reports Error state

Good day community! i have a multi node openstack mitaka deployment with secure management network and external network for instances as in openstack reference install guide.

controller has access to both networks. Trove-taskmanager uses rabbitMQ with rabbit_host set to CONTROLLER_MANAGEMENT_HOSTNAME, while guest agent uses rabbit_host set to CONTROLLER_EXTERNAL_IP.

I've have manualy built ubuntu 14.04 guest image for trove with mysql 5.6. my trove-guestagent.conf looks like this (have rabbit in both oslo and default just in case)

[DEFAULT]
verbose = True
debug = True
rabbit_host = CONTROLLER_EXTERNAL_IP
rabbit_userid = openstack
rabbit_password = PASS
nova_proxy_admin_user = admin
nova_proxy_admin_pass = ADMIN_PASS
nova_proxy_admin_tenant_name = service
trove_auth_url = http://CONTROLLER_MANAGEMENT_HOSTNAME:35357/v3
rpc_backend = rabbit
log_dir = /var/log/trove/
log_file = trove-guestagent.log
datastore_manager = mysql
datastore_registry_ext = mysql:trove.guestagent.datastore.mysql.manager.Manager

[oslo_messaging_rabbit]
rabbit_host = CONTROLLER_EXTERNAL_IP
rabbit_userid = openstack
rabbit_password = PASS

trove_auth_url is set to management host (though the instance has no access to that network), but accoroding to logs guest-agent doesn't even try to connect to that url so i don't think it is related to the issue.

When i start the instance the guest agent starts normaly, connects to rabbitmq and receives all the tasks it needs to do and finishes them without any error. in the end i have a fully functional Mysql database, accessible using the user/pass i entered during the creation. In the end, guestagent repeats the following message in the logs:

2016-08-23 12:53:44.685 1620 DEBUG oslo_service.periodic_task [-] Running periodic task Manager.update_status run_periodic_tasks /usr/local/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215
2016-08-23 12:53:44.686 1620 DEBUG trove.guestagent.datastore.manager [-] Update status called. update_status /usr/lib/python2.7/dist-packages/trove/guestagent/datastore/manager.py:246
2016-08-23 12:53:44.687 1620 DEBUG trove.guestagent.datastore.service [-] Determining status of DB server. update /usr/lib/python2.7/dist-packages/trove/guestagent/datastore/service.py:178
2016-08-23 12:53:44.688 1620 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): /usr/bin/mysqladmin ping execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:344
2016-08-23 12:53:44.737 1620 DEBUG oslo_concurrency.processutils [-] CMD "/usr/bin/mysqladmin ping" returned: 0 in 0.049s execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:374
2016-08-23 12:53:44.741 1620 INFO trove.guestagent.datastore.mysql_common.service [-] MySQL Service Status is RUNNING.
2016-08-23 12:53:44.742 1620 DEBUG trove.guestagent.datastore.service [-] Casting set_status message to conductor (status is 'running'). set_status /usr/lib/python2.7/dist-packages/trove/guestagent/datastore/service.py:162
2016-08-23 12:53:44.745 1620 DEBUG trove.conductor.api [-] Making async call to cast heartbeat for instance: 7a54fad1-b87f-484b-8ed0-a413f68cedad heartbeat /usr/lib/python2.7/dist-packages/trove/conductor/api.py:49
2016-08-23 12:53:44.747 1620 DEBUG oslo_messaging._drivers.amqpdriver [-] CAST unique_id: b496930502f5448f8b9864b892d3162c exchange 'openstack' topic 'trove-conductor' _send /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:448
2016-08-23 12:53:44.752 1620 DEBUG trove.guestagent.datastore.service [-] Successfully cast set_status. set_status /usr/lib/python2.7/dist-packages/trove/guestagent/datastore/service.py:168

But on the controller node the instance is still in build state. After some time (response timeout) on the controller node trove-taskmanager reports:

2016-08-23 16:02:02.748 3195 ERROR oslo.service.loopingcall [-] Fixed interval looping call 'trove.common.utils.poll_and_check' failed
2016-08-23 16:02:02.748 3195 ERROR oslo.service.loopingcall Traceback (most recent call last):
2016-08-23 16:02:02.748 3195 ERROR oslo.service.loopingcall   File "/usr/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 136, in _run_loop
2016-08-23 16:02:02.748 3195 ERROR oslo.service.loopingcall     result = func(*self.args, **self.kw)
2016-08-23 16:02:02.748 3195 ERROR oslo.service.loopingcall   File "/usr/lib/python2.7/dist-packages/trove/common/utils.py", line 200, in poll_and_check
2016-08-23 16:02:02.748 3195 ERROR oslo.service.loopingcall     raise exception.PollTimeOut
2016-08-23 16:02:02.748 3195 ERROR oslo.service.loopingcall PollTimeOut: Polling request timed out.
2016-08-23 16:02:02.748 3195 ERROR oslo.service.loopingcall 
2016-08-23 16:02:02.751 3195 ERROR trove.taskmanager.models [-] Failed to create instance 7a54fad1-b87f-484b-8ed0-a413f68cedad. Timeout waiting for instance to become active. No usage create-event was sent.
2016-08-23 16:02:02.777 3195 ERROR trove.taskmanager.models [-] Service status: ERROR
Service error description: guestagent error
2016-08-23 16:02:02.816 3195 DEBUG trove.db.models [-] Saving DBInstance: {u'cluster_id': None, u'shard_id': None, u'deleted_at': None, u'id': u'7a54fad1-b87f-484b-8ed0-a413f68cedad', u'datastore_version_id': u'292863b8-ffc7-4810-8921-11e452a2b3c2', 'errors': {}, u'hostname': None, u'server_status': None, u'task_description': 'Build error: guestagent timeout.', u'volume_size': 5, u'type': None, u'updated': datetime.datetime(2016, 8, 23, 13, 2, 2, 816686), '_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x7f1acc6c71d0>, u'deleted': 0, u'configuration_id': None, u'volume_id': u'f85d1530-7eca-40b5-98a9-8af791ec3dc7', u'slave_of_id': None, u'task_start_time': None, u'name': u'mysql_instance_1', u'task_id': 84, u'created': datetime.datetime(2016, 8, 23, 12, 45, 12), u'tenant_id': u'e9dc5711894649c7ac90a7300312257e', u'compute_instance_id': u'b256ef6c-f477-472d-9445-ffc1b90f946c', u'flavor_id': u'7'} save /usr/lib/python2.7/dist-packages/trove/db/models.py:62
2016-08-23 16:02:02.837 3195 ERROR trove.taskmanager.models [-] Trove instance status: ERROR
Trove instance status description: Build error: guestagent timeout.

My trove-taskmanager.conf is as follows:

[DEFAULT]
verbose = True
use_syslog = False
debug = True
usage_timeout = 600
update_status_on_fail = True
rpc_backend=rabbit
control_exchange = trove
db_api_implementation = trove.db.sqlalchemy.api
trove_auth_url = http://CONTROLLER_MANAGEMENT_HOSTNAME:5000/v2.0
nova_compute_url = http://CONTROLLER_MANAGEMENT_HOSTNAME:8774/v2
cinder_url = http://CONTROLLER_MANAGEMENT_HOSTNAME:8776/v1
swift_url = http://CONTROLLER_MANAGEMENT_HOSTNAME:8080/v1/AUTH_
trove_volume_support = True
block_device_mapping = vdb
device_path = /dev/vdb
mount_point = /var/lib/mysql
volume_time_out=30
server_delete_time_out=480
use_nova_server_config_drive = True
nova_proxy_admin_user = admin
nova_proxy_admin_pass = PASS
nova_proxy_admin_tenant_name = service
taskmanager_manager=trove.taskmanager.manager.Manager
trove_dns_support = False
dns_account_id = 123456
dns_auth_url = http://127.0.0.1:5000/v2.0
dns_username = user
dns_passkey = password
dns_ttl = 3600
dns_domain_name = 'trove.com.'
dns_domain_id = 11111111-1111-1111-1111-111111111111
dns_driver = trove.dns.designate.driver.DesignateDriver
dns_instance_entry_factory = trove.dns.designate.driver.DesignateInstanceEntryFactory
dns_endpoint_url = http://127.0.0.1/v1/
dns_service_type = dns
network_driver = trove.network.nova.NovaNetwork
default_neutron_networks =
trove_security_groups_support = True
trove_security_group_rule_cidr = 0.0.0.0/0
agent_heartbeat_time = 10
agent_call_low_timeout = 5
agent_call_high_timeout = 150
agent_replication_snapshot_timeout = 36000
use_nova_server_volume = False
network_label_regex = ^private$
template_path = /etc/trove/templates/
log_dir = /var/log/trove
notifier_queue_hostname = CONTROLLER_MANAGEMENT_HOSTNAME
pydev_debug = disabled
[database]
connection = mysql+pymysql://trove:PASS@CONTROLLER_MANAGEMENT_HOSTNAME/trove
idle_timeout = 3600
[profiler]
[oslo_messaging_rabbit]
rabbit_host=CONTROLLER_MANAGEMENT_HOSTNAME
rabbit_userid=openstack
rabbit_password=pass
...
[other_values_from_example_config]
...

Has any1 encountered a similar problem? the only thing i can think about is time sync of guest and controller (as you can see the timestamp is different in the logs of guest and controller), i will update the post when i set up the synchronization. On the other hand - guest fetches the initial message normaly even with different time set. But i want to know maybe there are other possible reasons for this kind of error.