Ask Your Question
0

neutron: Error, "AgentNotFoundByTypeHost: Agent with agent_type=L3 agent and host=compute1.example.com could not be found", caused by 'l2population'

asked 2018-01-03 18:47:23 -0500

dreamer.zzp gravatar image

updated 2018-01-04 02:34:41 -0500

Can any body help with this issue?

After I deployed my HA OpenStack Cluster (non-production) when I shut down a VM, I got the following error messages. If I do not restart or stop neutron-linuxbridge-agent.service (in compute node), these logs will not stop printing.

  • /var/log/neutron/server.log in controller node (The full error logs are in the last part).

    2017-12-28 16:01:26.964 16265 INFO neutron.notifiers.nova [-] Nova event response: {u'status': u'completed', u'tag': u'd2ab84b4-8339-491b-888b-ffaede27d795', u'name': u'network-vif-unplugged', u'server_uuid': u'e6dac399-7743-46ed-a384-1cecca3ac3f4', u'code': 200}
    2017-12-28 16:01:27.646 16265 ERROR oslo_messaging.rpc.server [req-edcb230d-6314-4b87-b13e-51691254391d - - - - -] Exception during message handling: AgentNotFoundByTypeHost: Agent with agent_type=L3 agent and host=compute1.example.com could not be found
    
  • /var/log/neutron/linuxbridge-agent.log in compute node (The full error logs are in the last part).

    2017-12-28 16:01:32.881 1510 ERROR neutron.plugins.ml2.drivers.agent._common_agent [u'Traceback (most recent call last):\n', u'  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 160, in _process_incoming\n    res = self.dispatcher.dispatch(message)\n', u'  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 213, in dispatch\n    return self._do_dispatch(endpoint, method, ctxt, args)\n', u'  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _do_dispatch\n    result = func(ctxt, **new_args)\n', u'  File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/rpc.py", line 234, in update_device_down\n    n_const.PORT_STATUS_DOWN, host)\n', u'  File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/rpc.py", line 331, in notify_l2pop_port_wiring\n    l2pop_driver.obj.update_port_down(port_context)\n', u'  File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/l2pop/mech_driver.py", line 253, in update_port_down\n    admin_context, agent_host, [port[\'device_id\']]):\n', u'  File "/usr/lib/python2.7/site-packages/neutron/db/l3_agentschedulers_db.py", line 303, in list_router_ids_on_host\n    context, constants.AGENT_TYPE_L3, host)\n', u'  File "/usr/lib/python2.7/site-packages/neutron/db/agents_db.py", line 291, in _get_agent_by_type_and_host\n    host=host)\n', u'AgentNotFoundByTypeHost: Agent with agent_type=L3 agent and host=compute1.example.com could not be found\n'].
    

I used Pike to deploy my HA OpenStack cluster, the OS is CentOS 7.x. There are four nodes in this cluster, three controller nodes and one compute. These four nodes are all VMs in a physical host, each node has 4 cpu cores and 8GB ram. Controller and cluster services (such as pacemaker, haproxy, memcached, rabbitmq, mariadb, keystone and so on) are all deployed on controller nodes. Host names are resolved through DNS server and time of all nodes are synchronized through NTP server.

Everything seemed to work well after I deploying the HA cluster until I shut down a VM. Error messages begin to print.

This issue confused me for several days, during these days, I checked neutron conf files over and over, redeployed neutron service many times, and tried a lot of ways to locate where I did wrong and attempted to fix it, but ... (more)

edit retag flag offensive close merge delete

Comments

Good, I could see the problem, the truth is that I'm looking for help because I have a problem, similar if there was something, I share it without problem or if I discover something.

https://ask.openstack.org/en/question...

gsic-emic gravatar imagegsic-emic ( 2018-01-08 10:18:27 -0500 )edit

You can try to disable l2population mechanism, this may help you. I have post more details in your own question.

dreamer.zzp gravatar imagedreamer.zzp ( 2018-01-12 19:57:54 -0500 )edit

It seems my cluster has the same issue. I'm curious if you, in the few months since, have discovered another solution. Or, if you have filed a bug report that I can track.

woltjert gravatar imagewoltjert ( 2018-03-07 07:17:34 -0500 )edit

3 answers

Sort by ยป oldest newest most voted
0

answered 2018-01-11 08:41:22 -0500

Moss gravatar image

updated 2018-03-23 08:52:35 -0500

I feel your pain - I don't have an answer for you but I have similar even bigger problem:

I deployed pike release using dvr+vrrp configuration based on this guide.

It requires openvswitch (linuxbridge doesn't support that) and openvswitch requires l2population to be enabled:D

I saw warnings that l2population causes problems but it is required by openvswitch's dvr mode - after disabling l2population ovs-agent won't start :/

I have problems with metadata agent which drops connections during vm's provisioning because of router's HA.

I'm trying to get some support on irc - i'll update you if needed. Cheers!

UPDATE 20180323 - Dongcan Ye (hellochosen@gmail.com) solved my problem with metadata - there is a bug for metadata proxy if you enable dvr_snat and ha - check this bug report #1606741. I applied his PATCH on all computes and it works as expected

edit flag offensive delete link more

Comments

Thank you for your attention.

I have considered to use openvswitch, but right now I have not so much time to learn and test.

I have already included openvswitch in my study plan, if I get something can help you, you will get it.

Thank you again.

dreamer.zzp gravatar imagedreamer.zzp ( 2018-01-12 19:24:15 -0500 )edit
0

answered 2018-03-20 04:58:19 -0500

ktibi gravatar image

Hi,

I have same issue with Pike and Centos7. I deployed with Kolla.

Failed to update device 176307e6-1379-479e-b7c8-1198412b4a51 down: AgentNotFoundByTypeHost: Agent with agent_type=L3 agent and host=compute02 could not be found

openstack network agent list --agent-type l3
+--------------------------------------+------------+--------+-------------------+-------+-------+------------------+
| ID                                   | Agent Type | Host   | Availability Zone | Alive | State | Binary           |
+--------------------------------------+------------+--------+-------------------+-------+-------+------------------+
| 8f3e7781-15f5-4596-b496-125d9188cd34 | L3 agent   | ctrl01 | nova              | :-)   | UP    | neutron-l3-agent |
| a3bd673b-84a8-48c5-a5d5-ed7411dd664f | L3 agent   | ctrl02 | nova              | :-)   | UP    | neutron-l3-agent |
| a4c00ca3-54d5-4e07-adf5-3f866a0d2a88 | L3 agent   | ctrl03 | nova              | :-)   | UP    | neutron-l3-agent |
+--------------------------------------+------------+--------+-------------------+-------+-------+------------------+

But I use OVS so I think the issue is about neutron code or ml2 but not with linuxbridge or OVS code.

edit flag offensive delete link more
0

answered 2018-01-18 14:46:34 -0500

ybob gravatar image

I met the same problem. And it seems the code tried to seek the record with agent_type=l3 and host=XX (XX was the node where the instance located) in database. However, the installation guide only installed the l3_agent in controller node.

root@controller:~/openstack/pike/controller/script# openstack network agent list --agent-type l3 +--------------------------------------+------------+-------+-------------------+-------+-------+------------------+ | ID | Agent Type | Host | Availability Zone | Alive | State | Binary | +--------------------------------------+------------+-------+-------------------+-------+-------+------------------+ | db00614d-5172-4057-b688-87b137fe3033 | L3 agent | controller | nova | :-) | UP | neutron-l3-agent | +--------------------------------------+------------+-------+-------------------+-------+-------+------------------+

Then I just installed neutron-l3-agent in all of the compute nodes.

root@cactl:~/openstack/pike/controller/script# openstack network agent list --agent-type l3 +--------------------------------------+------------+-------+-------------------+-------+-------+------------------+ | ID | Agent Type | Host | Availability Zone | Alive | State | Binary | +--------------------------------------+------------+-------+-------------------+-------+-------+------------------+ | 64471b43-fc9f-4991-9fe4-d314551f803a | L3 agent | cac01 | nova | :-) | UP | neutron-l3-agent | | 67eec814-418b-41f2-80b3-cf134f5c1983 | L3 agent | cac02 | nova | :-) | UP | neutron-l3-agent | | 5b537e1b-d35d-4370-93a7-5a2b529acd02 | L3 agent | cac03 | nova | :-) | UP | neutron-l3-agent | | db00614d-5172-4057-b688-87b137fe3033 | L3 agent | controller | nova | :-) | UP | neutron-l3-agent | +--------------------------------------+------------+-------+-------------------+-------+-------+------------------+

And the problem seems to be disappeared.

It should be a bug in the code and the host seems to be "controller" instead of the name of one of the compute node.

Anyway, using the method I mentioned above seems to be a walk-around. But I'm not sure about the side affects.

edit flag offensive delete link more

Comments

A side affect was found. When I used self network, the instance reported warning message as

url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [0/120s]: bad status code [500]

We definitely need a better solution.

ybob gravatar imageybob ( 2018-01-21 17:54:31 -0500 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Get to know Ask OpenStack

Resources for moderators

Question Tools

1 follower

Stats

Asked: 2018-01-01 22:39:41 -0500

Seen: 995 times

Last updated: Mar 23