Unable to launch gpu VMs : Filter PciPassthroughFilter returned 0 hosts

asked 2020-07-07 08:39:28 -0500

Hello,

I am using Openstack (Train) and I can not create VMs with GPU passthrough.


Failed action

Whenever I tried to create a server with a GPU pci-passthrough property, I have an error :

$ openstack server create --volume xxxx --flavor 1.gpu  --wait test-gpu-server
Error creating server: test-gpu-server

The gpu flavor was created by using these commands :

$ openstack flavor create --ram 4096 --disk 10 --vcpu 1 1.gpu
$ openstack flavor set 1.gpu --property "pci_passthrough:alias"="a1:1"

The nova trace is :

Jul 07 12:39:21 nova-api-container- nova-scheduler[17057]: INFO nova.filters [XXXX - default default] Filtering removed all hosts for the request with instance ID 'XXXX'. Filter results: ['AvailabilityZoneFilter: (start: 2, end: 2)', 'ComputeFilter: (start: 2, end: 2)', 'AggregateInstanceExtraSpecsFilter: (start: 2, end: 2)', 'AggregateCoreFilter: (start: 2, end: 2)', 'AggregateNumInstancesFilter: (start: 2, end: 2)', 'AggregateIoOpsFilter: (start: 2, end: 2)', 'ComputeCapabilitiesFilter: (start: 2, end: 2)', 'ImagePropertiesFilter: (start: 2, end: 2)', 'ServerGroupAntiAffinityFilter: (start: 2, end: 2)', 'ServerGroupAffinityFilter: (start: 2, end: 2)', 'NUMATopologyFilter: (start: 2, end: 2)', 'PciPassthroughFilter: (start: 2, end: 0)']
Jul 07 12:39:21 nova-api-container- nova-conductor[17035]: ERROR nova.conductor.manager [XXXX- default default] Failed to schedule instances: nova.exception_Remote.NoValidHost_Remote: No valid host was found. There are not enough hosts available.
                                                                           Traceback (most recent call last):

                                                                             File "/openstack/venvs/nova-20.1.2/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 235, in inner
                                                                               return func(*args, **kwargs)

                                                                             File "/openstack/venvs/nova-20.1.2/lib/python3.6/site-packages/nova/scheduler/manager.py", line 214, in select_destinations
                                                                               allocation_request_version, return_alternates)

                                                                             File "/openstack/venvs/nova-20.1.2/lib/python3.6/site-packages/nova/scheduler/filter_scheduler.py", line 96, in select_destinations
                                                                               allocation_request_version, return_alternates)

                                                                             File "/openstack/venvs/nova-20.1.2/lib/python3.6/site-packages/nova/scheduler/filter_scheduler.py", line 265, in _schedule
                                                                               claimed_instance_uuids)

                                                                             File "/openstack/venvs/nova-20.1.2/lib/python3.6/site-packages/nova/scheduler/filter_scheduler.py", line 302, in _ensure_sufficient_hosts
                                                                               raise exception.NoValidHost(reason=reason)

My configuration

Hosts : nova-api containers (and compute hosts not having GPUs) :

PCI section of the nova.conf file :

alias = { "vendor_id":"10de", "product_id":"1db4", "device_type":"type-PCI", "name":"a1" }

Hosts : the GPU compute host

PCI section of the nova.conf file :

pci_passthrough_whitelist = [{ "vendor_id": "10de", "product_id": "1db4" }]
alias = { "vendor_id":"10de", "product_id":"1db4", "device_type":"type-PCI", "name":"a1" }

Nova daemon(s) was restarted after modifying config files.

The VFIO driver is loaded for the device :

root@gpu-host:~# lspci -nnk -d 10de:1db4
86:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
    Subsystem: NVIDIA Corporation GV100 [Tesla V100 PCIe] [10de:1214]
    Kernel driver in use: vfio-pci
    Kernel modules: nvidiafb, nouveau

The control plane

Galera is showing this :

MariaDB [nova]> SELECT * FROM pci_devices;
+---------------------+---------------------+---------------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+---------------+------------+-----------+-------------+--------------------------------------+
| created_at          | updated_at          | deleted_at          | deleted | id | compute_node_id | address      | product_id | vendor_id | dev_type | dev_id           | label           | status    | extra_info | instance_uuid | request_id | numa_node | parent_addr | uuid                                 |
+---------------------+---------------------+---------------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+---------------+------------+-----------+-------------+--------------------------------------+
| 2020-07-01 13:47:49 | 2020-07-01 16:56:40 | 2020-07-02 07:31:02 |       1 |  1 |               5 | 0000:86:00.0 | 1db4       | 10de      | type-PCI | pci_0000_86_00_0 | label_10de_1db4 | available | {}         | NULL          | NULL       |         1 | NULL        | c9eb94ad-0738-47c2-a98a-025cb0fe2923 |
| 2020-07-02 07:31:53 | NULL                | NULL                |       0 |  3 |               5 | 0000:86:00.0 | 1db4       | 10de      | type-PCI | pci_0000_86_00_0 ...
(more)
edit retag flag offensive close merge delete