Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

Small HA installation: ceph nodes not considered?

Hello, sorry for long post, I try to give more details possible below. I'm setting up a small test lab for HA Openstack environment based on Queens and CentOS 7 (to be the best similar with OSP 13, as my target) and the Openstack nodes will be oVirt VMs. Also the director is a VM in oVirt.

Idea is to have 2 compute, 3 controllers, 3 ceph storage nodes (for image, block, object and manila). The nodes have 1 60Gb root disk; the ceph nodes have 2 more disks (100Gb for journal and 150Gb for OSD).

I have installed undercloud and I have made up some combinations of instackenv.json file for introspection and all nodes are correctly introspected with VMs powered on and off. I have 4 questions:

  • which value to use for this small storage cluster and ovveride default ceph parameters (pgnum, mon_max_pg_per_osd, ecc.) without getting errors during deploy?

  • what is the correct parameter to set in instackenv.json or through "openstack baremetal node set --property .." command to have a map for ceph OSD for the 3 designated hosts?

  • at which stage of the overcloud deploy are ceph nodes expected to be powered on and installed?

  • is it correct that in this architecture layout mon, mgr and mds are deployed on controller nodes as docker containers while only OSD on the dedicated storage nodes?

Thanks, Gianluca

Details: For ceph OSD nodes I have tried to give these capabilities in instackenv.json file:

"name": "ostack-ceph2",
"capabilities": "profile:ceph-storage,node:ceph-2,boot_option:local"

with then a scheduler_hints_env.yaml file of this type:

parameter_defaults:
  ControllerSchedulerHints:
    'capabilities:node': 'controller-%index%'
  ComputeSchedulerHints:
    'capabilities:node': 'compute-%index%'
  CephStorageSchedulerHints:
    'capabilities:node': 'ceph-%index%'
  HostnameMap:
    overcloud-controller-0: ostack-ctrl0
    overcloud-controller-1: ostack-ctrl1
    overcloud-controller-2: ostack-ctrl2
    overcloud-novacompute-0: ostack-compute0
    overcloud-novacompute-1: ostack-compute1
    overcloud-ceph-storage-0: ostack-ceph0
    overcloud-ceph-storage-1: ostack-ceph1
    overcloud-ceph-storage-2: ostack-ceph2

But while compute and controllers are deployed ok and their hostnames are also correctly mapped, ceph nodes remain untouched, not even powered on; I don't know if it depends on expected workflow and they need to be set up only at a final stage that doesn't arrive.

To accomplish this, for ceph I'm giving to overcloud deploy these environment files:

-e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-mds.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/manila-cephfsnative-config.yaml \

I'm also using this env file:

parameter_defaults:
  ControllerCount: 3
  ComputeCount: 2
  CephCount: 3

BTW: I receive at very beginning that CephCount parameter is ignored (?) Initially I received errors during Ceph setup due to low PGs defaut numbers:

"stderr": "Error ERANGE:  pg_num 128 size 3 would mean 768 total pgs, which exceeds max 750 (mon_max_pg_per_osd 250 * num_in_osds 3)"

So I'm trying to change with this env file:

parameter_defaults:
  CephPoolDefaultSize: 3
  CephPoolDefaultPgNum: 64
  CephConfigOverrides:
    mon_max_pg_per_osd: 400

Right now the deploy seems stuck after step

2020-04-28 14:04:27Z [overcloud.AllNodesDeploySteps.ControllerDeployment_Step5.2]: CREATE_COMPLETE  state changed

and on controller nodes I have:

[root@ostack-ctrl0 ~]# ceph -s
  cluster:
    id:     5d194678-8950-11ea-b8c5-566f3d480013
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            Reduced data availability: 256 pgs inactive

  services:
    mon: 3 daemons, quorum ostack-ctrl2,ostack-ctrl0,ostack-ctrl1
    mgr: ostack-ctrl1(active), standbys: ostack-ctrl2, ostack-ctrl0
    mds: cephfs-1/1/1 up  {0=ostack-ctrl0=up:creating}, 2 up:standby
    osd: 0 osds: 0 up, 0 in

  data:
    pools:   2 pools, 256 pgs
    objects: 0 objects, 0B
    usage:   0B used, 0B / 0B avail
    pgs:     100.000% pgs unknown
             256 unknown

[root@ostack-ctrl0 ~]# 


[root@ostack-ctrl0 log]# ceph health detail
HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 256 pgs inactive
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
    mdsostack-ctrl0(mds.0): 31 slow metadata IOs are blocked > 30 secs, oldest blocked for 7665 secs
PG_AVAILABILITY Reduced data availability: 256 pgs inactive
    pg 1.46 is stuck inactive for 7689.858680, current state unknown, last acting []
    pg 1.47 is stuck inactive for 7689.858680, current state unknown, last acting []
    pg 1.48 is stuck inactive for 7689.858680, current state unknown, last acting []
...
    pg 2.5e is stuck inactive for 7688.281643, current state unknown, last acting []
    pg 2.5f is stuck inactive for 7688.281643, current state unknown, last acting []
[root@ostack-ctrl0 log]# 

Thanks for any hint, insight and help for reaching the desired configuraiton

Gianluca