Ask Your Question
0

ceph node recover block

asked 2019-02-04 05:13:19 -0500

lelunicu gravatar image

hi, when i write an image in a ceph cluster -then this image will be copied in every ceph cluster node? if a block of this image will become bad in time then this block will be replaced with a good block? or we must have btrs on mirror on every cluster node to execute the above? tnx

edit retag flag offensive close merge delete

1 answer

Sort by ยป oldest newest most voted
0

answered 2019-02-04 06:05:34 -0500

eblock gravatar image

That's not how Ceph works. A ceph cluster consists of multiple nodes, each with one or more OSDs. Your data (an image) is striped into Placement Groups, so it's divided into many objects, typically these stripes have a size of 4 MB. These PGs define a pool and are stored on different OSDs for resiliency. In case of a replicated pool with size 3 each PG is stored on three different OSDs. A client (e.g. glance) has to only read from the primary OSD, while it has to write to three OSDs, so reads are faster than write operations. Just to give you an example (there's only one image in this pool):

# glance image
control:~ # openstack image list
+--------------------------------------+--------+--------+
| ID                                   | Name   | Status |
+--------------------------------------+--------+--------+
| 4578b1bd-fe9a-4547-9bc5-97372f0a5721 | Cirros | active |
+--------------------------------------+--------+--------+

# list pool content
ceph-2:~ # rbd -p glance8 ls
4578b1bd-fe9a-4547-9bc5-97372f0a5721

ceph-2:~ # rados df | grep glance8
POOL_NAME               USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS      RD WR_OPS      WR
glance8               79 MiB      16      0     48                  0       0        0 294616 468 MiB   5264 162 MiB

# show PG placement
ceph-2:~ # ceph pg ls-by-pool glance8 | tr -s ' ' | cut -d " " --fields=1,2,7,13 | column -t
PG    OBJECTS  LOG  ACTING
19.0  1        54   [7,3,0]p7
19.1  1        2    [6,1,7]p6
19.2  0        0    [3,0,7]p3
19.3  1        55   [2,0,3]p2
19.4  2        83   [6,7,1]p6
19.5  2        123  [7,6,1]p7
19.6  2        98   [1,7,6]p1
19.7  2        53   [6,1,7]p6
19.8  0        84   [6,2,1]p6
19.9  1        98   [3,6,7]p3
19.a  2        4    [2,6,3]p2
19.b  2        51   [1,7,0]p1`

So as you see, the only image is divided into 16 objects, placed on different PGs and on different OSDs. If one object/PG/OSD gets corrupted somehow Ceph will try to recover from the remaining healthy PGs. An advice regarding the replication size: avoid using only 2 copies (except for tests), this will get you into trouble sooner or later.

edit flag offensive delete link more

Comments

ceph file system is on top of btrfs xfs so on?in ceph node is compulsory to have ceph file system?

lelunicu gravatar imagelelunicu ( 2019-02-05 02:53:06 -0500 )edit

I think you should familiarize yourself with Ceph, otherwise this will get out of hand. Ceph OSDs (where the data is stored) don't have a filesystem anymore, they used to be on XFS, but Bluestore is the way to go now, although it's still possible to use filestore OSDs.

eblock gravatar imageeblock ( 2019-02-06 02:51:06 -0500 )edit

I'm not sure what you mean by "ceph file system", there is a CephFS (Ceph Filesystem) which you can mount and provide POSIX complient shared space (like NFS) to clients, but that's optional and probably not what you're asking. Ceph is a software-defined storage, so on top of the OS you run Ceph.

eblock gravatar imageeblock ( 2019-02-06 02:53:31 -0500 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Get to know Ask OpenStack

Resources for moderators

Question Tools

1 follower

Stats

Asked: 2019-02-04 05:13:19 -0500

Seen: 31 times

Last updated: Feb 04