/r/ceph
ceph
ceph discussion
/r/ceph
do i just mount the cephfs in /mnt/maildir and set mail location to /mnt/maildir or there is additional configurations ?
mount -t ceph name@.fs_name=/ /mnt/maildir -o mon_addr=1.2.3.4
mail_location = maildir:/mnt/maildir
Is there a way to change the standard command to this value:
smartctl -d cciss,0 -x —json=o /dev/sdg
Thank you in advance
Hi All,
I am trying to understand the output of 'ceph df'.
All of these pools, with the exception of the "cephfs_data" are 3x replicated pools. But I am not understanding why does the 'STORED' and 'USED' values for the pools are exactly the same? We do have another cluster, which it does show around 3x the value, which is correct, but I'm not sure why this cluster shows exactly the same.
Secondly, I am confused why the USED in the "RAW STORAGE" section shows 24TiB, but if you see the USED/STORED section on the pools, it's like ~1.5 TiB summed up
Can someone please explain or mention if I am doing something wrong?
Thanks!
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 894 TiB 873 TiB 21 TiB 21 TiB 2.35
ssd 265 TiB 262 TiB 3.3 TiB 3.3 TiB 1.26
TOTAL 1.1 PiB 1.1 PiB 24 TiB 24 TiB 2.10
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
device_health_metrics 1 1 263 MiB 148 263 MiB 0 83 TiB
vms 2 2048 902 GiB 163.61k 902 GiB 0.35 83 TiB
images 3 128 315 GiB 47.57k 315 GiB 0.12 83 TiB
backups 4 128 0 B 0 0 B 0 83 TiB
testbench 5 1024 0 B 0 0 B 0 83 TiB
cephfs_data 6 32 0 B 0 0 B 0 83 TiB
cephfs_metadata 7 32 5.4 KiB 22 5.4 KiB 0 83 TiB
To confirm, I can see for one pool that this is actually a 3x replicated pool
~# ceph osd pool get vms all
size: 3
min_size: 2
pg_num: 2048
pgp_num: 2048
crush_rule: SSD
hashpspool: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: false
nodeep-scrub: false
use_gmt_hitset: 1
fast_read: 0
pg_autoscale_mode: off
~#ceph osd crush rule dump SSD
{
"rule_id": 1,
"rule_name": "SSD",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -2,
"item_name": "default~ssd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
Hello,
I'm facing poor write (IOPS) performance (TPS as well) on Linux VM with MongoDB Apps.
Cluster:
Nodes: 3
Hardware: HP Gen11
Disks: 4 NVME PM1733 Enterprise NVME ## With latest firmware driver.
Network: Mellanox-connectx-6 25 gig
PVE Version: 8.2.4 , 6.8.8-2-pve
Ceph:
Version: 18.2.2 Reef.
4 OSD's per node.
PG: 512
Replica 2/1
Additional ceph config:
bluestore_min_alloc_size_ssd = 4096 ## tried also 8K
osd_memory_target = 8G
osd_op_num_threads_per_shard_ssd = 8
OSD disks cache configured as "write through" ## Ceph recommendation for better latency.
Apply \ Commit latency below 1MS.
Network:
MTU: 9000
TX \ RX Ring: 2046
VM:
Rocky 9 (tried also ubuntu 22):
boot: order=scsi0
cores: 32
cpu: host
memory: 4096
name: test-fio-2
net0: virtio=BC:24:11:F9:51:1A,bridge=vmbr2
numa: 0
ostype: l26
scsi0: Data-Pool-1:vm-102-disk-0,size=50G ## OS
scsihw: virtio-scsi-pci
smbios1: uuid=5cbef167-8339-4e76-b412-4fea905e87cd
sockets: 2
tags: templatae
virtio0: sa:vm-103-disk-0,backup=0,cache=writeback,discard=on,iothread=1,size=33G ### Local disk - same NVME
virtio2: db-pool:vm-103-disk-0,backup=0,cache=writeback,discard=on,iothread=1,size=34G ### Ceph - same NVME
virtio23 db-pool:vm-104-disk-0,backup=0,cache=unsafe,discard=on,iothread=1,size=35G ### Ceph - same NVME
Disk1: Local nvme with iothread
Disk2: Ceph disk with Write Cache with iothread
Disk3: Ceph disk with Write Cache Unsafe with iothread
I've made FIO test in one SSH session and IOSTAT on second session:
fio --filename=/dev/vda --sync=1 --rw=write --bs=64k --numjobs=1 --iodepth=1 --runtime=15 --time_based --name=fioa
Results:
Disk1 - Local nvme:
WRITE: bw=74.4MiB/s (78.0MB/s), 74.4MiB/s-74.4MiB/s (78.0MB/s-78.0MB/s), io=1116MiB (1170MB), run=15001-15001msec
TPS: 2500
DIsk2 - Ceph disk with Write Cache:
WRITE: bw=18.6MiB/s (19.5MB/s), 18.6MiB/s-18.6MiB/s (19.5MB/s-19.5MB/s), io=279MiB (292MB), run=15002-15002msec
TPS: 550-600
Disk3 - Ceph disk with Write Cache Unsafe:
WRITE: bw=177MiB/s (186MB/s), 177MiB/s-177MiB/s (186MB/s-186MB/s), io=2658MiB (2788MB), run=15001-15001msec
TPS: 5000-8000
The VM disk cache configured with "Write Cache"
The queue scheduler configured with "none" (Ceph OSD disk as well).
I'm also sharing rados bench results:
rados bench -p testpool 30 write --no-cleanup
Total time run: 30.0137
Total writes made: 28006
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 3732.42
Stddev Bandwidth: 166.574
Max bandwidth (MB/sec): 3892
Min bandwidth (MB/sec): 2900
Average IOPS: 933
Stddev IOPS: 41.6434
Max IOPS: 973
Min IOPS: 725
Average Latency(s): 0.0171387
Stddev Latency(s): 0.00626496
Max latency(s): 0.133125
Min latency(s): 0.00645552
I've also remove one of the OSD and made FIO test:
fio --filename=/dev/nvme4n1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=20 --time_based --name=fioaa
WRITE: bw=297MiB/s (312MB/s), 297MiB/s-297MiB/s (312MB/s-312MB/s), io=5948MiB (6237MB), run=20001-20001msec
Very good results.
Any suggestion please how to improve the write speed within the VM?
How can find the bottleneck?
Many Thanks.
Looks like I still haven't given up on using Ceph ...
I'd be running it at home, on a home budget, and with home constraints on space, noise, power, etc. I also work from home, and it would be for my work data first and foremost.
I want something that is bulletproof and has as few SPoF as is reasonably achievable. And I'd like to get away from the concept of a single, large, complex, and expensive server. I have nightmares about that server failing while I'm on a project. Yes, I have backups, but not the time to build a new server and restore them.
Like, ZFS made the failure of a disk an absolute non-issue for me over a decade ago; to the point that I use recertified disks as one half of each mirror now—now I kind of want that for the entire hardware stack.
AFAICS, Ceph's kind of the only game in town.
The base will be Debian/Proxmox, not so much because running VMs is the primary use case, but because it supports Ceph OOTB. There will be the usual home/homelab VMs, but the main event is file server duty, really. Not sure about cephfs vs an RBD-backed conventional file server VM running Samba or ksmbd or whatever; opinions welcome.
3 nodes to start with, 5 may be achievable, but not short term:
Medium term, I want two 10G switches for the redundancy, but for now 10G+1G will have to do. If the switch fails, it'll be slow, but it won't burn—or will it?
This far I can't change much, because I can get this stuff for cheap. If it isn't possible to build something sane with this, then I'll just scrap the plan and stick to ZFS with replication.
The board has 6x SATA and a measly PCIe 3.0 x1 M.2 slot. Could add an HBA in the x16, move the NIC to the x4, but then I'd need a bigger case to hold them as well, so it'd drive up the non-drive costs/node quite a bit—may still be worth it, you tell me.
I'd like to have an HDD pool as well, maybe 1–2 disks/node; this is just for archival. And because I get the impression that having the DB/WAL on something faster doesn't do that much on Bluestore, especially if it's just store (large) file / retrieve (large) file, I thought I'd try doing without that, keep it "simple".
The rest would go to an SSD pool of Samsung PM883 960GB. Best GB/€ right now, well, after the 8 TB models, but going with those would mean 1 OSD/node, and from what I've read, that's not a good idea. ^^
Usually I'd mirror the boot device, but I don't see how I can spare two ports just to boot. Does it make sense to have a third pool on the single possible NVMe? I'm not sure about segmenting the storage so much, it's small as it is, but booting from an NVMe feels like a waste.
How many SSDs can I get away with, to start with? I.e., is there any sweet spot in a 3-node cluster? Capacity-wise 6 SSD OSDs total would probably be enough, and the fewer 1 TB SSDs I buy now, the sooner I can buy a set of larger ones.
But no clue about performance. As long as the Samba shares and the VMs don't feel slow, I'll be happy.
Much as I like to play around with cool tech—and Ceph is that—if I'm going to do this, I need it to be viable, not just cool. Is this viable? (How) can it be made viable?
P.S. Sorry, not sorry for another 3-node thread. I figure if there's enough visible demand, maybe, just maybe, such tiny deployments will get some love as well.
Hello,
I have a 5-node Ceph cluster that has NVMe and "spinner" drives. I have the clitser connected to a 3-node Proxmox cluster. The Proxmox cluster connects via RBD.
I also have a 3-node VMware cluster that connects to a Dell SAN via iSCSI. The Dell SAN also have SSD and "spinner" drives.
I ran Crystalmark test on a Windows VM that is on the Proxmox cluster on the "HDD" (or "spinner") pool and then also ran it on a Windows VM that is on the non-SSD storage on the VMware cluster. Here are the results:
Proxmox + Ceph (Writeback enabled):
though it did take a *long* while for it to complete (I didn;t time it, but it was at least 20 minutes, quite possibly more).
VMware + iSCSI:
This one finished fairly quickly (within 5 minutes or so)
I was expecting the Proxmox + Ceph result to be "worse" than then VMware + iSCSI if for no other reason than the Ceph cluster needs to write three copies of any changes to the VM files but despiet taking longer to complete the tests, the resutls are *much* better than the VMware + iSCSI.
How trustworthy are these results?
I’m working on a POC to use Ceph as a storage provider and have no issues setting it up either standalone or within a k8s cluster using Rook. My challenge is creating a straightforward, local “development environment” for each developer.
Ideally, I’d like a setup where developers can quickly test object storage (e.g., uploading and retrieving an image) with minimal overhead—ideally just one command to start, without requiring multiple nodes, VMs, or large storage capacities. Is it possible to achieve this with a real Ceph instance (not a limited or emulated version)?
Any tips or recommendations would be appreciated!
Hi, I'm trying to create a new ceph cluster for testing purposes in vmware but ceph cannot find any disks.
sdb is the disk I'm trying add as an OSD
Any help would be appreciated. This is driving me insane.
System:
Ubuntu 24.04
Ceph Squid 19.2.0
root@ceph:~$ sudo lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 50G 0 disk
├─sda1 8:1 0 1G 0 part /boot/efi
├─sda2 8:2 0 2G 0 part /boot
└─sda3 8:3 0 21.9G 0 part
└─ubuntu--vg-ubuntu--lv 252:0 0 21.9G 0 lvm /
sdb 8:16 0 120G 0 disk
root@ceph:~$ sudo ceph orch apply osd --all-available-devices
Scheduled osd.all-available-devices update...
root@ceph:~$ sudo cephadm ceph-volume inventory
Inferring fsid e894abeb-9541-34gh-67fg-005056baa4cf
Using ceph image with id '37996728e013' and tag 'v19' created on 2024-09-27 18:08:21 -0400 EDT
quay.io/ceph/ceph@sha256:200087c35811bf28e8a8073b15fa86c07cce85c575f1ccd62d1d6ddbfdc6770a
Device Path Size Device nodes rotates available Model name
/dev/sdb 120.00 GB sdb True True Virtual disk
/dev/sda 50.00 GB sda True False Virtual disk
root@ceph:~$ sudo ceph device ls
DEVICE HOST:DEV DAEMONS WEAR LIFE EXPECTANCY
Hi all,
i just created my first 3 node cluster primarily for CEPH.
Nodes: Asus S14NA-U12 MBO, AMD EPYC 8534P CPU, 256GB DDR5 RAM
CEPH NICs: 2 x 100G Mellanox ConnectX-4 - one for public and one for backend connection
I planned to use 4 NVMe drives but ended with 2 per node for testing. I tested drives speed in Windows VM as I need to use 2 Windows VMs (RDS server and SQL server) in production
ZFS, writeback cache DISABLED, mirror of 2 x Micron 7400 PRO 1.92TB (declared 4400/2000MB/s)
CEPH, writeback cache DISABLED, 6 x Micron 7400 PRO 1.92TB (declared 4400/2000MB/s) OSD total on 3 nodes (2 per node), 100G public network, 100G backend, KRBD disabled – after config tweaks
CEPH, writeback cache ENABLED, 6 x Micron 7400 PRO 1.92TB (declared 4400/2000MB/s) OSD total on 3 nodes (2 per node), 100G public network, 100G backend, KRBD disabled – after config tweaks
As you see, without writeback cache, the same drives are 3 times slower on CEPH then on ZFS. I don't know if this is normal but I expected much faster speeds.
I tried a lot of tweaks, reinstalled 3 times, set MTU 9000, disabled auth and debug... but the CEPH disk speed is always around the same result.
What do you think might be the problem ?
I have a 3 node homelab cluster that is and had been great however I need more capacity. I have 6x 4tb hdds per node. Experience has shown that matching drive sizes (and specs) is very important to maintain performance especially with a small cluster like mine.
I'm considering purchasing 3x 12tb hdds of similar class as the drives I currently have. Adding them to the cluster would 'work' however I'd end up with unbalanced #pgs per osd and limit performance significantly. This would be especially painful because hdds+ceph is pain ☺️. Unless the new drives are significantly more performant vrs existing they would get a disproportionate share of io.
What I'm considering. Create raid0 mdraid array out of 3x of my 4tb drives, giving me a 12tb volume, then use that for an osd. From a ceph perspective this device would be 'matched'. I'd actually have 2x 12tb arrays per node (using all 6 drives). The 12tb hdd would still be the bottleneck but mostly because the array would more performant rather than pg inbalance.
The big downsides I see is undue complexity (layering storage) and the 3x increase if risk of failure of the raided osd. With 3x replication I can safely lose a node so I'm less concerned (and I have backups) with the later.
At some future date I could purchase additional 12tb drives and decomm the arrays as I do resulting in having only 12tb hdds (and more bays available).
Some limitations. I can't reasonably do more than 3 nodes or purchase more than a couple 12tb drives at a time for $$, networking and space reasons. So I can't scale out or mass replace my existing drives. I do have time so letting crush to it's thing as I orchestrate this operation isn't a problem.
Hello, I am interested in a Ceph cluster on bare metal. Probably, in the 1 PB+ range in a server cabinet. I haven't found a clear explanation of where and when Ceph cluster bottlenecks occur.
I suppose the bottlenecks are as follows:
Are there some simple guidelines to follow in finding the bottlenecks to clarify which components to select for the bare metal cluster build?
Edit: CEPH -> Ceph
We recently made a testing cluster for Ceph with 3 nodes. Do you all have any experiments you were curious about that I could look at?
Spec for 3 nodes
100g networking
2.85->3.1ghz cpu
512g ram
24tb ssd
900tb hdd
Have been trying to design a new ~20PB cluster for a few months and things are well underway. Planning to just use Ceph Radosgw (S3), then fill it to the brim straight after it's released. We are therefore wondering:
What design choices regarding ceph/ephadm/osds/mons/mgrs/pools/crush-profiles/ec-profiles etc are either irreversible, or just a real pain in the a#$ to undo, after your cluster is loaded with data?
I imagine EC profiles, Crush rules and Pool settings should be optimised, but what else might a noob forget?
I've been looking and I'm thinking an OptiPlex 5040 SFF plus 10GbE and a SATA card might make a decent node for homelab use. i5-6500, 16GB, and x16/x4 PCIe v3.0, just right for the cards. Setup 3 or 5 of these for the cluster along with one of the cheaper 10GbE switches. Not sure how many drives each, maybe three to start?
There are several things I want to use it for, but I think the most demanding would be to use it as the underlying storage to boot and run a diskless Windows machine and have it feel roughly like it's running from an attached SATA III drive.
Is this arrangement likely to behave how I hope?
immich-web-68bb98dc66-k8988:/cephfs# getfattr -n ceph.file.layout.pool check-cephfs-file
# file: check-cephfs-file
ceph.file.layout.pool="ceph-filesystem-unspecified"
immich-web-68bb98dc66-k8988:/cephfs# getfattr -n ceph.file.layout.pool_id check-cephfs-file
# file: check-cephfs-file
ceph.file.layout.pool_id="20"
immich-web-68bb98dc66-k8988:/cephfs# getfattr -n ceph.file.layout.pool_name check-cephfs-file
# file: check-cephfs-file
ceph.file.layout.pool_name="ceph-filesystem-unspecified"
immich-web-68bb98dc66-k8988:/cephfs# setfattr -n ceph.file.layout.pool_name -v ceph-filesystem-unspecified check-cephfs-file
setfattr: check-cephfs-file: Permission denied
immich-web-68bb98dc66-k8988:/cephfs# du -hd1 check-ceph
0 check-ceph
immich-web-68bb98dc66-k8988:/cephfs# du -hd1 check-ceph
immich-web-68bb98dc66-k8988:/cephfs# ceph-filesystem-unspecified
immich-web-68bb98dc66-k8988:/cephfs# setfattr -n ceph.dir.layout.pool -v ceph-filesystem-unspecified check-ceph
setfattr: check-ceph: Permission denied
immich-web-68bb98dc66-k8988:/cephfs#
I keep getting Permission denied
. I already checked:
- Pool exsist (I even set the pool name to the one that they already reside on)
- Dir and file are empty
- Ceph cannot be mounted with usr_xattr
Installed ceph via proxmox. Everything is the latest version as of a couple weeks ago.
At first, I thought everything was 10gb but performance was slow and I discovered the 10gb were actually showing as 1gb. With a couple commands the 10gb nics were performing as expected and I verified the speed between all three of my hosts on the storage network with perf3 as transferring almost 10gb as expected.
However, cephfs seems to be performing about the same.
I originally detected an issue when certain pods in kubernetes clusters restarted at the same time as cephfs did a write at about 50MiB. Seems the bandwidth was all used up slowing vms, causing the pods to crash.
After increasing the nics to 10gb I thought the issue would resolve but it hasn't and cephfs performance seems to be about the same. I'm not doing any special testing of the speed, I was just expecting the pods not to restart in response to a cephfs write.
Reads and writes seem to max out at aout 100MiB between them (same as when it was all 1gb before).
It seems like everything is the same as when the nics were running at 1gb. Is it possible ceph still thinks they are at 1gb for some reason? Is there something ceph related that needs to be restarted / updated?
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.0.1.21/24
fsid = 1c3f0c6f-696c-4cc0-b29c-e68f73ba9e4b
mon_allow_pool_delete = true
mon_host = 10.0.0.21 10.0.0.23 10.0.0.22
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.0.0.21/24
googled how to run a speed test (does this look like 10gb nic w/ nvmes performance?):
root@pve-a:~# ceph osd pool create testbench 100 100
pool 'testbench' created
root@pve-a:~# rados bench -p testbench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_pve-a_423981
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 35 19 75.996 76 0.894092 0.348493
2 16 51 35 69.9949 64 0.828046 0.621269
3 16 58 42 55.9958 28 1.33079 0.738915
4 16 70 54 53.9958 48 0.267747 0.900903
5 16 77 61 48.7961 28 0.0664892 0.956315
6 16 82 66 43.9964 20 0.265331 0.930378
7 16 87 71 40.5681 20 0.331954 1.07364
8 16 100 84 41.9965 52 3.22042 1.38887
9 16 105 89 39.5522 20 1.93728 1.44535
10 16 106 90 35.9969 4 1.93704 1.45081
11 14 106 92 33.4517 8 1.52379 1.46059
Total time run: 11.5055
Total writes made: 106
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 36.8518
Stddev Bandwidth: 23.2739
Max bandwidth (MB/sec): 76
Min bandwidth (MB/sec): 4
Average IOPS: 9
Stddev IOPS: 5.81847
Max IOPS: 19
Min IOPS: 1
Average Latency(s): 1.70953
Stddev Latency(s): 1.33644
Max latency(s): 4.55411
Min latency(s): 0.026747
Hello, I need help with my Ceph cluster setup. I have a 6-node cluster with a replica size of 6 and a minimum size of 3. The problem started when 3 nodes, including Node 0, were placed in the basement, which flooded and damaged all 3.
I replaced the damaged nodes with new hardware and expected that giving them the same names would allow Proxmox to automatically reintegrate them into the cluster. However, this didn’t work. The cluster’s quorum is broken, and my efforts to restore it haven't been successful.
While I managed to re-establish quorum and add a new node, I still can't restore the original quorum with the replaced nodes. I find this process with Proxmox and Ceph to be more complicated than expected. I understand the need for a halt when quorum is broken, but I assumed that replacing nodes with similar hardware and the same server name would allow for seamless reintegration.
Where am I going wrong, and what steps can I take to fix this issue?
Hi All,
About to build my first 3 node Proxmox cluster and will be looking to use Ceph on the storage front. Each node will have a Mellanox ConnectX-4 10Gigabit Ethernet Card direct connected to each other in a mesh. Each node will have an LSI 9200-8E controller in IT mode.
For storage, each node will have 2 x Intel 1.6TB DC S3510 Series SATA SSDs connected via motherboard SATA ports and 8 x 1TB 7200RPM 2.5 inch drives. I also have some Micron 512GB SSDs which I had thought I might be able to use as a R/W cache for the spinning disk's, however not sure if that is possible.
My requirements won't be extreme, so was wondering about setting the 1.6TB SSDs as a mirror for my performance VMs/Containers and then have the 8 x 1TB drives for lesser I/O intensive tasks like archiving, email, etc.
What would be my best approach for configuration of this storage? Are there other questions I should be asking myself first?
How do I deal with this without a) rebooting the client b) restarting the MDS daemon?
HEALTH_WARN 1 clients failing to respond to cache pressure
[WRN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure
mds.cxxxvolume.cxxx-m18-33.lwbjtt(mds.4): Client ip113.xxxx failing to respond to cache pressure client_id: 413354
I know if I reboot the host, this error message will go away, but I can't really reboot it.
There are 15 users currently on this machine connecting to it via some RDP software.
unmounting the ceph cluster and remounting didn't help
restarting the MDS daemon has bitten me in the ass a lot. One of the biggest problems I will have is the MDS daemon will restart, so then another MDS daemon picks up as primary; all good so far. But the MDS that took over goes into a weird run away memory cache mode and crashes the daemon, OOMs the host and OUTs all of the OSDs in that host. This is a nightmare, because once the MDS host goes offline, another MDS host picks up, and rinse repeat..
The hosts have 256 gigs of ram, 24 CPU threads, 21 OSDS, 10 gig nics for public and cluster network.
ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
Cephfs kernel driver
What I've tried so far is to unmount and remount, clear cache "echo 3 >/proc/sys/vm/drop_caches", blocked the IP (from the client) of the MDS host, hoping to timeout and clear the cache (no joy).
How do I prevent future warning messages like this? I want to make sure that I'm not experiencing some sort of networking issue or HBA (IT mode 12GB/SAS )
Thoughts?
Hi,
I've done some research but unfortunately without success.
I'm asking you if it's possible to have a 4-node cluster that can continue to provide storage service even if only one node remains active.
I did a quick test with microceph, on four machines, but as soon as I turned off two of them, the cluster was no longer available.
Would it theoretically be possible to configure a system like this?
Thanks
Hey all. I'm trying to get ceph running on three ubuntu servers, and am following along with the guide here.
I start by installing cephadm
apt install cephadm -y
It installs successfully. I think bootstrap a monitor and manager daemon to the same host:
cephadm bootstrap --mon-ip [host IP]
I copy the /etc/ceph/ceph.pub key to the osd host, and am able to add the osd host to (ceph-osd01) to the cluster:
ceph orch host add ceph-osd01 192.168.0.10
But I cannot seem to deploy an osd daemon to the host.
Running "ceph orch daemon add osd ceph-osd01:/dev/sdb" results in the following:
root@ceph-mon01:/home/thing# ceph orch daemon add osd ceph-osd01:/dev/sdb
Error EINVAL: Traceback (most recent call last):
File "/usr/share/ceph/mgr/mgr_module.py", line 1862, in _handle_command
return self.handle_command(inbuf, cmd)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 184, in handle_command
return dispatch[cmd['prefix']].call(self, cmd, inbuf)
File "/usr/share/ceph/mgr/mgr_module.py", line 499, in call
return self.func(mgr, **kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 120, in <lambda>
wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs) # noqa: E731
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 109, in wrapper
return func(*args, **kwargs)
File "/usr/share/ceph/mgr/orchestrator/module.py", line 1374, in _daemon_add_osd
raise_if_exception(completion)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 241, in raise_if_exception
raise e
RuntimeError: cephadm exited with an error code: 1, stderr:Inferring config /var/lib/ceph/e6c69d42-8d67-11ef-bbe0-005056aa68a2/mon.ceph-osd01/config
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/var/lib/ceph/e6c69d42-8d67-11ef-bbe0-005056aa68a2/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/__main__.py", line 5579, in <module>
File "/var/lib/ceph/e6c69d42-8d67-11ef-bbe0-005056aa68a2/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/__main__.py", line 5567, in main
File "/var/lib/ceph/e6c69d42-8d67-11ef-bbe0-005056aa68a2/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/__main__.py", line 409, in _infer_config
File "/var/lib/ceph/e6c69d42-8d67-11ef-bbe0-005056aa68a2/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/__main__.py", line 324, in _infer_fsid
File "/var/lib/ceph/e6c69d42-8d67-11ef-bbe0-005056aa68a2/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/__main__.py", line 437, in _infer_image
File "/var/lib/ceph/e6c69d42-8d67-11ef-bbe0-005056aa68a2/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/__main__.py", line 311, in _validate_fsid
File "/var/lib/ceph/e6c69d42-8d67-11ef-bbe0-005056aa68a2/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/__main__.py", line 3288, in command_ceph_volume
File "/var/lib/ceph/e6c69d42-8d67-11ef-bbe0-005056aa68a2/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/__main__.py", line 918, in get_container_mounts_for_type
File "/var/lib/ceph/e6c69d42-8d67-11ef-bbe0-005056aa68a2/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/cephadmlib/daemons/ceph.py", line 422, in get_ceph_mounts_for_type
File "/var/lib/ceph/e6c69d42-8d67-11ef-bbe0-005056aa68a2/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/cephadmlib/host_facts.py", line 760, in selinux_enabled
File "/var/lib/ceph/e6c69d42-8d67-11ef-bbe0-005056aa68a2/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/cephadmlib/host_facts.py", line 743, in kernel_security
File "/var/lib/ceph/e6c69d42-8d67-11ef-bbe0-005056aa68a2/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/cephadmlib/host_facts.py", line 722, in _fetch_apparmor
ValueError: too many values to unpack (expected 2)
I am able to see host lists:
root@ceph-mon01:/home/thing# ceph orch host ls
HOST ADDR LABELS STATUS
ceph-mon01 192.168.0.1 _admin
ceph-osd01 192.168.0.10 mon,mgr,osd
ceph-osd02 192.168.0.11 mon,mgr,osd
3 hosts in cluster
but not device lists:
root@ceph-mon01:/# ceph orch device ls
root@ceph-mon01:/#
wtf is going on here? :(
Hello community,
After 4 years of using Ceph, there is a first serious problem with data consistency. After some deep-scrubing one pg had inconsistent status. I tried to repairing it and deep scrubbing many times but it always failed. I noticed that the primary OSD for this pg (4.e4) is osd.21. Restarting this OSD does not help me. I checked dmesg and noticed that there are a lot of write errors. My next idea was to change crush weight to 0. After that, at the end of recovery/backfilling all, 3 OSDs with this placement group (12,25,21) restarted, and the process started again. Below I attach some logs that I hope describe a problem.
osd.21 :
-3> 2024-10-18T11:25:25.005+0200 7fbb2e797700 10 osd.21 pg_epoch: 304540 pg[4.e4( v 304540'35368884 (304019'35365884,304540'35368884] local-lis/les=304539/304540 n=13702 ec=12199/43 lis/c=304539/303049 les/c/f=304540/303064/140355 sis=304539) [0,25]/[21,25] backfill=[0] r=0 lpr=304539 pi=[303049,304539)/7 crt=304540'35368884 lcod 304540'35368883 mlcod 304540'35368883 active+undersized+degraded+remapped+backfilling rops=1 mbc={}] get_object_context: 0x55f7729bdb80 4:274f1d06:::rbd_data.04e53058991b67.00000000000006da:151 rwstate(read n=1 w=0) oi: 4:274f1d06:::rbd_data.04e53058991b67.00000000000006da:151(6030'16132894 osd.6.0:93038977 dirty|data_digest|omap_digest s 4194304 uv 14398660 dd 14217e41 od ffffffff alloc_hint [0 0 0]) exists: 1 ssc: 0x55f75f7011e0 snapset: 4796=[]:{4796=[4796,4789,477b,476f,475b,4741,4733,3b5d,3b51,3b3e,3375,336b,3bc,13c]}
-2> 2024-10-18T11:25:25.005+0200 7fbb2e797700 10 osd.21 pg_epoch: 304540 pg[4.e4( v 304540'35368884 (304019'35365884,304540'35368884] local-lis/les=304539/304540 n=13702 ec=12199/43 lis/c=304539/303049 les/c/f=304540/303064/140355 sis=304539) [0,25]/[21,25] backfill=[0] r=0 lpr=304539 pi=[303049,304539)/7 crt=304540'35368884 lcod 304540'35368883 mlcod 304540'35368883 active+undersized+degraded+remapped+backfilling rops=1 mbc={}] add_object_context_to_pg_stat 4:274f1d06:::rbd_data.04e53058991b67.00000000000006da:151
-1> 2024-10-18T11:25:25.021+0200 7fbb2e797700 -1 ./src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7fbb2e797700 time 2024-10-18T11:25:25.008828+0200
./src/osd/osd_types.cc: 5888: FAILED ceph_assert(clone_overlap.count(clone))
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x55f74c7c4fe8]
2: /usr/bin/ceph-osd(+0xc25186) [0x55f74c7c5186]
3: (SnapSet::get_clone_bytes(snapid_t) const+0xe3) [0x55f74cb08bc3]
4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x23e) [0x55f74c9b2d6e]
5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x19f3) [0x55f74ca1d963]
6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xf2a) [0x55f74ca2384a]
7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x295) [0x55f74c8914f5]
8: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x55f74cb4ce79]
9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xad0) [0x55f74c8b2a80]
10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x55f74cf99f3a]
11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55f74cf9c510]
12: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7fbb61ed3ea7]
13: clone()
osd.12
ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7fe0cab71140]
2: signal()
3: abort()
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x17e) [0x5615e5acd77a]
5: /usr/bin/ceph-osd(+0xc278be) [0x5615e5acd8be]
6: (SnapSet::get_clone_bytes(snapid_t) const+0xe3) [0x5615e5e19113]
7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x23e) [0x5615e5cbce2e]
8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x19f4) [0x5615e5d27af4]
9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xf2a) [0x5615e5d2da3a]
10: (OSD::do_recovery(PG*, unsigned int, unsigned long, int, ThreadPool::TPHandle&)+0x2a5) [0x5615e5b9b445]
11: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0xcb) [0x5615e5e5d6db]
12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xaa8) [0x5615e5bba138]
13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x5615e62aac1a]
14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5615e62ad1f0]
15: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7fe0cab65ea7]
16: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I googled this problem and found this (translated from German) and this
I am afraid because as I know using the ceph-objectstore-tool may cause damage and lost data. Has anyone had the same problem and resolved it or could confirm that the information in one of the upper articles is correct? Is the any way to prevent losing data? Maybe backup pg from 3 OSDs with 4.e4 pg?
As per title, while trying to recover from a degraded cluster I marked one osd as lost because I lost its Wal and db. Since then no writes have been made to the cluster, just cluster backfills and recover. My question is: if I manage to recover Wal/db device is there a chance to get that data again into the cluster?
Hello everyone
I found out about Ceph 2 months ago via Proxmox and everything was amazing, especially with the live migrate function.
So I decided to empty server by server, add it to the ceph cluster and keep on going until all the disks were setup as OSD instead of RAID-10 local storage with VM.
Now I'm done here's the current result:
I use P4510 / P4610 & other enterprise disks only with PLP.
I read having a lot of ram and fast CPU was good. I put 1-2 TB ram per server and used the EPYC Milan CPU just to be sure. Should be 32 cores free at all times per server.
I didn't have enough servers to begin with to start with EC 4 + 2. As I read it requires a minimum of 6 servers, 7 really because you want to have one spare in case of failure. Sooo when migrating the VM from local storage to Ceph, I just put them on the standard 3x REP.
However now we're there. I have 7 servers, finally!
There are around 600 VM running in the cluster on the 3x replication. It's just small VPN servers so as you see they don't use that much storage on 3x, and not a lot of IOPS either. Should be perfect for EC?
Here are the performance stats:
Does everything look good you think? I tried to follow as much as possible of what was "recommended" such as trying to keep storage balanced between nodes, using enterprise only disks, have + 1-2 extra spares, LACP bond to two switches, 25G network for latency (I really don't need the 100G throughput unless there's a rebuild).
Anything I should think about when going from 3x REP to 4 + 2 EC for my VM?
Is 7 servers enough or do I need to add an 8th server before going to 4 + 2?
What is my next step?
I'm thinking about relying on RDB writeback cache for any bursts if needed. All servers have A/B power, UPS.
I don't mind about keeping current VM on the 3x replication if it's hard to migrate but at least deploy new VM on the EC setup would be great so I don't blow through all of this nvme.
Thanks!
The built-in Grafana dashboards that Cephadm come with are excellent.
I am wondering though, if I want to put these onto another Grafana instance. Where would be a good place to download them from? (Ideally for my specific Cephadm version too!)
I've located a bunch of copies on the host that containers installed on, but copying them out of here just feels like a messy way to do this:
root@storage-13-09002:~# find / -name "*.json" | xargs grep -l "grafana"
/var/lib/docker/overlay2/d43fe8e11f978ce76013c7354fa545e8fbd87f27f3a03463b2c57f10f6540d90/merged/etc/grafana/dashboards/ceph-dashboard/osds-overview.json
/var/lib/docker/overlay2/d43fe8e11f978ce76013c7354fa545e8fbd87f27f3a03463b2c57f10f6540d90/merged/etc/grafana/dashboards/ceph-dashboard/cephfs-overview.json
/var/lib/docker/overlay2/d43fe8e11f978ce76013c7354fa545e8fbd87f27f3a03463b2c57f10f6540d90/merged/etc/grafana/dashboards/ceph-dashboard/radosgw-detail.json
/var/lib/docker/overlay2/d43fe8e11f978ce76013c7354fa545e8fbd87f27f3a03463b2c57f10f6540d90/merged/etc/grafana/dashboards/ceph-dashboard/pool-detail.json
...
Answer: This seems like a good place to fetch them from: https://github.com/ceph/ceph/tree/main/monitoring/ceph-mixin/dashboards_out
Dear All,
Any recommended supermicro chassis to recommend for a brand new ceph setup? I would like to use nvme u.2 for cost efficiency and 2x100g ports for all the bandwidth needs
single cpu with 128GB ram.
We have many issues with our ceph cluster, but what I'm struggling with the most is finding the useful data from the logs. We're running a stock setup logging-wise, yet I'm finding numerous logs that Ceph marks as [DBG]
which sure look like debug logs to me (billions and billions of them) being sent to Journal at priority 3 (ERROR) or 5 (NOTICE) level.
The logging pages cat docs.ceph.com only talk about increasing log level, and I've confirmed that debug logs are disabled for every daemon. Can anyone point me at better docs, or share how they have tamed ceph logging so that debug logs are not reported at high levels?
EtA: Specifically concerned with logs submitted to Journald. I really need to be able to tune these down to appropriate priorties.
Examples:
{ "PRIORITY":"3", "MESSAGE":"system:0\n", "_CMDLINE":"/usr/bin/conmon --api-version 1 [...]", ...}
Really. You're telling me system:0
at priority level WARNING ? Not useful.
{ "PRIORITY":"4", "MESSAGE":"log_channel(cluster) log [DBG] : fsmap [...]" }
These fsmap messages come by the thousands, and they don't say anything of use. They are even marked as DEBUG messages. So why are they logged at NOTICE level?
My goal is to have primary on a specific host (due to read-replicas not an option for non-RBD), and replicas on any host (including the host already chosen), but not the primary OSD.
My current CRUSH rule is
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# devices
device 0 osd.0 class nvme
device 1 osd.1 class ssd
device 2 osd.2 class nvme
device 3 osd.3 class nvme
device 4 osd.4 class ssd
device 5 osd.5 class nvme
device 6 osd.6 class ssd
device 7 osd.7 class hdd
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root
# buckets
host nanopc-cm3588-nas {
id -3
# do not change unnecessarily
id -4 class nvme
# do not change unnecessarily
id -5 class ssd
# do not change unnecessarily
id -26 class hdd
# do not change unnecessarily
# weight 3.06104
alg straw2
hash 0
# rjenkins1
item osd.0 weight 0.23288
item osd.2 weight 0.23288
item osd.5 weight 1.81940
item osd.7 weight 0.77588
}
host mbpcp {
id -7
# do not change unnecessarily
id -8 class nvme
# do not change unnecessarily
id -9 class ssd
# do not change unnecessarily
id -22 class hdd
# do not change unnecessarily
# weight 0.37560
alg straw2
hash 0
# rjenkins1
item osd.3 weight 0.37560
}
host mba {
id -10
# do not change unnecessarily
id -11 class nvme
# do not change unnecessarily
id -12 class ssd
# do not change unnecessarily
id -23 class hdd
# do not change unnecessarily
# weight 0.20340
alg straw2
hash 0
# rjenkins1
item osd.4 weight 0.20340
}
host mbpsp {
id -13
# do not change unnecessarily
id -14 class nvme
# do not change unnecessarily
id -15 class ssd
# do not change unnecessarily
id -24 class hdd
# do not change unnecessarily
# weight 0.37155
alg straw2
hash 0
# rjenkins1
item osd.1 weight 0.18578
item osd.6 weight 0.18578
}
root default {
id -1
# do not change unnecessarily
id -2 class nvme
# do not change unnecessarily
id -6 class ssd
# do not change unnecessarily
id -28 class hdd
# do not change unnecessarily
# weight 4.01160
alg straw2
hash 0
# rjenkins1
item nanopc-cm3588-nas weight 3.06104
item mbpcp weight 0.37560
item mba weight 0.20340
item mbpsp weight 0.37157
}
chassis chassis-nanopc {
id -16
# do not change unnecessarily
id -20 class nvme
# do not change unnecessarily
id -21 class ssd
# do not change unnecessarily
id -27 class hdd
# do not change unnecessarily
# weight 3.06104
alg straw2
hash 0
# rjenkins1
item nanopc-cm3588-nas weight 3.06104
}
chassis chassis-others {
id -17
# do not change unnecessarily
id -18 class nvme
# do not change unnecessarily
id -19 class ssd
# do not change unnecessarily
id -25 class hdd
# do not change unnecessarily
# weight 0.95056
alg straw2
hash 0
# rjenkins1
item mbpcp weight 0.37560
item mba weight 0.20340
item mbpsp weight 0.37157
}
# rules
rule replicated_rule {
id 0
type replicated
step take chassis-nanopc
step chooseleaf firstn 1 type host
step emit
step take default
step chooseleaf firstn 0 type osd
step emit
}
However, it resulted in pg dump like this:
version 14099
stamp 2024-10-13T11:46:25.490783+0000
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN LAST_SCRUB_DURATION SCRUB_SCHEDULING OBJECTS_SCRUBBED OBJECTS_TRIMMED
6.3f 3385 0 0 3385 0 8216139409 0 0 1732 3000 1732 active+clean+remapped 2024-10-13T02:21:07.580486+0000 5024'13409 5027:39551 [5,5] 5 [5,4] 5 4373'10387 2024-10-12T09:46:54.412039+0000 1599'106 2024-10-09T15:41:52.360255+0000 0 2 periodic scrub scheduled @ 2024-10-13T17:41:52.579122+0000 2245 0
6.3e 3217 0 0 3217 0 7806374402 0 0 1819 1345 1819 active+clean+remapped 2024-10-13T03:36:53.629380+0000 5025'13549 5027:36882 [7,7] 7 [7,4] 7 4373'10667 2024-10-12T09:46:51.075549+0000 0'0 2024-10-08T07:13:08.545820+0000 0 2 periodic scrub scheduled @ 2024-10-13T13:27:11.454963+0000 2132 0
6.3d 3256 0 0 3256 0 7780755159 0 0 1733 3000 1733 active+clean+remapped 2024-10-13T02:21:46.947129+0000 5024'13609 5027:28986 [5,5] 5 [5,4] 5 4371'11218 2024-10-12T09:39:44.502516+0000 0'0 2024-10-08T07:13:08.545820+0000 0 2 periodic scrub scheduled @ 2024-10-13T14:12:17.856811+0000 2202 0
See [5,5]. Thus my cluster remains in remapping state. Is there anyway I can achieve my goal stated above?
When I try to open in GUI block devices mgr service is restarting (18.2.4) . I cannot access crash logs after container restart
13:46:38 hostname bash[27409]: debug -17> 2024-10-11T11:46:38.401+0000 7f4e2f823640 5 librbd::io::Dispatcher: 0x55ff0ea65000 register_dispatch: dispatch_layer=6
Oct 11 13:46:38 hostname bash[27409]: debug -16> 2024-10-11T11:46:38.401+0000 7f4e1a47a640 5 asok(0x55ff051a8000) register_command rbd cache flush Images/10f3af9e-1766-4dbf-9cdb-416436027b23 hook 0x55ff13050f00
Oct 11 13:46:38 hostname bash[27409]: debug -15> 2024-10-11T11:46:38.401+0000 7f4e1a47a640 5 asok(0x55ff051a8000) register_command rbd cache invalidate Images/10f3af9e-1766-4dbf-9cdb-416436027b23 hook 0x55ff13050f00
Oct 11 13:46:38 hostname bash[27409]: debug -14> 2024-10-11T11:46:38.401+0000 7f4e1a47a640 5 librbd::ImageCtx: 0x55ff11696000: disabling zero-copy writes
Oct 11 13:46:38 hostname bash[27409]: debug -12> 2024-10-11T11:46:38.401+0000 7f4e1a47a640 5 librbd::cache::WriteAroundObjectDispatch: 0x55ff1253a900 init:
Oct 11 13:46:38 hostname bash[27409]: debug -11> 2024-10-11T11:46:38.401+0000 7f4e1a47a640 5 librbd::io::Dispatcher: 0x55ff0ea65000 register_dispatch: dispatch_layer=1
Oct 11 13:46:38 hostname bash[27409]: debug -10> 2024-10-11T11:46:38.405+0000 7f4e1ac7b640 5 librbd::io::SimpleSchedulerObjectDispatch: 0x55ff1304c6c0 SimpleSchedulerObjectDispatch: ictx=0x55ff11696000
Oct 11 13:46:38 hostname bash[27409]: debug -9> 2024-10-11T11:46:38.405+0000 7f4e1ac7b640 5 librbd::io::SimpleSchedulerObjectDispatch: 0x55ff1304c6c0 init:
Oct 11 13:46:38 hostname bash[27409]: debug -8> 2024-10-11T11:46:38.405+0000 7f4e1ac7b640 5 librbd::io::Dispatcher: 0x55ff0ea65000 register_dispatch: dispatch_layer=5
Oct 11 13:46:38 hostname bash[27409]: debug -6> 2024-10-11T11:46:38.405+0000 7f4e1ac7b640 5 librbd::io::Dispatcher: 0x55ff13076090 shut_down_dispatch: dispatch_layer=3
Oct 11 13:46:38 hostname bash[27409]: debug -5> 2024-10-11T11:46:38.405+0000 7f4e1a47a640 5 librbd::io::WriteBlockImageDispatch: 0x55ff0e6540a0 unblock_writes: 0x55ff11696000, num=0
Oct 11 13:46:38 hostname bash[27409]: debug -3> 2024-10-11T11:46:38.409+0000 7f4e1a47a640 5 librbd::io::WriteBlockImageDispatch: 0x55ff0e6540a0 unblock_writes: 0x55ff11696000, num=0
Oct 11 13:46:38 hostname bash[27409]: debug -2> 2024-10-11T11:46:38.409+0000 7f4e2f823640 5 librbd::DiffIterate: fast diff enabled
Oct 11 13:46:38 hostname bash[27409]: debug -1> 2024-10-11T11:46:38.409+0000 7f4e2f823640 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.4/rpm/el9/BUILD/ceph-18.2.4/src/librbd/api/DiffIterate.cc: In function 'int librbd::api::DiffIterate<ImageCtxT>::execute() [with ImageCtxT = librbd::ImageCtx]' thread 7f4e2f823640 time 2024-10-11T11:46:38.414077+0000
Oct 11 13:46:38 hostname bash[27409]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.4/rpm/el9/BUILD/ceph-18.2.4/src/librbd/api/DiffIterate.cc: 341: FAILED ceph_assert(object_diff_state.size() == end_object_no - start_object_no)
Oct 11 13:46:38 hostname bash[27409]: ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d) reef (stable)
Oct 11 13:46:38 hostname bash[27409]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7f51910e504d]
Oct 11 13:46:38 hostname bash[27409]: 4: /lib64/librbd.so.1(+0x51ada7) [0x7f5181bf1da7]
Oct 11 13:46:38 hostname bash[27409]: 6: /lib64/python3.9/site-packages/rbd.cpython-39-x86_64-linux-gnu.so(+0x630bc) [0x7f5181e7c0bc]
Oct 11 13:46:38 hostname bash[27409]: 8: PyVectorcall_Call()
Oct 11 13:46:38 hostname bash[27409]: 9: /lib64/python3.9/site-packages/rbd.cpython-39-x86_64-linux-gnu.so(+0x44d50) [0x7f5181e5dd50]
Oct 11 13:46:38 hostname bash[27409]: 10: _PyObject_MakeTpCall()
Oct 11 13:46:38 hostname bash[27409]: 11: /lib64/libpython3.9.so.1.0(+0x125133) [0x7f5191c0a133]
Oct 11 13:46:38 hostname bash[27409]: 12: _PyEval_EvalFrameDefault()
Oct 11 13:46:38 hostname bash[27409]: 14: _PyFunction_Vectorcall()
Oct 11 13:46:38 hostname bash[27409]: 17: /lib64/libpython3.9.so.1.0(+0x11cb73) [0x7f5191c01b73]
Oct 11 13:46:38 hostname bash[27409]: 18: /lib64/libpython3.9.so.1.0(+0x125031) [0x7f5191c0a031]
Oct 11 13:46:38 hostname bash[27409]: 19: _PyEval_EvalFrameDefault()
Oct 11 13:46:38 hostname bash[27409]: 20: /lib64/libpython3.9.so.1.0(+0x11cb73) [0x7f5191c01b73]
Oct 11 13:46:38 hostname bash[27409]: 21: /lib64/libpython3.9.so.1.0(+0x125031) [0x7f5191c0a031]
Oct 11 13:46:38 hostname bash[27409]: 22: _PyEval_EvalFrameDefault()
Oct 11 13:46:38 hostname bash[27409]: 23: /lib64/libpython3.9.so.1.0(+0x10ec35) [0x7f5191bf3c35]
Oct 11 13:46:38 hostname bash[27409]: 25: /lib64/libpython3.9.so.1.0(+0x125031) [0x7f5191c0a031]
Oct 11 13:46:38 hostname bash[27409]: 26: _PyEval_EvalFrameDefault()
Oct 11 13:46:38 hostname bash[27409]: 27: /lib64/libpython3.9.so.1.0(+0x10ec35) [0x7f5191bf3c35]
Oct 11 13:46:38 hostname bash[27409]: 29: /lib64/libpython3.9.so.1.0(+0x125031) [0x7f5191c0a031]
Oct 11 13:46:38 hostname bash[27409]: 30: _PyEval_EvalFrameDefault()
Oct 11 13:46:38 hostname bash[27409]: 31: /lib64/libpython3.9.so.1.0(+0x10ec35) [0x7f5191bf3c35]
Oct 11 13:46:38 hostname bash[27409]: debug 0> 2024-10-11T11:46:38.413+0000 7f4e2f823640 -1 *** Caught signal (Aborted) **
Oct 11 13:46:38 hostname bash[27409]: 1: /lib64/libc.so.6(+0x3e6f0) [0x7f5190a8e6f0]
Oct 11 13:46:38 hostname bash[27409]: 2: /lib64/libc.so.6(+0x8b94c) [0x7f5190adb94c]
Oct 11 13:46:38 hostname bash[27409]: 3: raise()
Oct 11 13:46:38 hostname bash[27409]: 4: abort()
Oct 11 13:46:38 hostname bash[27409]: 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f51910e50a7]
Oct 11 13:46:38 hostname bash[27409]: 6: /usr/lib64/ceph/libceph-common.so.2(+0x16b20b) [0x7f51910e520b]
Oct 11 13:46:38 hostname bash[27409]: 7: /lib64/librbd.so.1(+0x193403) [0x7f518186a403]
Oct 11 13:46:38 hostname bash[27409]: 9: rbd_diff_iterate2()
Oct 11 13:46:38 hostname bash[27409]: 11: /lib64/libpython3.9.so.1.0(+0x11d7a1) [0x7f5191c027a1]
Oct 11 13:46:38 hostname bash[27409]: 13: /lib64/python3.9/site-packages/rbd.cpython-39-x86_64-linux-gnu.so(+0x44d50) [0x7f5181e5dd50]
Oct 11 13:46:38 hostname bash[27409]: 14: _PyObject_MakeTpCall()
Oct 11 13:46:38 hostname bash[27409]: 16: _PyEval_EvalFrameDefault()
Oct 11 13:46:38 hostname bash[27409]: 17: /lib64/libpython3.9.so.1.0(+0x10ec35) [0x7f5191bf3c35]
Oct 11 13:46:38 hostname bash[27409]: 18: _PyFunction_Vectorcall()
Oct 11 13:46:38 hostname bash[27409]: 19: /lib64/libpython3.9.so.1.0(+0x125031) [0x7f5191c0a031]
Oct 11 13:46:38 hostname bash[27409]: 20: _PyEval_EvalFrameDefault()
Oct 11 13:46:38 hostname bash[27409]: 21: /lib64/libpython3.9.so.1.0(+0x11cb73) [0x7f5191c01b73]
Oct 11 13:46:38 hostname bash[27409]: 22: /lib64/libpython3.9.so.1.0(+0x125031) [0x7f5191c0a031]
Oct 11 13:46:38 hostname bash[27409]: 24: /lib64/libpython3.9.so.1.0(+0x11cb73) [0x7f5191c01b73]
Oct 11 13:46:38 hostname bash[27409]: 26: _PyEval_EvalFrameDefault()
Oct 11 13:46:38 hostname bash[27409]: 27: /lib64/libpython3.9.so.1.0(+0x10ec35) [0x7f5191bf3c35]
Oct 11 13:46:38 hostname bash[27409]: 29: /lib64/libpython3.9.so.1.0(+0x125031) [0x7f5191c0a031]
Oct 11 13:46:38 hostname bash[27409]: 30: _PyEval_EvalFrameDefault()
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 none
Oct 11 13:46:38 hostname bash[27409]: 0/ 1 context
Oct 11 13:46:38 hostname bash[27409]: 1/ 5 mds_balancer
Oct 11 13:46:38 hostname bash[27409]: 1/ 5 mds_log
Oct 11 13:46:38 hostname bash[27409]: 1/ 5 mds_log_expire
Oct 11 13:46:38 hostname bash[27409]: 1/ 5 mds_migrator
Oct 11 13:46:38 hostname bash[27409]: 0/ 1 buffer
Oct 11 13:46:38 hostname bash[27409]: 0/ 1 timer
Oct 11 13:46:38 hostname bash[27409]: 0/ 1 objecter
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 rados
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 rbd_mirror
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 rbd_replay
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 rbd_pwl
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 journaler
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 immutable_obj_cache
Oct 11 13:46:38 hostname bash[27409]: 1/ 5 osd
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 objclass
Oct 11 13:46:38 hostname bash[27409]: 0/ 0 ms
Oct 11 13:46:38 hostname bash[27409]: 0/10 monc
Oct 11 13:46:38 hostname bash[27409]: 1/ 5 paxos
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 tp
Oct 11 13:46:38 hostname bash[27409]: 1/ 5 crypto
Oct 11 13:46:38 hostname bash[27409]: 1/ 5 heartbeatmap
Oct 11 13:46:38 hostname bash[27409]: 1/ 5 rgw_sync
Oct 11 13:46:38 hostname bash[27409]: 1/ 5 rgw_datacache
Oct 11 13:46:38 hostname bash[27409]: 1/ 5 rgw_flight
Oct 11 13:46:38 hostname bash[27409]: 1/ 5 asok
Oct 11 13:46:38 hostname bash[27409]: 1/ 1 throttle
Oct 11 13:46:38 hostname bash[27409]: 0/ 0 refs
Oct 11 13:46:38 hostname bash[27409]: 1/ 5 compressor
Oct 11 13:46:38 hostname bash[27409]: 1/ 5 bluestore
Oct 11 13:46:38 hostname bash[27409]: 1/ 5 kstore
Oct 11 13:46:38 hostname bash[27409]: 4/ 5 rocksdb
Oct 11 13:46:38 hostname bash[27409]: 1/ 5 fuse
Oct 11 13:46:38 hostname bash[27409]: 2/ 5 mgr
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 test
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 seastore
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 seastore_onode
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 seastore_odata
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 seastore_t
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 seastore_cleaner
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 seastore_epm
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 seastore_lba
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 seastore_fixedkv_tree
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 seastore_cache
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 seastore_device
Oct 11 13:46:38 hostname bash[27409]: 0/ 5 cyanstore
Oct 11 13:46:38 hostname bash[27409]: 1/ 5 ceph_exporter
Oct 11 13:46:38 hostname bash[27409]: -2/-2 (syslog threshold)
Oct 11 13:46:38 hostname bash[27409]: 99/99 (stderr threshold)
Oct 11 13:46:38 hostname bash[27409]: 7f4e1a47a640 / io_context_pool
Oct 11 13:46:38 hostname bash[27409]: 7f4e1b47c640 / safe_timer
Oct 11 13:46:38 hostname bash[27409]: 7f4e1e000640 / ms_dispatch
Oct 11 13:46:38 hostname bash[27409]: 7f4e1f803640 / io_context_pool
Oct 11 13:46:38 hostname bash[27409]: 7f4e2b01a640 / mgr-fin
Oct 11 13:46:38 hostname bash[27409]: 7f4e2c01c640 / dashboard
Oct 11 13:46:38 hostname bash[27409]: 7f4e2c81d640 / dashboard
Oct 11 13:46:38 hostname bash[27409]: 7f4e2e020640 / dashboard
Oct 11 13:46:38 hostname bash[27409]: 7f4e2e821640 / dashboard
Oct 11 13:46:38 hostname bash[27409]: 7f4e2f022640 / dashboard
Oct 11 13:46:38 hostname bash[27409]: 7f4e2f823640 / dashboard
Oct 11 13:46:38 hostname bash[27409]: 7f4e31827640 / mgr-fin
Oct 11 13:46:38 hostname bash[27409]: 7f4e32829640 / prometheus
Oct 11 13:46:38 hostname bash[27409]: 7f4e3402c640 / prometheus
Oct 11 13:46:38 hostname bash[27409]: 7f4e36030640 / prometheus
Oct 11 13:46:38 hostname bash[27409]: 7f4e36831640 / prometheus
Oct 11 13:46:38 hostname bash[27409]: 7f4e38034640 / mgr-fin
Oct 11 13:46:38 hostname bash[27409]: 7f4e3a9b8640 /
Oct 11 13:46:38 hostname bash[27409]: 7f4e3e1bf640 / mgr-fin
Oct 11 13:46:38 hostname bash[27409]: 7f4e42b08640 / safe_timer
Oct 11 13:46:38 hostname bash[27409]: 7f4e43b0a640 / ms_dispatch
Oct 11 13:46:38 hostname bash[27409]: 7f4e453cd640 / io_context_pool
Oct 11 13:46:38 hostname bash[27409]: 7f4e45c0e640 / mgr-fin
Oct 11 13:46:38 hostname bash[27409]: 7f4e47c12640 /
Oct 11 13:46:38 hostname bash[27409]: 7f4e4a417640 /
Oct 11 13:46:38 hostname bash[27409]: 7f4e4ac18640 / mgr-fin
Oct 11 13:46:38 hostname bash[27409]: 7f4e4c41b640 / safe_timer
Oct 11 13:46:38 hostname bash[27409]: 7f4e4d41d640 / ms_dispatch
Oct 11 13:46:38 hostname bash[27409]: 7f4e4ec20640 / io_context_pool
Oct 11 13:46:38 hostname bash[27409]: 7f4e51465640 / prometheus
Oct 11 13:46:38 hostname bash[27409]: 7f4e5552d640 / pg_autoscaler
Oct 11 13:46:38 hostname bash[27409]: 7f4e5652f640 /
Oct 11 13:46:38 hostname bash[27409]: 7f4e58d34640 /
Oct 11 13:46:38 hostname bash[27409]: 7f4e5a537640 /
Oct 11 13:46:38 hostname bash[27409]: 7f4e5bd7a640 / devicehealth
Oct 11 13:46:38 hostname bash[27409]: 7f4e5fd82640 / crash
Oct 11 13:46:38 hostname bash[27409]: 7f4e60d84640 / cephadm
Oct 11 13:46:38 hostname bash[27409]: 7f4e62587640 / mgr-fin
Oct 11 13:46:38 hostname bash[27409]: 7f4e64e0c640 / mgr-fin
Oct 11 13:46:38 hostname bash[27409]: 7f4e65e0e640 / mgr-fin
Oct 11 13:46:38 hostname bash[27409]: 7f4e66e10640 / mgr-fin
Oct 11 13:46:38 hostname bash[27409]: 7f4e67611640 / mgr-fin
Oct 11 13:46:38 hostname bash[27409]: 7f4e68613640 / mgr-fin
Oct 11 13:46:38 hostname bash[27409]: 7f4e68e14640 / mgr-fin
Oct 11 13:46:38 hostname bash[27409]: 7f4e69e96640 / balancer
Oct 11 13:46:38 hostname bash[27409]: 7f4e6dede640 / cmdfin
Oct 11 13:46:38 hostname bash[27409]: 7f4e6f6e1640 / ms_dispatch
Oct 11 13:46:38 hostname bash[27409]: 7f51864f4640 / safe_timer
Oct 11 13:46:38 hostname bash[27409]: 7f518acfd640 / ms_dispatch
Oct 11 13:46:38 hostname bash[27409]: 7f518e504640 / msgr-worker-1
Oct 11 13:46:38 hostname bash[27409]: 7f518ed05640 / msgr-worker-0
Oct 11 13:46:38 hostname bash[27409]: max_recent 10000
Oct 11 13:46:38 hostname bash[27409]: max_new 1000
Oct 11 13:46:38 hostname bash[27409]: log_file /var/lib/ceph/crash/2024-10-11T11:46:38.415833Z_b3978f24-6697-44f5-80dc-4915b5ec144d/log
Oct 11 13:46:38 hostname bash[27409]: --- end dump of recent events ---
Very rudimentary question, does erasure coding scheme of 2 data chunks (k) and 4 coding chunks (m) able to withstand loss of any 4 chunks irrespective of both data chunks loss i.e with just two coding chunks. If yes, what kind of data would be stored in the coding chunks that it can reconstruct the original data chunks.