By Chanaka January 24, 2023

OLVM Gluster data domain healing – (Addressing split brain)

In recent years, the tremendous growth of applications, and these applications
started generating huge volumes of data be it from mobile devices or be it
from the web. As more and more such applications are being built, they needed
to deliver the content directly to the user at a faster rate irrespective of
if they are using a Mobile, Tablet, Laptop, Desktop, or any such device.
Along with this, handling a larger volume of files became a challenge, needs a
lot of Metadata related to the file needs to be stored and accessed when
needed. Data storage once looked very easy, became a big challenge now.

Storage technologies are rapidly changing in the last 3 decades, Current trend
is towards software-driven data center technologies. Now we have
software-driven cluster files systems such as Gluster, which gives you more elasticity and scalability.

In the clustered environment there is a possibility you will face this
split-brain scenario. In simple terms, split-brain occurs when two nodes of a
cluster are disconnected. Each node thinks the other one is not working.

Let’s understand what is split-brain.

What is Split-Brain?

As mentioned in the
Official Documentation on Managing Split-Brain
provided by RedHat, split-brain is a state when data or availability
inconsistencies originating from the maintenance of two separate data sets
with an overlap in scope, either because of servers in a network design or a
failure condition based on servers not communicating and synchronizing their
data to each other. And it is a term applicable to replicate the
configuration.

Pay attention that it is said “a failure condition based on servers not
communicating and synchronizing their data to each other” – due to any
likelihood – but it doesn’t mean that your nodes might lose the connection.
The Peer may be yet in the cluster and connected.

Summarized :

The difference in file data/metadata across the bricks of a replica.
Cannot identify which brick holds the good copy, even when all bricks are
available.
Each brick accuses the other of needing healing.
All modification FOPs fail with input/output Error (EIO)

Split-Brain Types :

We have three different types of split-brain, and as far as I can see yours is
entry split-brain. To explain three types of split-brain :

Data split-brain: Contents of the file under split-brain are
different in different replica pairs and automatic healing is not possible.

Metadata split-brain:, The metadata of the files (for example,
user-defined extended attribute) are different and automatic healing is
not possible.

Entry split-brain: It happens when a file has a different GFID on each replica pair.

What is GFID?

GlusterFS internal file identifier (GFID) is a UUID that is unique to each
file across the entire cluster. This is analogous to the inode number in a
normal filesystem. The GFID of a file is stored in its xattr named
trusted.gfid. To find the path from GFID, I highly recommend you read
this official article
provided by GlusterFS.

In this article, I will cover the steps of how we can come out of the GFS
split-brain condition.

How glusterfs data domain entered to split-brain condition?

I faced a split-brain scenario in the glusterfs data domain configured in OLVM. This
occurred due to unexpected network latency triggered by the KVM management
network. At this time we are reading the data from NFS share and writing to GFS
storage. RMAN restores are storage intensive and GFS configured as replicated volume
transferring data from one block to another. Due to network latency suddenly
replication got stopped and the GFS file system went into a split-brain condition.

How to identify the files in a split-brain condition

To check whether the GFS files are in split-brain or not execute gluster volume heal gvol0 info split-brain.
As per this below mention output, the command will display the files that are in split-brain condition.

During this time period, both storage domains went offline, because the storage master node got affected due by this network latency.



[root@KVM01 dom_md]# gluster volume heal gvol0 info split-brain
Brick KVM01:/nodirectwritedata/glusterfs/brick1/gvol0
/22a3d534-86b2-4f63-aa44-9ac555404692/images/6639bd7e-33b1-42a5-89b0-0eee2b3a7262/e9981ace-c0c0-4bc6-9e6d-e03a805f083a  
/22a3d534-86b2-4f63-aa44-9ac555404692/dom_md/ids                                                                             
/22a3d534-86b2-4f63-aa44-9ac555404692/images/bdea3934-edce-494b-91cf-06fb536a9f9c/c04307ed-e8dc-459d-8bd5-02446b2b9175
/22a3d534-86b2-4f63-aa44-9ac555404692/images/a9b5b747-2fae-4b32-b839-2ea03dfcf35e/cb1b3014-1fbd-44d6-854c-fe55dc22f4a3
/22a3d534-86b2-4f63-aa44-9ac555404692/images/d28ab741-0d69-4ad6-97e3-4449b42b782f/10bc2496-ff19-4087-bc55-aab201b39936
/22a3d534-86b2-4f63-aa44-9ac555404692/images/7c25b5da-aabd-49ad-bf4a-f458f382e525/a44a71d7-98d4-47cd-aeae-f8fe5ac4bf1e
/22a3d534-86b2-4f63-aa44-9ac555404692/images/d7882784-cf18-4c8c-af22-f46fe3a96c8e/4fa5c17b-2739-46a7-8c20-3e943cc764b5
/22a3d534-86b2-4f63-aa44-9ac555404692/images/05caeb56-9287-484b-aef0-8f389d27f1bf/d370a0d8-889d-488d-bcaa-4ac652f7c5fe
/22a3d534-86b2-4f63-aa44-9ac555404692/dom_md/leases                                                                      
/22a3d534-86b2-4f63-aa44-9ac555404692/dom_md/outbox                                                                     
Status: Connected
Number of entries in split-brain: 10

Recovery Process

There are a few ways to perform the glusterfs split-brain recovery. All the recovery scenarios are there in the glusterfs document:https://docs.gluster.org/en/main/Troubleshooting/resolving-splitbrain/.

This was a rare situation in that we need to pick up the latest modified block as the recovery file. The latest file modified time can be validated via the stat command.

As per the log, brick 2 got the latest modified time stamp. This block can be used to recover the files to get out of the split-brain condition.



[root@KVM01 dom_md]# stat /nodirectwritedata/glusterfs/brick1/gvol0/22a3d534-86b2-4f63-aa44-9ac555404692/images/a9b5b747-2fae-4b32-b839-2ea03dfcf35e/cb1b3014-1fbd-44d6-854c-fe55dc22f4a3
  File: /nodirectwritedata/glusterfs/brick1/gvol0/22a3d534-86b2-4f63-aa44-9ac555404692/images/a9b5b747-2fae-4b32-b839-2ea03dfcf35e/cb1b3014-1fbd-44d6-854c-fe55dc22f4a3
  Size: 268435456000    Blocks: 524301176  IO Block: 4096   regular file
Device: fc07h/64519d    Inode: 3224096484  Links: 2
Access: (0660/-rw-rw----)  Uid: (   36/    vdsm)   Gid: (   36/     kvm)
Context: system_u:object_r:glusterd_brick_t:s0
Access: 2022-10-13 10:05:08.396792385 -0400
Modify: 2022-10-13 10:05:15.705792522 -0400
Change: 2022-10-14 09:59:16.348788467 -0400
 Birth: 2022-09-22 11:18:48.043805922 -0400
 
 [root@KVM02 log]# stat /nodirectwritedata/glusterfs/brick2/gvol0/22a3d534-86b2-4f63-aa44-9ac555404692/images/a9b5b747-2fae-4b32-b839-2ea03dfcf35e/cb1b3014-1fbd-44d6-854c-fe55dc22f4a3
  File: /nodirectwritedata/glusterfs/brick2/gvol0/22a3d534-86b2-4f63-aa44-9ac555404692/images/a9b5b747-2fae-4b32-b839-2ea03dfcf35e/cb1b3014-1fbd-44d6-854c-fe55dc22f4a3
  Size: 268435456000    Blocks: 524300992  IO Block: 4096   regular file
Device: fc07h/64519d    Inode: 4109        Links: 2
Access: (0660/-rw-rw----)  Uid: (   36/    vdsm)   Gid: (   36/     kvm)
Context: system_u:object_r:glusterd_brick_t:s0
Access: 2022-10-05 15:15:46.368629123 -0400
Modify: 2022-10-13 10:08:08.513029740 -0400
Change: 2022-10-14 10:01:50.679076234 -0400
 Birth: 2022-09-22 11:21:29.195597907 -0400

Healing

While performing the healing you have to make sure the session should be consistent without any interruption. you can use tmux to create a consistent session, hope this cheat sheet will be useful to understand tmux : https://tmuxcheatsheet.com/.
Healing time will vary with the size, 500GB vm disk took 4hr to complete the healing.

Validate can be performed via md5sum, both files should have the same hash value.

Healing can be performed by executing under mentioned command.


gluster volume heal --VOLNAME-- split-brain source-brick --HOSTNAME:BRICKNAME-- --FILE--

Sample output of the heal



[root@KVM02 ~]#gluster volume heal gvol0 split-brain source-brick KVM01.local.com:/nodirectwritedata/glusterfs/brick2/gvol0 /22a3d534-86b2-4f63-aa44-9ac555404692/images/a9b5b747-2fae-4b32-b839-2ea03dfcf35e/cb1b3014-1fbd-44d6-854c-fe55dc22f4a3
Healed /22a3d534-86b2-4f63-aa44-9ac555404692/images/a9b5b747-2fae-4b32-b839-2ea03dfcf35e/cb1b3014-1fbd-44d6-854c-fe55dc22f4a3.

Conclusion

There can be situations glusterfs replicate volume can move to an inconsistent state due to managing network traffic. This can be avoided by having 3 blocks for glusterfs volume or the need to enable fencing via OLVM engine. Next blog I will elaborate on how you can increase the network threshold to 100%. This gives you breathing space to avoid split brain conditions.