* [gentoo-user] Fun with mdadm (Software RAID)
@ 2024-12-20 10:47 Alan Mackenzie
2024-12-20 14:50 ` karl
0 siblings, 1 reply; 23+ messages in thread
From: Alan Mackenzie @ 2024-12-20 10:47 UTC (permalink / raw
To: gentoo-user
Hello, Gentoo.
After having got the syslinux boot manager working well, I lost the root
partition on my newer machine. I spent the entire evening yesterday
trying to get it back again, with various expedients for recovering ext4
partitions from backup superblocks, and so on.
It wasn't until the middle of the night that it dawned on me what had
happened, and I immediately got up and had it fixed within twenty
minutes.
The cause was me booting up the machine with a rescue disk. This
assembled my RAID partitions /dev/md127 and /dev/md126 reversed, but
also wrote those wrong identifiers, 126 and 127, into the "preferred
minor" field of the partitions' super blocks. In essence, they got
swapped.
Hence trying to boot up into my normal system, /dev/md126, the root
partition, was an unformatted empty space on the SSD.
I don't blame the rescue disk for this occurrence. For some reason,
when the kernel assembles /dev/md devices, it only seems to pay
attention to the "preferred minor" fields when they are wrong. :-(
mdadm appears to write the "preferred minor" fields at random when
assembling the RAID arrays. I don't think it should, unless explicitly
asked. There is an argument to mdadm which specifies the writing of
these fields. In fact I used this to effect a repair, ironically
enough, from the rescue disk booted with the option to suppress the
automatic assembly of the arrays.
Just for the record, all my RAID arrays have metadata version 0.90, the
(old fashioned) one that allows auto-assembly by the kernel without the
need of an initramfs.
The moral of the story: if your system uses software RAID, be careful
indeed before you boot up with a rescue disk.
--
Alan Mackenzie (Nuremberg, Germany).
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-20 10:47 [gentoo-user] Fun with mdadm (Software RAID) Alan Mackenzie
@ 2024-12-20 14:50 ` karl
2024-12-20 15:28 ` Alan Mackenzie
0 siblings, 1 reply; 23+ messages in thread
From: karl @ 2024-12-20 14:50 UTC (permalink / raw
To: gentoo-user
Alan Mackenzie:
...
> The cause was me booting up the machine with a rescue disk. This
> assembled my RAID partitions /dev/md127 and /dev/md126 reversed, but
> also wrote those wrong identifiers, 126 and 127, into the "preferred
> minor" field of the partitions' super blocks. In essence, they got
> swapped.
...
> Just for the record, all my RAID arrays have metadata version 0.90, the
> (old fashioned) one that allows auto-assembly by the kernel without the
> need of an initramfs.
>
> The moral of the story: if your system uses software RAID, be careful
> indeed before you boot up with a rescue disk.
So, why don't you simple add "root=902 md=2,/dev/sda2,/dev/sdb2" or similar to
your boot loader kernel command line ?
///
And... what is the need for dynamic minors now when dev_t is 32bits:
$ grep dev_t /Net/git/linux-stable/include/linux/types.h
typedef u32 __kernel_dev_t;
typedef __kernel_dev_t dev_t;
$
and we have 20 bits minors:
$ grep -A1 MINORBITS /Net/git/linux-stable/include/linux/kdev_t.h
#define MINORBITS 20
#define MINORMASK ((1U << MINORBITS) - 1)
#define MAJOR(dev) ((unsigned int) ((dev) >> MINORBITS))
#define MINOR(dev) ((unsigned int) ((dev) & MINORMASK))
#define MKDEV(ma,mi) (((ma) << MINORBITS) | (mi))
Regards,
/Karl Hammar
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-20 14:50 ` karl
@ 2024-12-20 15:28 ` Alan Mackenzie
2024-12-20 17:44 ` karl
0 siblings, 1 reply; 23+ messages in thread
From: Alan Mackenzie @ 2024-12-20 15:28 UTC (permalink / raw
To: gentoo-user
Hello, Karl.
On Fri, Dec 20, 2024 at 15:50:53 +0100, karl@aspodata.se wrote:
> Alan Mackenzie:
> ...
> > The cause was me booting up the machine with a rescue disk. This
> > assembled my RAID partitions /dev/md127 and /dev/md126 reversed, but
> > also wrote those wrong identifiers, 126 and 127, into the "preferred
> > minor" field of the partitions' super blocks. In essence, they got
> > swapped.
> ...
> > Just for the record, all my RAID arrays have metadata version 0.90, the
> > (old fashioned) one that allows auto-assembly by the kernel without the
> > need of an initramfs.
> > The moral of the story: if your system uses software RAID, be careful
> > indeed before you boot up with a rescue disk.
> So, why don't you simple add "root=902 md=2,/dev/sda2,/dev/sdb2" or similar to
> your boot loader kernel command line ?
Because I didn't know about it. I found out about it this morning, and
immediately tested it by setting up an
"md=126,/dev/nvme0n1p4,/dev/nvme1n1p4" on the kernel command line, using
the rescue disk to make the "preferred minor"s wrong, and testing it.
It worked!
If I understand things correctly, with this mechanism one can have the
kernel assemble the RAID arrays at boot up time with a modern metadata,
but still without needing the initramfs. My arrays are still at
metadata 0.90.
> ///
> And... what is the need for dynamic minors now when dev_t is 32bits:
Dynamic minors? I don't think I follow you, here.
> $ grep dev_t /Net/git/linux-stable/include/linux/types.h
> typedef u32 __kernel_dev_t;
> typedef __kernel_dev_t dev_t;
> $
> and we have 20 bits minors:
> $ grep -A1 MINORBITS /Net/git/linux-stable/include/linux/kdev_t.h
> #define MINORBITS 20
> #define MINORMASK ((1U << MINORBITS) - 1)
> #define MAJOR(dev) ((unsigned int) ((dev) >> MINORBITS))
> #define MINOR(dev) ((unsigned int) ((dev) & MINORMASK))
> #define MKDEV(ma,mi) (((ma) << MINORBITS) | (mi))
> Regards,
> /Karl Hammar
--
Alan Mackenzie (Nuremberg, Germany).
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-20 15:28 ` Alan Mackenzie
@ 2024-12-20 17:44 ` karl
2024-12-20 20:19 ` Alan Mackenzie
2024-12-22 12:02 ` Wols Lists
0 siblings, 2 replies; 23+ messages in thread
From: karl @ 2024-12-20 17:44 UTC (permalink / raw
To: gentoo-user
Alan Mackenzie:
> On Fri, Dec 20, 2024 at 15:50:53 +0100, karl@aspodata.se wrote:
...
> Because I didn't know about it. I found out about it this morning, and
> immediately tested it by setting up an
> "md=126,/dev/nvme0n1p4,/dev/nvme1n1p4" on the kernel command line, using
> the rescue disk to make the "preferred minor"s wrong, and testing it.
> It worked!
>
> If I understand things correctly, with this mechanism one can have the
> kernel assemble the RAID arrays at boot up time with a modern metadata,
> but still without needing the initramfs. My arrays are still at
> metadata 0.90.
Please tell if you make booting with metadata 1.2 work.
I havn't tested that.
///
...
> > And... what is the need for dynamic minors now when dev_t is 32bits:
> Dynamic minors? I don't think I follow you, here.
If you partition the md device, the partitions will get a device with a
dynamic minor.
# mdadm -C /dev/md11 -n 1 -l 1 --force /dev/sdc2
# mdadm -C /dev/md10 -n 1 -l 1 -e 0 --force /dev/sdc1
... create partitions
# fdisk -l /dev/md10
...
Device Boot Start End Sectors Size Id Type
/dev/md10p1 2048 22527 20480 10M 83 Linux
/dev/md10p2 22528 192383 169856 82.9M 83 Linux
# fdisk -l /dev/md11
...
Device Boot Start End Sectors Size Id Type
/dev/md11p1 2048 206847 204800 100M 83 Linux
/dev/md11p2 206848 1757183 1550336 757M 83 Linux
# cat /sys/block/md10/md10p1/dev
259:0
# cat /sys/block/md10/md10p2/dev
259:1
# cat /sys/block/md11/md11p1/dev
259:2
# cat /sys/block/md11/md11p2/dev
259:3
$ grep -A2 '259 block' /Net/git/linux-stable/Documentation/admin-guide/devices.txt
259 block Block Extended Major
Used dynamically to hold additional partition minor
numbers and allow large numbers of partitions per device
So, to boot to a md device partition (as /) might be a hit and miss
unless you use some initramfs magic.
Regards,
/Karl Hammar
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-20 17:44 ` karl
@ 2024-12-20 20:19 ` Alan Mackenzie
2024-12-20 20:38 ` Hoël Bézier
` (3 more replies)
2024-12-22 12:02 ` Wols Lists
1 sibling, 4 replies; 23+ messages in thread
From: Alan Mackenzie @ 2024-12-20 20:19 UTC (permalink / raw
To: gentoo-user
Hello, Karl.
On Fri, Dec 20, 2024 at 18:44:53 +0100, karl@aspodata.se wrote:
> Alan Mackenzie:
> > On Fri, Dec 20, 2024 at 15:50:53 +0100, karl@aspodata.se wrote:
> ...
> > Because I didn't know about it. I found out about it this morning, and
> > immediately tested it by setting up an
> > "md=126,/dev/nvme0n1p4,/dev/nvme1n1p4" on the kernel command line, using
> > the rescue disk to make the "preferred minor"s wrong, and testing it.
> > It worked!
> > If I understand things correctly, with this mechanism one can have the
> > kernel assemble the RAID arrays at boot up time with a modern metadata,
> > but still without needing the initramfs. My arrays are still at
> > metadata 0.90.
> Please tell if you make booting with metadata 1.2 work.
> I havn't tested that.
I've just tried it, with metadata 1.2, and it doesn't work. I got error
messages at boot up to the effect that the component partitions were
lacking valid version 0.0 super blocks.
People without initramfs appear not to be in the sights of the
maintainers of this software. They could so easily have made the
assembly of metadata 1.2 components on the kernel command line work.
:-(
By the way, do you know an easy way for copying an entire filesystem,
such as the root system, but without copying other systems mounted in
it? I tried for some while with rsync and various combinations of
find's and xargs's, and in the end booted up into the rescue disc to do
it. I shouldn't have to do that.
> ///
> ...
> > > And... what is the need for dynamic minors now when dev_t is 32bits:
> > Dynamic minors? I don't think I follow you, here.
> If you partition the md device, the partitions will get a device with a
> dynamic minor.
> # mdadm -C /dev/md11 -n 1 -l 1 --force /dev/sdc2
> # mdadm -C /dev/md10 -n 1 -l 1 -e 0 --force /dev/sdc1
> ... create partitions
> # fdisk -l /dev/md10
> ...
> Device Boot Start End Sectors Size Id Type
> /dev/md10p1 2048 22527 20480 10M 83 Linux
> /dev/md10p2 22528 192383 169856 82.9M 83 Linux
> # fdisk -l /dev/md11
> ...
> Device Boot Start End Sectors Size Id Type
> /dev/md11p1 2048 206847 204800 100M 83 Linux
> /dev/md11p2 206848 1757183 1550336 757M 83 Linux
> # cat /sys/block/md10/md10p1/dev
> 259:0
> # cat /sys/block/md10/md10p2/dev
> 259:1
> # cat /sys/block/md11/md11p1/dev
> 259:2
> # cat /sys/block/md11/md11p2/dev
> 259:3
> $ grep -A2 '259 block' /Net/git/linux-stable/Documentation/admin-guide/devices.txt
> 259 block Block Extended Major
> Used dynamically to hold additional partition minor
> numbers and allow large numbers of partitions per device
> So, to boot to a md device partition (as /) might be a hit and miss
> unless you use some initramfs magic.
OK, thanks for the explanation. My root partition is an entire device,
/dev/md126. I've only had problems with it when accidents have
happened, like yesterday evening.
> Regards,
> /Karl Hammar
--
Alan Mackenzie (Nuremberg, Germany).
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-20 20:19 ` Alan Mackenzie
@ 2024-12-20 20:38 ` Hoël Bézier
2024-12-20 20:53 ` Alan Mackenzie
2024-12-20 22:02 ` karl
` (2 subsequent siblings)
3 siblings, 1 reply; 23+ messages in thread
From: Hoël Bézier @ 2024-12-20 20:38 UTC (permalink / raw
To: gentoo-user
Am Fr, Dez 20, 2024 am 08:19:55 +0000 schrieb Alan Mackenzie:
>By the way, do you know an easy way for copying an entire filesystem,
>such as the root system, but without copying other systems mounted in
>it? I tried for some while with rsync and various combinations of
>find's and xargs's, and in the end booted up into the rescue disc to do
>it. I shouldn't have to do that.
rsync -x / /some-other-place
From man rsync:
--one-file-system, -x don’t cross filesystem boundaries
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-20 20:38 ` Hoël Bézier
@ 2024-12-20 20:53 ` Alan Mackenzie
0 siblings, 0 replies; 23+ messages in thread
From: Alan Mackenzie @ 2024-12-20 20:53 UTC (permalink / raw
To: gentoo-user
Hello, Hoël.
On Fri, Dec 20, 2024 at 21:38:42 +0100, Hoël Bézier wrote:
> Am Fr, Dez 20, 2024 am 08:19:55 +0000 schrieb Alan Mackenzie:
> >By the way, do you know an easy way for copying an entire filesystem,
> >such as the root system, but without copying other systems mounted in
> >it? I tried for some while with rsync and various combinations of
> >find's and xargs's, and in the end booted up into the rescue disc to do
> >it. I shouldn't have to do that.
> rsync -x / /some-other-place
> From man rsync:
> --one-file-system, -x don’t cross filesystem boundaries
Thanks! I'll remember that. For some reason I didn't find it when
searching the rsync man page.
--
Alan Mackenzie (Nuremberg, Germany).
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-20 20:19 ` Alan Mackenzie
2024-12-20 20:38 ` Hoël Bézier
@ 2024-12-20 22:02 ` karl
2024-12-30 4:08 ` Frank Steinmetzger
2024-12-20 22:02 ` karl
2024-12-22 12:08 ` Wols Lists
3 siblings, 1 reply; 23+ messages in thread
From: karl @ 2024-12-20 22:02 UTC (permalink / raw
To: gentoo-user
Alan Mackenzie:
...
> By the way, do you know an easy way for copying an entire filesystem,
> such as the root system, but without copying other systems mounted in
> it? I tried for some while with rsync and various combinations of
> find's and xargs's, and in the end booted up into the rescue disc to do
> it. I shouldn't have to do that.
rsync as other people have suggested.
There is also
cp -x
dump/restore
find -xdev
etc.
You can also do it by accessing the /dev/-file like
dd if=source of=dest (cp works here also but dd is more the norm).
///
When something is mounted on a mount point, the files below the
mount point is hidden and the mounted filessystem will be available
instead. Do you want to copy thoose hidden files also ?
Regards,
/Karl Hammar
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-20 20:19 ` Alan Mackenzie
2024-12-20 20:38 ` Hoël Bézier
2024-12-20 22:02 ` karl
@ 2024-12-20 22:02 ` karl
2024-12-21 12:43 ` Alan Mackenzie
2024-12-22 12:08 ` Wols Lists
3 siblings, 1 reply; 23+ messages in thread
From: karl @ 2024-12-20 22:02 UTC (permalink / raw
To: gentoo-user
Alan Mackenzie:
> On Fri, Dec 20, 2024 at 18:44:53 +0100, karl@aspodata.se wrote:
...
> > Please tell if you make booting with metadata 1.2 work.
> > I havn't tested that.
>
> I've just tried it, with metadata 1.2, and it doesn't work. I got error
> messages at boot up to the effect that the component partitions were
> lacking valid version 0.0 super blocks.
>
> People without initramfs appear not to be in the sights of the
> maintainers of this software. They could so easily have made the
> assembly of metadata 1.2 components on the kernel command line work.
> :-(
...
The cmd line handling and auto mounting seems to be handled in files
like (depending of kernel version I guess):
drivers/md/md-autodetect.c
init/do_mounts_md.c
you can find the correct file with
find <kernel top dir> -type f -name \*.c | xargs grep MD_AUTODETECT
The problem might be that in format 1.2, the superblock is at 4K from
start, could format 1.1 (where the superblock is at start) work ?
Regards,
/Karl Hammar
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-20 22:02 ` karl
@ 2024-12-21 12:43 ` Alan Mackenzie
2024-12-21 16:36 ` Alan Mackenzie
2024-12-22 12:16 ` Wols Lists
0 siblings, 2 replies; 23+ messages in thread
From: Alan Mackenzie @ 2024-12-21 12:43 UTC (permalink / raw
To: gentoo-user
Hello, Karl.
On Fri, Dec 20, 2024 at 23:02:58 +0100, karl@aspodata.se wrote:
> Alan Mackenzie:
> > On Fri, Dec 20, 2024 at 18:44:53 +0100, karl@aspodata.se wrote:
> ...
> > > Please tell if you make booting with metadata 1.2 work.
> > > I havn't tested that.
> > I've just tried it, with metadata 1.2, and it doesn't work. I got error
> > messages at boot up to the effect that the component partitions were
> > lacking valid version 0.0 super blocks.
> > People without initramfs appear not to be in the sights of the
> > maintainers of this software. They could so easily have made the
> > assembly of metadata 1.2 components on the kernel command line work.
> > :-(
> ...
> The cmd line handling and auto mounting seems to be handled in files
> like (depending of kernel version I guess):
> drivers/md/md-autodetect.c
> init/do_mounts_md.c
> you can find the correct file with
> find <kernel top dir> -type f -name \*.c | xargs grep MD_AUTODETECT
The pertinent functions are mainly in drivers/md/md-autodetect.c and
md.c (same directory).
It seems that nowhere does this code try the different metadata formats
in turn, using the first valid one that it finds. Instead, it expects
the metadata format to be passed in as an argument to whatever needs it.
For the md kernel parameter to be able to load metadata versions
1.[012], the parameter definition would have to be enhanced, somehow.
Something like:
md=124,1.2,/dev/nvme0n1p6,/dev/nvme1n1p6
^^^^
, where the extra bit is optional. This enhancement would not be
difficult. The trouble is more political. I think this code is
maintained by RedHat. RedHat's customers all use initramfs, so they
probably think everybody else should, too, hence would be unwilling to
enhance it for a small group of Gentooers.
> The problem might be that in format 1.2, the superblock is at 4K from
> start, could format 1.1 (where the superblock is at start) work ?
This doesn't seem to be the problem. The 0.90 superblock is right at
the end of the partition, for example. There are two functions in md.c,
super_90_load and super_1_load which read and verify the super block of
the given metadata type.
Despite the 0.90 format being "deprecated", it doesn't appear to be in
any danger. It was in a deprecated state in 2010, when I started using
RAID, and I think the maintainers realise that to phase 0.90 out would
cause a lot of pain and protest. The main limitation with 0.90 that I
can see is its restriction to 2^32 512-byte blocks per component device.
This is the 2 terabyte limitation, which isn't a problem for me at the
moment, but might be for other people with enormous drives.
Nevertheless, I might make the above enhancement, just because.
> Regards,
> /Karl Hammar
--
Alan Mackenzie (Nuremberg, Germany).
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-21 12:43 ` Alan Mackenzie
@ 2024-12-21 16:36 ` Alan Mackenzie
2024-12-21 16:45 ` karl
2024-12-22 12:16 ` Wols Lists
1 sibling, 1 reply; 23+ messages in thread
From: Alan Mackenzie @ 2024-12-21 16:36 UTC (permalink / raw
To: gentoo-user
Hello again, Karl.
On Sat, Dec 21, 2024 at 12:43:50 +0000, Alan Mackenzie wrote:
> On Fri, Dec 20, 2024 at 23:02:58 +0100, karl@aspodata.se wrote:
> > Alan Mackenzie:
> > > On Fri, Dec 20, 2024 at 18:44:53 +0100, karl@aspodata.se wrote:
> > ...
> > > > Please tell if you make booting with metadata 1.2 work.
> > > > I havn't tested that.
> > > I've just tried it, with metadata 1.2, and it doesn't work. I got error
> > > messages at boot up to the effect that the component partitions were
> > > lacking valid version 0.0 super blocks.
> > > People without initramfs appear not to be in the sights of the
> > > maintainers of this software. They could so easily have made the
> > > assembly of metadata 1.2 components on the kernel command line work.
> > > :-(
> > ...
I've now got working code which assembles a metadata 1.2 RAID array at
boot time. The syntax needed on the command line is, again,
md=124,1.2,/dev/nvme0n1p6,/dev/nvme1n1p6
.. In place of 1.2 can be any of 0.90, 1.0, 1.1, though I haven't tested
it with anything but 1.2 as yet.
> The pertinent functions are mainly in drivers/md/md-autodetect.c and
> md.c (same directory).
Actually, just in md-autodetect.c.
[ .... ]
> Nevertheless, I might make the above enhancement, just because.
Done.
> > Regards,
> > /Karl Hammar
--
Alan Mackenzie (Nuremberg, Germany).
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-21 16:36 ` Alan Mackenzie
@ 2024-12-21 16:45 ` karl
2024-12-21 16:58 ` Alan Mackenzie
0 siblings, 1 reply; 23+ messages in thread
From: karl @ 2024-12-21 16:45 UTC (permalink / raw
To: gentoo-user
Alan Mackenzie:
...
> I've now got working code which assembles a metadata 1.2 RAID array at
> boot time. The syntax needed on the command line is, again,
>
> md=124,1.2,/dev/nvme0n1p6,/dev/nvme1n1p6
>
> .. In place of 1.2 can be any of 0.90, 1.0, 1.1, though I haven't tested
> it with anything but 1.2 as yet.
...
Fun! Which kernel, can you send a patch ?
Regards,
/Karl Hammar
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-21 16:45 ` karl
@ 2024-12-21 16:58 ` Alan Mackenzie
2024-12-22 13:08 ` Alan Mackenzie
0 siblings, 1 reply; 23+ messages in thread
From: Alan Mackenzie @ 2024-12-21 16:58 UTC (permalink / raw
To: gentoo-user
[-- Attachment #1: Type: text/plain, Size: 627 bytes --]
Hello, Karl.
On Sat, Dec 21, 2024 at 17:45:13 +0100, karl@aspodata.se wrote:
> Alan Mackenzie:
> ...
> > I've now got working code which assembles a metadata 1.2 RAID array at
> > boot time. The syntax needed on the command line is, again,
> > md=124,1.2,/dev/nvme0n1p6,/dev/nvme1n1p6
> > .. In place of 1.2 can be any of 0.90, 1.0, 1.1, though I haven't tested
> > it with anything but 1.2 as yet.
> ...
> Fun! Which kernel, can you send a patch ?
6.6.62. Patch enclosed. It should apply cleanly from the directory
..../drivers/md.
Have fun!
> Regards,
> /Karl Hammar
--
Alan Mackenzie (Nuremberg, Germany).
[-- Attachment #2: diff.20241221b.diff --]
[-- Type: text/plain, Size: 1469 bytes --]
diff --git a/drivers/md/md-autodetect.c b/drivers/md/md-autodetect.c
index b2a00f213c2c..2cd347108284 100644
--- a/drivers/md/md-autodetect.c
+++ b/drivers/md/md-autodetect.c
@@ -124,6 +124,17 @@ static void __init md_setup_drive(struct md_setup_args *args)
struct mddev *mddev;
int err = 0, i;
char name[16];
+ int major_version = 0, minor_version = 90;
+ char *pp;
+ static struct {
+ char *metadata;
+ int major_version;
+ int minor_version;
+ } metadata_table[] =
+ {{"0.90", 0, 90},
+ {"1.0", 1, 0},
+ {"1.1", 1, 1},
+ {"1.2", 1, 2}};
if (args->partitioned) {
mdev = MKDEV(mdp_major, args->minor << MdpMinorShift);
@@ -133,6 +144,21 @@ static void __init md_setup_drive(struct md_setup_args *args)
sprintf(name, "md%d", args->minor);
}
+ pp = strchr(devname, ',');
+ if (pp)
+ {
+ *pp = 0;
+ for (i = 1; i < ARRAY_SIZE(metadata_table); i++)
+ if (!strcmp(devname, metadata_table[i].metadata))
+ {
+ major_version = metadata_table[i].major_version;
+ minor_version = metadata_table[i].minor_version;
+ devname = pp + 1;
+ break;
+ }
+ *pp = ',';
+ }
+
for (i = 0; i < MD_SB_DISKS && devname != NULL; i++) {
struct kstat stat;
char *p;
@@ -183,6 +209,8 @@ static void __init md_setup_drive(struct md_setup_args *args)
goto out_unlock;
}
+ ainfo.major_version = major_version;
+ ainfo.minor_version = minor_version;
if (args->level != LEVEL_NONE) {
/* non-persistent */
ainfo.level = args->level;
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-20 17:44 ` karl
2024-12-20 20:19 ` Alan Mackenzie
@ 2024-12-22 12:02 ` Wols Lists
2024-12-22 13:43 ` Alan Mackenzie
1 sibling, 1 reply; 23+ messages in thread
From: Wols Lists @ 2024-12-22 12:02 UTC (permalink / raw
To: gentoo-user
On 20/12/2024 17:44, karl@aspodata.se wrote:
>> If I understand things correctly, with this mechanism one can have the
>> kernel assemble the RAID arrays at boot up time with a modern metadata,
>> but still without needing the initramfs. My arrays are still at
>> metadata 0.90.
> Please tell if you make booting with metadata 1.2 work.
> I havn't tested that.
It is NOT supported. The kernel has no code to do so, you need an
initramfs. That said, nowadays I believe you can actually load the
initramfs into the kernel so it's one monolithic blob ...
By the way, as to the other point of putting /dev/sda etc on the kernel
command line, it's the kernel that's messing up and scrambling which
physical disk is which logical sda sdb et al device, so explicitly
specifying that will have exactly NO effect when your hardware/software
combo changes again. I guess it was the fact your rescue disk booted
from CDROM or whatever made THAT sda, and pushed the other disks out of
the way.
sda, sdb, sdc et al are allocated AT RANDOM by the kernel. It just so
happens that the "seed" rarely changes, so in normal use the same values
happen to get chosen every time - until something DOES change, and then
you wonder why everything falls over. The same is also true of md127,
md126 et al. If your raid counts up from md1, md2 etc then those I
believe are stable, but I haven't seen them for pretty much the entire
time I've been involved in mdraid (maybe a decade or so?)
You need to use those UUID/GUID things. I know it's a hassle finding out
whether it's a guid or a uuid, and what it is, and all that crud, but
trust me they don't change, you can shuffle your disks, stick in another
SATA card, move it from SATA to USB (BAD move - don't even think of it
!!!), and the system will still find the correct disk.
Cheers,
Wol
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-20 20:19 ` Alan Mackenzie
` (2 preceding siblings ...)
2024-12-20 22:02 ` karl
@ 2024-12-22 12:08 ` Wols Lists
3 siblings, 0 replies; 23+ messages in thread
From: Wols Lists @ 2024-12-22 12:08 UTC (permalink / raw
To: gentoo-user
On 20/12/2024 20:19, Alan Mackenzie wrote:
> I've just tried it, with metadata 1.2, and it doesn't work. I got error
> messages at boot up to the effect that the component partitions were
> lacking valid version 0.0 super blocks.
>
> People without initramfs appear not to be in the sights of the
> maintainers of this software. They could so easily have made the
> assembly of metadata 1.2 components on the kernel command line work.
> 🙁
No they couldn't. Not if they wanted (at the time) a kernel small enough
to boot successfully ...
Making the disk write write identically to two disks (your basic 0.9
mirror) is pretty simple, and also extremely error prone. Making mdraid
robust with all the other features of an enterprise "protect your data"
system is a lot more work.
mdraid has probably just protected my data - dunno what triggered it,
but I lost a disk and it just got rebuilt in the background without me
doing a thing ...
>
> By the way, do you know an easy way for copying an entire filesystem,
> such as the root system, but without copying other systems mounted in
> it? I tried for some while with rsync and various combinations of
> find's and xargs's, and in the end booted up into the rescue disc to do
> it. I shouldn't have to do that.
Provided it's read-only (so yes if it's the root I might well use a
rescue disk) I'd use dd. That's assuming a fairly small root that's
fairly full, it's rather wasteful if it's not ...
Cheers,
Wol
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-21 12:43 ` Alan Mackenzie
2024-12-21 16:36 ` Alan Mackenzie
@ 2024-12-22 12:16 ` Wols Lists
1 sibling, 0 replies; 23+ messages in thread
From: Wols Lists @ 2024-12-22 12:16 UTC (permalink / raw
To: gentoo-user
On 21/12/2024 12:43, Alan Mackenzie wrote:
> , where the extra bit is optional. This enhancement would not be
> difficult. The trouble is more political. I think this code is
> maintained by RedHat. RedHat's customers all use initramfs, so they
> probably think everybody else should, too, hence would be unwilling to
> enhance it for a small group of Gentooers.
Let's blame RedHat again ... I think you're wrong.
There's a fair few SUSE people in there. The person who did nearly all
of the heavy lifting before he stepped down was SuSE. A lot of the
"senior" team (just a couple of people, as per normal) are Far Eastern,
I'm not sure of their company affiliation.
About the only person I'm confident IS RedHat is the guy maintaining
mdadm, which is not mdraid (it's the management tool, not the "do the
work" tool).
Cheers,
Wol
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-21 16:58 ` Alan Mackenzie
@ 2024-12-22 13:08 ` Alan Mackenzie
0 siblings, 0 replies; 23+ messages in thread
From: Alan Mackenzie @ 2024-12-22 13:08 UTC (permalink / raw
To: gentoo-user
[-- Attachment #1: Type: text/plain, Size: 1168 bytes --]
Hello again!
On Sat, Dec 21, 2024 at 16:58:59 +0000, Alan Mackenzie wrote:
> Hello, Karl.
> On Sat, Dec 21, 2024 at 17:45:13 +0100, karl@aspodata.se wrote:
> > Alan Mackenzie:
> > ...
> > > I've now got working code which assembles a metadata 1.2 RAID array at
> > > boot time. The syntax needed on the command line is, again,
> > > md=124,1.2,/dev/nvme0n1p6,/dev/nvme1n1p6
> > > .. In place of 1.2 can be any of 0.90, 1.0, 1.1, though I haven't tested
> > > it with anything but 1.2 as yet.
> > ...
> > Fun! Which kernel, can you send a patch ?
> 6.6.62. Patch enclosed. It should apply cleanly from the directory
> ..../drivers/md.
There was an error in yesterday's patch. For some reason I can't
fathom, I'd started a loop with
for (i = 1; ....)
in place of the correct
for (i = 0; ....)
.. The consequence was that the driver would not recognise "0.90" when
given explicitly in the kernel command line, for example as
md=126,0.90,/dev/nvme0n1p4,/dev/nvme1n1p4
.. Please use the enclosed patch in place of that patch from yesterday.
Thanks!
> Have fun!
> > Regards,
> > /Karl Hammar
--
Alan Mackenzie (Nuremberg, Germany).
[-- Attachment #2: diff.20241222.diff --]
[-- Type: text/plain, Size: 1473 bytes --]
diff --git a/drivers/md/md-autodetect.c b/drivers/md/md-autodetect.c
index b2a00f213c2c..6bd6e9177969 100644
--- a/drivers/md/md-autodetect.c
+++ b/drivers/md/md-autodetect.c
@@ -124,6 +124,17 @@ static void __init md_setup_drive(struct md_setup_args *args)
struct mddev *mddev;
int err = 0, i;
char name[16];
+ int major_version = 0, minor_version = 90;
+ char *pp;
+ static struct {
+ char *metadata;
+ int major_version;
+ int minor_version;
+ } metadata_table[] =
+ {{"0.90", 0, 90},
+ {"1.0", 1, 0},
+ {"1.1", 1, 1},
+ {"1.2", 1, 2}};
if (args->partitioned) {
mdev = MKDEV(mdp_major, args->minor << MdpMinorShift);
@@ -133,6 +144,21 @@ static void __init md_setup_drive(struct md_setup_args *args)
sprintf(name, "md%d", args->minor);
}
+ pp = strchr(devname, ',');
+ if (pp)
+ {
+ *pp = 0;
+ for (i = 0; i < ARRAY_SIZE(metadata_table); i++)
+ if (!strcmp(devname, metadata_table[i].metadata))
+ {
+ major_version = metadata_table[i].major_version;
+ minor_version = metadata_table[i].minor_version;
+ devname = pp + 1;
+ break;
+ }
+ *pp = ',';
+ }
+
for (i = 0; i < MD_SB_DISKS && devname != NULL; i++) {
struct kstat stat;
char *p;
@@ -183,6 +209,8 @@ static void __init md_setup_drive(struct md_setup_args *args)
goto out_unlock;
}
+ ainfo.major_version = major_version;
+ ainfo.minor_version = minor_version;
if (args->level != LEVEL_NONE) {
/* non-persistent */
ainfo.level = args->level;
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-22 12:02 ` Wols Lists
@ 2024-12-22 13:43 ` Alan Mackenzie
2024-12-22 15:29 ` Peter Humphrey
0 siblings, 1 reply; 23+ messages in thread
From: Alan Mackenzie @ 2024-12-22 13:43 UTC (permalink / raw
To: gentoo-user
Hello, Wol.
On Sun, Dec 22, 2024 at 12:02:49 +0000, Wols Lists wrote:
> On 20/12/2024 17:44, karl@aspodata.se wrote:
> >> If I understand things correctly, with this mechanism one can have the
> >> kernel assemble the RAID arrays at boot up time with a modern metadata,
> >> but still without needing the initramfs. My arrays are still at
> >> metadata 0.90.
> > Please tell if you make booting with metadata 1.2 work.
> > I havn't tested that.
> It is NOT supported. The kernel has no code to do so, you need an
> initramfs. That said, nowadays I believe you can actually load the
> initramfs into the kernel so it's one monolithic blob ...
With my patch from yesterday (corrected today), you can indeed instruct
the kernel to assemble RAID devices with metadata 1.2. It wasn't a
difficult patch by any means. One wonders why the md kernel team hadn't
done it a long time ago.
initramfs's are ugly ungainly things, often many times larger than the
kernel itself, and appear not to have been well thought out. They are
surely a source of complication and error, and are best avoided, if
possible. I've never actually built one myself, and will go to some
lengths, like hacking the kernel, to avoid it.
> By the way, as to the other point of putting /dev/sda etc on the kernel
> command line, it's the kernel that's messing up and scrambling which
> physical disk is which logical sda sdb et al device, so explicitly
> specifying that will have exactly NO effect when your hardware/software
> combo changes again.
/dev/sda (or, in my case, /dev/nvme0n1), etc. don't, in my experience,
get scrambled by the kernel. They're plugged into the same sockets on
the motherboard from day to day, so unless you're physically inserting
or removing them, you won't have trouble.
> I guess it was the fact your rescue disk booted from CDROM or whatever
> made THAT sda, and pushed the other disks out of the way.
No, you've misunderstood my situation. What got scrambled by the rescue
disc was the assignment of /dev/md127 and /dev/md126. This has been
solved by explicitly specifying the assignment with md parameters in the
kernel command line. So now my system boots just fine, even after the
assignment of the devices (the "preferred-minor" field in the MD
superblock) has been scrambled by the rescue disk.
> sda, sdb, sdc et al are allocated AT RANDOM by the kernel.
Only in the sense that it may be difficult on a new machine to predict
in advance which physical HDD becomes which sdx. As I said, the
assignment of physical drives to logical devices is repeatable, and
doesn't change from day to day.
> It just so happens that the "seed" rarely changes, so in normal use
> the same values happen to get chosen every time - until something DOES
> change, and then you wonder why everything falls over. The same is
> also true of md127, md126 et al. If your raid counts up from md1, md2
> etc then those I believe are stable, but I haven't seen them for
> pretty much the entire time I've been involved in mdraid (maybe a
> decade or so?)
> You need to use those UUID/GUID things. I know it's a hassle finding out
> whether it's a guid or a uuid, and what it is, and all that crud, but
> trust me they don't change, you can shuffle your disks, stick in another
> SATA card, move it from SATA to USB (BAD move - don't even think of it
> !!!), and the system will still find the correct disk.
The trouble being that a kernel command line, or /etc/fstab, using lots
of these is not human readable, and hence is at the edge of
unmaintainability. This maintenance difficulty surely outweighs the
rare situation where the physical->logical assignment changes due to a
broken drive. That's what we've got rescue disks for.
> Cheers,
> Wol
--
Alan Mackenzie (Nuremberg, Germany).
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-22 13:43 ` Alan Mackenzie
@ 2024-12-22 15:29 ` Peter Humphrey
2024-12-22 16:53 ` Wols Lists
0 siblings, 1 reply; 23+ messages in thread
From: Peter Humphrey @ 2024-12-22 15:29 UTC (permalink / raw
To: gentoo-user
On Sunday 22 December 2024 13:43:08 GMT Alan Mackenzie wrote:
> The trouble [is] that a kernel command line, or /etc/fstab, using lots
> of these is not human readable, and hence is at the edge of
> unmaintainability. This maintenance difficulty surely outweighs the
> rare situation where the physical->logical assignment changes due to a
> broken drive. That's what we've got rescue disks for.
Hear, hear! I never could understand why everyone seems to want to jump onto
that band-wagon.
--
Regards,
Peter.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-22 15:29 ` Peter Humphrey
@ 2024-12-22 16:53 ` Wols Lists
2024-12-22 20:05 ` Alan Mackenzie
0 siblings, 1 reply; 23+ messages in thread
From: Wols Lists @ 2024-12-22 16:53 UTC (permalink / raw
To: gentoo-user
On 22/12/2024 15:29, Peter Humphrey wrote:
> On Sunday 22 December 2024 13:43:08 GMT Alan Mackenzie wrote:
>
>> The trouble [is] that a kernel command line, or /etc/fstab, using lots
>> of these is not human readable, and hence is at the edge of
>> unmaintainability. This maintenance difficulty surely outweighs the
>> rare situation where the physical->logical assignment changes due to a
>> broken drive. That's what we've got rescue disks for.
>
> Hear, hear! I never could understand why everyone seems to want to jump onto
> that band-wagon.
>
I have no problem with you saying all this long guid crap makes stuff
unreadable (and yes, I agree, unreadable and unmaintainable aren't that
far different) BUT
> surely outweighs the rare situation where the physical->logical
assignment changes
THAT DEPENDS ON YOUR HARDWARE!
For normal consumer grade hardware, I agree. I've never known it change
unless I've been mucking about with add-in SATA, PATA, whatever cards.
BUT. Especially on big server-grade hardware, where there's lots of trip
switches so stuff doesn't all power up in one huge spike (and I've
worked with such), different parts of the system come up in a completely
random order, and drives re-order themselves pretty much every single boot!
So yes, with our consumer hardware I'd agree with you. But the people
paying big bills for reliable top-range hardware would wonder what
you're smoking!
Cheers,
Wol
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-22 16:53 ` Wols Lists
@ 2024-12-22 20:05 ` Alan Mackenzie
2024-12-25 21:16 ` Steven Lembark
0 siblings, 1 reply; 23+ messages in thread
From: Alan Mackenzie @ 2024-12-22 20:05 UTC (permalink / raw
To: gentoo-user
Hello, Wol.
On Sun, Dec 22, 2024 at 16:53:17 +0000, Wols Lists wrote:
> On 22/12/2024 15:29, Peter Humphrey wrote:
> > On Sunday 22 December 2024 13:43:08 GMT Alan Mackenzie wrote:
> >> The trouble [is] that a kernel command line, or /etc/fstab, using lots
> >> of these is not human readable, and hence is at the edge of
> >> unmaintainability. This maintenance difficulty surely outweighs the
> >> rare situation where the physical->logical assignment changes due to a
> >> broken drive. That's what we've got rescue disks for.
> > Hear, hear! I never could understand why everyone seems to want to jump onto
> > that band-wagon.
> I have no problem with you saying all this long guid crap makes stuff
> unreadable (and yes, I agree, unreadable and unmaintainable aren't that
> far different) BUT
> > surely outweighs the rare situation where the physical->logical
> assignment changes
> THAT DEPENDS ON YOUR HARDWARE!
> For normal consumer grade hardware, I agree. I've never known it change
> unless I've been mucking about with add-in SATA, PATA, whatever cards.
This is the desirable state of affairs.
> BUT. Especially on big server-grade hardware, where there's lots of trip
> switches so stuff doesn't all power up in one huge spike (and I've
> worked with such), different parts of the system come up in a completely
> random order, and drives re-order themselves pretty much every single boot!
So all this 32 hex digit UUID stuff is a workaround for the
unpredictability of server hardware. What seems to be missing is a way
of associating a given disk socket on the motherboard with /dev/sda.
Instead we have to put up with "content addressing".
> So yes, with our consumer hardware I'd agree with you. But the people
> paying big bills for reliable top-range hardware would wonder what
> you're smoking!
I think any system admins reading this would long for the predictability
of "consumer hardware", having too often been confronted with
indistinguishable 32 hex digit identifiers. I would imagine it quite
likely that the said admins have written scripts to make this more
manageable.
> Cheers,
> Wol
--
Alan Mackenzie (Nuremberg, Germany).
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-22 20:05 ` Alan Mackenzie
@ 2024-12-25 21:16 ` Steven Lembark
0 siblings, 0 replies; 23+ messages in thread
From: Steven Lembark @ 2024-12-25 21:16 UTC (permalink / raw
To: gentoo-user; +Cc: lembark
> I think any system admins reading this would long for the
> predictability of "consumer hardware", having too often been
> confronted with indistinguishable 32 hex digit identifiers. I would
> imagine it quite likely that the said admins have written scripts to
> make this more manageable.
Simple fix: use LVM, let it deal with the UUID. At that point
the PV's get UUID's, the VG's get UUID's, the LV's get UUID's
and you never have to type or see or use them.
Snippet from my /etc/fstab:
/dev/vg00/root / xfs ...
/dev/vg00/var /var xfs ...
/dev/vg00/var-tmp /var/tmp xfs ...
this is basically the same fstab on my server & notebook, hasn't
changed in the transitions from ATA to SATA to SCSI to SAS to
NVME.
If you want mirroring then either create a mirror with mdadm
and use it as a PV -- kenel will auto-start the mirror, vgscan
will find it, and Viola!, it's up -- or use -m2 and mirror/stripe/
RAID5/whatever using lvcreate to spread the data across whatever
you like.
Here I have two nvme's (used to be scsi, then sas) which are mirrored
for vg00 w/ the root, var, home filesystems another that's striped
for /var/tmp and other scratch spaces.
This gives an overview:
https://speakerdeck.com/lembark/its-only-logical-lvm-for-linux
--
Steven Lembark
Workhorse Computing
lembark@wrkhors.com
+1 888 359 3508
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gentoo-user] Fun with mdadm (Software RAID)
2024-12-20 22:02 ` karl
@ 2024-12-30 4:08 ` Frank Steinmetzger
0 siblings, 0 replies; 23+ messages in thread
From: Frank Steinmetzger @ 2024-12-30 4:08 UTC (permalink / raw
To: gentoo-user
Am Fri, Dec 20, 2024 at 11:02:55PM +0100 schrieb karl@aspodata.se:
> Alan Mackenzie:
> ...
> > By the way, do you know an easy way for copying an entire filesystem,
> > such as the root system, but without copying other systems mounted in
> > it? I tried for some while with rsync and various combinations of
> > find's and xargs's, and in the end booted up into the rescue disc to do
> > it. I shouldn't have to do that.
>
> rsync as other people have suggested.
> There is also
> cp -x
> dump/restore
> find -xdev
> etc.
>
> You can also do it by accessing the /dev/-file like
> dd if=source of=dest (cp works here also but dd is more the norm).
>
> ///
>
> When something is mounted on a mount point, the files below the
> mount point is hidden and the mounted filessystem will be available
> instead. Do you want to copy thoose hidden files also ?
To circumnavigate this, I usually bind-mount the filesystem to another
directory first. I usually only do this when I’m dealing with /, as my FS
structure is not complex:
mount --bind / /mnt/bind
rsync -axAHX /mnt/bind/ /path/to/destination/
(-x is not needed then, but it’s part of muscle memory)
--
Grüße | Greetings | Salut | Qapla’
Please do not share anything from, with or about me on any social network.
Keyboard not connected, press F1 to continue.
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2024-12-30 4:13 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-20 10:47 [gentoo-user] Fun with mdadm (Software RAID) Alan Mackenzie
2024-12-20 14:50 ` karl
2024-12-20 15:28 ` Alan Mackenzie
2024-12-20 17:44 ` karl
2024-12-20 20:19 ` Alan Mackenzie
2024-12-20 20:38 ` Hoël Bézier
2024-12-20 20:53 ` Alan Mackenzie
2024-12-20 22:02 ` karl
2024-12-30 4:08 ` Frank Steinmetzger
2024-12-20 22:02 ` karl
2024-12-21 12:43 ` Alan Mackenzie
2024-12-21 16:36 ` Alan Mackenzie
2024-12-21 16:45 ` karl
2024-12-21 16:58 ` Alan Mackenzie
2024-12-22 13:08 ` Alan Mackenzie
2024-12-22 12:16 ` Wols Lists
2024-12-22 12:08 ` Wols Lists
2024-12-22 12:02 ` Wols Lists
2024-12-22 13:43 ` Alan Mackenzie
2024-12-22 15:29 ` Peter Humphrey
2024-12-22 16:53 ` Wols Lists
2024-12-22 20:05 ` Alan Mackenzie
2024-12-25 21:16 ` Steven Lembark
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox