Posted on 2024-10-23
Today my day started as many days have started before. I logged in to my PC, checked archlinux.org for any important news, and ran a system update:
$ pacman -Syu
I ususally manage my own AUR dependencies, having not dabbled into the world of yay
yet. Today I was greeted with an incompatibility that I couldn’t easily resolve - I’ve been busy and it has been over 3 weeks since my last update! Rolling releases don’t like it when the rolling stops. Motivated to just keep moving, and not particularly fussed over my install of aseprite
, I removed the package causing the problem and carried on.
$ pacman -R aseprite
$ pacman -Syu
This was a large update, due to the time that had passed. Several GB (or Gb? I am never fully sure) of data flowed through the internet into my machine. I realized that I hadn’t merged any .pacnew
files for a while. But what is a .pacnew
file? When some default configuration files change (especially important ones like /etc/sshd/sshd_config
), pacman helpfully puts the changes in a .pacnew
file, and merging these changes into your config is left as an exercise for the system administrator.
Some people just groaned reading that, but I love this philosophy of handling updates. I love Arch Linux because there is no magic. If important configuration options are added to a core utility, I want to know about them and make an informed decision about how to merge them into my config.
Today, there were a few really important .pacnew
files I had left unmerged for too long. One was for /etc/shells
, a file listing valid login shells. /etc/passwd
is a critical system file that is used in Unix systems that is used as part of user login. The actual passwords don’t live here in any sensible system, but it is an important and very old part of user authentication, with contents including a user’s login shell. If a user’s login shell is not in /etc/shells
, they cannot login. Emboldened by what looked to be trivial changes, I performed the merges by hand with confidence, but I didn’t consider investigating any secondary files such as /etc/password
which could also be impacted.
I was able to update everything in-place without rebooting and continue using my computer. Later when attempting to use a USB controller, it wasn’t being detected - I needed to reboot to reload the USB drivers in the new kernel. Upon rebooting I was greeted with my favourite login prompt, the plaintext console:
Arch Linux 6.11.4-zen2-1-zen (tty1)
lassie login: _
I type my username and password… Login incorrect
. It must have been a typo, so I try again. And again. Have I forgotten my password? Then I considered the .pacnew
changes and realize I must have broken user authentication! I don’t have any other users configured that I can try logging in as, so I can’t tell if this is local to my user or system-wide. It’s possible that me being the only user makes this a system-wide error by default!
I have at least 2 Arch Linux boot USBs floating around in drawes and containers. The Arch Linux installer is actually just Arch Linux, so as a nice side-effect of this you are left with an extremely useful recovery device when you install from a USB. The first step was finding a boot USB. This took longer than I expected.
The first USB refused to boot. I looked around its contents on my laptop, noting that it had a kernel and all the other things needed to run an OS. I can’t tell why it can’t boot, but I spend at least 20mins trying, and looking at BIOS settings on my PC.
I spend even longer hunting for the second USB. I pull 3 USBs out of a box, and none of them have been used in over a year, but I suspect one of them boots Arch Linux. My first guess is correct, and I’m back in business on my main PC.
I have broken something in /etc
. Maybe I broke /etc/passwd
, but I also changed /etc/shells
. I need to mount the partition of my OS that contains /etc
, examine my system files, then correct any errors. To do this, I’m going to assume root
control of my installed OS from the boot USB.
The first step is to find /etc
. I have a few drives installed in my PC, from sda
to sde
, 13 paritions all up. I was excited about partitions when I installed this OS, I’m not sure I would use so many paritions next time. I notice that /dev/sdb
has the most partitions, and I’m pretty sure this drive has most of my Linux filesystem. It’s time to start mounting stuff!
root@archiso ~ # mount /dev/sdb1 /mnt
root@archiso ~ # ls /mnt
EFI screenshot_001.bmp
This is clearly my EFI partition. Why is there a screenshot here? 🤔
root@archiso ~ # umount /mnt
root@archiso ~ # mount /dev/sdb2 /mnt
root@archiso ~ # ls /mnt
bin boot data dev etc home lib lib64 lost+found media mnt opt proc root run sbin shared src sys tmp usr var
root@archiso ~ # ls /mnt/etc
\snipped for brevity
Success! And I can verify that the contents of /etc
are really here! I try to open the file and realize that while we have vim
, vi
is not symlinked to vim
. I thought that was fairly standard, and it catches me off guard every time I use the Arch boot USB. On my system vi
is symlinked to nvim
. So I investigate this file using vim
:
root@archiso ~ # vim /mnt/etc/passwd
Finding my passwd entry in this file:
tom:x:1000:1000::/home/tom:/bin/zsh
I remember one of the changes in the .pacnew
files was moving from /bin/bash
to /usr/bin/bash
, because we all agreed that we’re doing things in /usr/bin
now (except for when we don’t). I noticed that my /etc/shells
entry for ZSH was also using /bin/zsh
, so I helpfully changed that to /usr/bin/zsh
. On no, now my passwd
file does not contain a valid shell!
One important step to minimize slow USB rebooting is to try logging in from this USB environment. We can use the currently loaded kernel, but rebase ourselves into a different filesystem using the chroot
command. This would mean we are acting as if we were the root
user of my real PC Linux install, but running on whatever version of the kernel was on this USB. I’m sure there will be no problems with this despite the age of the USB.
Before I can chroot
I need to mount my remaining filesystem into /mnt
root@archiso ~ # mount /dev/sdb3 /mnt/var
root@archiso ~ # mount /dev/nvme0n1p1 /mnt/home
root@archiso ~ # swapon /dev/sdb5
I’m not sure if I need to manually enable the same swap partition, but it’s present in what will become /etc/fstab
so it’s probably wise. I have a few other partitions for misc data (eg /home/tom/workspace
) but these aren’t critical for this recovery mission. Let’s see if we can chroot
and su
into my tom
user!
root@archiso ~ # chroot /mnt
archiso# su tom
tom@archiso>
It looks like su
bypasses whatever login mechanism checks /etc/shells
, and while that isn’t useful to me now it’s worth taking note of - If I had root login with a password enabled then I’d be able to fix this without the rescue USB. I’m not sure disabling that has meaningfully more security when I store a boot USB capable of building this chroot
environment within 2m of my PC.
One cool thing is that we can now use vi
and it picks up my symlink to my nvim
- we’re using my /usr/bin
now!
At the very least I can make sure that /usr/bin/zsh
is a real shell once I am logged in:
tom@archiso> echo $SHELL
/bin/zsh
tom@archiso> ^D
archiso# vi /etc/passwd
archiso# su tom
tom@archiso> echo $SHELL
/usr/bin/zsh
After double checking this shell is in /etc/shells
, I rebooted and…
As I’m typing this final line on my PC, it’s safe to say that we’re back 😎