August 12, 2019

Diskless Boot for a Raspberry Pi over PXE and iSCSI

WORDS BY   Sašo Stanovnik

POSTED IN   raspberry pi | pxe | iscsi


In this episode, we are looking at removing the SD card from the boot process of the Raspberry Pi in excruciating detail. Using a network drive to boot both significantly increases performance and ensures the SD card will not wear out over time, which is a big issue with heavily-used Raspberry Pis.

This post serves as both a guide for those unfamiliar with the tools used, going through all the necessary steps from scratch, and also allows advanced users to adapt an existing setup with peculiarities that network-booting a Raspberry Pi involves. Although booting machines from the network is definitely a solved problem, there are quite a number of challenges that arise from trying to boot a device without a (traditional) BIOS, with only slight support of PXE and frequently running a purpose-built operating system.

Preparation

We start with a list of required hardware to set up an isolated environment in which to work:

  • any router capable of being a DHCP server
  • a Raspberry Pi 3B v1.2
  • a storage server machine
  • a dev machine (may also be the storage server for development purposes)
    • running Debian 10 (Buster), or at least dnsmasq>=2.77
  • enough network cables to connect machines
  • an SD card, 4 GB or more
  • an SD card reader
  • an HDMI cable
  • a power supply for the Pi
  • optionally, a separate monitor for convenience

Other versions of the Raspberry Pi, especially later, will probably work, but the internet reports that the Raspberry Pi 2B still needs an SD card to host the second-stage bootloader, which 3B can fetch over the network. The reason for using a separate router is to isolate traffic from the upstream network, because we will be messing around with a proxy DHCP server, which, if misconfigured, might cause issues in the network. There is no need to using a separate network other than that PXE should not already be set up, otherwise our non-intrusive solution will be overridden.

Customising Raspbian

Download a version of Raspbian, the official operating system for the Raspberry Pi. We used a version from 2019-04-09, because there is currently a problem with newer kernel versions which prevents the Pi from booting from a USB device, which is what the network card is on the 3B.

You can also build Raspbian from scratch, but, as you get the latest available kernel which (at the time this post was written) has issues, will not work. The Raspbian package archives do not host old versions of the kernel, so there is no straightforward way to lock the kernel to a specific version. Another issue at the time of writing is that building Raspbian on x64 machines is silently broken, requiring you to use the Docker build process and modifying the base image to explicitly be its i386 incarnation.

After you have obtained a Raspbian image, we need to modify it slightly to enable SSH access over the network, which will simplify further steps by not requiring a keyboard or even a display attached to the Raspbian. The commands below are suitable for direct execution after changing the parameters at the top; this is true for all code blocks in this post.

unzip "2019-04-08-raspbian-stretch-lite.zip"
ssh-keygen -t ed25519 -C raspi-mgmt -N "" -f raspi-mgmt

ROOT_PUBKEY="$(cat raspi-mgmt.pub)"
IMAGE_FILE="2019-04-08-raspbian-stretch-lite.img"

apt install kpartx
creation_output="$(kpartx -asv "$IMAGE_FILE")"
loop_device="$(echo "$creation_output" \
    | head -n 1 \
    | sed -E 's/add map (loop[0-9]+)p.*/\1/')"
temp_dir="$(mktemp -d)"
mount "/dev/mapper/${loop_device}p2" "$temp_dir"
mount "/dev/mapper/${loop_device}p1" "$temp_dir/boot/"

touch "$temp_dir/boot/ssh"
mkdir -p "$temp_dir/root/.ssh/"
echo "$ROOT_PUBKEY" >>"$temp_dir/root/.ssh/authorized_keys"
sed -r -i \
    's/#?.*?PermitRootLogin.*?$/PermitRootLogin without-password/g' \
    "$temp_dir/etc/ssh/sshd_config"

umount "$temp_dir/boot/"
umount "$temp_dir"
rmdir "$temp_dir"
kpartx -dv "$IMAGE_FILE"

Let's look at what's happening in this somewhat complicated script.

  • We install kpartx and set up a loopback device for the image file.
  • Then, we mount both partitions of the image into a temporary folder.
  • Touching /boot/ssh enables SSH on boot for Raspbian.
  • Inserting the public key and enabling pubkey root login completes the configuration.
  • To clean up, we unmount the image and remove the temporary directory and loop device.

The base Raspbian image is now modified and ready to be used for both network and SD card boot. Take note that SSH access is now available immediately after booting and the default pi user, which has passwordless superuser permissions, still has the default password of raspberry. Make sure the environment you are using this image in is secure, and/or make sure you change this password or completely disable password-based authentication for SSH.

Preparing the Raspberry Pi

The Raspberry Pi can boot over the network, but this has to be explicitly enabled in its OTP (one time programmable) memory. To do this, we need to boot via the SD card, just this one time, and enable the feature. Begin by flashing the customised Raspbian image onto an SD card.

IMAGE_FILE="2019-04-08-raspbian-stretch-lite.img"
SD_CARD_DEVICE="/dev/sdX"

pv "$IMAGE_FILE" | dd of="$SD_CARD_DEVICE" bs=4M conv=noerror,notrunc

Insert the SD card into the Raspberry Pi, connect it to the network, and boot. Then connect via SSH and enable the feature, which requires another reboot. Okay, I lied about only needing to boot from the SD card once. We could insert program_usb_mode in the customisation process in the previous section, but that would enable the feature for every device booting the image, which may not be what you want. While we are booted into the Pi, grab its serial number to enable differentiation between machines in the following sections.

echo "program_usb_mode=1" >> /boot/config.txt
reboot
# ...
vcgencmd otp_dump | grep -q "17:3020000a" \
    && echo "USB boot enabled" || echo "USB boot disabled"

grep '^Serial' /proc/cpuinfo \
    | cut -d ':' -f 2 \
    | sed -E 's/ +0+(.*)/\1/'

Modifying the Raspbian initrd for iSCSI

The default initial ramdisk for Raspbian does not support iSCSI, and we need to enable it to be able to use a remote root disk. Installing open-iscsi, enabling generation of the module and finally generating a new initramfs generates new files in the boot partition. We need the files in the boot partition for booting over the network, so we also copy them onto our storage machine.

STORAGE_MACHINE="192.0.2.2"

apt install open-iscsi initramfs-tools
touch /etc/iscsi/iscsi.initramfs
update-initramfs -v -k "$(uname -r)" -c

ssh "$STORAGE_MACHINE" mkdir -p /tmp/bootpart/
scp -r /boot/ "$STORAGE_MACHINE:/tmp/bootpart/$(uname -r)/"

Note that this is tied to the kernel version, a problem we will solve later.

Provisioning

Now that we have gotten through one-time device setup steps, we can continue with provisioning the storage server and boot environments for each Pi.

Creating the iSCSI target, PXE DHCP proxy and TFTP server

This step is largely standard, with little differences to any other iSCSI PXE network boot configuration. There are, however, Raspberry Pi specifics that we will point out that make or break the entire setup.

An important prerequisite for this part is running Debian Buster on the development machine. This is because Debian Stretch, the previous edition, bundles an old version of dnsmasq that does not support the dhcp-reply-delay parameter, important to be able to boot the Raspberry Pi. Alternatively, make sure dnsmasq is at least version 2.77.

These steps are executed on the storage target machine.

TARGET_IQN="iqn.2019-08.si.xlab.blog:rpis"
NETWORK_SUBNET="192.0.2.255"

apt install targetcli-fb dnsmasq
targetcli /iscsi create "$TARGET_IQN"
targetcli saveconfig
mkdir -p /tftpboot/

cat >/etc/dnsmasq.d/proxydhcp.conf <<EOF
port=0
dhcp-range=$NETWORK_SUBNET,proxy
log-dhcp
log-queries
enable-tftp
tftp-root=/tftpboot
pxe-service=0,"Raspberry Pi Boot   "
pxe-prompt="Boot Raspberry Pi", 1
dhcp-no-override
dhcp-reply-delay=1
EOF

systemctl enable dnsmasq
systemctl restart dnsmasq

There are several things going one in the fragment above. Let us look at the important bits, also highlighted in the listing.

  • We create an iSCSI target using targetcli.
  • dnsmasq is set up as a proxy DHCP server on the same subnet as the network
  • TFTP is enabled with its root at /tftpboot/
  • Important: the pxe-service option deliberately includes a string with three spaces at the end. The Raspberry Pi does not accept DHCP PXE responses without this string with three spaces at the end.
  • Important: dhcp-reply-delay is set to one second. Because of a firmware bug in the primary Raspberry Pi bootloader, the Pi is not capable of processing two simultaneous DHCP responses. This is solved by delaying the secondary, proxy response containing the PXE data by one second, which solves the issue.

Making the Pi boot from the network

All that is left is to configure the serverside boot environment for the Raspberry Pi. This involves configuring the iSCSI daemon to serve a root image for the device, the TFTP server to serve the appropriate boot files and link everything together with the proper kernel parameters.

This step is the most complicated one. Again, remember that these code blocks are suitable for copy-paste execution once you modify the parameters at the top. Most lines are completely standard steps for setting up an iSCSI root disk PXE boot network.

RPI_SERIAL="xxxxxxxx"
RPI_KERNEL_VERSION="4.14.98-v7+"
TARGET_IQN="iqn.2019-08.si.xlab.blog:rpis"
INITIATOR_IQN="iqn.2019-08.si.xlab.blog.initiator:rpi-blog"
BACKSTORE_SIZE="16G"
IMAGE_FILE="2019-04-08-raspbian-stretch-lite.img"
STORAGE_MACHINE_IP="192.0.2.2"

apt install kpartx

mkdir -p "/tftpboot/$RPI_SERIAL/"
cp -r "/tmp/$RPI_KERNEL_VERSION/"* "/tftpboot/$RPI_SERIAL/"
cp /tftpboot/$RPI_SERIAL/bootcode.bin /tftpboot/bootcode.bin
echo "initramfs initrd.img-$RPI_KERNEL_VERSION followkernel" \
    >> "/tftpboot/$RPI_SERIAL/config.txt"

targetcli /backstores/fileio create \
    "backstore-$RPI_SERIAL" \
    "/srv/backing-file-$RPI_SERIAL" \
    "$BACKSTORE_SIZE" \
    true \
    true
targetcli "/iscsi/$TARGET_IQN/tpg1/luns" create \
    "/backstores/fileio/backstore-$RPI_SERIAL"
targetcli "/iscsi/$TARGET_IQN/tpg1/acls" create \
    "$INITIATOR_IQN" \
    false
targetcli "/iscsi/$TARGET_IQN/tpg1/acls/$INITIATOR_IQN" create \
    0 \
    "/backstores/fileio/backstore-$RPI_SERIAL"
targetcli saveconfig

pv "$IMAGE_FILE" \
    | dd of="/srv/backing-file-$RPI_SERIAL" bs=4M conv=noerror,notrunc

Ignoring standard steps to be more succinct, there are a few notable points to the above setup.

  • The Raspberry Pi is apparently not a PXE client, it just happens to work that way. A notable thing missing is TFTP request namespacing based on the machine's MAC address, where a Raspberry Pi exclusively looks for bootcode.bin in the root and then for other files under a folder named after its serial number.
  • In the boot configuration file config.txt, the iSCSI-enabled initrd is specified.
  • The backstore is created as a sparse file to save on space.
  • Multiple clients can be provisioned and connect through the same procedure, as both boot files and the iSCSI configuration are namespaced using its serial number.
  • We have not set up authentication for iSCSI, which means that anyone knowing both the target and initiator IQNs can connect. There is also no transport security in iSCSI. Make sure this is okay with you.

An optional step now is to resize the root filesystem on the provisioned image. Although Raspbian already includes logic for expanding the root filesystem to fill the SD card, this does not seem to work when using iSCSI.

RPI_SERIAL="xxxxxxxx"

apt install cloud-guest-utils

creation_output="$(kpartx -asv "$IMAGE_FILE")"
loop_device="$(echo "$creation_output" \
    | head -n 1 \
    | sed -E 's/add map (loop[0-9]+)p.*/\1/')"
growpart "/dev/$loop_device" 2
partprobe "/dev/$loop_device"

boot_part_offset="$(($(partx --nr 1 --output start --noheadings \
    "/srv/backing-file-$RPI_SERIAL") * 512))"
boot_part_uuid="$(blkid --probe --offset "$root_part_offset" \
    --output value --match-tag UUID "/srv/backing-file-$RPI_SERIAL")"
root_part_offset="$(($(partx --nr 2 --output start --noheadings \
    "/srv/backing-file-$RPI_SERIAL") * 512))"
root_part_uuid="$(blkid --probe --offset "$root_part_offset" \
    --output value --match-tag UUID "/srv/backing-file-$RPI_SERIAL")"

e2fsck -f -v -p "/dev/disk/by-uuid/$root_part_uuid"
resize2fs "/dev/disk/by-uuid/$root_part_uuid"
temp_dir="$(mktemp -d)"
mount "/dev/disk/by-uuid/$uuid_root" "$temp_dir"
sed -E -i \
    "s|.*/boot.*|UUID=$boot_part_uuid /boot vfat defaults 0 2|" \
    "$temp_dir/etc/fstab"
sed -E -i \
    "s|.*/ +.*|UUID=$root_part_uuid / ext4 defaults,noatime 0 1|" \
    "$temp_dir/etc/fstab"

umount "$temp_dir"
rmdir "$temp_dir"
kpartx -dv "$IMAGE_FILE"

Notice that we also modify /etc/fstab inside the image. This is because growpart changes partition identifiers, making the PARTUUID identifiers in the file no longer be valid. We replace them with the current partition UUIDs.

The final step is modifying kernel boot parameters to point at the iSCSI target.

echo \
    dwc_otg.lpm_enable=0 \
    console=tty1 \
    rootfstype=ext4 \
    elevator=deadline \
    fsck.repair=yes \
    rootwait \
    ip=::::rpi-blog:eth0:dhcp \
    root=UUID=$root_part_uuid \
    ISCSI_INITIATOR=$INITIATOR_IQN \
    "ISCSI_TARGET_NAME=$TARGET_IQN" \
    "ISCSI_TARGET_IP=$STORAGE_MACHINE_IP" \
    ISCSI_TARGET_PORT=3260 \
    rw \
>"/tftpboot/$RPI_SERIAL/cmdline.txt"

Booting up and troubleshooting

Turning on the Raspberry Pi without an inserted SD card should now successfully boot through PXE, connect to the iSCSI target and boot into the system. However, because there are a lot of moving parts, many things can go wrong. Here are some common troubleshooting steps to use in various situations:

  • Streams to monitor for useful information are
    • the display output of the Raspberry Pi,
    • journalctl -efu dnsmasq and
    • tcpdump -vv -i <eth0> port 67 or port 68 or port 69 on the storage server.
  • For suspected iSCSI connection failures, attempt connecting manually from a development machine.
  • If the boot process does not find a disk with a specific UUID, connect the iSCSI target manually and verify that partition UUIDs in the boot configuration, fstab and the block device match using blkid.
    • This may be due to the partition changing identifiers when growing with growpart.
  • The boot process encountering a kernel panic, or a hang characterised by an attached USB keyboard's {Num,Caps,Scroll} Lock lights not responding to toggles means your kernel version is too new and you have encountered a kernel issue described in sections above.
    • Verify the running kernel version by editing cmdline.txt and appending init=/bin/bash to avoid systemd's init.
    • Use the specific version of Raspbian linked above or otherwise ensure that your running kernel version does not exhibit the bug.
  • Strange issues, such as the machine not appearing to boot at all, may appear because of an outdated version of bootcode.bin in the TFTP root directory. This executable is distributed alongside the Raspbian kernel. Using a newer version might help.
  • When encountering issues loading kernel modules, e.g. when installing Docker, check that the booted kernel version (via uname -r) matches the version on the root partition (in /lib/modules/).
  • If curl fails with could not open CA file, you have likely built Raspbian yourself and are encountering the issue of building on an x64 machine. Use the Docker build process as mentioned in the section on customising Raspbian.

Benchmark

Now that our Raspberry Pi boots without issue, we would, of course, like to see how much faster it is around the track. And by track, I mean synthetic benchmarks showing base performance characteristics of the storage subsystem commonly found in storage comparisons.

This is the relevant set of hardware used for measurements:

  • Raspberry Pi 3B v1.2
  • Raspbian Buster, fio 3.12
  • Kingston MicroSDHC 16 GB Class 4 (slightly used, fresh install)
  • iSCSI target on Intel NUC6CAYH, Samsung SSD 850 EVO (500 GB, SATA)

The system was booted and idled for about 1 minute prior to starting the test. Tests were run one right after the other, the hardware was not left to cool down. No other processes, aside from the base system, were running on the NUC at the time of the test. Two simultaneous workers each operating on a 1 GB file were used. The Raspberry Pi was connected to an official Raspberry Pi power supply (rated at 3 A), a 10 meter CAT5e cable to a Cisco SG200-08 switch with a Sentinel Router upstream, a keyboard and an HDMI cable to a monitor. The NUC was connected to the switch via a 20 cm CAT6 patch cable.

SD card vs iSCSI SSD benchmark results

Transfers are much quicker in all cases but sequential reads, as this is where SD cards excel. The Raspberry Pi 3B v1.2, used for the benchmark, has only a 100 Megabit Ethernet adapter, and even that wired through the USB hub internally. We assume this is the cause for the artificial-seeming limit on reads and writes, which are both capped at about the same speed.

In real-world tests, switching to iSCSI-backed storage yielded at least a doubling in C code compilation performance, but we do not have any measurements for you on this. Frequent stalls of the machine, experienced as processes hanging for a few seconds, are also gone, as storage is now much more consistent and not as prone to any write buffers filling up. Life expectancy of the SD card is also greatly increased, because there isn't any.

Using Ansible

Finally, let us take a look at how we could use Ansible to automate some of this. As a fairly complicated and lengthy process, not all steps are suitable for automation, or at least, some reap many more benefits from being scripted than others.

First off, customising the Raspbian image, enabling USB/network boot on the Pi and generating a new initrd are things not directly suitable to Ansible automation. Of course you can, but simple shell scripts can work even better, as these tasks rarely run in parallel on multiple machines, as they only need to be executed once per machine, kernel or Raspbian version.

Setting up the target, however, is very suitable for Ansible. We think a two-step provisioning process, just like the one used in the blog post, is most suitable. First, setting up the storage target by enabling an iSCSI target, DHCP/PXE and TFTP server, which only needs to be done once, and then executing steps that must be done to provision each Raspberry Pi.

Cleverly using differentiating inventory variables for each Pi, such as serial numbers, kernel versions and backstore sizes can allow you to provision "infinitely" many devices at once. A peculiarity of this setup would be that, although the inventory upon which tasks are executed would contain Raspberry Pis, all tasks would still have to be delegated to the iSCSI target machine to provision new disk images, iSCSI ACLs and so on; but this would greatly simplify parallelisation.

In closing

There are many tutorials and guides on the internet on how to boot a Raspberry Pi from the network, including the official one, but none completely deal with the issues we have trouble-shot here.

We hope this has helped you remove a piece of plastic from your pies, and maybe also learn something more about how booting said pie works. Your next task is to integrate all of this into a more comprehensive PXE solution!

Do you have anything to add? Talk to us on Twitter and/or Reddit!