Debugging complicated shell scripts

Tracking down differences between OSH and other shells

The setup

We’re testing if osh (an alternative shell) can build all Alpine Linux packages by replacing /bin/sh and /bin/bash with symlinks to osh. We build the packages in our “custom” osh environment, and in a “regular” environment (referred to as baseline). There is a build server that builds all packages and reports if there is a disagreement between baseline and osh, i.e. a package can be built in the baseline environment but when we use osh in place of ash and bash the build fails. This gives us a nice overview of what packages we should investigate, and it also provides us with a mechanism to track the causes of the build failures. If we notice a package failed to build we can use the same set up on our local machines to debug the individual package further. The end goal is to end up with as little disagreements as possible.

Debugging a simple failure

Knowing a package fails to build is not that useful to know on its own. Of course the real thing we are interested in is why it works with baseline and not with osh. In the report the build logs of each package is stored. Sometimes this can make it very easy to figure out what’s going on. For example, for the zeitgeist packages, there was this error in the build log:

python/Makefile.am:5: installing 'build-aux/py-compile'
autoreconf: Leaving directory '.'
+ set '+x'
    if [ "$1" = "--help" ]; then exit 0 else
                                        ^~~~
./autogen.sh:31: Unexpected argument to 'exit'

"Unexpected argument to 'exit'" sounds like a syntax error right off the bat, and it is! It’s just a syntax error caused by a typo (missing ; after exit 0) that ash didn’t catch. It’s always good to verify this (you can just do this in a normal shell, no need to enter a special environment):

$ ash -c "if [ true ]; then exit 0 else echo 'hi'; fi"
$ bin/osh -c "if [ true ]; then exit 0 else echo 'hi'; fi"
  if [ true ]; then exit 0 else echo 'hi'; fi
                           ^~~~
[ -c flag ]:1: Unexpected argument to 'exit'

Okay, so in this case it’s pretty clear what the disagreement is. It is just that osh is stricter on this syntax error than ash (which is a good thing). Ideally we also fix the disagreement in osh, but sometimes there are cases like this where it may not be really desirable to match the behavior of the other shell. So this was a very quick investigation, but not all build failures are this easy to analyze.

Debugging a complex failure

Understanding the error

The most common reason that packages fail to build are failing test cases. These are usually a bit more tricky to debug, especially if they don’t contain obvious syntax errors like the last example. For example, let’s take a look at the logs of the failed package mdev-conv. We see this:

Executing command [ sh /home/udu/aports/main/mdev-conf/src/mdev-conf-4.8/tests/../persistent-storage ]
Executing command [ readlink disk/by-label/EFI\x20System ]
Fail: incorrect exit status: 1, expected: 0
stdout:

stderr:

Files left in work directory after failure: blkid, block, disk
persistent-storage_test:persistent_storage_by_label  ->  failed: atf-check failed; see the output of the test for details

It even prints what commands it is executing (which is already kind of a luxury), but this still doesn’t really tell us what is going on. But the commands do make it easy to check what code is being executed, and what it is expecting. We can just open the persistent_storage file in a text editor and look for a readlink command. In this case, the test seems to check this:

mdev=sda1 action=add atf_check \ # assigning some variables
  sh $mdevscript                 # executing `$mdevscript` in a new shell
atf_check -o match:"sda1" \
  readlink disk/by-label/"$fs_encoded_label"  # some test command that seems to be checking for `sda1` in some output

Reproducing the issue

Now it is still not very clear what is going wrong, so it’s time to get our hands dirty in the chroot osh environment. We can enter the environment by setting INTERACTIVE to 1 and running the build locally:

INTERACTIVE=1 regtest/aports-run.sh build-package-overlayfs osh-as-sh mdev-conf

This puts us in a shell in the build environment, and we can rebuild the package to see if the issue is still there:

abuild -f -r -K -C ~/aports/main/mdev-conf    # `abuild` is Alpine's package building tool'

Adding the -K flag prevents abuild from cleaning everything up, which allows us to poke around a bit.

`echo` debugging

Since we can now rebuild and reproduce the error locally, we can start with the most basic form of debugging: print debugging (or I guess echo debugging in this case). The test code looks like it’s searching for the string sda1 in the output of the readlink command, but it seems to fail.

atf_check -o match:"sda1" \
  readlink disk/by-label/"$fs_encoded_label"

Let’s add some echos to see if that file is even available at all:

mdev=sda1 action=add atf_check \
  sh $mdevscript
echo $(ls -l disk/by-label/$fs_encoded_label) # added this `echo`
atf_check -o match:"sda1" \
  readlink disk/by-label/"$fs_encoded_label"

It outputs No such file or directory so that’s a pretty clear answer: some process was supposed to create a file at disk/by-label/$fs_encoded_label. The file is not created, but the error logs don’t indicate any attempt to create that file that failed. You can see before the atf_check line though, there is this call to $mdevscript. Maybe that was supposed to create it? We could look for that and add more echo statements there. But when our issue seems to spread multiple (sub)shells, I tend to reach for strace first.

Using `strace`

strace is a really nice tool that can give insight into what a program is doing. strace shows every system call a program makes - things like opening files, running programs (execve), reading/writing data. Each line of its output shows the syscall name, its arguments, and return value. We can use strace by copying the command listed in the build logs in the beginning, but adding strace before it:

strace -o output.log sh /home/udu/aports/main/mdev-conf/src/mdev-conf-4.8/tests/../persistent-storage

strace can output a lot of lines, so I recommend writing everything to a log (with -o) so you can search through all the logs after the run is done. In these logs we can actually see our test attempting to read the file:

[pid 14538] execve("/usr/bin/readlink", ["readlink", "disk/by-label/EFI\\x20System"], 0x7ffe83dd5088 /* 16 vars */)

It is showing the readlink command being executed, and we can see now $fs_encoded_label has been expanded to efi\\x20system. Those escape codes in the name are already quite suspicious. A few lines above it (so this happened before the readlink command), we can see a file (or actually a symlink) being created, but not exactly the one that readlink is attempting to read:

[pid 14533] execve("/bin/ln", ["ln", "-sf", "../../sda1", "disk/by-label/efix20system"], 0x7f0416e3f068 /* 18 vars */) = 0

There is a clear difference here, the file that gets created is called efix20system, but the script is trying to read efi\\x20system afterwards. This definitely looks like an escaping issue of some kind. I also came across this suspicious looking sed command in the strace log:

[pid 14508] execve("/bin/sed", ["sed", "-e", "s| |\\\\x20|g"],

Relating `strace` statements back to the shell code

This is some useful information. Let’s take a step back and look at the full test. I’ve annotated the most important parts with code comments:

persistent_storage_by_label_body() {
        local fs_label="efi system"  # The original string from the file
        local fs_encoded_label="$(echo $fs_label | sed -e 's| |\\x20|g')" # This is where the \\x20 escaping gets added

        cat <<-eof > blkid # This is writing a mock script
        #!/bin/sh          # It is executed by `$mdevscript`
        # blkid mock       # All this script does is output a string
        case \$1 in        # But only if its first argument is `/dev/sda1`
                /dev/sda1)
                echo "/dev/sda1: uuid=\"00283cf5-01\" label=\"$fs_encoded_label\" type=\"ext4\""
                ;; # ^~~~ this line creates the script output 
        esac
        eof
        chmod +x blkid
        export path="$pwd:$path"
        mdev=sda1 action=add atf_check \
                sh $mdevscript
        atf_check -o match:"sda1" \ # this is the check that causes our test to fail
                readlink disk/by-label/"$fs_encoded_label"
}

It’s still not very clear what is going wrong, but all the lines that contain $fs_encoded_label are definitely suspicious. Let’s see if we can extract the relevant lines out of the test so we can test this outside of the test scripts. This seems to work in ash:

$ fs_label="efi system"
$ fs_encoded_label="$(echo $fs_label | sed -e 's| |\\x20|g')"
$ echo $fs_encoded_label
efi\x20system
$ cat <<-EOF > /tmp/test-blkid   # here we are writing the mock file
#!/bin/sh
case \$1 in
    /dev/sda1)
    echo "/dev/sda1: UUID=\"00283cf5-01\" LABEL=\"$fs_encoded_label\" TYPE=\"ext4\""
    ;;
esac
EOF
$ ash /tmp/test-blkid /dev/sda1  # Here the output seems correct:
/dev/sda1: UUID="00283cf5-01" LABEL="efi\x20system" TYPE="ext4"

Seems like things are working up until this point. Let’s try it out in osh:

osh-0.36$ fs_label="efi system"
osh-0.36$ fs_encoded_label="$(echo $fs_label | sed -e 's| |\\x20|g')"
osh-0.36$ echo $fs_encoded_label
efi\x20system                    # still the same as in `ash` here
osh-0.36$ cat <<-EOF > /tmp/test-blkid
#!/bin/sh
case \$1 in
    /dev/sda1)
    echo "/dev/sda1: UUID=\"00283cf5-01\" LABEL=\"$fs_encoded_label\" TYPE=\"ext4\""
;;
esac
EOF
osh-0.36$ /tmp/test-blkid /dev/sda1
/dev/sda1: UUID=00283cf5-01 LABEL=efix20system TYPE=ext4

Well there’s a clear difference, osh somehow consumed the \ before x20, changing it from an escape code to just a random string. That’s definitely an issue! The output of this mock script must be used to create the file with the ln -s command that we saw earlier in the strace log:

[pid 14533] execve("/bin/ln", ["ln", "-sf", "../../sda1", "disk/by-label/efix20system"], 0x7f0416e3f068 /* 18 vars */) = 0

That is creating the symlink as efix20system instead of efi system. But the readlink command later is trying to correctly read the link called efi system.

Narrowing down the cause

Now we need to figure out why this goes wrong. Both commands use the $fs_encoded_label variable, so that can’t be the issue. Otherwise the readlink command would not attempt to read the right file. There is definitely something going wrong with the mock script. But where exactly? Is the problem when we write the heredoc (creating the mock script), or when we run the mock script? One thing we could try is creating the mock script with bash (or ash) and running it with osh. That could give us a hint if writing the script is the issue or running it.

osh-0.36$ osh /tmp/test-blkid /dev/sda1    # /tmp/test-blkid was created with bash
/dev/sda1: UUID="00283cf5-01" LABEL="efi\x20system" TYPE="ext4"

That’s the correct output, so something must go wrong when we create the script. Let’s compare the scripts that get written:

$ cat <<-EOF > /tmp/test-bash          # create the script with bash
#!/bin/sh
case \$1 in
    /dev/sda1)
    echo "/dev/sda1: UUID=\"00283cf5-01\" LABEL=\"$fs_encoded_label\" TYPE=\"ext4\""
    ;;
esac
EOF
osh-0.36$ cat <<-EOF > /tmp/test-osh   # create the script with osh
#!/bin/sh
case \$1 in
    /dev/sda1)
    echo "/dev/sda1: UUID=\"00283cf5-01\" LABEL=\"$fs_encoded_label\" TYPE=\"ext4\""
    ;;
esac
EOF
# reading the 2 created files
osh-0.36$ cat /tmp/test-osh
#!/bin/sh
case $1 in
    /dev/sda1)
    echo "/dev/sda1: UUID="00283cf5-01" LABEL="efi\x20system" TYPE="ext4""
    ;;
esac
osh-0.36$ cat /tmp/test-bash
#!/bin/sh
case $1 in
    /dev/sda1)
    echo "/dev/sda1: UUID=\"00283cf5-01\" LABEL=\"efi\x20system\" TYPE=\"ext4\""
    ;;
esac

There’s a clear difference again! The test-bash file has escaped quotes ("), but in test-osh those backslashes have disappeared. This is what’s causing the filename issues. We can verify this by manually running the echo commands that got written to each file:

# echo command generated by osh
$ echo "/dev/sda1: UUID="00283cf5-01" LABEL="efi\x20system" TYPE="ext4""
# these are not escaped with \    ~~~~~~~~~~^~~~~~~~~~~~~~^
# output:
/dev/sda1: UUID=00283cf5-01 LABEL=efix20system TYPE=ext4        
# missing \ before x20            ~~~^

# echo command generated by bash
$ echo "/dev/sda1: UUID=\"00283cf5-01\" LABEL=\"efi\x20system\" TYPE=\"ext4\""
# these *are* escaped with \      ~~~~~~~~~~~~^~~~~~~~~~~~~~~^
# output:
/dev/sda1: UUID="00283cf5-01" LABEL="efi\x20system" TYPE="ext4"
# correct, will show "efi system" ~~~~~~^

So the issue is that osh “consumes” the \ characters before the double quotes, which is basically malforming the echo command and causes the wrong string to be sent out.

Conclusion

If we really want to make the life of whoever is going to solve this issue easier though, we should make the example a bit easier to run (and understand). This is a very minimal example which still shows the disagreeing behaviors:

$ cat bug.sh
cat <<EOF
a \"quote\"
EOF

$ osh bug.sh
a "quote"

$ bash bug.sh
a \"quote\"

Here we can clearly see what is going wrong: osh removes the \ characters before the " character. So any escape characters you put into an osh heredoc magically disappear. That’s obviously not good. It is also interesting to see that osh is not the only shell that has this issue. brush and sush, which are 2 bash alternatives written in Rust, also have this exact bug!

With this minimal snippet to reproduce the issue, we definitely know enough to file an issue on GitHub. Our investigation is done! Now all that is left to do is for someone to fix this issue, and it will be possible to build the mdev-conf package using osh as sh! As I’m writing this, the issue is still open. If this interests you and you have some free time, you could be the person to fix this bug! (hint: this bug is caused by an issue in the lexer of osh - you can read more about here and here)

If you’d like to read more about this kind of stuff, Aiden who is working on the same project wrote a similiar post you can check out: reproducing dtc failure

The setup#

Debugging a simple failure#

Debugging a complex failure#

Understanding the error#

Reproducing the issue#

echo debugging#

Using strace#

Relating strace statements back to the shell code#

Narrowing down the cause#

Conclusion#