nixos-riscv/JOURNAL.md

215 KiB
Raw Permalink Blame History

2024-07-02

Aleix provided some notes on how to trace the kernel using tracepoints that are printed to the console:

BOOT TIME TRACING
  - compile with CONFIG_BOOTTIME_TRACING=y
  - add to kernel parameters:
      trace_options=sym-addr trace_event=initcall:* tp_printk trace_buf_size=1M ftrace=function ftrace_filter="vfs*"
      trace_options=sym-addr trace_event=sched:* tp_printk trace_buf_size=1M
  - tp_printk sends tracepoint (TRACE_EVENT) to printk buffer
  - trace_event=[subsytem:event]; accepts
     - *:<event_name>
     - <subsystem>:*
     - *:* all
  - many more options at https://www.kernel.org/doc/html/latest/trace/boottime-trace.html

Start with:

trace_options=sym-addr trace_event=sched:* tp_printk trace_buf_size=1M loglevel=7

Can it be caused by the D extension? If I set the ISA to:

riscv,isa = "rv64imaf";

It locks the stage1 script without providing any output. Let see trying again. Locked again.

I can disable the FPU in the kernel, and then it will act as a detector of any floating point instruction.

Then I just need to rebuild busybox without support for double instructions.

Let see if I can build busybox with double instruction

Nope, the u-boot is reporting the d extension is in the isa:

riscv,isa = "rv64imafd";

2024-07-03

I cannot switch to gcc.arch = rv64ima because rust fails to build.

Assumption: The extensions F and D work well and don't cause a hang in the CPU.

Let's go back and try to get the initrd shell, so we can systematically hang it in the switch_root

OBSERVATION: The riscv-timer seems to be causing interrupts with IRQ 5:

[   62.439060] irq_handler_entry: irq=5 name=riscv-timer
[   62.444980] irq_handler_exit: irq=5 ret=handled

OBSERVATION: Rohan reports the serial startup routine being running after

the init begins.

OBSERVATION: Only interrupts in timer, others are zero.

With:

commit 4c656bd8ddd2e41ccaa976ff8d6bd9209175a632 (HEAD -> lagarto-ox)
Author: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
Date:   Wed Jul 3 13:21:04 2024 +0200

    Add busybox patch to see debug lines

    The switch_root command seems to hang in the execv() syscall.

I can see this:

~ # cat /proc/interrupts ; sleep 10 ; cat /proc/interrupts
           CPU0
 10:      42926  RISC-V INTC   5 Edge      riscv-timer
IPI0:         0  Rescheduling interrupts
IPI1:         0  Function call interrupts
IPI2:         0  CPU stop interrupts
IPI3:         0  CPU stop (for crash dump) interrupts
IPI4:         0  IRQ work interrupts
IPI5:         0  Timer broadcast interrupts
           CPU0
 10:      46023  RISC-V INTC   5 Edge      riscv-timer
IPI0:         0  Rescheduling interrupts
IPI1:         0  Function call interrupts
IPI2:         0  CPU stop interrupts
IPI3:         0  CPU stop (for crash dump) interrupts
IPI4:         0  IRQ work interrupts
IPI5:         0  Timer broadcast interrupts

OBSERVATION: There is a timer configured in 0x40170000 but in the device

tree we only have one at timer@40002000.

#define OX_ALVEO_TIMER_BASE		0x40170000
#define ADDR_TIME_L 			0x0u // 32 lower bits of the time register
#define ADDR_TIME_H 			0x1u // 32 higher bits of the time register
#define ADDR_TIMECMP_L 			0x2u // 32 lower bits of the time comparator
#define ADDR_TIMECMP_H 			0x3u // 32 higher bits of the time comparator

https://gitlab.bsc.es/hwdesign/bsc-linux/-/blob/d6d194bd30d9a8fe49c2a278ffb3c3ae7852e75d/bsc_tree/patches/ox_alveo/opensbi/0001-opensbi-ox_alveo-platform.patch#L63

OBSERVATION: When the serial console starts, the speed of the serial port

changes to 9600:

[    6.845400] io scheduler mq-deadline registered
[    6.851500] io scheduler kyber registered
[   17.644460] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[   18.141160] printk: console [ttyS0] disabled
[   18.229480] 40001000.serial: ttyS0 at MMIO 0x40001000 (irq = 11, base_baud = 3125000) is a 16550

*** baud: 230400 ***

*** baud: 460800 ***

*** baud: 500000 ***

*** baud: 576000 ***

*** baud: 500000 ***

*** baud: 460800 ***

*** baud: 230400 ***

*** baud: 115200 ***

*** baud: 57600 ***

*** baud: 38400 ***

*** baud: 19200 ***
faaa0?xx
        <ffa
            fx3fÆf
                  ff
                    ~3xf03faaa0?~3<3<`<~<f<`<憆3~fÆf̆x3fff

*** baud: 9600 ***

*** baud: 19200 ***

*** baud: 38400 ***
xxxxxx<xx<xx<xx<xxxxxxxx
*** baud: 57600 ***

*** baud: 38400 ***
xxxxxxx<x<xx<xx<xx<xx<x<xxxxxxx<xxx<x<x<xxx<xxxxxx<xx<xxxxx<x<xxxxxxxxxxxxxx<xxxxxxxx<xxxxx<x
*** baud: 19200 ***
faaa??~3fx3x3憆`?3f
                   ff3f<ffa<xÞ?<?f<x<x<`<`<f<x??`?`<怘

*** baud: 9600 ***
[   77.929980] stage-1-init: [Thu Jan  1 00:01:17 UTC 1970] + ln -sfn /proc/self/fd/1 /dev/stdout
[   79.258380] stage-1-init: [Thu Jan  1 00:01:18 UTC 1970] + ln -sfn /proc/self/fd/2 /dev/stderr
[   80.212360] stage-1-init: [Thu Jan  1 00:01:19 UTC 1970] + mkdir -p /etc/systemd
[   81.224900] stage-1-init: [Thu Jan  1 00:01:20 UTC 1970] + ln -sfn /nix/store/r4ycv0ymw9wrnalh0sdgkp39nwy3szqz-link-units /etc/systemd/network
[   82.270000] stage-1-init: [Thu Jan  1 00:01:21 UTC 1970] + mkdir -p /etc/udev
[   83.249640] stage-1-init: [Thu Jan  1 00:01:22 UTC 1970] + ln -sfn /nix/store/lx1q7j61dldj8bv0p33v42q0nldb6sd7-udev-rules /etc/udev/rules.d
[   84.304940] stage-1-init: [Thu Jan  1 00:01:23 UTC 1970] + mkdir -p /dev/.mdadm
[   85.221060] stage-1-init: [Thu Jan  1 00:01:24 UTC 1970] + systemd-udevd --daemon
[   86.157320] stage-1-init: [Thu Jan  1 00:01:25 UTC 1970] Starting systemd-udevd version 254.3
[   87.130060] stage-1-init: [Thu Jan  1 00:01:26 UTC 1970] + udevadm trigger '--action=add'

*** break sent ***
[  158.281900] stage-1-init: [Thu Jan  1 00:02:37 UTC 1970] + udevadm settle

*** break sent ***

*** break sent ***
[  214.524500] stage-1-init: [Thu Jan  1 00:03:33 UTC 1970] + kbd_mode -u -C /dev/console
[  215.392960] stage-1-init: [Thu Jan  1 00:03:34 UTC 1970] kbd_mode: KDSKBMODE: Inappropriate ioctl for device

That was my mistake as I need to put the baud speed in the ttyS0, like this:

console=ttyS0,115200n8

OBSERVATION: Trying to read from the serial console /dev/ttyS0 causes no

more messages in the console (or a hang).

QUESTION: Can we make a heartbeat for the kernel?

The idea is to keep a counter in some memory of the kernel so we can see it from the host being moved.

QUESTION: Can we disable the serial driver 8250 from loading?

initcall_blacklist=<driver_init>

I need to know the 8250 init function name:

drivers/tty/serial/8250/8250_core.c:static int __init serial8250_init(void)

So...

initcall_blacklist=serial8250_init

Yes, but that doesn't seem to do anything. It is hanging:

+ modprobe dm_mod
[  627.473580] stage-1-init: [Thu Jan  1 00:10:26 UTC 1970] + echo init /nix/store/v6pi2mqfgshxdsbyxlvpm9nvawxrpijv-nixos-system-nixos-riscv-23.11pre-git/init
[  628.249440] stage-1-init: [Thu Jan  1 00:10:27 UTC 1970] + set -- init /nix/store/v6pi2mqfgshxdsbyxlvpm9nvawxrpijv-nixos-system-nixos-riscv-23.11pre-git/init
[  629.004840] stage-1-init: [Thu Jan  1 00:10:28 UTC 1970] + stage2Init=/nix/store/v6pi2mqfgshxdsbyxlvpm9nvawxrpijv-nixos-system-nixos-riscv-23.11pre-git/init
[  629.733920] stage-1-init: [Thu Jan  1 00:10:29 UTC 1970] + echo /nix/store/snvvqpxmryw1szlllk0bxpm37p8vj8sw-extra-utils/bin/modprobe

QUESTION: What happens if we remap the interruptions?

  • Move the serial from 0 to 1
  • Move the plic from 3 to 2 and remove 7

Now we have one context only:

[    0.000000] riscv-intc: 64 local interrupts mapped
[    0.000000] plic: plic@40800000: mapped 3 interrupts with 0 handlers for 1 contexts.
[    0.000000] riscv: providing IPIs using SBI IPI extension

Rather than two:

[    0.000000] riscv-intc: 64 local interrupts mapped
[    0.000000] plic: plic@40800000: mapped 3 interrupts with 0 handlers for 2 contexts.
[    0.000000] riscv: providing IPIs using SBI IPI extension

QUESTION: What happens if we block the sbi_ipi driver?

initcall_blacklist=sbi_ipi_init

Nothing, it cannot be disabled it seems. I will remove SMP support so it won't be compiled in.

OBSERVATION: Searching for 'riscv,plic0' only matches irq-sifive-plic driver.

hut% rg 'riscv,plic0'
Documentation/devicetree/bindings/interrupt-controller/sifive,plic-1.0.0.yaml
72:          - const: riscv,plic0

drivers/irqchip/irq-sifive-plic.c
572:IRQCHIP_DECLARE(riscv_plic0, "riscv,plic0", plic_init); /* for legacy systems */

So it looks that the only driver that setups the plic is the one used by SiFive. Here is the doc: https://static.dev.sifive.com/U54-MC-RVCoreIP.pdf

OBSERVATION: The number of handlers is 0, so there are no interruptions.

It seems the number next to the phandle of the interrupts-extended attribute in the plic follows a different convention of values. Using 9 and 11:

plic: plic@40800000: mapped 3 interrupts with 1 handlers for 2 contexts.

Remark: The key combination to run Magic SysRq using the HVC console is Ctrl-O and then the SysRq key. It only works it the console is being actively polled, otherwise it hangs.

2024-07-04

OBSERVATION: I saw they changed this option in Cinco Ranch DTS for the

serial:

reg-shift = <0>; // regs are spaced on 8 bit boundary (modified from Xilinx UART16550 to be ns16550 compatible)

Tested booting with debug1 and the ttyS0 console, and it goes extremely slow (but still outputs at 115200) and then continues to fail to read keyboard input.

QUESTION: Let's try setting the console in poll mode.

setenv bootargs "root=/dev/ram0 loglevel=7 debug rw earlycon=uart,io,0x40001000,115200n8 boot.trace console=uart,io,0x40001000,115200n8 debug1 init=/nix/store/wavmnv6wjj8y10ha07wxd5f0sqacivj8-nixos-system-nixos-riscv-23.11pre-git/init"

[    5.909360] io scheduler mq-deadline registered
[    5.914900] io scheduler kyber registered
[   14.405980] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[   14.856040] 40001000.serial: ttyS0 at MMIO 0x40001000 (irq = 2, base_baud = 3125000) is a 16550
[   15.059680] nd_pmem namespace0.0: unable to guarantee persistence of writes
...
[   40.250100] clk: Disabling unused clocks
[   40.256220] Warning: unable to open an initial console. <---- SEE THIS
[   40.618300] Freeing unused kernel image (initmem) memory: 5592K
[   40.728300] Checked W+X mappings: passed, no W+X pages found
[   40.735540] Run /init as init process
[   40.740080]   with arguments:
[   40.743920]     /init
[   40.746660]   with environment:
[   40.750740]     HOME=/
[   40.754020]     TERM=linux
[   46.569960] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[   46.578100] CPU: 0 PID: 1 Comm: init Not tainted 6.6.1 #1-NixOS
[   46.584600] Hardware name: Barcelona Supercomputing Center - Lagarto Ox (NixOS) (DT)
[   46.592740] Call Trace:
[   46.595640] [<ffffffff80006688>] dump_backtrace+0x38/0x48
[   46.601760] [<ffffffff80a96a04>] show_stack+0x50/0x68
[   46.607540] [<ffffffff80aa3f74>] dump_stack_lvl+0x3c/0x5c
[   46.613660] [<ffffffff80aa3fb4>] dump_stack+0x20/0x30
[   46.619400] [<ffffffff80a970dc>] panic+0x158/0x374
[   46.624900] [<ffffffff8001c18c>] do_exit+0x9e8/0x9f0
[   46.630580] [<ffffffff8001c3b0>] do_group_exit+0x44/0xb0
[   46.636600] [<ffffffff8001c454>] __wake_up_parent+0x0/0x40
[   46.642780] [<ffffffff80aa4eac>] do_trap_ecall_u+0x14c/0x168
[   46.649140] [<ffffffff800035ec>] ret_from_exception+0x0/0xac
[   46.655500] Kernel Offset: 0x0 from 0xffffffff80000000
[   46.661160] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100 ]---

Also found: no_console_suspend

OBSERVATION: There are messages of address space being assigned to

registers:

Slave segment '/MEEP_uart_0/S_AXI/Reg' is being assigned into address space '/m_axi_uart0' at <0x0000_0000 [ 4K ]>.
Slave segment '/MEEP_uart_1/S_AXI/Reg' is being assigned into address space '/m_axi_uart1' at <0x0000_0000 [ 4K ]>.

QUESTION: What happens if I enable CONFIG_CONSOLE_POLL?

With console=ttyS0,115200n8 debug1 I cannot type.

OBSERVATION: I can dump iomem memory with the tool devmem:

But it seems I cannot dump the registers of the serial io mapped region:

~ # cat /proc/iomem
40001000-400010ff : serial
60000000-7fffffff : Reserved
80000000-ffefffff : System RAM
  80201000-81fa0b87 : Kernel image
    80201000-80cb177f : Kernel code
    81400000-819fffff : Kernel rodata
    81c00000-81f18747 : Kernel data
    81f19000-81fa0b87 : Kernel bss
100000000-1bfffffff : namespace0.0

~ # devmem 0x40001000
devmem: mmap: Operation not permitted

It looks like the following options may be required to be disabled to allow user-space tools read those regions.

~ # zcat /proc/config.gz | grep CONFIG_STRICT_DEVMEM
CONFIG_STRICT_DEVMEM=y
~ # zcat /proc/config.gz | grep CONFIG_IO_STRICT_DEVMEM
CONFIG_IO_STRICT_DEVMEM=y

Let's try disabling CONFIG_STRICT_DEVMEM and see if we can read the serial registers.

It works!

~ # cat /proc/iomem
40001000-400010ff : serial
60000000-7fffffff : Reserved
80000000-ffefffff : System RAM
  80201000-81fa0b87 : Kernel image
    80201000-80cb159f : Kernel code
    81400000-819fffff : Kernel rodata
    81c00000-81f18707 : Kernel data
    81f19000-81fa0b87 : Kernel bss
100000000-1bfffffff : namespace0.0
~ # devmem 0x40001000
0x0000006E
~ # devmem 0x40001000
0x0000006C
~ # devmem 0x40001000
0x00000072
~ # devmem 0x40001000
0x0000000D

OBSERVATION: The interrupt register of the serial console is 0x0:

Assuming the console registers follow AXI UART 16550, here is the IER:

0x1004 IER R/W Interrupt Enable Register

Which is zero:

~ # devmem 0x40001004 0x00000000

The line control register is 0x3:

~ # devmem 0x4000100C 0x00000003

QUESTION: Can I write to some memory address and see the result from the host?

For that I would need to find some address that is mapped to the DMA or to the pmem. Xavi recommended 0x6000_0000 as it is uncached.

It seems to have some content already:

~ # devmem 0x60000000
0x00000093
~ # devmem 0x60000004
0x00000113
~ # devmem 0x60000008
0x00000193
~ # devmem 0x60000010
0x00000293
~ # devmem 0x60000014
0x00000313
~ # devmem 0x60000018
0x00000393

Writing test seems to work:

~ # devmem 0x60000000 32 0x11223344
~ # devmem 0x60000000
0x11223344

So, technically we should be using the /dev/qdma34000-MM-0 device, as we already use the other one to map the memory.

[bsc015557@fpgan02 nixos]$ for i in {0..16}; do addr=$(($i * 0x10000000)); \
    printf "addr 0x%09x: " $addr; dd if=/dev/qdma34000-MM-0 count=16 bs=1 skip=$addr 2>/dev/null | xxd; done
addr 0x000000000: 00000000: 4444 4444 4444 4444 4444 4444 4444 4444  DDDDDDDDDDDDDDDD
addr 0x010000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
addr 0x020000000: 00000000: 3333 3333 3333 3333 3333 3333 3333 3333  3333333333333333
addr 0x030000000: 00000000: cccc cccc cccc cccc cccc cccc cccc cccc  ................
addr 0x040000000: 00000000: cccc cccc cccc cccc cccc cccc cccc cccc  ................
addr 0x050000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
addr 0x060000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
addr 0x070000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
addr 0x080000000: 00000000: 3333 3333 3333 3333 3333 3333 3333 3333  3333333333333333
addr 0x090000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
addr 0x0a0000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
addr 0x0b0000000: 00000000: 2f2f 2f2f 2f2f 2f2f 2f2f 2f2f 2f2f 2f2f  ////////////////
addr 0x0c0000000: 00000000: 6e6e 6e6e 6e6e 6e6e 6e6e 6e6e 6e6e 6e6e  nnnnnnnnnnnnnnnn
addr 0x0d0000000: 00000000: 2020 2020 2020 2020 2020 2020 2020 2020
addr 0x0e0000000: 00000000: 6c6c 6c6c 6c6c 6c6c 6c6c 6c6c 6c6c 6c6c  llllllllllllllll
addr 0x0f0000000: 00000000: 6767 6767 6767 6767 6767 6767 6767 6767  gggggggggggggggg
addr 0x100000000: 00000000: 2424 2424 2424 2424 2424 2424 2424 2424  $$$$$$$$$$$$$$$$

[bsc015557@fpgan02 nixos]$ for i in {0..16}; do addr=$(($i * 0x10000000)); \
    printf "addr 0x%09x: " $addr; dd if=/dev/qdma34000-MM-1 count=16 bs=1 skip=$addr 2>/dev/null | xxd; done
addr 0x000000000: 00000000: 4444 4444 4444 4444 4444 4444 4444 4444  DDDDDDDDDDDDDDDD
addr 0x010000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
addr 0x020000000: 00000000: 3333 3333 3333 3333 3333 3333 3333 3333  3333333333333333
addr 0x030000000: 00000000: cccc cccc cccc cccc cccc cccc cccc cccc  ................
addr 0x040000000: 00000000: cccc cccc cccc cccc cccc cccc cccc cccc  ................
addr 0x050000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
addr 0x060000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
addr 0x070000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
addr 0x080000000: 00000000: 3333 3333 3333 3333 3333 3333 3333 3333  3333333333333333
addr 0x090000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
addr 0x0a0000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
addr 0x0b0000000: 00000000: 2f2f 2f2f 2f2f 2f2f 2f2f 2f2f 2f2f 2f2f  ////////////////
addr 0x0c0000000: 00000000: 6e6e 6e6e 6e6e 6e6e 6e6e 6e6e 6e6e 6e6e  nnnnnnnnnnnnnnnn
addr 0x0d0000000: 00000000: 2020 2020 2020 2020 2020 2020 2020 2020
addr 0x0e0000000: 00000000: 6c6c 6c6c 6c6c 6c6c 6c6c 6c6c 6c6c 6c6c  llllllllllllllll
addr 0x0f0000000: 00000000: 6767 6767 6767 6767 6767 6767 6767 6767  gggggggggggggggg
addr 0x100000000: 00000000: 2424 2424 2424 2424 2424 2424 2424 2424  $$$$$$$$$$$$$$$$

None of the two queues seem to have the value 0x11223344 at any multiple of 0x1000_0000.

Let's verify first that this method works. The kernel is loaded here:

[bsc015557@fpgan02 nixos]$ printf '0x%x\n' $FPGACTL_KERNEL_ADDR
0x24000000

So we should see the same values as the kernel file:

[bsc015557@fpgan02 nixos]$ dd if=kernel.bin count=16 bs=1 2>/dev/null| xxd
00000000: 6f00 400d 0000 0000 0000 2000 0000 0000  o.@....... .....

But we don't see the same:

[bsc015557@fpgan02 nixos]$ dd if=/dev/qdma34000-MM-1 count=16 bs=1 skip=$FPGACTL_KERNEL_ADDR 2>/dev/null | xxd
00000000: 9797 9797 9797 9797 9797 9797 9797 9797  ................
[bsc015557@fpgan02 nixos]$ dd if=/dev/qdma34000-MM-0 count=16 bs=1 skip=$FPGACTL_KERNEL_ADDR 2>/dev/null | xxd
00000000: 9797 9797 9797 9797 9797 9797 9797 9797  ................

QUESTION: Missing forward M to S via Mideleg?

Can it be happening that he MEDELEG is not forwarding the interruptions to the Supervisor (kernel)?

Boot HART MIDELEG         : 0x0000000000000222
Boot HART MEDELEG         : 0x000000000000b109

QUESTION: Can we add a timer to the PLIC to test the interrupts?

OBSERVATION: Here is the PLIC register dump:

~ # for i in `seq 0 16`; do addr=$((0x40600000 + $i)); printf '%08x: ' $addr; devmem $addr; done
40600000: 0x00010002
40600001: 0x09000000
40600002: 0x00090000
40600003: 0x00000900
40600004: 0x00010009
40600005: 0x00000000
40600006: 0x00000000
40600007: 0x00000000
40600008: 0x00000000
40600009: 0x00000000
4060000a: 0x00000000
4060000b: 0x00000000
4060000c: 0x00000000
4060000d: 0x00000000
4060000e: 0x00000000
4060000f: 0x00000000
40600010: 0x00000000

QUESTION: Can we boot with the new bitstream that includes the second UART?

The interruptions are enabled for the UART 1, not the default UART 0.

OBSERVATION: I'm using 0x100 not 0x1000 in the serial range:

reg = <0x0 0x40003000 0x0 0x100>;
reg = <0x0 0x40003000 0x0 0x1000>;

Can this produce any problem?

It doesn't seem to change anything, still unable to send any bytes.

QUESTION: Can we use virtio to mount a FS in the DMA shared memory?

2024-07-05

OBSERVATION: The kernel continues working when the console hangs.

Switching to 0x100000000 as 0x60000000 shows:

~ # devmem 0x6000000
0xBADCAB1E

With the following loop:

~ # i=0; while [ 1 ]; do let i=$i+1; devmem 0x100000000 32 $i; done &
~ # cat /dev/ttyS0
(hangs)

Shows the kernel works:

[bsc015557@fpgan02 nixos]$ while [ 1 ]; do xxd -s $((0x100000000 - 0x60000000)) -l 4 /dev/qdma34000-MM-1; sleep 0.2; done
...
a0000000: 6400 0000                                d...
a0000000: 6500 0000                                e...
a0000000: 6600 0000                                f...
a0000000: 6700 0000                                g...
a0000000: 6800 0000                                h...
a0000000: 6900 0000                                i...

QUESTION: Can we reproduce it with switch_root?

For that I would have to ensure the process continues to operate, even if we exit the console. Maybe I can make a double fork?

I cannot use 0x1000_0000 as that is where the pmem will be. But I can try to use an address in the end, as we are not filling the whole space.

[0x1_0000_0000, 0x1_c000_0000) -> PMEM (3072 MiB)

Maybe 0x1_bfff_0000? Let's try first from the initrd shell.

i=0; while [ 1 ]; do let i=$i+1; devmem 0x1bfff0000 32 $i; done &

Then

while [ 1 ]; do xxd -s $((0x1bfff0000 - 0x60000000)) -l 4 /dev/qdma34000-MM-1; sleep 0.2; done

Yes, it seems to be working. Let's load the rootfs too.

I added a loop in the stage1 script.

QUESTION: Can we see any clock in memory?

This will allow us to check if the AXI still works.

OBSERVATION: The kernel stops updating the counter in the mount phase.

Managed to reach the mount and hang there:

[  337.504740] stage-1-init: [Thu Jan  1 00:05:36 UTC 1970] + '[' -d
/dev/disk/by-label/NIXOS_SD ]
[  338.284560] stage-1-init: [Thu Jan  1 00:05:37 UTC 1970] + mkdir -m 0755 -p
/mnt-root/
[  339.017420] stage-1-init: [Thu Jan  1 00:05:38 UTC 1970] + local 'n=0'
[  339.752560] stage-1-init: [Thu Jan  1 00:05:39 UTC 1970] + true
[  340.488960] stage-1-init: [Thu Jan  1 00:05:39 UTC 1970] + mount /mnt-root/

After almost 6 minutes, with 571 beats:

$ xxd -s 5905514496 -l 4 /dev/qdma34000-MM-1
5fff0000: 3b02 0000                                ;...

It looks like the kernel is the one getting stuck or at least is unable to propagate the heartbeat changes to the host. It would be nice to monitor a hardware clock from the DMA region too, so we can discard problems in the AXI.

OBSERVATION: There is an ioctl failed for /dev/console

[  177.009540] stage-1-init: [Thu Jan  1 00:02:56 UTC 1970] + udevadm settle
+ kbd_mode -u -C /dev/console
kbd_mode: KDSKBMODE: Inappropriate ioctl for device
+ printf '\033%%G'
+ loadkmap
[  266.301040] stage-1-init: [Thu Jan  1 00:04:25 UTC 1970] + kbd_mode -u -C /dev/console

ASSUMPTION: The kernel hangs.

If the kernel hangs, there must be an instruction or sequence of instructions that causes it. First I need to determine what is being executed by the kernel. For that I could use ftrace to see which program is running at the time it hangs.

trace_options=sym-addr trace_event=initcall:* tp_printk trace_buf_size=1M

(prev_comm != 2 && next_comm != 2)

So, we can just enable the tp_printk but not the tracer. Then in the initrd script, I enable the function tracer and the filter.

OBSERVATION: It takes a long time to init the pty:

Interesting timing:

[   12.612620] initcall_start: func=pty_init+0x0/0x3f4
[   20.962640] initcall_finish: func=pty_init+0x0/0x3f4 ret=0

OBSERVATION: The kcompactd0 daemon is using the CPU:

[  290.394920] sched_switch: prev_comm=devmem prev_pid=129 prev_prio=120 prev_state=R ==> next_comm=init next_pid=69 next_prio=120
[  290.408160] sched_switch: prev_comm=init prev_pid=69 prev_prio=120 prev_state=R ==> next_comm=tee next_pid=68 next_prio=120
[  290.420720] sched_switch: prev_comm=tee prev_pid=68 prev_prio=120 prev_state=R+ ==> next_comm=ksoftirqd/0 next_pid=12 next_prio=120
[  290.433960] sched_switch: prev_comm=ksoftirqd/0 prev_pid=12 prev_prio=120 prev_state=R ==> next_comm=init next_pid=1 next_prio=120
[  290.447100] sched_switch: prev_comm=init prev_pid=1 prev_prio=120 prev_state=R ==> next_comm=kcompactd0 next_pid=22 next_prio=120
[  290.460180] sched_switch: prev_comm=kcompactd0 prev_pid=22 prev_prio=120 prev_state=R ==> next_comm=khvcd next_pid=31 next_prio=120
[  290.473400] sched_switch: prev_comm=khvcd prev_pid=31 prev_prio=120 prev_state=R ==> next_comm=kworker/u2:2 next_pid=19 next_prio=120
[  290.486960] sched_switch: prev_comm=kworker/u2:2 prev_pid=19 prev_prio=120 prev_state=R ==> next_comm=khungtaskd next_pid=18 next_prio=120
[  290.500800] sched_switch: prev_comm=khungtaskd prev_pid=18 prev_prio=120 prev_state=R ==> next_comm=kworker/0:1 next_pid=13 next_prio=120
[  290.514560] sched_switch: prev_comm=kworker/0:1 prev_pid=13 prev_prio=120 prev_state=R ==> next_comm=tee next_pid=68 next_prio=120
[  290.527720] sched_switch: prev_comm=tee prev_pid=68 prev_prio=120 prev_state=R+ ==> next_comm=init next_pid=69 next_prio=120
[  290.540360] sched_switch: prev_comm=init prev_pid=69 prev_prio=120 prev_state=R ==> next_comm=devmem next_pid=129 next_prio=120
[  290.553280] sched_switch: prev_comm=devmem prev_pid=129 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/0 next_pid=12 next_prio=120
[  290.566780] sched_switch: prev_comm=ksoftirqd/0 prev_pid=12 prev_prio=120 prev_state=R ==> next_comm=kcompactd0 next_pid=22 next_prio=120
[  290.580500] sched_switch: prev_comm=kcompactd0 prev_pid=22 prev_prio=120 prev_state=R ==> next_comm=init next_pid=1 next_prio=120
[  290.593740] sched_switch: prev_comm=init prev_pid=1 prev_prio=120 prev_state=R ==> next_comm=khvcd next_pid=31 next_prio=120
[  290.606340] sched_switch: prev_comm=khvcd prev_pid=31 prev_prio=120 prev_state=R ==> next_comm=kworker/u2:2 next_pid=19 next_prio=120
[  290.619780] sched_switch: prev_comm=kworker/u2:2 prev_pid=19 prev_prio=120 prev_state=R ==> next_comm=khungtaskd next_pid=18 next_prio=120
[  290.633620] sched_switch: prev_comm=khungtaskd prev_pid=18 prev_prio=120 prev_state=R ==> next_comm=tee next_pid=68 next_prio=120
[  290.646700] sched_switch: prev_comm=tee prev_pid=68 prev_prio=120 prev_state=R+ ==> next_comm=init next_pid=69 next_prio=120
[  290.659320] sched_switch: prev_comm=init prev_pid=69 prev_prio=120 prev_state=R ==> next_comm=kworker/0:1 next_pid=13 next_prio=120
[  290.672560] sched_switch: prev_comm=kworker/0:1 prev_pid=13 prev_prio=120 prev_state=R ==> next_comm=devmem next_pid=129 next_prio=120
[  290.686080] sched_switch: prev_comm=devmem prev_pid=129 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/0 next_pid=12 next_prio=120
[  290.699720] sched_switch: prev_comm=ksoftirqd/0 prev_pid=12 prev_prio=120 prev_state=R ==> next_comm=init next_pid=1 next_prio=120
[  290.712880] sched_switch: prev_comm=init prev_pid=1 prev_prio=120 prev_state=R ==> next_comm=khvcd next_pid=31 next_prio=120
[  290.725500] sched_switch: prev_comm=khvcd prev_pid=31 prev_prio=120 prev_state=R ==> next_comm=kcompactd0 next_pid=22 next_prio=120

QUESTION: Can we reproduce this hang with 6.9.7?

Disabling clang as it is failing to build:

hut% nix develop '.#lagarto-ox'
error: builder for '/nix/store/x1nfa792pv28px70kvfakm3aalcfbdyw-clang-epi-479518d.drv' failed with exit code 2;
       last 10 log lines:
       >       |           ^~~~~~~~~~~~~~~
       > 2 errors generated.
       > make[2]: *** [lib/Support/CMakeFiles/LLVMSupport.dir/build.make:1868: lib/Support/CMakeFiles/LLVMSupport.dir/Signals.cpp.o] Error 1
       > make[2]: *** Waiting for unfinished jobs....
       > [  9%] Built target obj.clang-tblgen
       > 1 warning generated.
       > make[1]: *** [CMakeFiles/Makefile2:9468: lib/Support/CMakeFiles/LLVMSupport.dir/all] Error 2
       > make[1]: *** Waiting for unfinished jobs....
       > [  9%] Built target obj.llvm-tblgen
       > make: *** [Makefile:156: all] Error 2
       For full logs, run 'nix log /nix/store/x1nfa792pv28px70kvfakm3aalcfbdyw-clang-epi-479518d.drv'.
error: 1 dependencies of derivation '/nix/store/m54sxxyi3cg062djrcddcawp10z7r49l-riscv64-unknown-linux-gnu-clang-epi-wrapper-479518d.drv' failed to build
error: 1 dependencies of derivation '/nix/store/yl26dbqqj0snl807c0wjabg4dpbq5gvp-stdenv-linux.drv' failed to build
error: 1 dependencies of derivation '/nix/store/h6180fcl30kqy3apaqjsbkkik2p1spmr-rvb-riscv64-unknown-linux-gnu-da202d6.drv' failed to build
error: 1 dependencies of derivation '/nix/store/b13shgqj7128rdsdzzp4qicqbzl0wnfw-system-path.drv' failed to build
error: 1 dependencies of derivation '/nix/store/6qghlihqcyg6155309ldj5xm9m0v835i-nixos-system-nixos-riscv-24.11pre-git.drv' failed to build
error: 1 dependencies of derivation '/nix/store/l2x18cih29r1kn6vi8imwhkyk98yhw4i-nix-shell-riscv64-unknown-linux-gnu-env.drv' failed to build

QUESTION: Missing cache information may affect?

Other CPUs report the cache details in the DT. For example this one https://github.com/torvalds/linux/blob/master/arch/riscv/boot/dts/sifive/fu540-c000.dtsi#L45

cpu1: cpu@1 {
	compatible = "sifive,u54-mc", "sifive,rocket0", "riscv";
	d-cache-block-size = <64>;
	d-cache-sets = <64>;
	d-cache-size = <32768>;
	d-tlb-sets = <1>;
	d-tlb-size = <32>;
	device_type = "cpu";
	i-cache-block-size = <64>;
	i-cache-sets = <64>;
	i-cache-size = <32768>;
	i-tlb-sets = <1>;
	i-tlb-size = <32>;
	mmu-type = "riscv,sv39";
	reg = <1>;
	riscv,isa = "rv64imafdc";
	riscv,isa-base = "rv64i";
	riscv,isa-extensions = "i", "m", "a", "f", "d", "c", "zicntr", "zicsr",
			       "zifencei", "zihpm";
	tlb-split;
	next-level-cache = <&l2cache>;
	cpu1_intc: interrupt-controller {
		#interrupt-cells = <1>;
		compatible = "riscv,cpu-intc";
		interrupt-controller;
	};
};

We may want to add it to our DT to be sure that it has no effect.

OBSERVATION: Arrived to stage 2!

+ kill -9 74
+ readlink /proc/75/exe
[  374.961120] stage-1-init: [Thu Jan  1 00:06:14 UTC 1970] + test 0 -ge 8
+ '[' 75 -eq 1 ]
+ kill -9 75
+ readlink /proc/102/exe
+ '[' 102 -eq 1 ]
+ kill -9 102
+ readlink /proc/137/exe
+ continue
+ readlink /proc/674/exe
+ continue
+ readlink /proc/675/exe
+ continue
+ test -n
+ echo /sbin/modprobe
+ '[' '!' -e /mnt-root//nix/store/xwqaqpc66ijvay9wxnm5nqmi30f2lp1i-nixos-system-nixos-riscv-24.11pre-git/init ]
+ mkdir -m 0755 -p /mnt-root/proc /mnt-root/sys /mnt-root/dev /mnt-root/run
+ mount --move /proc /mnt-root/proc
+ mount --move /sys /mnt-root/sys
+ mount --move /dev /mnt-root/dev
+ mount --move /run /mnt-root/run
+ type -P switch_root
+ exec env -i /nix/store/988axh0bq3wqp90gms4b4a0hkfwvkd3i-extra-utils/bin/switch_root /mnt-root /nix/store/xwqaqpc66ijvay9wxnm5nqmi30f2lp1i-nixos-system-nixos-riscv-24.11pre-git/init

<<< NixOS Stage 2 >>>

[  384.203680] EXT4-fs (pmem0p2): re-mounted 44444444-4444-4444-8888-888888888888 r/w. Quota mode: none.
[  384.287600] booting system configuration /nix/store/xwqaqpc66ijvay9wxnm5nqmi30f2lp1i-nixos-system-nixos-riscv-24.11pre-git
running activation script...
[  388.163860] stage-2-init: running activation script...
[  391.643500] random: perl: uninitialized urandom read (4 bytes read)
[  391.884800] random: perl: uninitialized urandom read (4 bytes read)
[  425.302000] random: perl: uninitialized urandom read (4 bytes read)

But then it hangs.

2024-07-08

QUESTION: Who sets the plic interrupts?

Shouldn't OpenSBI read the DT and do some configuration in the PLIC while in machine mode?

OBSERVATION: Semi-stack trace from CincoRanch

hvc_remove?
console_unlock <-- only called from hvc_remove()
prb_read_valid
desc_read_finalized_seq
__memcpy (multiple times)
get_data
do_trap_break
report_bug
is_valid_bugaddr
copy_from_kernel_nofault
copy_from_kernel_nofault_allowed
find_bug
_printk
vprintk
vprintk_default
vprintk_emit
vprintk_store
sched_clock
vsnprintf
format_decode
__memcpy
printk_parse_prefix
prb_reserve
do_page_fault
fixup_exception
search_exception_tables
search_extable
cmp_ex_search (multiple times)
search_module_extables
__module_address
no_context.part.0
die_kernel_fault <-- last frame(?)

QUESTION: Can we place a trace point in hvc_remove?

If we are getting stuck in the same place, we should be able to see the backtrace (assuming the console still works) just before we try to remove the console device.

Placed, but still unable to see anything in any hang. Here is a hang in the Stage 2:

<<< NixOS Stage 2 >>>

[  404.158340] EXT4-fs (pmem0p2): re-mounted 44444444-4444-4444-8888-888888888888 r/w. Quota mode: none.
[  404.242500] booting system configuration /nix/store/0za1vqh5alk7mxqs59qxx8izmwmf21w6-nixos-system-nixos-riscv-24.11pre-git
running activation script...
[  408.148380] stage-2-init: running activation script...
[  411.612240] random: perl: uninitialized urandom read (4 bytes read)
[  411.866440] random: perl: uninitialized urandom read (4 bytes read)
[  447.588880] random: perl: uninitialized urandom read (4 bytes read)

Still, it may be hang in a similar way, causing a loop of page faults just while trying to printk to the console, which would explain why we don't see anything and why the heartbeat stops.

Fran has created another bitstream with two consoles enabled (ox_u55c_a234c132.bit), let see if I manage to boot with it.

First I will need to enable the consoles on the DTS.

It doesn't seem to produce any output in the UART. I cannot see OpenSBI while loading it on each baud rate:

Type [C-a] [C-h] to see available commands
Terminal ready

*** baud: 57600 ***

*** baud: 38400 ***

*** baud: 19200 ***

*** baud: 9600 ***

[...]

*** baud: 230400 ***

*** baud: 460800 ***

Let's keep the bitstream files in a repository, so I can carefully track them with git too.

OBSERVATION: The new bitstream requires a bootrom to start

I added it to the bitstream repository, as it is a binary blob too. Now I need to update the load addresses:

6a63bcea6d/boot_riscv/boot_sa.sh (L36-40)

Continues to hang just after those perl messages:

+ mount --move /proc /mnt-root/proc
+ mount --move /sys /mnt-root/sys
+ mount --move /dev /mnt-root/dev
+ mount --move /run /mnt-root/run
+ type -P switch_root
+ exec env -i /nix/store/988axh0bq3wqp90gms4b4a0hkfwvkd3i-extra-utils/bin/switch_root /mnt-root /nix/store/0za1vqh5alk7mxqs59qxx8izmwmf21w6-nixos-system-nixos-riscv-24.11pre-git/init

<<< NixOS Stage 2 >>>

[  541.559320] EXT4-fs (pmem0p2): re-mounted 44444444-4444-4444-8888-888888888888 r/w. Quota mode: none.
[  541.641280] booting system configuration /nix/store/0za1vqh5alk7mxqs59qxx8izmwmf21w6-nixos-system-nixos-riscv-24.11pre-git
running activation script...
[  545.569700] stage-2-init: running activation script...
[  549.019380] random: perl: uninitialized urandom read (4 bytes read)
[  549.274940] random: perl: uninitialized urandom read (4 bytes read)

QUESTION: What happens if we disable CONFIG_BUG

May be a long shot, but if we are experiencing the same page fault loop as in cincoranch we may as well try.

2024-07-09

QUESTION: Maybe we can try without out-of-order?

I made a small tool in C to view and change the CSR register that controls the in-order/out-of-order. Maybe we can try with the "in-order" setting.

We arrive to execute systemd:

Starting interactive shell...
+ setsid /nix/store/xm3mpj9aldz5r4s5yb7p08jdjv98hj4w-extra-utils/bin/ash -c 'exec /nix/store/xm3mpj9aldz5r4s5yb7p08jdjv98hj4w-extra-utils/bin/ash < /dev/hvc0 >/dev/hvc0 2>/dev/hvc0'
[   90.077300] stage-1-init: [Thu Jan  1 00:01:27 UTC 1970] + '[' -n 1 -a i '=' f ]
[   90.639760] stage-1-init: [Thu Jan  1 00:01:28 UTC 1970] + '[' -n 1 -a i '=' i ]
~ # [   90.967260] stage-1-init: [Thu Jan  1 00:01:28 UTC 1970] + echo 'Starting interactive shell...'
[   91.234980] stage-1-init: [Thu Jan  1 00:01:28 UTC 1970] Starting interactive shell...
[   91.569580] stage-1-init: [Thu Jan  1 00:01:29 UTC 1970] + setsid /nix/store/xm3mpj9aldz5r4s5yb7p08jdjv98hj4w-extra-utils/bin/ash -c 'exec /nix/store/xm3mpj9aldz5r4s5yb7p08jdjv98hj4w-extr
a-utils/bin/ash < /dev/hvc0 >/dev/hvc0 2>/dev/hvc0'
which csrtool
/nix/store/xm3mpj9aldz5r4s5yb7p08jdjv98hj4w-extra-utils/bin/csrtool
~ # csrtool
CSR 0x801 = 0u
~ # csrtool o
unknown 'o', use: mem-in-order, all-in-order or all-out-of-order
~ # csrtool all-in-order
CSR 0x801 = 7u
~ # csrtool
CSR 0x801 = 7u
~ #
+ IFS='='
+ echo init /nix/store/xmagm60y90pfh3yvqanvmaswa0m3cb0a-nixos-system-nixos-riscv-24.11pre-git/init
+ set -- init /nix/store/xmagm60y90pfh3yvqanvmaswa0m3cb0a-nixos-system-nixos-riscv-24.11pre-git/init
+ stage2Init=/nix/store/xmagm60y90pfh3yvqanvmaswa0m3cb0a-nixos-system-nixos-riscv-24.11pre-git/init
+ echo /nix/store/xm3mpj9aldz5r4s5yb7p08jdjv98hj4w-extra-utils/bin/modprobe
+ basename dm_mod
[...]
+ echo /sbin/modprobe
+ '[' '!' -e /mnt-root//nix/store/xmagm60y90pfh3yvqanvmaswa0m3cb0a-nixos-system-nixos-riscv-24.11pre-git/init ]
+ mkdir -m 0755 -p /mnt-root/proc /mnt-root/sys /mnt-root/dev /mnt-root/run
+ mount --move /proc /mnt-root/proc
+ mount --move /sys /mnt-root/sys
+ mount --move /dev /mnt-root/dev
+ mount --move /run /mnt-root/run
+ type -P switch_root
+ exec env -i /nix/store/xm3mpj9aldz5r4s5yb7p08jdjv98hj4w-extra-utils/bin/switch_root /mnt-root /nix/store/xmagm60y90pfh3yvqanvmaswa0m3cb0a-nixos-system-nixos-riscv-24.11pre-git/init

<<< NixOS Stage 2 >>>

[  967.703320] EXT4-fs (pmem0p2): re-mounted 44444444-4444-4444-8888-888888888888 r/w. Quota mode: none.
[  967.928020] booting system configuration /nix/store/xmagm60y90pfh3yvqanvmaswa0m3cb0a-nixos-system-nixos-riscv-24.11pre-git
running activation script...
[  977.608980] stage-2-init: running activation script...
bbbbbbbbbbbbbbbbbbbbsetting up /etc...
[ 1084.376420] stage-2-init: setting up /etc...
starting systemd...

Not sure if it is a good reproducer, as it is taking around 15 minutes to hang in a very large piece of software, while when we have the out-of-order enabled, we can hang in half of the time in some script.

Either way, we need to see a backtrace of where it is hanging to understand why it does. We may also enable a stage-2 heartbeat to be sure that it is hanging the kernel and not only the console.

Another idea is to arrive at a proper bash shell, where we can have debugging tools, which may allow us to go slowly until we catch the bug.

OBSERVATION: Setting memory in-order only causes a hang

Tested with:

$ csrtool mem-in-order

And it hangs just after exiting the tool.

QUESTION: Can we see the printk buffer from the host?

If the problem that we are observing is somehow related to the recursive segfault of the kernel in Cincoranch, we may be able to see the printk ring buffer by directly poking at the memory from the host.

QUESTION: Can we crash the CPU by exercising the memory?

I did a small tool memtool that performs allocations and deallocations.

Good news, the tool has hang the console (potentially the kernel too).

~ # which memtool
/nix/store/amj11aclwx62d4mnvkhdgj19kq5gjb9y-extra-utils/bin/memtool
~ # memtool
iter 0, nblocks 1, nbytes 0.1M (A)
iter 1, nblocks 2, nbytes 3.5M (A)
iter 2, nblocks 3, nbytes 3.8M (A)
iter 3, nblocks 4, nbytes 5.7M (A)
iter 4, nblocks 5, nbytes 6.8M (A)
iter 5, nblocks 6, nbytes 10.7M (A)
iter 6, nblocks 5, nbytes 10.6M (D)
iter 7, nblocks 6, nbytes 13.9M (A)
iter 8, nblocks 7, nbytes 16.4M (A)
iter 9, nblocks 8, nbytes 19.8M (A)
iter 10, nblocks 9, nbytes 21.2M (A)
iter 11, nblocks 10, nbytes 24.3M (A)
iter 12, nblocks 11, nbytes 27.4M (A)
iter 13, nblocks 12, nbytes 28.3M (A)
iter 14, nblocks 13, nbytes 31.6M (A)
iter 15, nblocks 12, nbytes 28.2M (D)
iter 16, nblocks 13, nbytes 29.2M (A)
iter 17, nblocks 14, nbytes 30.8M (A)
iter 18, nblocks 15, nbytes 32.6M (A)
iter 19, nblocks 16, nbytes 32.8M (A)
iter 20, nblocks 17, nbytes 36.8M (A)
iter 21, nblocks 18, nbytes 39.6M (A)
iter 22, nblocks 19, nbytes 41.1M (A)
iter 23, nblocks 20, nbytes 44.1M (A)
iter 24, nblocks 21, nbytes 46.9M (A)
iter 25, nblocks 20, nbytes 46.5M (D)
iter 26, nblocks 21, nbytes 50.2M (A)
iter 27, nblocks 22, nbytes 53.8M (A)

Let's see if we can reproduce it again in the same position.

~ # memtool
iter 0, nblocks 1, nbytes 0.1M (A)
iter 1, nblocks 2, nbytes 3.5M (A)
iter 2, nblocks 3, nbytes 3.8M (A)
iter 3, nblocks 4, nbytes 5.7M (A)
iter 4, nblocks 5, nbytes 6.8M (A)
iter 5, nblocks 6, nbytes 10.7M (A)
iter 6, nblocks 5, nbytes 10.6M (D)
iter 7, nblocks 6, nbytes 13.9M (A)
iter 8, nblocks 7, nbytes 16.4M (A)
iter 9, nblocks 8, nbytes 19.8M (A)
iter 10, nblocks 9, nbytes 21.2M (A)
iter 11, nblocks 10, nbytes 24.3M (A)
iter 12, nblocks 11, nbytes 27.4M (A)
iter 13, nblocks 12, nbytes 28.3M (A)
iter 14, nblocks 13, nbytes 31.6M (A)
iter 15, nblocks 12, nbytes 28.2M (D)
iter 16, nblocks 13, nbytes 29.2M (A)
iter 17, nblocks 14, nbytes 30.8M (A)
iter 18, nblocks 15, nbytes 32.6M (A)
iter 19, nblocks 16, nbytes 32.8M (A)
iter 20, nblocks 17, nbytes 36.8M (A)
iter 21, nblocks 18, nbytes 39.6M (A)
iter 22, nblocks 19, nbytes 41.1M (A)
iter 23, nblocks 20, nbytes 44.1M (A)
iter 24, nblocks 21, nbytes 46.9M (A)
iter 25, nblocks 20, nbytes 46.5M (D)
iter 26, nblocks 21, nbytes 50.2M (A)
iter 27, nblocks 22, nbytes 53.8M (A)

Let's make it automatic, so we only need to boot and confirm that it hangs. Just in case we can make it not hang by a miracle.

With blocks of up to 64K we hang in the ~50 M region.

iter=2042 nblocks=1577 allocated=50458K (A)
iter=2043 nblocks=1578 allocated=50489K (A)
iter=2044 nblocks=1579 allocated=50550K (A)
iter=2045 nblocks=1580 allocated=50605K (A)

With blocks of maxsize=512K it also hangs around ~57M.

memtool v1.0.0 maxsize=512K
iter=0 nblocks=1 allocated=88K (A)
iter=1 nblocks=2 allocated=464K (A)
...
iter=275 nblocks=218 allocated=56674K (A)
iter=276 nblocks=219 allocated=56787K (A)
iter=277 nblocks=220 allocated=57252K (A)
iter=278 nblocks=221 allocated=57493K (A)
iter=279 nblocks=222 allocated=57581K (A)
iter=280 nblocks=221 allocated=57416K (D)
iter=281 nblocks=222 allocated=57521K (A)

Maybe there is a problem in the memory segment? Can we reduce it to 1 GiB only and see if it has any effect?

iter=289 nblocks=228 allocated=58636K (A)
iter=290 nblocks=227 allocated=58412K (D)
iter=291 nblocks=228 allocated=58480K (A)
iter=292 nblocks=229 allocated=58599K (A)

Has changed, but not much.

aaaiter=291 nblocks=228 allocated=58480K (A)
allocating...
filling...
aaaaaaaaiter=292 nblocks=229 allocated=58599K (A)
allocating...
aafilling...

It seems to be getting stuck in the filling phase. Can we trace it down with ftrace? It should be generating page faults.

2024-07-10

So, if we manage to crash in the filling phase, we can further pinpoint the issue and remove any effect of malloc(). It would be only related to a page fault and the MMU at this point.

Let's make a much simpler program that only allocates once a buffer of N bytes and then begins filling it, printing the progress in the output.

OBSERVATION: Writing to a vector also stops around 58 MiB

+ memtool fill 536870912
memtool v0.0.1 - Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
mode fill: nbytes=512M, n=134217728
written=0K, addr=0x3f9b800020 OK
written=4096K, addr=0x3f9bc00020 OK
written=8192K, addr=0x3f9c000020 OK
written=12288K, addr=0x3f9c400020 OK
written=16384K, addr=0x3f9c800020 OK
written=20480K, addr=0x3f9cc00020 OK
written=24576K, addr=0x3f9d000020 OK
written=28672K, addr=0x3f9d400020 OK
written=32768K, addr=0x3f9d800020 OK
written=36864K, addr=0x3f9dc00020 OK
written=40960K, addr=0x3f9e000020 OK
written=45056K, addr=0x3f9e400020 OK
written=49152K, addr=0x3f9e800020 OK
written=53248K, addr=0x3f9ec00020 OK
written=57344K, addr=0x3f9f000020 OK

Let's see if we can trace the page fault.

In today's meeting, Jonnatan suggests test the memtool program with all-in-order configuration. I should also try to reproduce this hang with a "production" bitstream (from master).

Also, Xavi reports that the memtool chain test continued to run until the memory was exhausted using an old bistream.

Let's do the quick CSR test first, and then we go back to the ftrace testing, which will take more time.

Here are the commands I was testing, but nothing comes out of the console, even after booting with the tp_printk trace_buf_size=1M boot options:

mkdir -p /sys/kernel/debug/
mount -t debugfs none /sys/kernel/debug/
td=/sys/kernel/debug/tracing
echo nop > $td/current_tracer
echo 100 > $td/max_graph_depth
echo do_page_fault > $td/set_graph_function
echo function_graph > $td/current_tracer

OBSERVATION: Using all-in-order causes the hang in the same place

After setting the CSR 0x801 register to 0x7, the all-in-order configuration, the memtool fill tests continues to hang in the same position:

+ csrtool all-in-order
CSR 0x801 = 7u
+ memtool fill 536870912
memtool v0.0.1 - Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
mode fill: nbytes=512M, n=134217728
written=0K, addr=0x3f8d600020 OK
written=4096K, addr=0x3f8da00020 OK
written=8192K, addr=0x3f8de00020 OK
written=12288K, addr=0x3f8e200020 OK
written=16384K, addr=0x3f8e600020 OK
written=20480K, addr=0x3f8ea00020 OK
written=24576K, addr=0x3f8ee00020 OK
written=28672K, addr=0x3f8f200020 OK
written=32768K, addr=0x3f8f600020 OK
written=36864K, addr=0x3f8fa00020 OK
written=40960K, addr=0x3f8fe00020 OK
written=45056K, addr=0x3f90200020 OK
written=49152K, addr=0x3f90600020 OK
written=53248K, addr=0x3f90a00020 OK
written=57344K, addr=0x3f90e00020 OK

It doesn't seem to have any observable effect with this test, other than going more slow.

Interesting article on how to write a simple bootrom that outputs some ASCII text into the console.

Regarding the do_page_fault filter, it seems that is no available. I need to pick one from available_filter_functions instead. Let's try handle_page_fault.

It still doesn't seem to appear in the console. It is visible with a SysRq request Ctrl+O then 'z'.

Maybe I can try disabling the huge pages, just to discard that it may be related to it: CONFIG_HUGETLBFS.

Same hang with huge pages disabled, but a bit further ~70 MB.

OBSERVATION: The Linux memtest fails in the first round

[    0.000000] Linux version 6.9.7 (nixbld@localhost) (riscv64-unknown-linux-gnu-gcc (GCC) 13.3.0, GNU ld (GNU Binutils) 2.41) #1-NixOS Thu Jun 27 11:52:32 UTC 2024
[    0.000000] Machine model: Barcelona Supercomputing Center - Lagarto Ox (NixOS)
[    0.000000] SBI specification v2.0 detected
[    0.000000] SBI implementation ID=0x1 Version=0x10004
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI DBCN extension detected
[    0.000000] earlycon: sbi0 at I/O port 0x0 (options '')
[    0.000000] printk: legacy bootconsole [sbi0] enabled
[    0.000000] Reserved memory: created DMA memory pool at 0x0000000060000000, size 256 MiB
[    0.000000] OF: reserved mem: initialized node dma_pool@60000000, compatible id shared-dma-pool
[    0.000000] OF: reserved mem: 0x0000000060000000..0x000000006fffffff (262144 KiB) map non-reusable dma_pool@60000000
[    0.000000] Reserved memory: created DMA memory pool at 0x0000000070000000, size 256 MiB
[    0.000000] OF: reserved mem: initialized node dma_pool@70000000, compatible id shared-dma-pool
[    0.000000] OF: reserved mem: 0x0000000070000000..0x000000007fffffff (262144 KiB) map non-reusable dma_pool@70000000
[    0.000000] cma: Reserved 16 MiB at 0x00000000bf000000 on node -1
[    0.000000] early_memtest: # of tests: 3
[    0.000000]   0x0000000080000000 - 0x0000000080013000 pattern 5555555555555555

This is not suggesting that the problem is not in the virtual memory, but with the actual physical memory.

I will try a similar test with uboot with the mtest command, but requires enabling it first.

OBSERVATION: Memory in the 0xb0000000..0xc0000000 range is bad

Reproduced from U-Boot:

=> mtest 0x80000000 0x90000000 0 2
Testing 80000000 ... 90000000:
Pattern FFFFFFFFFFFFFFFF  Writing...  Reading...Iteration:      2
Tested 2 iteration(s) with 0 errors.
=> mtest 0x90000000 0xa0000000 0 2
Testing 90000000 ... a0000000:
Pattern FFFFFFFFFFFFFFFF  Writing...  Reading...Iteration:      2
Tested 2 iteration(s) with 0 errors.
=> mtest 0xa0000000 0xb0000000 0 2
Testing a0000000 ... b0000000:
Pattern FFFFFFFFFFFFFFFF  Writing...  Reading...Iteration:      2
Tested 2 iteration(s) with 0 errors.
=> mtest 0xb0000000 0xc0000000 0 2
Testing b0000000 ... c0000000:
Pattern 0000000000000000  Writing...

Let's see if we can fix the boot hang by reducing the memory enough to avoid this bad region.

2024-07-11

OBSERVATION: U-Boot mtest hangs in the last 256 MiB

After reducing the size of the RAM segment, I run again the mtest, but this time it fails in the last 256 MiB block.

I assume that U-Boot moves itself to the last part of the memory, and them mtest overwrites the U-Boot code, causing a hang.

So, I simply changed the FDT from U-Boot to skip the first 2M:

fdt set /memory@80000000  reg <0x0 0x80200000 0x0 0x40000000>

And then I enabled the memtest in the kernel boot parameters:

=> fdt set /memory@80000000  reg <0x0 0x80200000 0x0 0x30000000>
=> setenv bootargs "root=/dev/ram0 loglevel=7 debug rw earlycon=sbi console=hvc0 boot.trace boot.tracedebug init=/nix/store/zxbq93zfg8ijkyq5cq5sb4742rczqfck-nixos-system-nixos-riscv-24.11pre-git/init"
=> setenv ramdisk_size 12611657
=> setenv bootargs "root=/dev/ram0 loglevel=7 debug rw earlycon=sbi console=hvc0 boot.trace boot.tracedebug memtest=3 init=/nix/store/zxbq93zfg8ijkyq5cq5sb4742rczqfck-nixos-system-nixos-risc=> 4.11pre-git/init"
=> booti ${kernel_addr_r} ${ramdisk_addr_r}:${ramdisk_size} ${fdtcontroladdr}
Moving Image from 0x84000000 to 0x80200000, end=8303c4d0
## Flattened Device Tree blob at 80013000
   Booting using the fdt blob at 0x80013000
Working FDT set to 80013000
   Using Device Tree in place at 0000000080013000, end 000000008001696f
Working FDT set to 80013000

Starting kernel ...

[    0.000000] Linux version 6.9.7 (nixbld@localhost) (riscv64-unknown-linux-gnu-gcc (GCC) 13.3.0, GNU ld (GNU Binutils) 2.41) #1-NixOS Thu Jun 27 11:52:32 UTC 2024
[    0.000000] Machine model: Barcelona Supercomputing Center - Lagarto Ox (NixOS)
[    0.000000] SBI specification v2.0 detected
[    0.000000] SBI implementation ID=0x1 Version=0x10004
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI DBCN extension detected
[    0.000000] earlycon: sbi0 at I/O port 0x0 (options '')
[    0.000000] printk: legacy bootconsole [sbi0] enabled
[    0.000000] Reserved memory: created DMA memory pool at 0x0000000060000000, size 256 MiB
[    0.000000] OF: reserved mem: initialized node dma_pool@60000000, compatible id shared-dma-pool
[    0.000000] OF: reserved mem: 0x0000000060000000..0x000000006fffffff (262144 KiB) map non-reusable dma_pool@60000000
[    0.000000] Reserved memory: created DMA memory pool at 0x0000000070000000, size 256 MiB
[    0.000000] OF: reserved mem: initialized node dma_pool@70000000, compatible id shared-dma-pool
[    0.000000] OF: reserved mem: 0x0000000070000000..0x000000007fffffff (262144 KiB) map non-reusable dma_pool@70000000
[    0.000000] cma: Reserved 16 MiB at 0x00000000af000000 on node -1
[    0.000000] early_memtest: # of tests: 3
[    0.000000]   0x0000000083200000 - 0x000000008c300000 pattern 5555555555555555
[    0.000000]   0x000000009e912000 - 0x00000000aeff9308 pattern 5555555555555555
[    0.000000]   0x00000000aeff9337 - 0x00000000aeff9338 pattern 5555555555555555
[    0.000000]   0x00000000aeff9367 - 0x00000000aeff9368 pattern 5555555555555555
[    0.000000]   0x00000000aeffcffc - 0x00000000aeffd000 pattern 5555555555555555
[    0.000000]   0x0000000083200000 - 0x000000008c300000 pattern ffffffffffffffff
[    0.000000]   0x000000009e912000 - 0x00000000aeff9308 pattern ffffffffffffffff
[    0.000000]   0x00000000aeff9337 - 0x00000000aeff9338 pattern ffffffffffffffff
[    0.000000]   0x00000000aeff9367 - 0x00000000aeff9368 pattern ffffffffffffffff
[    0.000000]   0x00000000aeffcffc - 0x00000000aeffd000 pattern ffffffffffffffff
[    0.000000]   0x0000000083200000 - 0x000000008c300000 pattern 0000000000000000
[    0.000000]   0x000000009e912000 - 0x00000000aeff9308 pattern 0000000000000000
[    0.000000]   0x00000000aeff9337 - 0x00000000aeff9338 pattern 0000000000000000
[    0.000000]   0x00000000aeff9367 - 0x00000000aeff9368 pattern 0000000000000000
[    0.000000]   0x00000000aeffcffc - 0x00000000aeffd000 pattern 0000000000000000
[    0.000000] Zone ranges:
[    0.000000]   DMA32    [mem 0x0000000080200000-0x00000000b01fffff]
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000080200000-0x00000000b01fffff]

That seems to pass the memtest fine, however the boot process hangs in different stages.

OBSERVATION: Cannot open /dev/ttyS0

~ # setserial -g /dev/ttyS1 -a
/dev/ttyS1, Line 1, UART: 16550, Port: 0x0000, IRQ: 1
        Baud_base: 3125000, close_delay: 50, divisor: 0
        closing_wait: 3000
        Flags: spd_normal

~ # setserial -g /dev/ttyS0 -a
(hangs)

This page seems to have good resources on the serial console:

https://tldp.org/HOWTO/Serial-HOWTO-16.html

It seems that there are some differences in the way the serial port is handled regarding 16550 and 16550A.

I can write to the UART console from U-Boot by directly writing in the 0x40001000 address (A = 0x41):

=> help mw
mw - memory write (fill)

Usage:
mw [.b, .w, .l, .q] address value [count]
=> mw 0x40001000 0x41
A=> mw 0x40001000 0x42
B=> mw 0x40001000 0x43
C=>

OBSERVATION: I can type with the ttyS0 8250 driver

Tried to boot too, but hangs:

=> fdt set /memory@80000000  reg <0x0 0x80200000 0x0 0x30000000>
=> setenv bootargs "root=/dev/ram0 loglevel=7 debug rw earlycon=sbi console=ttyS0,115200n8 boot.trace boot.tracedebug init=/nix/store/zxbq93zfg8ijkyq5cq5sb4742rczqfck-nixos-
system-nixos-riscv-24.11pre-git/init"
=> setenv ramdisk_size 12611657
=> booti ${kernel_addr_r} ${ramdisk_addr_r}:${ramdisk_size} ${fdtcontroladdr}
...
[   30.740740] riscv-plic 40800000.plic: mapped 3 interrupts with 1 handlers for 2 contexts.
[   40.048300] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[   40.420000] of_serial 40001000.serial: error -ENXIO: IRQ index 0 not found
[   40.496720] printk: legacy console [ttyS0] disabled
[   40.558860] 40001000.serial: ttyS0 at MMIO 0x40001000 (irq = 0, base_baud = 3125000) is a 16550
[   40.583940] printk: legacy console [ttyS0] enabled
[   40.583940] printk: legacy console [ttyS0] enabled
[   40.595760] printk: legacy bootconsole [sbi0] disabled
[   40.595760] printk: legacy bootconsole [sbi0] disabled
[   40.820380] 40003000.serial: ttyS1 at MMIO 0x40003000 (irq = 1, base_baud = 3125000) is a 16550

...

<<< NixOS Stage 2 >>>

[  394.678980] EXT4-fs (pmem0p2): re-mounted 44444444-4444-4444-8888-888888888888 r/w. Quota mode: none.
[  394.764620] booting system configuration /nix/store/zxbq93zfg8ijkyq5cq5sb4742rczqfck-nixos-system-nixos-riscv-24.11pre-git
runnin[  398.543300] stage-2-init: running activation script...
g activation script... 

So, if we observe a hang when writing to a bad memory segment, can there be a problem in the place we are placing the pmem? Maybe we can test it with u-boot first.

Another note, the serial device 16550 doesn't seem to use a FIFO, but the 16550A does.

We may want to switch to the A variant, as it seems to be supported by U-boot and the kernel:

https://github.com/u-boot/u-boot/blob/master/drivers/serial/ns16550.c#L607-L619
https://github.com/torvalds/linux/blob/v6.9/drivers/tty/serial/8250/8250_of.c#L285

And defines a FIFO size of 16 bytes:

https://github.com/torvalds/linux/blob/v6.9/drivers/tty/serial/8250/8250_port.c#L74-L81

Still, we would have to wait for a bitstream that can forward the interrupts from the host to the serial console to test it.

OBSERVATION: The memory for the pmem seems to be ok

=> mtest 0x100000000 0x1c0000000 0 2
Testing 100000000 ... 1c0000000:
Pattern FFFFFFFFFFFFFFFF  Writing...  Reading...Iteration:      2
Tested 2 iteration(s) with 0 errors.

It may be worth enabling huge pages again, as there didn't seem to have any effect.

OBSERVATION: There are extra regions mapped by OpenSBI

In the OpenSBI 1.2 test from buildroot:

Domain0 Region00          : 0x0000000080000000-0x000000008003ffff ()
Domain0 Region01          : 0x0000000000000000-0xffffffffffffffff (R,W,X)
Domain0 Next Address      : 0x0000000080200000
Domain0 Next Arg1         : 0x0000000080010000

But with OpenSBI 1.4:

Domain0 Region00          : 0x0000000040000000-0x0000000040000fff M: (I,R,W) S/U: (R,W)
Domain0 Region01          : 0x0000000080020000-0x000000008002ffff M: (R,W) S/U: ()
Domain0 Region02          : 0x0000000080000000-0x000000008001ffff M: (R,X) S/U: ()
Domain0 Region03          : 0x0000000000000000-0xffffffffffffffff M: () S/U: (R,W,X)
Domain0 Next Address      : 0x0000000080200000
Domain0 Next Arg1         : 0x0000000080013000

2024-07-12

I find it strange that if we are writing to the 0x80000000 region from the kernel and OpenSBI has mapped a segment for Machine mode only, there is no error message about it in the console.

Let's update to OpenSBI 1.5 for now, and see if we get any more information.

It seems to be booting fine.

So, I observed the udevadm settle to take a lot of time, maybe I can get some clues by enabling the debug mode.

If a trap is being issued, we may be able to modify OpenSBI to print some information to the serial console so we can understand what is going on.

In the meanwhile, let's try a simple test. We can enable debug output of systemd with the following boot options:

systemd.log_level=debug systemd.log_target=console

We will also need to run csrtool all-in-order to arrive to systemd.

OBSERVATION: Hangs in switch_root again.

Technically, we cannot discard the hypothesis than only the console has crashed, as when we switch to the stage 2 we don't have the heartbeat counter. We may as well run it again before we arrive to systemd just to verify that userland crashed.

On the other hand, I don't understand why we hang in such a way when we try to write to the 0x8000_0000 area from the kernel memtest. I've been reading the OpenSBI source code and they seem to have a trap handler that can emit verbose information to the console when a problem with the trap is detected. I would expect to see some error being dumped to the console in that case.

From the OpenSBI information, this line:

Domain0 Region02          : 0x0000000080000000-0x000000008001ffff M: (R,X) S/U: ()

Suggests that it registers a region with no write permission at 0x8000_0000, so it should fail right away from the kernel side. However, this is not reported anywhere in the console.

As we have an easy way to trigger this situation, maybe we can use it as a test to modify OpenSBI to report that problem to the console and verify that it is working. With that information, we could rule out that a similar problem is happening when we try to run systemd. Maybe we could also try to debug other traps.

Another observation is that the memtest lines we see on the console are printed before the actual test begins:

pr_info("  %pa - %pa pattern %016llx\n",
        &this_start, &this_end, cpu_to_be64(pattern));
memtest(pattern, this_start, this_end - this_start);

So when this line is shown:

[    0.000000] early_memtest: # of tests: 3
[    0.000000]   0x0000000080000000 - 0x0000000080013000 pattern 5555555555555555

We an infer that the problem is located in that region, which agrees with the hypothesis that is related with the OpenSBI regions.

This is the output I get with OpenSBI 1.5:

OpenSBI v1.5
   ____                    _____ ____ _____
  / __ \                  / ____|  _ \_   _|
 | |  | |_ __   ___ _ __ | (___ | |_) || |
 | |  | | '_ \ / _ \ '_ \ \___ \|  _ < | |
 | |__| | |_) |  __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|____/_____|
        | |
        |_|

Platform Name             : ox (Rodrigo NixOS version)
Platform Features         : medeleg
Platform HART Count       : 1
Platform IPI Device       : ---
Platform Timer Device     : axi_timer @ 50000000Hz
Platform Console Device   : uart8250
Platform HSM Device       : ---
Platform PMU Device       : ---
Platform Reboot Device    : ---
Platform Shutdown Device  : ---
Platform Suspend Device   : ---
Platform CPPC Device      : ---
Firmware Base             : 0x80000000
Firmware Size             : 310 KB
Firmware RW Offset        : 0x40000
Firmware RW Size          : 54 KB
Firmware Heap Offset      : 0x45000
Firmware Heap Size        : 34 KB (total), 2 KB (reserved), 11 KB (used), 20 KB (free)
Firmware Scratch Size     : 4096 B (total), 368 B (used), 3728 B (free)
Runtime SBI Version       : 2.0

Domain0 Name              : root
Domain0 Boot HART         : 0
Domain0 HARTs             : 0*
Domain0 Region00          : 0x0000000040000000-0x0000000040000fff M: (I,R,W) S/U: (R,W)
Domain0 Region01          : 0x0000000080040000-0x000000008004ffff M: (R,W) S/U: ()
Domain0 Region02          : 0x0000000080000000-0x000000008003ffff M: (R,X) S/U: ()
Domain0 Region03          : 0x0000000000000000-0xffffffffffffffff M: () S/U: (R,W,X)
Domain0 Next Address      : 0x0000000080200000
Domain0 Next Arg1         : 0x0000000080017000
Domain0 Next Mode         : S-mode
Domain0 SysReset          : yes
Domain0 SysSuspend        : yes

Boot HART ID              : 0
Boot HART Domain          : root
Boot HART Priv Version    : v1.10
Boot HART Base ISA        : rv64imafdc
Boot HART ISA Extensions  : zicntr,zihpm,sdtrig
Boot HART PMP Count       : 0
Boot HART PMP Granularity : 0 bits
Boot HART PMP Address Bits: 0
Boot HART MHPM Info       : 29 (0xfffffff8)
Boot HART Debug Triggers  : 0 triggers
Boot HART MIDELEG         : 0x0000000000000222
Boot HART MEDELEG         : 0x000000000000b109

<debug_uart>
Core:  12 devices, 8 uclasses, devicetree: board
Loading Environment from nowhere... OK
In:    serial,usbkbd
Out:   serial,vidconsole
Err:   serial,vidconsole
No working controllers found
Net:   No ethernet found.
Working FDT set to 80017000
Hit any key to stop autoboot:  0

Device 0: unknown device

Device 1: unknown device
scanning bus for devices...

Device 0: unknown device
starting USB...
No working controllers found
No ethernet found.
No ethernet found.
=>

Where now the regions are slightly off:

Domain0 Region00          : 0x0000000040000000-0x0000000040000fff M: (I,R,W) S/U: (R,W)
Domain0 Region01          : 0x0000000080040000-0x000000008004ffff M: (R,W) S/U: ()
Domain0 Region02          : 0x0000000080000000-0x000000008003ffff M: (R,X) S/U: ()
Domain0 Region03          : 0x0000000000000000-0xffffffffffffffff M: () S/U: (R,W,X)

I would assume that the region 1 is where OpenSBI places its own data, and region 2 is where it places its own code. Then, in region 0 there is the serial area.

Interestingly, I can read and write to the 0x80000000 - 0x81000000 from u-boot without problems:

=> mtest 0x80000000 0x81000000 0 4
Testing 80000000 ... 81000000:
Pattern FFFFFFFFFFFFFFFF  Writing...  Reading...Iteration:      4
Tested 4 iteration(s) with 0 errors.

So I suspect that it disables those regions before jumping into U-Boot.

What I don't understand is why the MMIO region 0 is starting at 0x40000000 while the UART port should be mapped in 0x40001000 as per the device tree. Maybe we could try with the generic configuration of OpenSBI and see if it can load the plic and the serial ports properly directly from the device tree.

Nice, with OpenSBI 1.5 I can see the console errors when trying the generic configuration:

OpenSBI v1.5
   ____                    _____ ____ _____
  / __ \                  / ____|  _ \_   _|
 | |  | |_ __   ___ _ __ | (___ | |_) || |
 | |  | | '_ \ / _ \ '_ \ \___ \|  _ < | |
 | |__| | |_) |  __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|____/_____|
        | |
        |_|

init_coldboot: timer init failed (error -3)

Seems to be failing in sbi_timer_init() with cold_boot = true.

And the -3 error seems to be:

#define SBI_ERR_INVALID_PARAM                   -3

I assume it is calling fdt_timer_init().

Let's try removing the reg-names property, as it seems to cause it to enter a different branch, but "control" is never matched there.

Still failing, the problem must be somewhere else.

Let's try with openpiton configuration instead.

Doesn't even start the UART:

GGGGGGGGG

These G's must be coming from the bootrom.

So let's go back to the generic platform and place some printf() calls to determine where it is failing.

OpenSBI v1.5
   ____                    _____ ____ _____
  / __ \                  / ____|  _ \_   _|
 | |  | |_ __   ___ _ __ | (___ | |_) || |
 | |  | | '_ \ / _ \ '_ \ \___ \|  _ < | |
 | |__| | |_) |  __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|____/_____|
        | |
        |_|

sbi_timer_init: begins
sbi_timer_init: got Zicntr extension
fdt_timer_cold_init: pos = 0
fdt_timer_cold_init: got match, name = riscv,clint0
fdt_timer_cold_init: enabled
fdt_timer_cold_init: drc->cold_init = -3
fdt_timer_init: fdt_timer_cold_init failed (-3)
sbi_platform_timer: sbi_platform_timer_init failed (-3)
init_coldboot: timer init failed (error -3)

Okay, now we can see where it failed. I wonder why aren't these messages enabled by default. I'll guess this is the timer_mtimer_cold_init() function, so let's add some more instrumentation there.

It seems that SiFive timer has a very long weird offset:

if (is_clint) { /* SiFive CLINT */
        /* Set CLINT addresses */
        mt->mtimecmp_addr = addr[0] + ACLINT_DEFAULT_MTIMECMP_OFFSET;
        mt->mtimecmp_size = ACLINT_DEFAULT_MTIMECMP_SIZE;
        if (!quirks->clint_without_mtime) {
                mt->mtime_addr = addr[0] + ACLINT_DEFAULT_MTIME_OFFSET;
                mt->mtime_size = size[0] - mt->mtimecmp_size;
                /* Adjust MTIMER address and size for CLINT device */
                mt->mtime_addr += quirks->clint_mtime_offset;
                mt->mtime_size -= quirks->clint_mtime_offset;
        } else {
                mt->mtime_addr = mt->mtime_size = 0;
        }
        mt->mtimecmp_addr += quirks->clint_mtime_offset;
} else { /* RISC-V ACLINT MTIMER */
        /* Set ACLINT MTIMER addresses */
        mt->mtime_addr = addr[0];
        mt->mtime_size = size[0];
        mt->mtimecmp_addr = addr[1];
        mt->mtimecmp_size = size[1];
}

We may want to use the ACLINT timer instead. Let's first see where the addresses lay in memory, and then use that to verify we change it to 0x0 and 0x8.

GGGGGGGGG
OpenSBI v1.5
   ____                    _____ ____ _____
  / __ \                  / ____|  _ \_   _|
 | |  | |_ __   ___ _ __ | (___ | |_) || |
 | |  | | '_ \ / _ \ '_ \ \___ \|  _ < | |
 | |__| | |_) |  __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|____/_____|
        | |
        |_|

sbi_timer_init: begins
sbi_timer_init: got Zicntr extension
fdt_timer_cold_init: pos = 0
fdt_timer_cold_init: got match, name = riscv,clint0
fdt_timer_cold_init: enabled
timer_mtimer_cold_init: begins, is_clint = 1
timer_mtimer_cold_init: mtime_addr = 0x4000dff8
timer_mtimer_cold_init: mtime_size = 0x000b4008
timer_mtimer_cold_init: mtime_addr = 0x40006000
timer_mtimer_cold_init: mtime_addr = 0x00007ff8
fdt_timer_cold_init: drc->cold_init = -3
fdt_timer_init: fdt_timer_cold_init failed (-3)
sbi_platform_timer: sbi_platform_timer_init failed (-3)
init_coldboot: timer init failed (error -3)

Yeah, those addresses are not what we want. Based on the device tree, the clint must be at 0x40002000, so they should be 40002000 and 40002008. Also I made some typos in the printf command, this is the patch:

+	sbi_printf("timer_mtimer_cold_init: mtime_addr = 0x%08lx\n", mt->mtime_addr);
+	sbi_printf("timer_mtimer_cold_init: mtime_size = 0x%08lx\n", mt->mtime_size);
+	sbi_printf("timer_mtimer_cold_init: mtime_addr = 0x%08lx\n", mt->mtimecmp_addr);
+	sbi_printf("timer_mtimer_cold_init: mtime_addr = 0x%08lx\n", mt->mtimecmp_size);

So, let's fix the patch and switch to the "riscv,aclint-mtimer" timer, which doesn't set any quirk or weird offset.

GGGGGGGGG
OpenSBI v1.5
   ____                    _____ ____ _____
  / __ \                  / ____|  _ \_   _|
 | |  | |_ __   ___ _ __ | (___ | |_) || |
 | |  | | '_ \ / _ \ '_ \ \___ \|  _ < | |
 | |__| | |_) |  __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|____/_____|
        | |
        |_|

sbi_timer_init: begins
sbi_timer_init: got Zicntr extension
fdt_timer_cold_init: pos = 0
fdt_timer_cold_init: got match, name = riscv,aclint-mtimer
fdt_timer_cold_init: enabled
timer_mtimer_cold_init: begins, is_clint = 0
timer_mtimer_cold_init: mtime_addr = 0x40002000
timer_mtimer_cold_init: mtime_size = 0x000c0000
timer_mtimer_cold_init: mtimecmp_addr = 0x00000000
timer_mtimer_cold_init: mtimecmp_size = 0x00000000
fdt_timer_cold_init: drc->cold_init = -3
fdt_timer_init: fdt_timer_cold_init failed (-3)
sbi_platform_timer: sbi_platform_timer_init failed (-3)
init_coldboot: timer init failed (error -3)

So, now we have better addresses, but this is not what I would expect. The regs property is being used to set the mtime address and size, while I would expect it to set the mtimecmp_addr too, but it seems it is expecting two pairs of address and size in the regs:

reg = <0x0 0x40002000 0x0 0x000c0000>;

So, first let's add the mtimecmp after the mtime registers and see if that clears the -3 error.

I'll leave it at 0x40002000, but I suspect the address might be 0x40170000:

#define OX_ALVEO_TIMER_BASE		0x40170000

After booting again, now it seems to work. But I'm not sure if that may cause more problems down the line.

GGGGGGGGG
OpenSBI v1.5
   ____                    _____ ____ _____
  / __ \                  / ____|  _ \_   _|
 | |  | |_ __   ___ _ __ | (___ | |_) || |
 | |  | | '_ \ / _ \ '_ \ \___ \|  _ < | |
 | |__| | |_) |  __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|____/_____|
        | |
        |_|

sbi_timer_init: begins
sbi_timer_init: got Zicntr extension
fdt_timer_cold_init: pos = 0
fdt_timer_cold_init: got match, name = riscv,aclint-mtimer
fdt_timer_cold_init: enabled
timer_mtimer_cold_init: begins, is_clint = 0
timer_mtimer_cold_init: mtime_addr = 0x40002000
timer_mtimer_cold_init: mtime_size = 0x00000008
timer_mtimer_cold_init: mtimecmp_addr = 0x40002008
timer_mtimer_cold_init: mtimecmp_size = 0x00000008
fdt_timer_cold_init: drc->cold_init = 0
fdt_timer_cold_init: pos = 1
fdt_timer_cold_init: returns 0
Platform Name             : Barcelona Supercomputing Center - Lagarto Ox (NixOS)
Platform Features         : medeleg
Platform HART Count       : 1
Platform IPI Device       : ---
Platform Timer Device     : aclint-mtimer @ 50000Hz
Platform Console Device   : uart8250
Platform HSM Device       : ---
Platform PMU Device       : ---
Platform Reboot Device    : ---
Platform Shutdown Device  : ---
Platform Suspend Device   : ---
Platform CPPC Device      : ---
Firmware Base             : 0x80000000
Firmware Size             : 327 KB
Firmware RW Offset        : 0x40000
Firmware RW Size          : 71 KB
Firmware Heap Offset      : 0x49000
Firmware Heap Size        : 35 KB (total), 2 KB (reserved), 11 KB (used), 21 KB (free)
Firmware Scratch Size     : 4096 B (total), 408 B (used), 3688 B (free)
Runtime SBI Version       : 2.0

Domain0 Name              : root
Domain0 Boot HART         : 0
Domain0 HARTs             : 0*
Domain0 Region00          : 0x0000000040002000-0x000000004000200f M: (I,R,W) S/U: ()
Domain0 Region01          : 0x0000000040001000-0x0000000040001fff M: (I,R,W) S/U: (R,W)
Domain0 Region02          : 0x0000000080040000-0x000000008005ffff M: (R,W) S/U: ()
Domain0 Region03          : 0x0000000080000000-0x000000008003ffff M: (R,X) S/U: ()
Domain0 Region04          : 0x0000000040800000-0x0000000040bfffff M: (I,R,W) S/U: (R,W)
Domain0 Region05          : 0x0000000000000000-0xffffffffffffffff M: () S/U: (R,W,X)
Domain0 Next Address      : 0x0000000080200000
Domain0 Next Arg1         : 0x0000000082200000
Domain0 Next Mode         : S-mode
Domain0 SysReset          : yes
Domain0 SysSuspend        : yes

Boot HART ID              : 0
Boot HART Domain          : root
Boot HART Priv Version    : v1.10
Boot HART Base ISA        : rv64imafdc
Boot HART ISA Extensions  : zicntr,zihpm,sdtrig
Boot HART PMP Count       : 0
Boot HART PMP Granularity : 0 bits
Boot HART PMP Address Bits: 0
Boot HART MHPM Info       : 29 (0xfffffff8)
Boot HART Debug Triggers  : 0 triggers
Boot HART MIDELEG         : 0x0000000000000222
Boot HART MEDELEG         : 0x000000000000b109

<debug_uart>
Core:  12 devices, 8 uclasses, devicetree: board
Loading Environment from nowhere... OK
In:    serial,usbkbd
Out:   serial,vidconsole
Err:   serial,vidconsole
No working controllers found
Net:   No ethernet found.
Working FDT set to 82200000
Hit any key to stop autoboot:  0

Device 0: unknown device

Device 1: unknown device
scanning bus for devices...

Device 0: unknown device
starting USB...
No working controllers found
No ethernet found.
No ethernet found.

However, now I cannot boot the kernel, as it is overwriting the FDT:

=> printenv fdtcontroladdr
fdtcontroladdr=82200000
=> fdt addr ${fdtcontroladdr }
Working fdt: 82200000
=> setenv bootargs "root=/dev/ram0 loglevel=7 debug rw earlycon=sbi console=hvc0"
=> setenv ramdisk_size 12614846
=> #booti ${kernel_addr_r} ${ramdisk_addr_r}:${ramdisk_size} ${fdtcontroladdr}
=> printenv kernel_addr_r
kernel_addr_r=0x84000000
=> printenv ramdisk_addr_r
ramdisk_addr_r=0x8c300000
=> booti ${kernel_addr_r} ${ramdisk_addr_r}:${ramdisk_size} ${fdtcontroladdr}
Moving Image from 0x84000000 to 0x80200000, end=83044650
ERROR: Did not find a cmdline Flattened Device Tree
Could not find a valid device tree

Let's move the FDT to 0x80100000.

Now I can load the kernel, but it gets stuck in the middle of the boot:

GGGGGGGGG
OpenSBI v1.5
   ____                    _____ ____ _____
  / __ \                  / ____|  _ \_   _|
 | |  | |_ __   ___ _ __ | (___ | |_) || |
 | |  | | '_ \ / _ \ '_ \ \___ \|  _ < | |
 | |__| | |_) |  __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|____/_____|
        | |
        |_|

sbi_timer_init: begins
sbi_timer_init: got Zicntr extension
fdt_timer_cold_init: pos = 0
fdt_timer_cold_init: got match, name = riscv,aclint-mtimer
fdt_timer_cold_init: enabled
timer_mtimer_cold_init: begins, is_clint = 0
timer_mtimer_cold_init: mtime_addr = 0x40002000
timer_mtimer_cold_init: mtime_size = 0x00000008
timer_mtimer_cold_init: mtimecmp_addr = 0x40002008
timer_mtimer_cold_init: mtimecmp_size = 0x00000008
fdt_timer_cold_init: drc->cold_init = 0
fdt_timer_cold_init: pos = 1
fdt_timer_cold_init: returns 0
Platform Name             : Barcelona Supercomputing Center - Lagarto Ox (NixOS)
Platform Features         : medeleg
Platform HART Count       : 1
Platform IPI Device       : ---
Platform Timer Device     : aclint-mtimer @ 50000Hz
Platform Console Device   : uart8250
Platform HSM Device       : ---
Platform PMU Device       : ---
Platform Reboot Device    : ---
Platform Shutdown Device  : ---
Platform Suspend Device   : ---
Platform CPPC Device      : ---
Firmware Base             : 0x80000000
Firmware Size             : 327 KB
Firmware RW Offset        : 0x40000
Firmware RW Size          : 71 KB
Firmware Heap Offset      : 0x49000
Firmware Heap Size        : 35 KB (total), 2 KB (reserved), 11 KB (used), 21 KB (free)
Firmware Scratch Size     : 4096 B (total), 408 B (used), 3688 B (free)
Runtime SBI Version       : 2.0

Domain0 Name              : root
Domain0 Boot HART         : 0
Domain0 HARTs             : 0*
Domain0 Region00          : 0x0000000040002000-0x000000004000200f M: (I,R,W) S/U: ()
Domain0 Region01          : 0x0000000040001000-0x0000000040001fff M: (I,R,W) S/U: (R,W)
Domain0 Region02          : 0x0000000080040000-0x000000008005ffff M: (R,W) S/U: ()
Domain0 Region03          : 0x0000000080000000-0x000000008003ffff M: (R,X) S/U: ()
Domain0 Region04          : 0x0000000040800000-0x0000000040bfffff M: (I,R,W) S/U: (R,W)
Domain0 Region05          : 0x0000000000000000-0xffffffffffffffff M: () S/U: (R,W,X)
Domain0 Next Address      : 0x0000000080200000
Domain0 Next Arg1         : 0x0000000080100000
Domain0 Next Mode         : S-mode
Domain0 SysReset          : yes
Domain0 SysSuspend        : yes

Boot HART ID              : 0
Boot HART Domain          : root
Boot HART Priv Version    : v1.10
Boot HART Base ISA        : rv64imafdc
Boot HART ISA Extensions  : zicntr,zihpm,sdtrig
Boot HART PMP Count       : 0
Boot HART PMP Granularity : 0 bits
Boot HART PMP Address Bits: 0
Boot HART MHPM Info       : 29 (0xfffffff8)
Boot HART Debug Triggers  : 0 triggers
Boot HART MIDELEG         : 0x0000000000000222
Boot HART MEDELEG         : 0x000000000000b109

<debug_uart>
Core:  12 devices, 8 uclasses, devicetree: board
Loading Environment from nowhere... OK
In:    serial,usbkbd
Out:   serial,vidconsole
Err:   serial,vidconsole
No working controllers found
Net:   No ethernet found.
Working FDT set to 80100000
Hit any key to stop autoboot:  0

Device 0: unknown device

Device 1: unknown device
scanning bus for devices...

Device 0: unknown device
starting USB...
No working controllers found
No ethernet found.
No ethernet found.
=> setenv bootargs "root=/dev/ram0 loglevel=7 debug rw earlycon=sbi console=hvc0"
=> setenv ramdisk_size 12614846
=> booti ${kernel_addr_r} ${ramdisk_addr_r}:${ramdisk_size} ${fdtcontroladdr}
Moving Image from 0x84000000 to 0x80200000, end=83044650
## Flattened Device Tree blob at 80100000
   Booting using the fdt blob at 0x80100000
Working FDT set to 80100000
ERROR: reserving fdt memory region failed (addr=80000000 size=4000000 flags=4)
   Using Device Tree in place at 0000000080100000, end 0000000080103dd0
Working FDT set to 80100000

Starting kernel ...

[    0.000000] Linux version 6.9.7 (nixbld@localhost) (riscv64-unknown-linux-gnu-gcc (GCC) 13.3.0, GNU ld (GNU Binutils) 2.41) #1-NixOS Thu Jun 27 11:52:32 UTC 2024
[    0.000000] Machine model: Barcelona Supercomputing Center - Lagarto Ox (NixOS)
[    0.000000] SBI specification v2.0 detected
[    0.000000] SBI implementation ID=0x1 Version=0x10005
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI DBCN extension detected
[    0.000000] earlycon: sbi0 at I/O port 0x0 (options '')
[    0.000000] printk: legacy bootconsole [sbi0] enabled
[    0.000000] OF: reserved mem: Reserved memory: failed to reserve memory for node 'reserved@80000000': base 0x0000000080000000, size 64 MiB
[    0.000000] OF: reserved mem: OVERLAP DETECTED!
[    0.000000] mmode_resv1@80000000 (0x0000000080000000--0x0000000080040000) overlaps with reserved@80000000 (0x0000000080000000--0x0000000084000000)
[    0.000000] OF: reserved mem: OVERLAP DETECTED!
[    0.000000] reserved@80000000 (0x0000000080000000--0x0000000084000000) overlaps with mmode_resv0@80040000 (0x0000000080040000--0x0000000080060000)
[    0.000000] Reserved memory: created DMA memory pool at 0x0000000060000000, size 256 MiB
[    0.000000] OF: reserved mem: initialized node dma_pool@60000000, compatible id shared-dma-pool
[    0.000000] OF: reserved mem: 0x0000000060000000..0x000000006fffffff (262144 KiB) map non-reusable dma_pool@60000000
[    0.000000] Reserved memory: created DMA memory pool at 0x0000000070000000, size 256 MiB
[    0.000000] OF: reserved mem: initialized node dma_pool@70000000, compatible id shared-dma-pool
[    0.000000] OF: reserved mem: 0x0000000070000000..0x000000007fffffff (262144 KiB) map non-reusable dma_pool@70000000
[    0.000000] OF: reserved mem: 0x0000000080000000..0x000000008003ffff (256 KiB) nomap non-reusable mmode_resv1@80000000
[    0.000000] OF: reserved mem: 0x0000000080000000..0x0000000083ffffff (65536 KiB) nomap non-reusable reserved@80000000
[    0.000000] OF: reserved mem: 0x0000000080040000..0x000000008005ffff (128 KiB) nomap non-reusable mmode_resv0@80040000
[    0.000000] cma: Reserved 16 MiB at 0x00000000af000000 on node -1
[    0.000000] Zone ranges:
[    0.000000]   DMA32    [mem 0x0000000080000000-0x00000000afffffff]
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000080000000-0x000000008005ffff]
[    0.000000]   node   0: [mem 0x0000000080060000-0x00000000afffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000080000000-0x00000000afffffff]
[    0.000000] Falling back to deprecated "riscv,isa"
[    0.000000] riscv: base ISA extensions adfim
[    0.000000] riscv: ELF capabilities adfim
[    0.000000] pcpu-alloc: s0 r0 d131072 u131072 alloc=1*131072
[    0.000000] pcpu-alloc: [0] 0
[    0.000000] Kernel command line: root=/dev/ram0 loglevel=7 debug rw earlycon=sbi console=hvc0
[    0.000000] Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)
[    0.000000] Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 193536
[    0.000000] mem auto-init: stack:all(zero), heap alloc:off, heap free:off
[    0.000000] Virtual kernel memory layout:
[    0.000000]       fixmap : 0xffffffc6fea00000 - 0xffffffc6ff000000   (6144 kB)
[    0.000000]       pci io : 0xffffffc6ff000000 - 0xffffffc700000000   (  16 MB)
[    0.000000]      vmemmap : 0xffffffc700000000 - 0xffffffc800000000   (4096 MB)
[    0.000000]      vmalloc : 0xffffffc800000000 - 0xffffffd800000000   (  64 GB)
[    0.000000]      modules : 0xffffffff02e45000 - 0xffffffff80000000   (2001 MB)
[    0.000000]       lowmem : 0xffffffd800000000 - 0xffffffd830000000   ( 768 MB)
[    0.000000]       kernel : 0xffffffff80000000 - 0xffffffffffffffff   (2047 MB)
[    0.000000] Memory: 386632K/786432K available (17075K kernel code, 9047K rwdata, 10240K rodata, 8737K init, 917K bss, 383416K reserved, 16384K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] ftrace: allocating 46961 entries in 184 pages
[    0.000000] ftrace: allocated 184 pages with 4 groups
[    0.000000] trace event string verifier disabled
[    0.000000] RCU Tasks Rude: Setting shift to 0 and lim to 1 rcu_task_cb_adjust=1.
[    0.000000] RCU Tasks Trace: Setting shift to 0 and lim to 1 rcu_task_cb_adjust=1.
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] riscv-intc: 64 local interrupts mapped
[    0.000000] clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0x179dd7f66, max_idle_ns: 56421785867800 ns
[    0.000020] sched_clock: 64 bits at 50kHz, resolution 20000ns, wraps every 70368744170000ns
[    0.015020] kfence: initialized - using 2097152 bytes for 255 objects at 0x(____ptrval____)-0x(____ptrval____)
[    0.035900] Console: colour dummy device 80x25
[    0.041720] Calibrating delay loop (skipped), value calculated using timer frequency.. 0.10 BogoMIPS (lpj=200)
[    0.052480] pid_max: default: 32768 minimum: 301
[    0.187380] LSM: initializing lsm=capability,landlock,yama
[    0.479640] landlock: Up and running.
[    0.483700] Yama: becoming mindful.
[    0.521600] Mount-cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
[    0.529780] Mountpoint-cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
[    0.784220] riscv: ELF compat mode unsupported
[    0.784600] ASID allocator disabled (0 bits)
[    0.867620] devtmpfs: initialized
[    0.986340] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.996900] futex hash table entries: 256 (order: 1, 12288 bytes, linear)
[    1.084560] pinctrl core: initialized pinctrl subsystem
[    1.195460] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[    1.244960] DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
[    1.255960] DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations
[    1.266500] audit: initializing netlink subsys (disabled)
[    1.317920] thermal_sys: Registered thermal governor 'step_wise'
[    1.319800] cpuidle: using governor ladder

This is very interesting, because all these lockups show the same symptoms and they may be related with the kernel trying to access an area of memory that it shouldn't.

First, let's remove the reserved region, as now OpenSBI properly forwards the regions to the kernel.

It continues to hang, but at least now we don't have overlap of memory regions:

[    0.000000] Reserved memory: created DMA memory pool at 0x0000000060000000, size 256 MiB
[    0.000000] OF: reserved mem: initialized node dma_pool@60000000, compatible id shared-dma-pool
[    0.000000] OF: reserved mem: 0x0000000060000000..0x000000006fffffff (262144 KiB) map non-reusable dma_pool@60000000
[    0.000000] Reserved memory: created DMA memory pool at 0x0000000070000000, size 256 MiB
[    0.000000] OF: reserved mem: initialized node dma_pool@70000000, compatible id shared-dma-pool
[    0.000000] OF: reserved mem: 0x0000000070000000..0x000000007fffffff (262144 KiB) map non-reusable dma_pool@70000000
[    0.000000] OF: reserved mem: 0x0000000080000000..0x000000008003ffff (256 KiB) nomap non-reusable mmode_resv1@80000000
[    0.000000] OF: reserved mem: 0x0000000080040000..0x000000008005ffff (128 KiB) nomap non-reusable mmode_resv0@80040000
[    0.000000] cma: Reserved 16 MiB at 0x00000000af000000 on node -1
[    0.000000] Zone ranges:
[    0.000000]   DMA32    [mem 0x0000000080000000-0x00000000afffffff]
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000080000000-0x000000008005ffff]
[    0.000000]   node   0: [mem 0x0000000080060000-0x00000000afffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000080000000-0x00000000afffffff]

Let's remove the clint from the device tree for now, just to see if it is affecting.

Disabling the clint doesn't seem to cause any effect. In fact, the kernel is still able to get a clock as shown in the kernel messages.

OBSERVATION: The kernel hangs after the unaligned check begins

I enabled the ftrace for initcall and I can see that is getting stuck in the unaligned check:

Domain0 Name              : root
Domain0 Boot HART         : 0
Domain0 HARTs             : 0*
Domain0 Region00          : 0x0000000040001000-0x0000000040001fff M: (I,R,W) S/U: (R,W)
Domain0 Region01          : 0x0000000080040000-0x000000008005ffff M: (R,W) S/U: ()
Domain0 Region02          : 0x0000000080000000-0x000000008003ffff M: (R,X) S/U: ()
Domain0 Region03          : 0x0000000040800000-0x0000000040bfffff M: (I,R,W) S/U: (R,W)
Domain0 Region04          : 0x0000000000000000-0xffffffffffffffff M: () S/U: (R,W,X)
Domain0 Next Address      : 0x0000000080200000
Domain0 Next Arg1         : 0x0000000080100000
Domain0 Next Mode         : S-mode
Domain0 SysReset          : yes
Domain0 SysSuspend        : yes
...
=> setenv bootargs "root=/dev/ram0 loglevel=7 debug rw earlycon=sbi console=hvc0 trace_event=initcall:* trace_options=sym-addr tp_printk trace_buf_size=1M"
=> setenv ramdisk_size 12614846
=> booti ${kernel_addr_r} ${ramdisk_addr_r}:${ramdisk_size} ${fdtcontroladdr}
...
[    0.000000] printk: legacy bootconsole [sbi0] enabled
[    0.000000] Reserved memory: created DMA memory pool at 0x0000000060000000, size 256 MiB
[    0.000000] OF: reserved mem: initialized node dma_pool@60000000, compatible id shared-dma-pool
[    0.000000] OF: reserved mem: 0x0000000060000000..0x000000006fffffff (262144 KiB) map non-reusable dma_pool@60000000
[    0.000000] Reserved memory: created DMA memory pool at 0x0000000070000000, size 256 MiB
[    0.000000] OF: reserved mem: initialized node dma_pool@70000000, compatible id shared-dma-pool
[    0.000000] OF: reserved mem: 0x0000000070000000..0x000000007fffffff (262144 KiB) map non-reusable dma_pool@70000000
[    0.000000] OF: reserved mem: 0x0000000080000000..0x000000008003ffff (256 KiB) nomap non-reusable mmode_resv1@80000000
[    0.000000] OF: reserved mem: 0x0000000080040000..0x000000008005ffff (128 KiB) nomap non-reusable mmode_resv0@80040000
[    0.000000] cma: Reserved 16 MiB at 0x00000000af000000 on node -1
[    0.000000] Zone ranges:
[    0.000000]   DMA32    [mem 0x0000000080000000-0x00000000afffffff]
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000080000000-0x000000008005ffff]
[    0.000000]   node   0: [mem 0x0000000080060000-0x00000000afffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000080000000-0x00000000afffffff]
...
[    2.736300] initcall_finish: func=init_ladder+0x0/0x40 ret=0
[    2.742640] initcall_start: func=init_menu+0x0/0x38
[    2.748200] initcall_finish: func=init_menu+0x0/0x38 ret=0
[    2.754400] initcall_start: func=rpmsg_init+0x0/0xc0
[    2.763400] initcall_finish: func=rpmsg_init+0x0/0xc0 ret=0
[    2.769840] initcall_start: func=rpmsg_chrdev_init+0x0/0xc0
[    2.777660] initcall_finish: func=rpmsg_chrdev_init+0x0/0xc0 ret=0
[    2.784540] initcall_start: func=rpmsg_ctrldev_init+0x0/0xbc
[    2.792460] initcall_finish: func=rpmsg_ctrldev_init+0x0/0xbc ret=0
[    2.799420] initcall_start: func=rpmsg_ns_init+0x0/0x6c
[    2.807460] initcall_finish: func=rpmsg_ns_init+0x0/0x6c ret=0
[    2.814120] initcall_start: func=kobject_uevent_init+0x0/0x30
[    2.821120] initcall_finish: func=kobject_uevent_init+0x0/0x30 ret=0
[    2.830980] initcall_level: level=arch
[    2.835460] initcall_start: func=riscv_cpuinfo_init+0x0/0x78
[    2.841960] initcall_finish: func=riscv_cpuinfo_init+0x0/0x78 ret=0
[    2.848920] initcall_start: func=vdso_init+0x0/0x4c
[    2.855560] initcall_finish: func=vdso_init+0x0/0x4c ret=0
[    2.861900] initcall_start: func=check_unaligned_access_all_cpus+0x0/0x1d4

This makes me wonder if what has just happened is that somehow we have overwritten the OpenSBI trap for unaligned access and we are now executing some garbage code in machine mode.

Based on the OpenSBI regions, I would assume here is where the traps should be:

Domain0 Region02          : 0x0000000080000000-0x000000008003ffff M: (R,X) S/U: ()

It should be feasible to read the memory from the host where those traps are placed.

Here are the FDT reserved areas that U-Boot sees before switching to the kernel:

=> fdt print /reserved-memory
reserved-memory {
        #address-cells = <0x00000002>;
        #size-cells = <0x00000002>;
        ranges;
        mmode_resv1@80000000 {
                reg = <0x00000000 0x80000000 0x00000000 0x00040000>;
                no-map;
        };
        mmode_resv0@80040000 {
                reg = <0x00000000 0x80040000 0x00000000 0x00020000>;
                no-map;
        };
        dma_pool@60000000 {
                reg = <0x00000000 0x60000000 0x00000000 0x10000000>;
                compatible = "shared-dma-pool";
        };
        dma_pool@70000000 {
                reg = <0x00000000 0x70000000 0x00000000 0x10000000>;
                compatible = "shared-dma-pool";
        };
};

Which match with the ones the kernel parse:

[    0.000000] OF: reserved mem: 0x0000000080000000..0x000000008003ffff (256 KiB) nomap non-reusable mmode_resv1@80000000
[    0.000000] OF: reserved mem: 0x0000000080040000..0x000000008005ffff (128 KiB) nomap non-reusable mmode_resv0@80040000

And with the OpenSBI regions (in reverse order):

Domain0 Region01          : 0x0000000080040000-0x000000008005ffff M: (R,W) S/U: ()
Domain0 Region02          : 0x0000000080000000-0x000000008003ffff M: (R,X) S/U: ()

Let's add some instrumentation in the code that handles the traps in OpenSBI.

Memory regions:

0x8000_0000 to 0x8003_ffff : OpenSBI code
0x8004_0000 to 0x8005_ffff : OpenSBI data
0x8010_0000 to 0x801._.... : FDT
0x8020_0000 to 0x8020_.... : U-Boot (later kernel)

0x8020_1000 to 0x81fa_0b87 : Kernel image
    80201000-80cb177f : Kernel code
    81400000-819fffff : Kernel rodata
    81c00000-81f18747 : Kernel data
    81f19000-81fa0b87 : Kernel bss

0x8400_0000 to 0x84.._.... M: (R,W) S/U: ()  Linux kernel

Okay, so we enter the unaligned access check:

![    2.947680] initcall_start: func=check_unaligned_access_all_cpus+0x0/0x1d4
$
!OpenSBI: misaligned load$
!OpenSBI: misaligned load$
!OpenSBI: misaligned load$
!OpenSBI: misaligned load$
!OpenSBI: misaligned load$
!OpenSBI: misaligned load$
!OpenSBI: misaligned load$
!OpenSBI: misaligned load$
!OpenSBI: misaligned store$
!OpenSBI: misaligned store$
!OpenSBI: misaligned store$
!OpenSBI: misaligned store$
!OpenSBI: misaligned store$
!OpenSBI: misaligned store$
!OpenSBI: misaligned store$
!OpenSBI: misaligned store$
...
!OpenSBI: misaligned load$
!OpenSBI: misaligned load$
!OpenSBI: misaligned load$
!OpenSBI: misaligned load$
!OpenSBI: misaligned load$
!OpenSBI: misaligned load$
!OpenSBI: misaligned load$
!OpenSBI: misaligned load$
!OpenSBI: misaligned store$
!OpenSBI: misaligned store$
!OpenSBI: misaligned store$
!OpenSBI: misaligned store$
!OpenSBI: misaligned store$
!OpenSBI: misaligned store$
!OpenSBI: misaligned store$
!OpenSBI: misaligned store$

But we never arrive to anywhere else.

Here are the current options that match ALIGN:

hut% grep ALIGN /nix/store/c9jr35xnh2ffzjvkq8nvzj9i2siz1n4s-linux-config-riscv64-unknown-linux-gnu-6.9.7
CONFIG_SYSCTL_ARCH_UNALIGN_ALLOW=y
CONFIG_RISCV_MISALIGNED=y
CONFIG_RISCV_PROBE_UNALIGNED_ACCESS=y
# CONFIG_RISCV_EMULATED_UNALIGNED_ACCESS is not set
# CONFIG_RISCV_SLOW_UNALIGNED_ACCESS is not set
# CONFIG_RISCV_EFFICIENT_UNALIGNED_ACCESS is not set
CONFIG_HAVE_64BIT_ALIGNED_ACCESS=y
CONFIG_FUNCTION_ALIGNMENT=0
CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC=y
CONFIG_CMA_ALIGNMENT=8
# CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B is not set

We may want to set these two:

# CONFIG_RISCV_EMULATED_UNALIGNED_ACCESS is not set
# CONFIG_RISCV_SLOW_UNALIGNED_ACCESS is not set

And disable

CONFIG_RISCV_PROBE_UNALIGNED_ACCESS=y

So we don't perform the probing. However, this may bite later, so probably is a better idea to debug it now. I could bound where it is failing as it doesn't seem to be outside the speed check function.

Those 8 pairs of load and store calls seem to match this assembly function in arch/riscv/kernel/copy-unaligned.S:

/* void __riscv_copy_bytes_unaligned(void *, const void *, size_t) */
/* Performs a memcpy without aligning buffers, using only byte accesses. */
/* Note: The size is truncated to a multiple of 8 */
SYM_FUNC_START(__riscv_copy_bytes_unaligned)
        andi a4, a2, ~(8-1)
        beqz a4, 2f
        add  a3, a1, a4
1:
        lb   a4, 0(a1)
        lb   a5, 1(a1)
        lb   a6, 2(a1)
        lb   a7, 3(a1)
        lb   t0, 4(a1)
        lb   t1, 5(a1)
        lb   t2, 6(a1)
        lb   t3, 7(a1)
        sb   a4, 0(a0)
        sb   a5, 1(a0)
        sb   a6, 2(a0)
        sb   a7, 3(a0)
        sb   t0, 4(a0)
        sb   t1, 5(a0)
        sb   t2, 6(a0)
        sb   t3, 7(a0)
        addi a0, a0, 8
        addi a1, a1, 8
        bltu a1, a3, 1b

2:
        ret
SYM_FUNC_END(__riscv_copy_bytes_unaligned)

Booted again with simpler output:

><E><E><E><E[    2.902540] initcall_finish: func=riscv_cpuinfo_init+0x0/0x78 ret=0
><E[    2.909720] initcall_start: func=vdso_init+0x0/0x4c
><E[    2.917380] initcall_finish: func=vdso_init+0x0/0x4c ret=0
><E[    2.923940] initcall_start: func=check_unaligned_access_all_cpus+0x0/0x1d4
><L><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S><L><L><L><L><L><L><L><L><S><S><S><S><S><S><S><S>

So here is the simplified output using the format <X> where X is one character that differentiates the type of trap.

Based on the code that performs the check:

/*
 * For a fixed amount of time, repeatedly try the function, and take
 * the best time in cycles as the measurement.
 */
while (time_before(jiffies, now + (1 << MISALIGNED_ACCESS_JIFFIES_LG2))) {
        start_cycles = get_cycles64();
        /* Ensure the CSR read can't reorder WRT to the copy. */
        mb();
        __riscv_copy_words_unaligned(dst, src, MISALIGNED_COPY_SIZE);
        /* Ensure the copy ends before the end time is snapped. */
        mb();
        end_cycles = get_cycles64();
        if ((end_cycles - start_cycles) < word_cycles)
                word_cycles = end_cycles - start_cycles;
}

byte_cycles = -1ULL;
__riscv_copy_bytes_unaligned(dst, src, MISALIGNED_COPY_SIZE);
start_jiffies = jiffies;
while ((now = jiffies) == start_jiffies)
        cpu_relax();

while (time_before(jiffies, now + (1 << MISALIGNED_ACCESS_JIFFIES_LG2))) {
        start_cycles = get_cycles64();
        mb();
        __riscv_copy_bytes_unaligned(dst, src, MISALIGNED_COPY_SIZE);
        mb();
        end_cycles = get_cycles64();
        if ((end_cycles - start_cycles) < byte_cycles)
                byte_cycles = end_cycles - start_cycles;
}

preempt_enable();

/* Don't divide by zero. */
if (!word_cycles || !byte_cycles) {
        pr_warn("cpu%d: rdtime lacks granularity needed to measure unaligned access speed\n",
                cpu);

        return 0;
}

if (word_cycles < byte_cycles)
        speed = RISCV_HWPROBE_MISALIGNED_FAST;

ratio = div_u64((byte_cycles * 100), word_cycles);
pr_info("cpu%d: Ratio of byte access time to unaligned word access is %d.%02d, unaligned accesses are %s\n",
        cpu,
        ratio / 100,
        ratio % 100,
        (speed == RISCV_HWPROBE_MISALIGNED_FAST) ? "fast" : "slow");

The fact that the print line with the "Ratio of byte access..." doesn't appear, puts the hang place at some point in between the check and the pr_info() call.

If this is a problem on the OpenSBI side, we can bisect the code to find out where the problem was introduced. But first, I would have to try OpenSBI 1.4 and ensure we can reproduce it.

Okay so with OpenSBI 1.4 we have a hang in the same place.

Lets compare the domain regions:

With OpenSBI 1.4 fpga/alveo_ox:

Domain0 Region00          : 0x0000000040000000-0x0000000040000fff M: (I,R,W) S/U: (R,W)
Domain0 Region01          : 0x0000000080040000-0x000000008004ffff M: (R,W) S/U: ()
Domain0 Region02          : 0x0000000080000000-0x000000008003ffff M: (R,X) S/U: ()
Domain0 Region03          : 0x0000000000000000-0xffffffffffffffff M: () S/U: (R,W,X)
Domain0 Next Address      : 0x0000000080200000
Domain0 Next Arg1         : 0x0000000080017000

With OpenSBI 1.4 generic:

Domain0 Region00          : 0x0000000040001000-0x0000000040001fff M: (I,R,W) S/U: (R,W)
Domain0 Region01          : 0x0000000080040000-0x000000008005ffff M: (R,W) S/U: ()
Domain0 Region02          : 0x0000000080000000-0x000000008003ffff M: (R,X) S/U: ()
Domain0 Region03          : 0x0000000040800000-0x0000000040bfffff M: (I,R,W) S/U: (R,W)
Domain0 Region04          : 0x0000000000000000-0xffffffffffffffff M: () S/U: (R,W,X)
Domain0 Next Address      : 0x0000000080200000
Domain0 Next Arg1         : 0x0000000080100000

With OpenSBI 1.5 generic:

Domain0 Region00          : 0x0000000040001000-0x0000000040001fff M: (I,R,W) S/U: (R,W)
Domain0 Region01          : 0x0000000080040000-0x000000008005ffff M: (R,W) S/U: ()
Domain0 Region02          : 0x0000000080000000-0x000000008003ffff M: (R,X) S/U: ()
Domain0 Region03          : 0x0000000040800000-0x0000000040bfffff M: (I,R,W) S/U: (R,W)
Domain0 Region04          : 0x0000000000000000-0xffffffffffffffff M: () S/U: (R,W,X)
Domain0 Next Address      : 0x0000000080200000
Domain0 Next Arg1         : 0x0000000080100000

So we have several changes.

First, the PLIC has a new memory map. Let's comment it out in the device tree, and see what happens.

Hangs in the same place, but now we don't have the 0x40800000 region with OpenSBI 1.4 generic:

Domain0 Region00          : 0x0000000040001000-0x0000000040001fff M: (I,R,W) S/U: (R,W)
Domain0 Region01          : 0x0000000080040000-0x000000008005ffff M: (R,W) S/U: ()
Domain0 Region02          : 0x0000000080000000-0x000000008003ffff M: (R,X) S/U: ()
Domain0 Region03          : 0x0000000000000000-0xffffffffffffffff M: () S/U: (R,W,X)
Domain0 Next Address      : 0x0000000080200000
Domain0 Next Arg1         : 0x0000000080100000

Let's try to move the serial region to 0x40000000, although that seems to be wrong.

Yep, that causes OpenSBI to not emit any message in the console, so let's put it back in 0x40001000.

Next, we may want to place the FDT in the previous location, at 0x80017000.

=> setenv bootargs "root=/dev/ram0 loglevel=7 debug rw earlycon=sbi console=hvc0 trace_event=initcall:* trace_options=sym-addr tp_printk trace_buf_size=1M"
=> setenv ramdisk_size 12614846
=> booti ${kernel_addr_r} ${ramdisk_addr_r}:${ramdisk_size} ${fdtcontroladdr}

Hmm, continues to hang in the same point:

[    2.852200] initcall_start: func=check_unaligned_access_all_cpus+0x0/0x1d4

With this regions:

OpenSBI v1.4
   ____                    _____ ____ _____
  / __ \                  / ____|  _ \_   _|
 | |  | |_ __   ___ _ __ | (___ | |_) || |
 | |  | | '_ \ / _ \ '_ \ \___ \|  _ < | |
 | |__| | |_) |  __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|____/_____|
        | |
        |_|

Platform Name             : Barcelona Supercomputing Center - Lagarto Ox (NixOS)
Platform Features         : medeleg
Platform HART Count       : 1
Platform IPI Device       : ---
Platform Timer Device     : --- @ 0Hz
Platform Console Device   : uart8250
Platform HSM Device       : ---
Platform PMU Device       : ---
Platform Reboot Device    : ---
Platform Shutdown Device  : ---
Platform Suspend Device   : ---
Platform CPPC Device      : ---
Firmware Base             : 0x80000000
Firmware Size             : 323 KB
Firmware RW Offset        : 0x40000
Firmware RW Size          : 67 KB
Firmware Heap Offset      : 0x48000
Firmware Heap Size        : 35 KB (total), 2 KB (reserved), 8 KB (used), 24 KB (free)
Firmware Scratch Size     : 4096 B (total), 288 B (used), 3808 B (free)
Runtime SBI Version       : 2.0

Domain0 Name              : root
Domain0 Boot HART         : 0
Domain0 HARTs             : 0*
Domain0 Region00          : 0x0000000040001000-0x0000000040001fff M: (I,R,W) S/U: (R,W)
Domain0 Region01          : 0x0000000080040000-0x000000008005ffff M: (R,W) S/U: ()
Domain0 Region02          : 0x0000000080000000-0x000000008003ffff M: (R,X) S/U: ()
Domain0 Region03          : 0x0000000000000000-0xffffffffffffffff M: () S/U: (R,W,X)
Domain0 Next Address      : 0x0000000080200000
Domain0 Next Arg1         : 0x0000000080017000
Domain0 Next Mode         : S-mode
Domain0 SysReset          : yes
Domain0 SysSuspend        : yes

I will also try 0x80013000 as address as I saw it being used before, but I don't think it is the problem. It must be related with how the initialization is now different.

OpenSBI v1.4
   ____                    _____ ____ _____
  / __ \                  / ____|  _ \_   _|
 | |  | |_ __   ___ _ __ | (___ | |_) || |
 | |  | | '_ \ / _ \ '_ \ \___ \|  _ < | |
 | |__| | |_) |  __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|____/_____|
        | |
        |_|

sbi_trap_error: hart0: illegal instruction handler failed (error -2)
sbi_trap_error: hart0: mcause=0x0000000000000002 mtval=0x0000000000000000
sbi_trap_error: hart0: mepc=0x00000000800131a4 mstatus=0x8000000a00007800
sbi_trap_error: hart0: ra=0x0000000080013bb0 sp=0x0000000080046e70
sbi_trap_error: hart0: gp=0x0000000000000000 tp=0x0000000080047000
sbi_trap_error: hart0: s0=0x0000000080046eb0 s1=0x0000000000000118
sbi_trap_error: hart0: a0=0x0000000080013000 a1=0x0000000080046e74
sbi_trap_error: hart0: a2=0x00000000800133c0 a3=0x0000000000000600
sbi_trap_error: hart0: a4=0x000000000000062c a5=0x0000000000000284
sbi_trap_error: hart0: a6=0x0000000000000000 a7=0x000000000000002c
sbi_trap_error: hart0: s2=0x0000000080013000 s3=0x0000000000000000
sbi_trap_error: hart0: s4=0x0000000080047000 s5=0x0000000000000000
sbi_trap_error: hart0: s6=0x0000000000000000 s7=0x0000000000000001
sbi_trap_error: hart0: s8=0x0000000000002000 s9=0x00000000800436f0
sbi_trap_error: hart0: s10=0x0000000000000000 s11=0x0000000000000000
sbi_trap_error: hart0: t0=0x0000000080009796 t1=0x0000000000000268
sbi_trap_error: hart0: t2=0x0000000000001000 t3=0x00000000000000d7
sbi_trap_error: hart0: t4=0x000000007fffffff t5=0x0000000000000001
sbi_trap_error: hart0: t6=0x0000000000000004

With 0x80013000 we seem to enter OpenSBI code region.

Let's try with openpiton again, without the FDT address. We should disable the secondary console from the DT too.

No output, lets enable the PLIC again in the DT. Same.

Probably we have a better chance to fix it in the generic platform.

Let's use a very far away address for the FDT just in case: 0x0_c000_0000

This one doesn't work:

=> setenv bootargs "root=/dev/ram0 loglevel=7 debug rw earlycon=sbi console=hvc0 ftrace=function ftrace_filter=* tp_printk"

This one yes:

=> setenv bootargs "root=/dev/ram0 loglevel=7 debug rw earlycon=sbi console=hvc0 trace_event=*:* tp_printk trace_buf_size=1M"
=> setenv ramdisk_size 12614846
=> booti ${kernel_addr_r} ${ramdisk_addr_r}:${ramdisk_size} ${fdtcontroladdr}

[  164.849640] initcall_finish: func=vdso_init+0x0/0x4c ret=0
[  164.849960] console: initcall_finish: func=vdso_init+0x0/0x4c ret=0
[  164.862820] initcall_start: func=check_unaligned_access_all_cpus+0x0/0x1d4
[  164.863140] console: initcall_start: func=check_unaligned_access_all_cpus+0x0/0x1d4
[  164.878860] kmalloc: call_site=check_unaligned_access_all_cpus+0xa8/0x1d4 ptr=(____ptrval____) bytes_req=8 bytes_alloc=8 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false
[  164.879200] console: kmalloc: call_site=check_unaligned_access_all_cpus+0xa8/0x1d4 ptr=(____ptrval____) bytes_req=8 bytes_alloc=8 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false
[  164.912380] mm_page_alloc_zone_locked: page=(____ptrval____) pfn=0x83d28 order=2 migratetype=0 percpu_refill=1
[  164.912680] console: mm_page_alloc_zone_locked: page=(____ptrval____) pfn=0x83d28 order=2 migratetype=0 percpu_refill=1
[  164.934340] mm_page_alloc_zone_locked: page=(____ptrval____) pfn=0x83d2c order=2 migratetype=0 percpu_refill=1
[  164.934660] console: mm_page_alloc_zone_locked: page=(____ptrval____) pfn=0x83d2c order=2 migratetype=0 percpu_refill=1
[  164.956300] mm_page_alloc_zone_locked: page=(____ptrval____) pfn=0x83d30 order=2 migratetype=0 percpu_refill=1
[  164.956620] console: mm_page_alloc_zone_locked: page=(____ptrval____) pfn=0x83d30 order=2 migratetype=0 percpu_refill=1
[  164.978280] mm_page_alloc_zone_locked: page=(____ptrval____) pfn=0x83d34 order=2 migratetype=0 percpu_refill=1
[  164.978600] console: mm_page_alloc_zone_locked: page=(____ptrval____) pfn=0x83d34 order=2 migratetype=0 percpu_refill=1
[  165.000260] mm_page_alloc_zone_locked: page=(____ptrval____) pfn=0x83d38 order=2 migratetype=0 percpu_refill=1
[  165.000560] console: mm_page_alloc_zone_locked: page=(____ptrval____) pfn=0x83d38 order=2 migratetype=0 percpu_refill=1
[  165.022200] mm_page_alloc_zone_locked: page=(____ptrval____) pfn=0x83d3c order=2 migratetype=0 percpu_refill=1
[  165.022540] console: mm_page_alloc_zone_locked: page=(____ptrval____) pfn=0x83d3c order=2 migratetype=0 percpu_refill=1
[  165.044200] mm_page_alloc_zone_locked: page=(____ptrval____) pfn=0x83d40 order=2 migratetype=0 percpu_refill=1
[  165.044500] console: mm_page_alloc_zone_locked: page=(____ptrval____) pfn=0x83d40 order=2 migratetype=0 percpu_refill=1
[  165.066220] mm_page_alloc: page=(____ptrval____) pfn=0x83d28 order=2 migratetype=0 gfp_flags=GFP_KERNEL
[  165.066540] console: mm_page_alloc: page=(____ptrval____) pfn=0x83d28 order=2 migratetype=0 gfp_flags=GFP_KERNEL

2024-08-01

Now that we have a new bitstream with a CLINT connected to a PLIC input, we may be able to generate an interrupt.

Here is the comment where I gather the pieces:

---8<---{{{

From https://gitlab.bsc.es/hwdesign/rtl/core-tile/sa-fpga/ I can see that the auxiliary timer is in fact another CLINT.

I don't have access to the hlib repository (@jmendoza can I get access to it?) to see the CLINT definition, but based on this CLINT and this one I can estimate some of the previous information:

  • The information on which port number of the PLIC the timer is connected to.

https://gitlab.bsc.es/hwdesign/rtl/core-tile/sa-fpga/-/blob/main/fpga_core_bridge/rtl/fpga_core_bridge.sv#L1114

        plic #(
            .PARAMETER_BITWIDTH (7),
            .NUM_TARGETS        (1),
            .NUM_SOURCES        (4)
        ) plic_inst (
            .clk_i         (clk_i),
            .rstn_i        (reset),
            .irq_sources_i ({plic_timer_eirq,eth_irq,uart1_irq}), 
            .eip_targets_o (irq),

If I read it from right to left starting at 1, it should be at 4, as the eth_irq has two "wires".

  • The memory address of the timer and the mapped registers, so I can see it increasing its value. I think the aux_timer you had in the past would be fine.

https://gitlab.bsc.es/hwdesign/rtl/core-tile/sa-fpga/-/blob/main/fpga_core_bridge/rtl/local_includes/defines.svh#L33-36

//Size: 64KB
`define AUX_TIMER_XBAR_ID 2
`define AUX_TIMER_BASE_ADDR 64'h0000_0000_4001_0000 // Need to be this space because we use a clint as aux timer
`define AUX_TIMER_END_ADDR  64'h0000_0000_4001_FFFF
  • The specific operations I need to do in machine mode to configure the timer to fire at 1 Hz (probably setting two registers).

Based on the source of the CLINT, only one interrupt will be generated after setting the mtimecmp register to something larger than the mtime register. Then I suspect I would have to make the interrupt run some code to rearm it again by modifying the mtimecmp register to some value in the future:

// -----------------------------
// IRQ Generation
// -----------------------------
// The mtime register has a 64-bit precision on all RV32, RV64, and RV128 systems. Platforms provide a 64-bit
// memory-mapped machine-mode timer compare register (mtimecmp), which causes a timer interrupt to be posted when the
// mtime register contains a value greater than or equal (mtime >= mtimecmp) to the value in the mtimecmp register.
// The interrupt remains posted until it is cleared by writing the mtimecmp register. The interrupt will only be taken
// if interrupts are enabled and the MTIE bit is set in the mie register.
always_comb begin : irq_gen
    // check that the mtime cmp register is set to a meaningful value
    for (int unsigned i = 0; i < NR_CORES; i++) begin
        if (mtime_q >= mtimecmp_q[i]) begin
            timer_irq_o[i] = 1'b1;
        end else begin
            timer_irq_o[i] = 1'b0;
        end
    end
end

I could ensure that an interrupt has been fired by reading the mtime and mtimecmp values, and checking that mtime > mtimecmp.

Now I only need to find a bitstream that has been generated with 10ba8b2a11.

This may work:

https://gitlab.bsc.es/hwdesign/fpga/integration-lab/fpga-shell/-/jobs/968583/raw

Submodule path 'sa-fpga': checked out '12b77cb50cf1c416f107d4c7ab1c52d7b5e59056'

Which is based on fpga-shell 01265d197f

Here is the bitstream job: https://gitlab.bsc.es/hwdesign/fpga/integration-lab/fpga-shell/-/jobs/968585

And the bitstream: artifacts.zip

And full log: job.log

The memory map would need a bit of adjustment in the device tree, but to play with the timer in machine mode not much is needed.

I think I have all the pieces now.

---8<---}}}

I will try with the last bitstream that I already had compiled, as I will have to rebuild the required packages in nix.

To compute the memory position of the registers:

`define AUX_TIMER_XBAR_ID 2
`define AUX_TIMER_BASE_ADDR 64'h0000_0000_4001_0000 // Need to be this space because we use a clint as aux timer
`define AUX_TIMER_END_ADDR  64'h0000_0000_4001_FFFF

localparam logic [15:0] MSIP_BASE     = 16'h0;
localparam logic [15:0] MTIMECMP_BASE = 16'h4000;
localparam logic [15:0] MTIME_BASE    = 16'hbff8;

So, the base address 0x40010000 and the first MTIME at 0xbff8 would give us a timer at 0x4001bff8.

Here it is:

=> md 0x4001bff8 1
4001bff8: 006e65b8                             .en.
=> md 0x4001bff8 1
4001bff8: 006e9a26                             &.n.
=> md 0x4001bff8 1
4001bff8: 006ebae1                             ..n.
=> md 0x4001bff8 1
4001bff8: 006eda45                             E.n.
=> md 0x4001bff8 1
4001bff8: 006ef9d4                             ..n.
=> md 0x4001bff8 1
4001bff8: 006f1abb                             ..o.

Now, the MTIMECMP should be at 0x40014000, which should be 0.

=> md 0x40014000 1
40014000: 00000000                             ....

Good.

Now, I suspect the MSIP is not used, so it should be 0 at 0x40010000 too:

=> md 0x40010000 1
40010000: 00000000                             ....

Nice.

Just for testing, let's see if I can make the timer cause any change in the MSIP register by setting the MTIMECMP to a value:

=> mw 0x40014000 0x01700000 # Write the MTIMECMP
=> md 0x40014000 1
40014000: 01700000                             ..p.
=> md 0x4001bff8 1
4001bff8: 016da81a                             ..m.
=> md 0x40010000 1
40010000: 00000000                             ....
=> md 0x4001bff8 1
4001bff8: 016f947c                             |.o.
=> md 0x4001bff8 1
4001bff8: 016fff96                             ..o.
=> md 0x4001bff8 1
4001bff8: 01704367                             gCp. # Now we passed it
=> md 0x40010000 1
40010000: 00000000                             .... # But MSIP is still 0

As expected, nothing happens. We cannot monitor the interrupt line from the timer itself.

Now, let see if we can inspect the state of the PLIC.

From the plic_interface I can see where are the memory addresses of the registers exposed.

The PLIC is mapped here:

//Size: 4MB
`define PLIC_XBAR_ID 5
`define PLIC_BASE_ADDR 64'h0000_0000_4080_0000
`define PLIC_END_ADDR  64'h0000_0000_40BF_FFFF

There are several ways in which the interrupts are not forwarded to the destination, and several destinations. The PLIC specification is a good resource to understand it:

https://github.com/riscv/riscv-plic-spec

This is important:

The interrupt gateways are responsible for converting global interrupt signals into a common interrupt request format, and for controlling the flow of interrupt requests to the PLIC core. At most one interrupt request per interrupt source can be pending in the PLIC core at any time, indicated by setting the sources IP bit. The gateway only forwards a new interrupt request to the PLIC core after receiving notification that the interrupt handler servicing the previous interrupt request from the same source has completed.

So, there cannot be any pending interrupt, otherwise no more interrupts will be sent to the core.

Assuming the PLIC uses the standard memory layout, we should find:

base + 0x000000: Reserved (interrupt source 0 does not exist)
base + 0x000004: Interrupt source 1 priority
base + 0x000008: Interrupt source 2 priority

Which they should begin at 0x40800000.

=> md 0x40800000 8
40800000: 00000000 00000000 00000000 00000000  ................
40800010: 00000000 00000000 00000000 00000000  ................

All the priorities are set to 0.

Let's see the pending interrupts:

base + 0x000FFC: Interrupt source 1023 priority
base + 0x001000: Interrupt Pending bit 0-31
base + 0x00107C: Interrupt Pending bit 992-1023

They should be at 0x40801000:

=> md 0x40801000 8
40801000: 00000010 00000000 00000000 00000000  ................
40801010: 00000000 00000000 00000000 00000000  ................

Whoa, look at that.

             4321
0x00000010 = 10000
             |   |
             |   int 0 (reserved)
             int 4 = timer

We got the interrupt 4 pending in context 0!

Other context don't seem to see anything:

=> md 0x40801080 1
40801080: 00000000                             ....
=> md 0x40801100 1
40801100: 00000000                             ....
=> md 0x40801180 1
40801180: 00000000                             ....
=> md 0x40801200 1
40801200: 00000000                             ....
=> md 0x40801280 1
40801280: 00000000                             ....
=> md 0x40801300 1
40801300: 00000000                             ....
=> md 0x40801380 1
40801380: 00000000                             ....

So, as the priority is 0, this means it is ignored:

If PLIC supports Interrupt Priorities, then each PLIC interrupt source can be assigned a priority by writing to its 32-bit memory-mapped priority register. A priority value of 0 is reserved to mean "never interrupt" and effectively disables the interrupt. Priority 1 is the lowest active priority while the maximum level of priority depends on PLIC implementation. Ties between global interrupts of the same priority are broken by the Interrupt ID; interrupts with the lowest ID have the highest effective priority.

Let's claim the interrupt, by just performing a read from 0x40a00004:

=> md 0x40801000 1
40801000: 00000010                             ....
=> md 0x40a00004 1
40a00004: 00000000                             ....
=> md 0x40801000 1
40801000: 00000010                             ....

So, it continues to be pending.

We have to write the completed interrupt, by writing the number 4 to the same register:

=> mw 0x40a00004 4
=> md 0x40801000 1
40801000: 00000010                             ....

Still not cleared.

Let's try making the MTIMECMP value much higher than MTIME:

=> md 0x40014000 1
40014000: 01700000                             ..p.
=> md 0x4001bff8 1
4001bff8: 03a4584b                             KX..
=> mw 0x40014000 0xaaaaaaaa
=> md 0x40014000 1
40014000: aaaaaaaa                             ....
=> md 0x4001bff8 1
4001bff8: 03abc84d                             M...

So... the ID that must be written to the completion register is not the interrupt number, but the value read from the claim register, which is 0.

=> mw 0x40a00004 0
=> md 0x40801000 1
40801000: 00000010                             ....

Still, nothing.

All interrupts are disabled:

=> md 0x40802000 4
40802000: 00000000 00000000 00000000 00000000  ................

Let's try enabling the interrupt 4, by writting:

=> mw 0x40802000 0x10
=> md 0x40802000 1
40802000: 00000010                             ....
=> md 0x40801000 1
40801000: 00000010                             ....

Now, let's set the priority to something else than 0.

First, lets make sure that the context 0 threshold priority is set to 0, so we allow all interrupts:

0x200000: Priority threshold for context 0

=> md 0x40a00000 1
40a00000: 00000007                             ....

Oh, so we are only receiving interrupts with priority 7 or higher. But our interrupt has priority 0!

=> md 0x40800004 1
40800004: 00000000                             ....

Let's make the threshold 0 and our interrupt have priority 1.

=> mw 0x40a00000 0
=> mw 0x40800004 1
=> md 0x40800004 1
40800004: 00000001                             ....
=> md 0x40a00000
40a00000: 00000000                             ....

Not let's see again the interrupt state:

=> md 0x40801000 1
40801000: 00000010                             ....

Still on.

Let's read the claim register again.

=> md 0x40a00004
40a00004: 00000000                             ....

Still 0, let's try to complete it:

=> mw 0x40a00004 0
=> md 0x40801000 1
40801000: 00000010                             ....

Nope, still pending.

What, what the hell. The threshold value has changed to 1:

=> md 0x40800004 1
40800004: 00000001                             ....
=> md 0x40a00000 1
40a00000: 00000001                             .... <-- this was 0

Let's configure the interruption priority to something bigger than 1.

Wait, I put the priority in the wrong source:

0x000000: Reserved (interrupt source 0 does not exist)
0x000004: Interrupt source 1 priority
0x000008: Interrupt source 2 priority

Our timer should be the source 4, so 12 or 0xc:

=> md 0x4080000c 1
4080000c: 00000000                             ....

(This is wrong, should be 0x40800010, see below)

Let's make it have priority 0xd:

=> mw 0x4080000c 0xd
=> md 0x4080000c 1
4080000c: 0000000d                             ....

Something weird is going on with the priority register?

=> md 0x40a00000 1
40a00000: 00000000                             ....
=> md 0x40a00000 1
40a00000: 0000000d                             ....
=> md 0x40a00000 1
40a00000: 0000000d                             ....
=> md 0x40a00000 1
40a00000: 0000000d                             ....
=> md 0x40a00000 1
40a00000: 0000000d                             ....

Let's see the claim register, which should be in the next word:

=> md 0x40a00004 1
40a00004: 00000004                             ....

Yes! Now I can see the claim register with a proper ID. Let's complete this interrupt by writing the 4 back to that register:

=> mw 0x40a00004 4
=> md 0x40801000 1
40801000: 00000000                             ....

Perfect! It properly caused the pending interrupt to disappear.

Let's try now setting the MTIMECMP to something smaller than the MTIME, so it causes an interrupt. With a value 0 should always work, but lets choose a non zero value:

=> md 0x40014000
40014000: aaaaaaaa                             ....
=> mw 0x40014000 00aaaaaa
=> md 0x40014000
40014000: 00aaaaaa                             ....
=> md 0x4001bff8
4001bff8: 06211a0c                             ..!.
=> md 0x40801000 1
40801000: 00000010                             ....

Perfect! It causes the interrupt to appear as pending.

So, using the context 0, we can properly see the interrupt pending, claim it and complete it. But the context 0 is not used in OpenSBI, only the 9 and 11:

From include/sbi/riscv_encoding.h:

#define IRQ_S_SOFT			1
#define IRQ_VS_SOFT			2
#define IRQ_M_SOFT			3
#define IRQ_S_TIMER			5
#define IRQ_VS_TIMER		6
#define IRQ_M_TIMER			7
#define IRQ_S_EXT			9
#define IRQ_VS_EXT			10
#define IRQ_M_EXT			11
#define IRQ_S_GEXT			12
#define IRQ_PMU_OVF			13

And from lib/utils/irqchip/fdt_irqchip_plic.c:

static int irqchip_plic_update_hartid_table(void *fdt, int nodeoff,
                        struct plic_data *pd)
{
    const fdt32_t *val;
    u32 phandle, hwirq, hartid;
    struct sbi_scratch *scratch;
    int i, err, count, cpu_offset, cpu_intc_offset;

    val = fdt_getprop(fdt, nodeoff, "interrupts-extended", &count);
    if (!val || count < sizeof(fdt32_t))
        return SBI_EINVAL;
    count = count / sizeof(fdt32_t);

    for (i = 0; i < count; i += 2) {
        phandle = fdt32_to_cpu(val[i]);
        hwirq = fdt32_to_cpu(val[i + 1]);

        cpu_intc_offset = fdt_node_offset_by_phandle(fdt, phandle);
        if (cpu_intc_offset < 0)
            continue;

        cpu_offset = fdt_parent_offset(fdt, cpu_intc_offset);
        if (cpu_offset < 0)
            continue;

        err = fdt_parse_hart_id(fdt, cpu_offset, &hartid);
        if (err)
            continue;

        scratch = sbi_hartid_to_scratch(hartid);
        if (!scratch)
            continue;

        plic_set_hart_data_ptr(scratch, pd);
        switch (hwirq) {
        case IRQ_M_EXT:
            plic_set_hart_mcontext(scratch, i / 2);
            break;
        case IRQ_S_EXT:
            plic_set_hart_scontext(scratch, i / 2);
            break;
        }
    }

    return 0;
}

So, lets try to do the same, but with the context 11 for machine mode IRQ_M_EXT.

Let's compute the address of the input source for context 11:

base + 0x002000: Enable bits for sources 0-31 on context 0
base + 0x002004: Enable bits for sources 32-63 on context 0
...
base + 0x00207C: Enable bits for sources 992-1023 on context 0
base + 0x002080: Enable bits for sources 0-31 on context 1
base + 0x002084: Enable bits for sources 32-63 on context 1
...
base + 0x0020FC: Enable bits for sources 992-1023 on context 1
base + 0x002100: Enable bits for sources 0-31 on context 2
base + 0x002104: Enable bits for sources 32-63 on context 2
...
base + 0x00217C: Enable bits for sources 992-1023 on context 2
...
base + 0x1F1F80: Enable bits for sources 0-31 on context 15871
base + 0x1F1F84: Enable bits for sources 32-63 on context 15871
base + 0x1F1FFC: Enable bits for sources 992-1023 on context 15871
...

It should be:

>>> hex(0x40800000 + 0x2000 + (11 * 0x80))
'0x40802580'

They are all disabled:

=> md 0x40802580
40802580: 00000000                             ....

So, let's enable the source 4 by writing 0x10

=> mw 0x40802580 0x10
=> md 0x40801000 1
40801000: 00000010                             ....

Now, let's check the context 11 priority threshold:

0x200000: Priority threshold for context 0
0x201000: Priority threshold for context 1
0x202000: Priority threshold for context 2
0x203000: Priority threshold for context 3

The priority threshold for context 11 should be at:

>>> hex(0x40800000 + 0x200000 + (11 * 0x1000))
'0x40a0b000'

=> md 0x40a0b000
40a0b000: 00000000                             ....

It has value 0, so all interrupts with non-zero priority should pass:

For example, a threshold value of zero permits all interrupts with non-zero priority.

Let's see the priority of source 4 in context 11:

0x000000: Reserved (interrupt source 0 does not exist)
0x000004: Interrupt source 1 priority
0x000008: Interrupt source 2 priority
...
0x000FFC: Interrupt source 1023 priority

The address should be at:

>>> hex(0x40800000 + (4 * 0x4))

=> md 0x40800010
40800010: 00000000                             ....

It has priority 0, so it would never work. Let's make it priority 1:

=> mw 0x40800010 1
=> md 0x40800010 1
40800010: 00000001

Let's check the pending interrupts:

=> md 0x40801000 1
40801000: 00000010                             ....

It is still pending, so let's clear it my setting the MTIMECMP to a large value.

=> md 0x40014000
40014000: 00aaaaaa                             ....
=> mw 0x40014000 0xaaaaaaaa
=> md 0x40014000
40014000: aaaaaaaa                             ....
=> md 0x4001bff8
4001bff8: 0e8e6066                             f`..
=> md 0x4001bff8
4001bff8: 0e8ea4c9                             ....
=> md 0x4001bff8
4001bff8: 0e8ece24                             $...

Now, let's claim and complete it for the context 0 which was already enabled from the test before.

=> md 0x40a00004 1
40a00004: 00000004                             ....
=> mw 0x40a00004 4
=> md 0x40801000 1
40801000: 00000000                             ....

Perfect, now it is not pending anymore.

Now, the context 0 is still enabled, so the interruptions may be sent there instead of context 11. So let's disable the context 0 first.

=> mw 0x40802000 0
=> md 0x40802000 1
40802000: 00000000                             ....

Now let's fire the MTIMECMP and see if OpenSBI sees a machine trap.

=> md 0x40014000 1
40014000: aaaaaaaa                             ....
=> mw 0x40014000 00aaaaaa
=> md 0x40014000 1
40014000: 00aaaaaa                             ....

Nothing happened.

The interrupt is pending:

=> md 0x40801000 1
40801000: 00000010                             ....

The claim on context 0 returns 0, so not interrupt there which is expected:

=> md 0x40a00004 1
40a00004: 00000000                             ....

Let's compute the claim register on context 11:

0x200004: Interrupt Claim Process for context 0
0x201004: Interrupt Claim Process for context 1
0x202004: Interrupt Claim Process for context 2
0x203004: Interrupt Claim Process for context 3
...

>>> hex(0x40800000 + 0x200004 + (11 * 0x1000))
'0x40a0b004'

=> md 0x40a0b004 1
40a0b004: 00000000                             ....

Hmm, there is no claim ID.

So, I checked again, and I cannot enable the interrupt on context 11:

=> md 0x40802580 1
40802580: 00000000                             ....
=> mw 0x40802580 0x10
=> md 0x40802580 1
40802580: 00000000                             ....

Note, the first value is 0 and must be claimed:

=> md 0x40801000 1
40801000: 00000010                             ....
=> md 0x40802000 1
40802000: 00000010
=> md 0x40a00004 1
40a00004: 00000000                             ....
=> mw 0x40a00004 1
=> mw 0x40a00004 4
=> md 0x40a00004 1
40a00004: 00000004                             ....
=> mw 0x40a00004 4
=> md 0x40a00004 1
40a00004: 00000004                             ....
=> mw 0x40a00004 4
=> md 0x40a00004 1
40a00004: 00000004                             ....

2024-08-02

I see that the MIE sets the machine mode external interrupt enable in this way:

int sbi_irqchip_init(struct sbi_scratch *scratch, bool cold_boot)
{
    int rc;
    const struct sbi_platform *plat = sbi_platform_ptr(scratch);

    rc = sbi_platform_irqchip_init(plat, cold_boot);
    if (rc)
        return rc;

    if (ext_irqfn != default_irqfn)
        csr_set(CSR_MIE, MIP_MEIP);

    return 0;
}

Only if the external interrupt function is not the default one. But for the PLIC, it looks like the default one is being used. So let's enable the machine mode interrupts unconditionally.

Let's try to cause an interruption. I would need to list all the steps.

mw 0x40014000 0xffffffff # Disable clock interrupt
md 0x40801000 1 # Show pending interrupts (should be 0x10)
mw 0x40802000 0x10 # Enable interrupt for source 4 (timer)
mw 0x40800010 0xff # Make source 4 priority large
md 0x40a00004 1 # Claim interrupt (should read 4)
mw 0x40a00004 4 # Complete 4
md 0x40801000 1 # Show pending interrupts (should be 0x00)
mw 0x40014000 0x00000000 # Enable clock interrupt (should cause one)
md 0x40801000 1 # Show pending interrupts (should be 0x10)

Nice, I can see the trap:

Boot HART MIDELEG         : 0x0000000000000022
Boot HART MEDELEG         : 0x000000000000b109
...
=> mw 0x40802000 0x10 # Enable interrupt for source 4 (timer)
=> mw 0x40800010 0xff # Make source 4 priority large
=> md 0x40a00004 1 # Show which value should be claimed
40a00004: 00000000                             ....
=> mw 0x40a00004 0 # Claim 0 (weird)
<i
sbi_trap_error: hart0: trap0: unhandled local interrupt (error -1000)

sbi_trap_error: hart0: trap0: mcause=0x800000000000000b mtval=0x0000000000000000
sbi_trap_error: hart0: trap0: mepc=0x00000000af71ebbc mstatus=0x8000000a00006800
sbi_trap_error: hart0: trap0: ra=0x00000000af71eba0 sp=0x00000000aeed3b00
sbi_trap_error: hart0: trap0: gp=0x00000000aeed3dd0 tp=0x0000000000000000
sbi_trap_error: hart0: trap0: s0=0x00000000af7cd170 s1=0x0000000000000000
sbi_trap_error: hart0: trap0: a0=0x0000000000000000 a1=0x0000000000000002
sbi_trap_error: hart0: trap0: a2=0x0000000000000008 a3=0x0000000000000004
sbi_trap_error: hart0: trap0: a4=0x0000000000000001 a5=0x0000000000000001
sbi_trap_error: hart0: trap0: a6=0x0000000000000008 a7=0x00000000af795778
sbi_trap_error: hart0: trap0: s2=0x0000000000000000 s3=0x00000000aeed5b90
sbi_trap_error: hart0: trap0: s4=0x0000000000000003 s5=0x00000000af7f7a4c
sbi_trap_error: hart0: trap0: s6=0x0000000000000000 s7=0x0000000000000000
sbi_trap_error: hart0: trap0: s8=0x0000000000000000 s9=0x0000000000000000
sbi_trap_error: hart0: trap0: s10=0x00000000aeed5bc0 s11=0x0000000000000000
sbi_trap_error: hart0: trap0: t0=0x00000000aeed3ac0 t1=0x0000000000000039
sbi_trap_error: hart0: trap0: t2=0x3b3d74696e695f64 t3=0x0000000000000010
sbi_trap_error: hart0: trap0: t4=0x0000000000000000 t5=0x61745f746f6f627b
sbi_trap_error: hart0: trap0: t6=0x00000000aeed3aa0

Now let's try delegating it to u-boot, and see if I can print some information.

I don't see any message:

Boot HART MIDELEG         : 0x0000000000000222
Boot HART MEDELEG         : 0x000000000000b109
...
=> mw 0x40014000 0xffffffff # Disable clock interrupt
=> md 0x40801000 1 # Show pending interrupts (should be 0x10)
40801000: 00000010                             ....
=> mw 0x40802000 0x10 # Enable interrupt for source 4 (timer)
=> mw 0x40800010 0xff # Make source 4 priority large
=> md 0x40a00004 1 # Claim interrupt (should read 4)
40a00004: 00000004                             ....
=> mw 0x40a00004 4 # Complete 4
=> md 0x40801000 1 # Show pending interrupts (should be 0x00)
40801000: 00000000                             ....
=> mw 0x40014000 0x00000000 # Enable clock interrupt (should cause one)
=> md 0x40801000 1 # Show pending interrupts (should be 0x10)
40801000: 00000010

I made a small subcommand U-Boot "exception sregs" to be able to dump the supervisor registers, to check they have the proper values.

Boot HART MIDELEG         : 0x0000000000000222
Boot HART MEDELEG         : 0x000000000000b109
><E
><E<><Ed><Ee><Eb><Eu><Eg><E_><Eu><Ea><Er><Et><E>><E
><E><E>Core:  11 devices, 8 uclasses, devicetree: board
Loading Environment from nowhere... OK
In:    serial,usbkbd
Out:   serial,vidconsole
Err:   serial,vidconsole
No working controllers found
Net:   No ethernet found.
Working FDT set to c0000000
Hit any key to stop autoboot:  0
=> exception sregs
stvec   : 0x00000000af6f4400
sie     : 0x0000000000000000
sip     : 0x0000000000000000
sstatus : 0x8000000200006000
=> exception enable
=> exception sregs
stvec   : 0x00000000af6f4400
sie     : 0x0000000000000222
sip     : 0x0000000000000000
sstatus : 0x8000000200006002
=> mw 0x40014000 0xffffffff # Disable clock interrupt
=> md 0x40801000 1 # Show pending interrupts (should be 0x10)
40801000: 00000010                             ....
=> mw 0x40802000 0x10 # Enable interrupt for source 4 (timer)
=> mw 0x40800010 0xff # Make source 4 priority large
=> md 0x40a00004 1 # Claim interrupt (should read 4)
40a00004: 00000004                             ....
=> mw 0x40a00004 4 # Complete 4
=> md 0x40801000 1 # Show pending interrupts (should be 0x00)
40801000: 00000000                             ....
=> mw 0x40014000 0x00000000 # Enable clock interrupt (should cause one)
=> md 0x40801000 1 # Show pending interrupts (should be 0x10)
40801000: 00000010                             ....
=> exception sregs
stvec   : 0x00000000af6f4400
sie     : 0x0000000000000222
sip     : 0x0000000000000200
sstatus : 0x8000000200006002

Now I can see the external interrupt in supervisor arriving to the SIP.

Let's try to cause an interrupt with the normal CLINT:

`define CLINT_XBAR_ID 3
`define CLINT_BASE_ADDR 64'h0000_0000_4010_0000
`define CLINT_END_ADDR  64'h0000_0000_4010_FFFF

`define AUX_TIMER_XBAR_ID 2
`define AUX_TIMER_BASE_ADDR 64'h0000_0000_4001_0000 // Need to be this space because we use a clint as aux timer
`define AUX_TIMER_END_ADDR  64'h0000_0000_4001_FFFF

=> exception sregs
stvec   : 0x00000000af6f4400
sie     : 0x0000000000000000
sip     : 0x0000000000000000
sstatus : 0x8000000200006000
=> exception enable
=> exception sregs
stvec   : 0x00000000af6f4400
sie     : 0x0000000000000222
sip     : 0x0000000000000000
sstatus : 0x8000000200006002
=> md 0x4010bff8 # Show normal CLINT mtime value
4010bff8: 00c8159b                             ....
=> md 0x4010bff8 # Show normal CLINT mtime value
4010bff8: 00c84453                             SD..
=> md 0x4010bff8 # Show normal CLINT mtime value
4010bff8: 00c865b5                             .e..
=> md 0x40104000 # Show normal CLINT mtimecmp value
40104000: 00000000                             ....
=> mw 0x40104000 aaaaaaaa # Disable interrupt for CLINT
=> md 0x40104000 # Show normal CLINT mtimecmp value
40104000: aaaaaaaa                             ....
=> exception sregs
stvec   : 0x00000000af6f4400
sie     : 0x0000000000000222
sip     : 0x0000000000000000
sstatus : 0x8000000200006002
=> mw 0x40104000 0 # Enable interrupt for CLINT
=> exception sregs
stvec   : 0x00000000af6f4400
sie     : 0x0000000000000222
sip     : 0x0000000000000000 <-- nothing here
sstatus : 0x8000000200006002
=> md 0x40104000 # Show normal CLINT mtimecmp value
40104000: 00000000

No interrupts seem to arrive at the SIP register.

Let's set the stvec to zero, so it causes a machine exception.

2024-08-21

I tried with the new bitstream (ox_u55c_46619ef4.bit) setting the stvec to zero and I can see the OpenSBI handler stopping, probably due to the jump to zero address:

Boot HART MIDELEG         : 0x0000000000000222
Boot HART MEDELEG         : 0x000000000000b109
><E
><E<><Ed><Ee><Eb><Eu><Eg><E_><Eu><Ea><Er><Et><E>><E
><E><E>Core:  11 devices, 8 uclasses, devicetree: board
Loading Environment from nowhere... OK
In:    serial,usbkbd
Out:   serial,vidconsole
Err:   serial,vidconsole
No working controllers found
Net:   No ethernet found.
Working FDT set to c0000000
Hit any key to stop autoboot:  0

Device 0: unknown device

Device 1: unknown device
scanning bus for devices...

Device 0: unknown device
starting USB...
No working controllers found
No ethernet found.
No ethernet found.
=> exception sregs
stvec   : 0x00000000af6f4400
sie     : 0x0000000000000000
sip     : 0x0000000000000000
sstatus : 0x8000000200006000
=> exception enable
=> exception sregs
stvec   : 0x0000000000000000
sie     : 0x0000000000000222
sip     : 0x0000000000000000
sstatus : 0x8000000200006002
=> mw 0x40014000 0xffffffff # Disable clock interrupt
=> md 0x40801000 1 # Show pending interrupts (should be 0x10)
40801000: 00000010                             ....
=> mw 0x40802000 0x10 # Enable interrupt for source 4 (timer)
=> mw 0x40800010 0xff # Make source 4 priority large
<i
sbi_trap_error: hart0: trap0: unhandled local interrupt (error -1000)

sbi_trap_error: hart0: trap0: mcause=0x800000000000000b mtval=0x0000000000000000
sbi_trap_error: hart0: trap0: mepc=0x00000000af71df28 mstatus=0x8000000a00006802
sbi_trap_error: hart0: trap0: ra=0x00000000af71df28 sp=0x00000000aeed2b00
sbi_trap_error: hart0: trap0: gp=0x00000000aeed2dd0 tp=0x0000000000000000
sbi_trap_error: hart0: trap0: s0=0x00000000af7cca38 s1=0x0000000000000000
sbi_trap_error: hart0: trap0: a0=0x0000000000000000 a1=0x0000000000000002
sbi_trap_error: hart0: trap0: a2=0x0000000000000008 a3=0x0000000000000004
sbi_trap_error: hart0: trap0: a4=0xffffffffffffffff a5=0x0000000040800014
sbi_trap_error: hart0: trap0: a6=0x0000000000000008 a7=0x00000000af794b08
sbi_trap_error: hart0: trap0: s2=0x0000000000000000 s3=0x00000000aeed4b90
sbi_trap_error: hart0: trap0: s4=0x0000000000000003 s5=0x00000000af7f7754
sbi_trap_error: hart0: trap0: s6=0x0000000000000000 s7=0x0000000000000000
sbi_trap_error: hart0: trap0: s8=0x0000000000000000 s9=0x0000000000000000
sbi_trap_error: hart0: trap0: s10=0x00000000aeed4bc0 s11=0x0000000000000000
sbi_trap_error: hart0: trap0: t0=0x00000000aeed2ac0 t1=0x0000000000000039
sbi_trap_error: hart0: trap0: t2=0x3b3d74696e695f64 t3=0x0000000000000010
sbi_trap_error: hart0: trap0: t4=0x0000000000000000 t5=0x61745f746f6f627b
sbi_trap_error: hart0: trap0: t6=0x00000000aeed2aa0

Let's revert the stvec to its original value and try again and see if we can see the delegation properly being handed to the supervisor handler.

It continues to do the same trap in machine mode, rather than handing the interrupt over to supervisor.

Boot HART MIDELEG         : 0x0000000000000222
Boot HART MEDELEG         : 0x000000000000b109
><E
><E<><Ed><Ee><Eb><Eu><Eg><E_><Eu><Ea><Er><Et><E>><E
><E><E>Core:  11 devices, 8 uclasses, devicetree: board
Loading Environment from nowhere... OK
In:    serial,usbkbd
Out:   serial,vidconsole
Err:   serial,vidconsole
No working controllers found
Net:   No ethernet found.
Working FDT set to c0000000
Hit any key to stop autoboot:  0
=> exception sregs
stvec   : 0x00000000af6f4400
sie     : 0x0000000000000000
sip     : 0x0000000000000000
sstatus : 0x8000000200006000
=> exception enable
=> exception sregs
stvec   : 0x00000000af6f4400 <-------- now stvec is left as-is
sie     : 0x0000000000000222
sip     : 0x0000000000000000
sstatus : 0x8000000200006002
=> mw 0x40014000 0xffffffff # Disable clock interrupt
=> md 0x40801000 1 # Show pending interrupts (should be 0x10)
40801000: 00000010                             ....
=> mw 0x40802000 0x10 # Enable interrupt for source 4 (timer)
=> mw 0x40800010 0xff # Make source 4 priority large
<i
sbi_trap_error: hart0: trap0: unhandled local interrupt (error -1000)

sbi_trap_error: hart0: trap0: mcause=0x800000000000000b mtval=0x0000000000000000
sbi_trap_error: hart0: trap0: mepc=0x00000000af71df40 mstatus=0x8000000a00006802
sbi_trap_error: hart0: trap0: ra=0x00000000af71df24 sp=0x00000000aeed2b00
sbi_trap_error: hart0: trap0: gp=0x00000000aeed2dd0 tp=0x0000000000000000
sbi_trap_error: hart0: trap0: s0=0x00000000af7cca38 s1=0x0000000000000000
sbi_trap_error: hart0: trap0: a0=0x0000000000000000 a1=0x0000000000000002
sbi_trap_error: hart0: trap0: a2=0x0000000000000008 a3=0x0000000000000004
sbi_trap_error: hart0: trap0: a4=0x0000000000000001 a5=0x0000000000000001
sbi_trap_error: hart0: trap0: a6=0x0000000000000008 a7=0x00000000af794b08
sbi_trap_error: hart0: trap0: s2=0x0000000000000000 s3=0x00000000aeed4b90
sbi_trap_error: hart0: trap0: s4=0x0000000000000003 s5=0x00000000af7f7754
sbi_trap_error: hart0: trap0: s6=0x0000000000000000 s7=0x0000000000000000
sbi_trap_error: hart0: trap0: s8=0x0000000000000000 s9=0x0000000000000000
sbi_trap_error: hart0: trap0: s10=0x00000000aeed4bc0 s11=0x0000000000000000
sbi_trap_error: hart0: trap0: t0=0x00000000aeed2ac0 t1=0x0000000000000039
sbi_trap_error: hart0: trap0: t2=0x3b3d74696e695f64 t3=0x0000000000000010
sbi_trap_error: hart0: trap0: t4=0x0000000000000000 t5=0x61745f746f6f627b
sbi_trap_error: hart0: trap0: t6=0x00000000aeed2aa0

Let's dump the machine registers too.

Let's also disable the MEIE bit in MIE register.

2024-08-22

I'm unable to delegate the external interrupt to supervisor. It is always intercepted by OpenSBI in machine mode or nothing happens.

I may still be missing some bit in some register that prevents the delegation from working. So I tried to run the bare metal tests, specifically plic_supervisor from here:

https://gitlab.bsc.es/hwdesign/rtl/core-tile/sa-fpga/-/blob/6547e63f3aed312a4c36f093b5e6421c7f1704f6/fpga_core_bridge/simulator/tests/c_tests/plic_supervisor/plic_supervisor_test.c

But I'm not able to make them run. I added them to Nix and created a bootable image with objcopy, but nothing is displayed in the UART. I also tried to patch the .text.init address to 0x100 without luck.

I think the next experiment I will do is try to port the baremetal tests to something I can load at 0x8000_0000

2024-08-26

I implemented a new bootrom as the original large bootrom is not initializing the UART. This causes baremetal tests to fail until someone properly initializes the UART (namely, OpenSBI). The new bootrom, named RBOOTROM, initializes the UART and prints an initial message. It also sets a machine mode trap to catch potential errors:

$ picocom -qb 115200 $FPGACTL_UART

RBOOTROM v1.0 :^)
Jumping to 0x8000_0000...

While the bitstream with the potential fixes for the supervisor interrupts is not available, I'll try to automate the testing by creating a pipeline that automatically runs the tests in an FPGA.

2024-08-29

The plic_supervisor baremetal test seems to be working with bitstream ox_u55c_450d0ff0_fix_delegation_v2.bit. Here is the output with registers dumped:

Set timer interval.
Registers:
    MIE: 0000000000000200
    MIP: 0000000000000280
MSTATUS: 8000000a00006002
MIDELEG: 0000000000000200
    SIE: 0000000000000200
    SIP: 0000000000000200
SSTATUS: 8000000200006002

Jumping to supervisor mode...

Supervisor Trap Entry Reached
Cause: 8000000000000009 EPC: 000000008000111a
Hi from PLIC Interrupt Service Routine!
Claim interrupt: 0000000000000004
Set timer interval.


Supervisor Trap Entry Reached
Cause: 8000000000000009 EPC: 0000000080001e88
Hi from PLIC Interrupt Service Routine!
Claim interrupt: 0000000000000004
Set timer interval.

In our OpenSBI test however:

OpenSBI v1.5
   ____                    _____ ____ _____
  / __ \                  / ____|  _ \_   _|
 | |  | |_ __   ___ _ __ | (___ | |_) || |
 | |  | | '_ \ / _ \ '_ \ \___ \|  _ < | |
 | |__| | |_) |  __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|____/_____|
        | |
        |_|

--- TESTING PLIC ---
Timer interrupt disabled
Enabled supervisor delegation:
Registers:
	MIE	: 0x0000000000000200
	MIP	: 0x0000000000000080
	MSTATUS	: 0x8000000a00006082
	MIDELEG	: 0x0000000000000222
	SIE	: 0x0000000000000200
	SIP	: 0x0000000000000000
	SSTATUS	: 0x8000000200006002
Enabling timer in PLIC
Pending: 16
Claim: 4
Pending: 0
Clearing MIP
Switching to supervisor
Registers:
	MIE	: 0x0000000000000200
	MIP	: 0x0000000000000080
	MSTATUS	: 0x8000000a00006082
	MIDELEG	: 0x0000000000000222
	SIE	: 0x0000000000000200
	SIP	: 0x0000000000000000
	SSTATUS	: 0x8000000200006002
Hello from supervisor
Registers:
	SIE	: 0x0000000000000200
	SIP	: 0x0000000000000000
	SSTATUS	: 0x8000000200006002
Timer alarm programmed
Waiting for interrupt...

Here are closer together:

Baremetal:
    MIE: 0000000000000200
    MIP: 0000000000000280
MSTATUS: 8000000a00006002
MIDELEG: 0000000000000200
    SIE: 0000000000000200
    SIP: 0000000000000200
SSTATUS: 8000000200006002
---------------------------------
OpenSBI:
    MIE: 0000000000000200
    MIP: 0000000000000080 *
MSTATUS: 8000000a00006082 *
MIDELEG: 0000000000000222 *
    SIE: 0000000000000200
    SIP: 0000000000000000 *
SSTATUS: 8000000200006002

In the baremetal test, the interrupts are already pending before jumping to supervisor.

Let's try following all the steps exactly the same in OpenSBI.

The supervisor trap address hold the proper function.

The problem seems to be that the offsets I was using for the auxiliar timer were not correct:

-#define PLIC_ENABLE_OFFSET      0x2000UL
-#define PLIC_THRESHOLD_OFFSET   0x200000UL
-#define PLIC_CLAIM_OFFSET       0x200004UL
+#define PLIC_ENABLE_OFFSET      0x2080UL
+#define PLIC_THRESHOLD_OFFSET   0x201000UL
+#define PLIC_CLAIM_OFFSET       0x201004UL

This seems to be required now as we have two contexts. The test is working now.

2024-08-30

We finally have delegation to supervisor working, so now I can go back and focus on booting NixOS. We will probably require JTAG or some other means to debug all problems with the boot.

Now that we have a CI pipeline, let's try making the whole boot process automatic, so I don't have to type anything. This basically requires setting the environment in U-Boot.

<<< NixOS Stage 1 >>>


An error occurred in stage 1 of the boot process, which must mount the
root filesystem on `/mnt-root' and then start stage 2.  Press one
of the following keys:

  i) to launch an interactive shell
  f) to start an interactive shell having pid 1 (needed if you want to
     start stage 2's init manually)
[   22.365260] stage-1-init: [Thu Jan  1 00:00:22 UTC 1970] An error occurred in stage 1 of the boot process, which must mount the
  r) to reboot immediately
  *) to ignore the error and continue
[   22.526780] stage-1-init: [Thu Jan  1 00:00:22 UTC 1970] root filesystem on `/mnt-root' and then start stage 2.  Press one
[   22.611640] stage-1-init: [Thu Jan  1 00:00:22 UTC 1970] of the following keys:
[   22.697460] stage-1-init: [Thu Jan  1 00:00:22 UTC 1970] i) to launch an interactive shell
[   22.788100] stage-1-init: [Thu Jan  1 00:00:22 UTC 1970] f) to start an interactive shell having pid 1 (needed if you want to
[   22.874060] stage-1-init: [Thu Jan  1 00:00:22 UTC 1970] start stage 2's init manually)
[   22.957940] stage-1-init: [Thu Jan  1 00:00:22 UTC 1970] r) to reboot immediately
[   23.042520] stage-1-init: [Thu Jan  1 00:00:22 UTC 1970] *) to ignore the error and continue
iStarting interactive shell...
[   32.314080] stage-1-init: [Thu Jan  1 00:00:32 UTC 1970] Starting interactive shell...
~ # cat /proc/sys/kernel/random
random/             randomize_va_space
~ # cat /proc/sys/kernel/random/entropy_avail
0

Let's see what happens with strace.

2024-09-02

Interestingly, I managed to reach the login console and run some commands after fully booting NixOS. I just needed to disable nscd and enable a daemon to fill the entropy pull, which is depleted on boot.

I'm not looking at the nscd binary, which works ok when running nscd -V but hangs the CPU when running nscd --invalidate group. I tried executing it under GDB and then ^C, but it doesn't respond.

As always, we don't have JTAG support ready, so this is going to be an absolute pain to debug. I could bisect the code and try to guess at which point it must be failing. But each attempt will take around 30 minutes, so it is extremely expensive.

It would be nice to reproduce this in the initrd shell, which loads much faster.

# not strictly required, but you'll likely want the log anyway
(gdb) set logging on

# ask gdb to not stop every screen-full
(gdb) set height 0

(gdb) while 1
 > x/i $pc
 > stepi
 > end

Interestingly, I can step up to the end of the program, but it seems to be failing:

(gdb) r
Starting program: /nix/store/nh1f85icnvlqs3dc8lv2ya0ylmljsfax-system-path/bin/nscd --invalidate group
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/nix/store/47a03qi49pwlk3hxpfwx2vq671mlqn57-glibc-riscv64-unknown-linux-gnu-2.39-52/lib/libthread_db.so.1".

Breakpoint 1, 0x0000002aaaaaf738 in main ()
(gdb) while 1
 >x/i $pc
 >nexti
 >end
=> 0x2aaaaaf738 <main+56>:	jal	0x2aaaaaf160 <setlocale@plt>
0x0000002aaaaaf73c in main ()
=> 0x2aaaaaf73c <main+60>:	auipc	a0,0x19
0x0000002aaaaaf740 in main ()
=> 0x2aaaaaf740 <main+64>:	ld	a0,1580(a0)
0x0000002aaaaaf744 in main ()
=> 0x2aaaaaf744 <main+68>:	jal	0x2aaaaaf690 <textdomain@plt>
0x0000002aaaaaf748 in main ()
=> 0x2aaaaaf748 <main+72>:	li	a5,0
0x0000002aaaaaf74c in main ()
=> 0x2aaaaaf74c <main+76>:	addi	a4,sp,8
0x0000002aaaaaf750 in main ()
=> 0x2aaaaaf750 <main+80>:	li	a3,0
0x0000002aaaaaf754 in main ()
=> 0x2aaaaaf754 <main+84>:	mv	a2,s1
0x0000002aaaaaf758 in main ()
=> 0x2aaaaaf758 <main+88>:	mv	a1,s0
0x0000002aaaaaf75c in main ()
=> 0x2aaaaaf75c <main+92>:	auipc	a0,0x19
0x0000002aaaaaf760 in main ()
=> 0x2aaaaaf760 <main+96>:	addi	a0,a0,-1852
0x0000002aaaaaf764 in main ()
=> 0x2aaaaaf764 <main+100>:	jal	0x2aaaaaef30 <argp_parse@plt>
[Inferior 1 (process 1962) exited with code 01]
No registers.

Using passwd instead hangs the CPU before reaching the main:

[root@nixos-riscv:~]# gdb --args $(which nscd) --invalidate passwd
GNU gdb (GDB) 14.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "riscv64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /run/current-system/sw/bin/nscd...
(No debugging symbols found in /run/current-system/sw/bin/nscd)
(gdb) b main
Breakpoint 1 at 0x5738
(gdb) b argp_parse
Breakpoint 2 at 0x4f38
(gdb) c
The program is not being run.
(gdb) r
Starting program: /nix/store/nh1f85icnvlqs3dc8lv2ya0ylmljsfax-system-path/bin/nscd --invalidate passwd

It's probably not deterministic.

I have also disabled kaslr by adding nokaslr to the bootargs.

Anyway, doesn't make much sense to debug this on userland and pay 30 minutes for each test. I'll wait until we have proper JTAG support or a similar way to dump the registers on a crash.

2024-09-03

Wrote a small tool plictool to dump the state of the PLIC:

~ # plictool -c 2
plic=0x40800000 nsources=1024 ncontexts=2
src=1 pend=0 prio=1 ctx=1 thre=1
src=2 pend=0 prio=1
src=3 pend=0 prio=1
src=4 pend=1 prio=1
src=33 pend=0 prio=0 ctx=1 thre=1

Interestingly, the auxiliar UART interrupts don't seem to be working very well. Also, is there another source 33 enabled?

I have also noticed that the clock is running with the wrong frequency. Everything is about twice as slow. For example, sleep 1 takes 2 seconds. So I assume the kernel thinks the clock goes faster than it actually does.

The timer is at 50 KHz:

timebase-frequency = <50000>; /* 50 kHz */

So, the frequency of this "clock" is done by using the "general" clock which goes at 50 MHz, and then using a simple counter that counts up to 1525, so the frequency is:

32786.88524590164 Hz

Assuming that it actually works well. Let's try that and see if we can have a more realistic reading for the CLINT interrupt frequency.

Something is going on with the plictool:

~ # plictool
plic=0x40800000 nsources=1024 ncontexts=15872
src=1 pend=1 prio=1 ctx=1 thre=0 <--- zero
src=2 pend=0 prio=1
src=3 pend=0 prio=1
src=4 pend=1 prio=1
src=33 pend=0 prio=0 ctx=1 thre=2
~ # plictool
plic=0x40800000 nsources=1024 ncontexts=15872
src=1 pend=1 prio=1 ctx=1 thre=2 <--- now changed to 2???
src=2 pend=0 prio=1
src=3 pend=0 prio=1
src=4 pend=1 prio=1
src=33 pend=0 prio=0 ctx=1 thre=2

Either the register is changing its value on its own, or the tool is causing it.

Let's revert the threshold to 0, but writing to the threshold address, should be at 0x40a01000.

~ # devmem 0x40a01000
0x00000002
~ # devmem 0x40a01000
0x00000002
~ # devmem 0x40a01000
0x00000002
~ # devmem 0x40a01000 8 0
~ # devmem 0x40a01000
0x00000000
~ # devmem 0x40a01000
0x00000000
~ # devmem 0x40a01000
0x00000000
~ # plictool -c 2 -s 5 <-- limiting the sources and contexts
plic=0x40800000 nsources=5 ncontexts=2
src=1 pend=1 prio=1 ctx=1 thre=0 <-- good
src=2 pend=0 prio=1
src=3 pend=0 prio=1
src=4 pend=1 prio=1
~ # plictool -c 2 -s 5
plic=0x40800000 nsources=5 ncontexts=2
src=1 pend=1 prio=1 ctx=1 thre=0
src=2 pend=0 prio=1
src=3 pend=0 prio=1
src=4 pend=1 prio=1
~ # plictool -c 2 -s 5
plic=0x40800000 nsources=5 ncontexts=2
src=1 pend=1 prio=1 ctx=1 thre=0 <--- stable
src=2 pend=0 prio=1
src=3 pend=0 prio=1
src=4 pend=1 prio=1
~ # cat /proc/interrupts
            CPU0
   1:          1  SiFive PLIC   1 Edge      ttyS1
   5:     195413  RISC-V INTC   5 Edge      riscv-timer
~ # cat /proc/int^Crupts

~ # echo a > /dev/ttyS1
~ # cat /proc/interrupts
            CPU0
   1:          1  SiFive PLIC   1 Edge      ttyS1
   5:     197102  RISC-V INTC   5 Edge      riscv-timer
~ # plictool -c 2 -s 5
plic=0x40800000 nsources=5 ncontexts=2
src=1 pend=1 <--- wtf prio=1 ctx=1 thre=0
src=2 pend=0 prio=1
src=3 pend=0 prio=1
src=4 pend=1 prio=1
~ # devmem 0x40a01004
0x00000000 <--- the claim register continues to give 0 instead of 1
~ # devmem 0x40a01004
0x00000000
~ # devmem 0x40a01004
0x00000000
~ # devmem 0x40a01004
0x00000000
~ # devmem 0x40a01004
0x00000000
~ # devmem 0x40a01004 8 5 <-- try writing 5
~ # devmem 0x40a01004
0x00000000 <--- no change
~ # devmem 0x40a01004
0x00000000
~ # devmem 0x40a01004
0x00000000
~ # devmem 0x40a01004
0x00000000
~ # devmem 0x40a01004 8 1 <-- try with 1
~ # devmem 0x40a01004
0x00000000
~ # devmem 0x40a01004
0x00000000
~ # devmem 0x40a01004
0x00000000
~ # plictool -c 2 -s 5
plic=0x40800000 nsources=5 ncontexts=2
src=1 pend=0 <--- now it is gone prio=1 ctx=1 thre=0
src=2 pend=0 prio=1
src=3 pend=0 prio=1
src=4 pend=1 prio=1
~ # cat /proc/interrupts
            CPU0
   1:          3 <-- and we have >1  SiFive PLIC   1 Edge      ttyS1
   5:     213775  RISC-V INTC   5 Edge      riscv-timer
~ # echo a > /dev/ttyS1
~ # cat /proc/interrupts
            CPU0
   1:          3 <--- stuck again SiFive PLIC   1 Edge      ttyS1
   5:     214551  RISC-V INTC   5 Edge      riscv-timer
~ # plictool -c 2 -s 5
plic=0x40800000 nsources=5 ncontexts=2
src=1 pend=1 prio=1 ctx=1 thre=1 <--- again set to 1
src=2 pend=0 prio=1
src=3 pend=0 prio=1
src=4 pend=1 prio=1

Clearly something is not working well in the PLIC.

Let's try to dump the pending and priority registers of the PLIC with devmem, just to check that the plictool is not doing something wrong.

devmem 0x40801000 # Dump pending bits of sources 0-31
devmem 0x40801004 # Dump pending bits of sources 32-63
devmem 0x40a01000 # Dump priority threshold of context 1
devmem 0x40a00000 # Dump priority threshold of context 0

Interesting output:

~ # plictool -c2
plictool v0.0.3 addr=0x40800000 nsrc=1024 nctx=2
src=1 pend=0 prio=1
src=2 pend=0 prio=1
src=3 pend=0 prio=1
src=4 pend=1 prio=1
~ # plictool -c2
plictool v0.0.3 addr=0x40800000 nsrc=1024 nctx=2
src=1 pend=0 prio=1
src=2 pend=0 prio=1
src=3 pend=0 prio=1
src=4 pend=1 prio=1
~ # plictool -c2
plictool v0.0.3 addr=0x40800000 nsrc=1024 nctx=2
src=1 pend=0 prio=1
src=2 pend=0 prio=1
src=3 pend=0 prio=1
src=4 pend=1 prio=1
~ # plictool -c2
plictool v0.0.3 addr=0x40800000 nsrc=1024 nctx=2
src=1 pend=0 prio=1
src=2 pend=0 prio=1
src=3 pend=0 prio=1
src=4 pend=1 prio=1
~ # dd if=/dev/ttyS1 bs=1 count=1 of=/dev/null &
~ # plictool -c2
plictool v0.0.3 addr=0x40800000 nsrc=1024 nctx=2
src=1 pend=1 prio=1
    ctx=1 thre=0 unmasked
src=2 pend=0 prio=1
src=3 pend=0 prio=1
src=4 pend=1 prio=1
src=33 pend=0 prio=0
    ctx=1 thre=1 masked
~ # plictool -c2
plictool v0.0.3 addr=0x40800000 nsrc=1024 nctx=2
src=1 pend=1 prio=1
    ctx=1 thre=1 masked
src=2 pend=0 prio=1
src=3 pend=0 prio=1
src=4 pend=1 prio=1
src=33 pend=0 prio=0
    ctx=1 thre=1 masked
~ # plictool -c2
plictool v0.0.3 addr=0x40800000 nsrc=1024 nctx=2
src=1 pend=1 prio=1
    ctx=1 thre=1 masked
src=2 pend=0 prio=1
src=3 pend=0 prio=1
src=4 pend=1 prio=1
src=33 pend=0 prio=0
    ctx=1 thre=1 masked
~ # plictool -c2
plictool v0.0.3 addr=0x40800000 nsrc=1024 nctx=2
src=1 pend=1 prio=1
    ctx=1 thre=1 masked
src=2 pend=0 prio=1
src=3 pend=0 prio=1
src=4 pend=1 prio=1
src=33 pend=0 prio=0
    ctx=1 thre=1 masked

2024-09-05

devmem 0x4080000C 32 127 # Write 127 to source 3 priority
devmem 0x40a01000 32 0 # Write context 1 threshold to 0

# devmem 0x40a01004 # Read claim register context 1
# devmem 0x40a01004 # Read claim register context 1
#          0x201004

devmem 0x40800008 # Print source 3 priority
devmem 0x40a01000 # Print context 1 threshold

ifconfig eth0 hw ether 02:05:00:01:00:02
  • Report specific PLIC issues
  • Enable aux timer
  • Broken RTC frequency.

Thu, 26 Sep 2024 10:26:26 +0200

2024-09-26

The seedrng tool can add some entropy in the kernel from a saved seed in the FS. The source is available here:

https://git.zx2c4.com/seedrng/tree/seedrng.c

But it seems to be included with busybox, so it should be already included in our initrd:

~ # which seedrng
/nix/store/8a4i33qxnpwn0q7hs1vx3q5h4y2cym7n-extra-utils/bin/seedrng
~ # cat /proc/sys/kernel/random/entropy_avail
0
~ # seedrng
seedrng: can't create directory '/var/lib/seedrng': No such file or directory
~ # mkdir -p /var/lib/seedrng
~ # seedrng
Saving 256 bits of non-creditable seed for next boot
~ # cat /proc/sys/kernel/random/entropy_avail
0
~ # ls /var/lib/seedrng
seed.no-credit
~ # hexdump /var/lib/seedrng/seed.no-credit
0000000 caba 9c5c c19a 5b1a 97fc 0a2e a805 9608
0000010 d860 b3b9 4ade b781 ce23 2fce 19a5 d1b1
0000020

Okay, so now I can magically convert it into creditable seed:

~ # mv /var/lib/seedrng/seed.no-credit /var/lib/seedrng/seed.credit
~ # cat /proc/sys/kernel/random/entropy_avail
0
~ # seedrng
Seeding 256 bits and crediting
[  135.171201] random: crng init done
Saving 256 bits of creditable seed for next boot
~ # cat /proc/sys/kernel/random/entropy_avail
256

Nice.

So, I can remove the jitter daemon and use this instead, assuming it won't go down. It doesn't seem to go down:

~ # cat /proc/sys/kernel/random/entropy_avail
256
~ # hexdump -n $((1024*1024)) /dev/random > /dev/null
~ # cat /proc/sys/kernel/random/entropy_avail
256

Another option may be to use a fake HW RNG by pointing to the address of some timer register:

https://github.com/torvalds/linux/blob/master/drivers/char/hw_random/timeriomem-rng.c
https://www.kernel.org/doc/Documentation/devicetree/bindings/rng/timeriomem_rng.txt

Let's try the seedrng approach first.

~ # seedrng -h
seedrng: invalid option -- 'h'
BusyBox v1.36.1 () multi-call binary.

Usage: seedrng [-d DIR] [-n]

Seed the kernel RNG from seed files

    -d DIR	Use seed files in DIR (default: /var/lib/seedrng)
    -n	Do not credit randomness, even if creditable

I can work with this.

It seems to be working.

After correcting the RTC, now I get a more or less good boot time:

[root@nixos-riscv:~]# systemd-analyze
Startup finished in 4min 49.469s (kernel) + 10min 48.260s (userspace) = 15min 37.729s
multi-user.target reached after 10min 48.024s in userspace.

Still, udevd is killing the CPU:

[root@nixos-riscv:~]# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 585072  16580  59200    0    0     0    12  164  142 49 51  0  0  0
 4  0      0 585072  16580  59200    0    0     0     0  159  190 46 54  0  0  0
 1  0      0 585072  16580  59200    0    0     0     0  157   66 49 51  0  0  0
 5  0      0 585072  16580  59200    0    0     0     0  156   60 75 25  0  0  0
 2  0      0 585072  16580  59200    0    0     0     0  165  200 30 70  0  0  0
 2  0      0 585072  16580  59216    0    0     0     4  157  200 31 69  0  0  0
 2  0      0 585072  16580  59216    0    0     0     0  157  194 39 61  0  0  0
 2  0      0 585072  16580  59216    0    0     0     0  158  201 53 47  0  0  0
 2  0      0 585072  16580  59216    0    0     0     0  156  173 62 38  0  0  0
 3  0      0 585072  16580  59216    0    0     0     0  162  192 64 36  0  0  0
 4  0      0 585072  16580  59232    0    0     0     0  163  205 35 65  0  0  0
 1  0      0 585072  16580  59232    0    0     0     0  165  182 36 64  0  0  0
 1  0      0 585072  16580  59232    0    0     0     0  155   44 76 24  0  0  0
 3  0      0 585072  16580  59232    0    0     0     0  157  149 46 54  0  0  0
 2  0      0 585072  16580  59232    0    0     0     0  158  197 34 66  0  0  0
 2  0      0 585072  16580  59232    0    0     0     0  158  198 30 70  0  0  0
 2  0      0 585072  16580  59232    0    0     0     0  156  196 48 52  0  0  0
 2  0      0 585072  16580  59232    0    0     0     0  158  179 58 42  0  0  0
 3  0      0 585072  16580  59232    0    0     0     0  158  176 67 33  0  0  0
 2  0      0 585072  16580  59488    0    0   256     0  159  195 36 64  0  0  0
 1  0      0 585072  16580  59620    0    0   128     0  156  134 55 45  0  0  0

Not sure which device is malfunctioning.

[root@nixos-riscv:~]# udevadm monitor
monitor will print the received events for:
UDEV - the event which udev sends out after rule processing
KERNEL - the kernel uevent

UDEV  [1738.570531] add      /devices/virtual/tty/ptyaa (tty)
UDEV  [1741.983817] add      /devices/virtual/tty/ptyab (tty)
UDEV  [1745.434070] add      /devices/virtual/tty/ptyac (tty)
UDEV  [1748.846075] add      /devices/virtual/tty/ptyad (tty)
UDEV  [1752.238682] add      /devices/virtual/tty/ptyae (tty)
UDEV  [1755.665632] add      /devices/virtual/tty/ptyaf (tty)
UDEV  [1759.047259] add      /devices/virtual/tty/ptyb0 (tty)

2024-09-27

Here is the FPGA startup services by duration:

[root@nixos-riscv:~]# systemd-analyze blame
3min 59.980s systemd-udev-trigger.service
 2min 6.780s suid-sgid-wrappers.service
 2min 2.677s mount-pstore.service
1min 46.770s user@0.service
1min 10.554s systemd-journald.service
 1min 5.538s resolvconf.service
     48.660s systemd-logind.service
     44.747s systemd-sysctl.service
     43.971s sys-fs-fuse-connections.mount
     43.358s dev-mqueue.mount
     43.197s dev-hugepages.mount
     43.043s sys-kernel-debug.mount
     42.959s kmod-static-nodes.service
     42.568s network-setup.service
     42.546s modprobe@configfs.service
     42.285s systemd-tmpfiles-setup-dev-early.service
     40.980s modprobe@efi_pstore.service
     40.510s sys-kernel-config.mount
     39.891s modprobe@fuse.service
     36.103s systemd-random-seed.service
     30.686s systemd-udevd.service
     30.059s systemd-modules-load.service
     29.682s systemd-journal-catalog-update.service
     29.091s systemd-remount-fs.service
     26.090s systemd-tmpfiles-setup.service
     23.570s systemd-journal-flush.service
     21.042s systemd-tmpfiles-setup-dev.service
     19.736s audit.service
     17.794s systemd-update-utmp.service
     15.329s dbus.service
     10.177s modprobe@drm.service
      9.866s systemd-update-done.service
      8.302s user-runtime-dir@0.service
      6.615s systemd-user-sessions.service
      5.384s network-local-commands.service
      4.933s getty@tty1.service

It doesn't make much sense to have udev on the FPGA, as we are not going to hotplug anything. Maybe I can just disable the service, and save 4 minutes (at least).

Disabling udev by just setting services.udev.enable = false doesn't work. It still gets activated by the kernel socket.

2024-09-30

Managed to launch an interactive shell from stage2, before we run systemd. It boots in less than 5 minutes.

Now I can run some benchmarks there.

Let's see if I can enable flow control on the serial console. That would be helpful.

For that I need to first switch to the 8250/16550 driver. Let's try switch to the ttyS0 without enabling flow control yet. We probably need to change the stage1 and stage2 scripts to follow the proper console device.

It works, but very slowly:

[    0.000000] Kernel command line: root=/dev/ram0 loglevel=7 rw earlycon=sbi console=ttyS0,115200n8 debug2 ...
...
[   42.069358] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[   42.473860] of_serial 40001000.serial: error -ENXIO: IRQ index 0 not found
[   42.559263] printk: legacy console [ttyS0] disabled
[   42.614225] 40001000.serial: ttyS0 at MMIO 0x40001000 (irq = 0, base_baud = 3125000) is a 16550
[   42.645519] printk: legacy console [ttyS0] enabled
[   42.645519] printk: legacy console [ttyS0] enabled
[   42.656865] printk: legacy bootconsole [sbi0] disabled
[   42.656865] printk: legacy bootconsole [sbi0] disabled
[   42.896358] 40003000.serial: ttyS1 at MMIO 0x40003000 (irq = 1, base_baud = 3125000) is a 16550
[   42.999450] SuperH (H)SCI(F) driver initialized
...
~ # stty -a
speed 115200 baud;stty: standard input
 line = 0;
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>;
eol2 = <undef>; swtch = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R;
werase = ^W; lnext = ^V; flush = ^O; min = 1; time = 0;
-parenb -parodd -cmspar cs8 hupcl -cstopb cread clocal -crtscts
-ignbrk -brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr icrnl ixon -ixoff
-iuclc -ixany -imaxbel -iutf8
opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop -echoprt
echoctl echoke -flusho -extproc

Let's add the ability to claim a context in the plictool, as I suspect the claim may be broken.

2024-10-02

Interesting output with ftrace:

<<< NixOS Stage 1 >>>


An error occurred in stage 1 of the boot process, which must mount the
root filesystem on `/mnt-root' and then start stage 2.  Press one
of the following keys:

  i) to launch an interactive shell
  f) to start an interactive shell having pid 1 (needed if you want to
     start stage 2's init manually)
  r) to reboot immediately
  *) to ignore the error and continue
iStarting interactive shell...
~ # mount -t tracefs nodev /sys/kernel/tracing/
~ # cd /sys/kernel/tracing/
/sys/kernel/tracing # cat current_tracer
function
/sys/kernel/tracing # cat enabled_functions
plic_irq_unmask (1)
plic_irq_mask (1)
plic_dying_cpu (1)
plic_starting_cpu (1)
plic_irq_domain_translate (1)
plic_irq_domain_alloc (1)
plic_parse_context_parent (1)
plic_irq_set_type (1)
plic_toggle.constprop.0 (1)
plic_irq_eoi (1)
plic_probe (1)
plic_irq_suspend (1)
plic_handle_irq (1)
plic_irq_enable (1)
plic_irq_disable (1)
plic_irq_resume (1)
/sys/kernel/tracing # cat tracing_on
1
/sys/kernel/tracing # cat trac
trace                trace_options        tracing_max_latency
trace_clock          trace_pipe           tracing_on
trace_marker         trace_stat/          tracing_thresh
trace_marker_raw     tracing_cpumask
/sys/kernel/tracing # cat trace
# tracer: function
#
# entries-in-buffer/entries-written: 1586/1586   #P:1
...
/sys/kernel/tracing # cat options/function-trace
1
/sys/kernel/tracing # plictool
Source	Pend	Prio	C0(7)	C1(0)
1	-	1	-	-
2	-	1	-	-
3	-	1	-	-
4	yes	1	-	-
/sys/kernel/tracing # cat tracing_on
1
/sys/kernel/tracing # echo 0>tracing_on

/sys/kernel/tracing # cat tracing_on
1
/sys/kernel/tracing # echo 0 > tracing_on
/sys/kernel/tracing # cat tracing_on
0
/sys/kernel/tracing # echo plic_handle_irq > set_ftrace_filter
/sys/kernel/tracing # cat enabled_functions
plic_handle_irq (1)
/sys/kernel/tracing # echo plic_handle_irq,plic_irq_eoi,plic_irq_enable,plic_irq
_disable > set_ftrace_filter ^C

/sys/kernel/tracing # echo plic_irq_eoi >> set_ftrace_filter
/sys/kernel/tracing # cat enabled_functions
plic_irq_eoi (1)
plic_handle_irq (1)
/sys/kernel/tracing # echo plic_irq_enable >> set_ftrace_filter
/sys/kernel/tracing # plictool
Source	Pend	Prio	C0(7)	C1(0)
1	-	1	-	-
2	-	1	-	-
3	-	1	-	-
4	yes	1	-	-
/sys/kernel/tracing # ^Ct enabled_functions

/sys/kernel/tracing # echo hi > /dev/ttyS1
^Z^C
/sys/kernel/tracing # plictool
Source	Pend	Prio	C0(7)	C1(0)
1	yes	1	-	-
2	-	1	-	-
3	-	1	-	-
4	yes	1	-	-
/sys/kernel/tracing # echo plic_irq_enable >> set_f^Cace_filter

/sys/kernel/tracing # cat tracing_on
0
/sys/kernel/tracing # echo 1 > tracing_on
/sys/kernel/tracing # cat tracing_on
1
/sys/kernel/tracing # cat trace_pipe &
/sys/kernel/tracing # fg
cat trace_pipe
^Z[1]+  Stopped                    cat trace_pipe
/sys/kernel/tracing # bg
[1] cat trace_pipe
/sys/kernel/tracing # echo hi > /dev/ttyS1
             ash-91      [000] d..2.   669.966205: plic_irq_enable <-irq_enable
             ash-91      [000] d..2.   669.966815: plic_irq_enable <-irq_enable

*** local echo: yes ***

*** local echo: no ***
^C
/sys/kernel/tracing # echo 0 > tracing_on
/sys/kernel/tracing # cat tracing_on
0
/sys/kernel/tracing # echo 'plic_*' >> set_ftrace_filter
/sys/kernel/tracing # cat enabled_functions
plic_irq_unmask (1)
plic_irq_mask (1)
plic_dying_cpu (1)
plic_starting_cpu (1)
plic_irq_domain_translate (1)
plic_irq_domain_alloc (1)
plic_parse_context_parent (1)
plic_irq_set_type (1)
plic_toggle.constprop.0 (1)
plic_irq_eoi (1)
plic_probe (1)
plic_irq_suspend (1)
plic_handle_irq (1)
plic_irq_enable (1)
plic_irq_disable (1)
plic_irq_resume (1)
/sys/kernel/tracing # echo 1 > tracing_on
/sys/kernel/tracing # cat trace_pipe &
/sys/kernel/tracing # cat: can't open 'trace_pipe': Device or resource busy

[2]+  Done(1)                    cat trace_pipe
/sys/kernel/tracing # ps
PID   USER     TIME  COMMAND
    1 0         0:33 {init} /nix/store/y477q6jlbg9b53knnclgib65fbzpazkj-extra-u
    2 0         0:00 [kthreadd]
    3 0         0:00 [pool_workqueue_]
    4 0         0:00 [kworker/R-slub_]
    5 0         0:00 [kworker/R-netns]
    6 0         0:00 [kworker/0:0-eve]
    7 0         0:00 [kworker/0:0H]
    8 0         0:03 [kworker/u4:0-ev]
    9 0         0:00 [kworker/R-mm_pe]
   10 0         0:41 [kworker/u4:1-ev]
   11 0         0:00 [rcu_tasks_rude_]
   12 0         0:00 [rcu_tasks_trace]
   13 0         0:00 [ksoftirqd/0]
   14 0         0:00 [kworker/0:1-eve]
   15 0         0:01 [kdevtmpfs]
   16 0         0:00 [kworker/R-inet_]
   17 0         0:00 [kauditd]
   18 0         0:00 [khungtaskd]
   20 0         0:00 [oom_reaper]
   21 0         0:00 [kworker/R-write]
   22 0         0:01 [kcompactd0]
   23 0         0:00 [ksmd]
   24 0         0:00 [kworker/R-kinte]
   25 0         0:00 [kworker/R-kbloc]
   26 0         0:00 [kworker/R-blkcg]
   27 0         0:00 [kworker/R-ata_s]
   28 0         0:00 [kworker/R-devfr]
   29 0         0:00 [watchdogd]
   30 0         0:01 [kworker/u4:3-ev]
   31 0         0:00 [kworker/R-rpcio]
   32 0         0:00 [kworker/R-xprti]
   33 0         0:00 [kswapd0]
   34 0         0:00 [kworker/R-nfsio]
   35 0         0:00 [kworker/R-kthro]
   36 0         0:09 [khvcd]
   37 0         0:00 [kworker/R-uas]
   38 0         0:00 [kworker/R-mld]
   39 0         0:00 [kworker/R-ipv6_]
   46 0         0:00 [kworker/R-kstrp]
   47 0         0:00 [kworker/u5:0]
   73 0         0:00 tee -i /proc/self/fd/8
   74 0         0:00 {init} /nix/store/y477q6jlbg9b53knnclgib65fbzpazkj-extra-u
   91 0         0:46 /nix/store/y477q6jlbg9b53knnclgib65fbzpazkj-extra-utils/bi
  113 0         0:00 cat trace_pipe
  117 0         0:00 ps
/sys/kernel/tracing # fg
cat trace_pipe
^C
/sys/kernel/tracing # cat trace_pipe &
/sys/kernel/tracing # echo 1 > tracing_on
/sys/kernel/tracing # echo hi > /dev/ttyS1
             ash-91      [000] d..2.   748.552248: plic_irq_enable <-irq_enable
             ash-91      [000] d..2.   748.552370: plic_toggle.constprop.0 <-plic_irq_enable
             ash-91      [000] d..2.   748.552797: plic_irq_disable <-irq_disable
             ash-91      [000] d..2.   748.552858: plic_toggle.constprop.0 <-plic_irq_disable
             ash-91      [000] d..2.   748.553102: plic_irq_enable <-irq_enable
             ash-91      [000] d..2.   748.553163: plic_toggle.constprop.0 <-plic_irq_enable
^C
/sys/kernel/tracing #              ash-91      [000] d..2.   762.013481: plic_irq_disable <-__irq_disable
             ash-91      [000] d..2.   762.013603: plic_toggle.constprop.0 <-plic_irq_disable

/sys/kernel/tracing #
/sys/kernel/tracing # plictool
Source	Pend	Prio	C0(7)	C1(0)
1	yes	1	-	-
2	-	1	-	-
3	-	1	-	-
4	yes	1	-	-
/sys/kernel/tracing # plictool -C 1
0
/sys/kernel/tracing # plictool -C 1 -w 4
/sys/kernel/tracing # plictool -C 1
0
/sys/kernel/tracing # plictool
Source	Pend	Prio	C0(7)	C1(0)
1	yes	1	-	-
2	-	1	-	-
3	-	1	-	-
4	yes	1	-	-
/sys/kernel/tracing # devmem 0x40014000 32 0xffffffff
/sys/kernel/tracing # plictool
Source	Pend	Prio	C0(7)	C1(0)
1	yes	1	-	-
2	-	1	-	-
3	-	1	-	-
4	yes	1	-	-
/sys/kernel/tracing # plictool -C 1 -w 4
/sys/kernel/tracing # plictool
Source	Pend	Prio	C0(7)	C1(0)
1	yes	1	-	-
2	-	1	-	-
3	-	1	-	-
4	yes	1	-	-
/sys/kernel/tracing # plictool -E
plictool: option requires an argument: E
plictool v0.0.5 -- Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
Usage:
  plictool [-a addr] [-L] [-n nsrc] [-x nctx]  # List (default)
  plictool [-a addr] -C ctx [-w value]         # Claim
  plictool [-a addr] -T ctx [-w value]         # Threshold
  plictool [-a addr] -I src [-w value]         # Priority
  plictool [-a addr] -P src [-w value]         # Pending
  plictool [-a addr] -E src -c ctx [-w value]  # Enabled
  plictool -v                                  # Version
/sys/kernel/tracing # plictool -E 4 -c 0 -w 1
/sys/kernel/tracing # plictool
Source	Pend	Prio	C0(7)	C1(0)
1	yes	1	-	-
2	-	1	-	-
3	-	1	-	-
4	yes	1	masked	-
36	-	0	masked	-
/sys/kernel/tracing # plictool -C 0
4
/sys/kernel/tracing # plictool -C 0 -w 4
/sys/kernel/tracing # plictool
Source	Pend	Prio	C0(7)	C1(0)
1	yes	1	-	-
2	-	1	-	-
3	-	1	-	-
4	-	1	masked	-
36	-	0	masked	-
/sys/kernel/tracing # plictool -E 4 -c 0 -w 0
/sys/kernel/tracing # plictool
Source	Pend	Prio	C0(7)	C1(0)
1	yes	1	-	-
2	-	1	-	-
3	-	1	-	-
4	-	1	-	-
/sys/kernel/tracing # cat /dev/ttyS1 &
/sys/kernel/tracing #              cat-136     [000] d..2.   887.106509: plic_irq_enable <-irq_enable
             cat-136     [000] d..2.   887.106631: plic_toggle.constprop.0 <-plic_irq_enable
             cat-136     [000] d.h2.   887.106814: plic_handle_irq <-generic_handle_domain_irq
             cat-136     [000] d.h3.   887.106997: plic_irq_eoi <-handle_fasteoi_irq
             cat-136     [000] d..2.   887.107424: plic_irq_disable <-irq_disable
             cat-136     [000] d..2.   887.107546: plic_toggle.constprop.0 <-plic_irq_disable
             cat-136     [000] d..2.   887.107729: plic_irq_enable <-irq_enable
             cat-136     [000] d..2.   887.107790: plic_toggle.constprop.0 <-plic_irq_enable

/sys/kernel/tracing # plictool
Source	Pend	Prio	C0(7)	C1(0)
1	yes	1	-	firing
2	-	1	-	-
3	-	1	-	-
4	-	1	-	-
33	-	0	-	masked

So, then an interrupt is fired, we disable the interrupts and then we enable them again. Can this happen because the next interrupt fired when interrupts were disabled?

There are also other cases in which we already have an interrupt pending and we enable the context 1. For example, by setting the enable bit, changing the prio, or the threshold for that context or by removing the context 0 enable bit.

I think I can do some tests with those. I'm not sure why the serial port is firing an interrupt, but if this is the case, I can try to write a character in the transmit register and wait for the pending bit, then try to claim the interrupt for source 1.

2024-10-03

/sys/kernel/tracing # plictool
Source	Pend	Prio	C0(7)	C1(0)
1	yes	1	-	-
2	-	1	-	-
3	-	1	-	-
4	yes	1	-	-
/sys/kernel/tracing # cat tracing_on
1
/sys/kernel/tracing # echo 0 > tracing_on
/sys/kernel/tracing # echo 1 > tracing_on
/sys/kernel/tracing # cat trace_pipe &
/sys/kernel/tracing # cat /dev/ttyS1 &
/sys/kernel/tracing #              cat-122     [000] ...1.  2978.202587: serial8250_pm <-uart_port_startup
             cat-122     [000] ...1.  2978.202648: serial8250_set_sleep <-serial8250_pm
             cat-122     [000] ...1.  2978.202892: serial8250_startup <-uart_port_startup
             cat-122     [000] ...1.  2978.203014: serial8250_do_startup <-serial8250_startup
             cat-122     [000] d..2.  2978.203319: plic_irq_enable <-irq_enable
             cat-122     [000] d..2.  2978.203380: plic_toggle.constprop.0 <-plic_irq_enable
             cat-122     [000] d.h2.  2978.203563: plic_handle_irq <-generic_handle_domain_irq
             cat-122     [000] d.h2.  2978.203685: serial8250_interrupt <-__handle_irq_event_percpu
             cat-122     [000] d.h3.  2978.203746: serial8250_default_handle_irq <-serial8250_interrupt
             cat-122     [000] d.h3.  2978.203807: serial8250_handle_irq <-serial8250_default_handle_irq
             cat-122     [000] d.h3.  2978.203990: plic_irq_eoi <-handle_fasteoi_irq
             cat-122     [000] d..2.  2978.204783: plic_irq_disable <-irq_disable
             cat-122     [000] d..2.  2978.204844: plic_toggle.constprop.0 <-plic_irq_disable
             cat-122     [000] d..2.  2978.205088: plic_irq_enable <-irq_enable
             cat-122     [000] d..2.  2978.205149: plic_toggle.constprop.0 <-plic_irq_enable
             cat-122     [000] d..2.  2978.205271: serial8250_do_set_mctrl <-serial8250_do_startup
             cat-122     [000] ...1.  2978.205393: serial8250_set_termios <-uart_change_line_settings
             cat-122     [000] ...1.  2978.206796: serial8250_do_set_termios <-serial8250_set_termios
             cat-122     [000] .n.1.  2978.206918: serial8250_get_divisor <-serial8250_do_set_termios
             cat-122     [000] dn.2.  2978.207040: serial8250_do_set_divisor <-serial8250_do_set_termios
             cat-122     [000] dn.2.  2978.207101: serial8250_do_set_mctrl <-serial8250_do_set_termios
             cat-122     [000] dn.2.  2978.207345: serial8250_set_mctrl <-uart_update_mctrl
             cat-122     [000] dn.2.  2978.207406: serial8250_do_set_mctrl <-serial8250_set_mctrl

/sys/kernel/tracing # plictool
Source	Pend	Prio	C0(7)	C1(0)
1	yes	1	-	firing
2	-	1	-	-
3	-	1	-	-
4	yes	1	-	-
33	-	0	-	masked
/sys/kernel/tracing # plictool -C 1
0

Another test tracing the UART serial too.

It would be nice to see where those calls are coming from.

Let's collect the commands I need to use here:

mount -t tracefs nodev /sys/kernel/tracing/
cd /sys/kernel/tracing/
echo 'plic_*' >> set_ftrace_filter
echo 'serial8250_*' >> set_ftrace_filter
cat current_tracer

Here is a trace with the stack trace printed:

~ # plictool
Source	Pend	Prio	C0(7)	C1(0)
1	-	1	-	-
2	-	1	-	-
3	-	1	-	-
4	yes	1	-	-

~ #
[...]
/sys/kernel/tracing # cat /dev/ttyS1 &

             cat-122     [000] .....   591.693894: serial8250_pm <-uart_port_startup
             cat-122     [000] .....   591.694077: <stack trace>
 => ftrace_call
 => uart_port_activate
 => tty_port_open
 => uart_open
 => tty_open
 => chrdev_open
 => do_dentry_open
 => vfs_open
 => path_openat
 => do_filp_open
 => do_sys_openat2
 => __riscv_sys_openat
 => do_trap_ecall_u
 => ret_from_exception
             cat-122     [000] .....   591.694138: serial8250_set_sleep <-serial8250_pm
             cat-122     [000] .....   591.694260: <stack trace>
 => ftrace_call
 => uart_port_startup
 => uart_port_activate
 => tty_port_open
 => uart_open
 => tty_open
 => chrdev_open
 => do_dentry_open
 => vfs_open
 => path_openat
 => do_filp_open
 => do_sys_openat2
 => __riscv_sys_openat
 => do_trap_ecall_u
 => ret_from_exception
             cat-122     [000] .....   591.694504: serial8250_startup <-uart_port_startup
             cat-122     [000] .....   591.694687: <stack trace>
 => ftrace_call
 => uart_port_activate
 => tty_port_open
 => uart_open
 => tty_open
 => chrdev_open
 => do_dentry_open
 => vfs_open
 => path_openat
 => do_filp_open
 => do_sys_openat2
 => __riscv_sys_openat
 => do_trap_ecall_u
 => ret_from_exception
             cat-122     [000] .....   591.694748: serial8250_do_startup <-serial8250_startup
             cat-122     [000] .....   591.694870: <stack trace>
 => ftrace_call
 => uart_port_startup
 => uart_port_activate
 => tty_port_open
 => uart_open
 => tty_open
 => chrdev_open
 => do_dentry_open
 => vfs_open
 => path_openat
 => do_filp_open
 => do_sys_openat2
 => __riscv_sys_openat
 => do_trap_ecall_u
 => ret_from_exception
             cat-122     [000] d..1.   591.695236: plic_irq_enable <-irq_enable
             cat-122     [000] d..1.   591.695358: <stack trace>
 => ftrace_call
 => irq_startup
 => __setup_irq
 => request_threaded_irq
 => univ8250_setup_irq
 => serial8250_do_startup
 => serial8250_startup
 => uart_port_startup
 => uart_port_activate
 => tty_port_open
 => uart_open
 => tty_open
 => chrdev_open
 => do_dentry_open
 => vfs_open
 => path_openat
 => do_filp_open
 => do_sys_openat2
 => __riscv_sys_openat
 => do_trap_ecall_u
 => ret_from_exception
             cat-122     [000] d..1.   591.695480: plic_toggle.constprop.0 <-plic_irq_enable
             cat-122     [000] d..1.   591.695602: <stack trace>
 => ftrace_call
 => irq_enable
 => irq_startup
 => __setup_irq
 => request_threaded_irq
 => univ8250_setup_irq
 => serial8250_do_startup
 => serial8250_startup
 => uart_port_startup
 => uart_port_activate
 => tty_port_open
 => uart_open
 => tty_open
 => chrdev_open
 => do_dentry_open
 => vfs_open
 => path_openat
 => do_filp_open
 => do_sys_openat2
 => __riscv_sys_openat
 => do_trap_ecall_u
 => ret_from_exception
             cat-122     [000] d..1.   591.696456: plic_irq_disable <-irq_disable
             cat-122     [000] d..1.   591.696578: <stack trace>
 => ftrace_call
 => __disable_irq_nosync
 => disable_irq_nosync
 => serial8250_do_startup
 => serial8250_startup
 => uart_port_startup
 => uart_port_activate
 => tty_port_open
 => uart_open
 => tty_open
 => chrdev_open
 => do_dentry_open
 => vfs_open
 => path_openat
 => do_filp_open
 => do_sys_openat2
 => __riscv_sys_openat
 => do_trap_ecall_u
 => ret_from_exception
             cat-122     [000] d..1.   591.696639: plic_toggle.constprop.0 <-plic_irq_disable
             cat-122     [000] d..1.   591.696822: <stack trace>
 => ftrace_call
 => irq_disable
 => __disable_irq_nosync
 => disable_irq_nosync
 => serial8250_do_startup
 => serial8250_startup
 => uart_port_startup
 => uart_port_activate
 => tty_port_open
 => uart_open
 => tty_open
 => chrdev_open
 => do_dentry_open
 => vfs_open
 => path_openat
 => do_filp_open
 => do_sys_openat2
 => __riscv_sys_openat
 => do_trap_ecall_u
 => ret_from_exception
             cat-122     [000] d..1.   591.697005: plic_irq_enable <-irq_enable
             cat-122     [000] d..1.   591.697188: <stack trace>
 => ftrace_call
 => irq_startup
 => __enable_irq
 => enable_irq
 => serial8250_do_startup
 => serial8250_startup
 => uart_port_startup
 => uart_port_activate
 => tty_port_open
 => uart_open
 => tty_open
 => chrdev_open
 => do_dentry_open
 => vfs_open
 => path_openat
 => do_filp_open
 => do_sys_openat2
 => __riscv_sys_openat
 => do_trap_ecall_u
 => ret_from_exception
             cat-122     [000] d..1.   591.697249: plic_toggle.constprop.0 <-plic_irq_enable
             cat-122     [000] d..1.   591.697432: <stack trace>
 => ftrace_call
 => irq_enable
 => irq_startup
 => __enable_irq
 => enable_irq
 => serial8250_do_startup
 => serial8250_startup
 => uart_port_startup
 => uart_port_activate
 => tty_port_open
 => uart_open
 => tty_open
 => chrdev_open
 => do_dentry_open
 => vfs_open
 => path_openat
 => do_filp_open
 => do_sys_openat2
 => __riscv_sys_openat
 => do_trap_ecall_u
 => ret_from_exception
             cat-122     [000] d.h1.   591.697554: plic_handle_irq <-generic_handle_domain_irq
             cat-122     [000] d.h1.   591.697676: <stack trace>
 => ftrace_call
 => riscv_intc_irq
 => handle_riscv_irq
 => call_on_irq_stack
 => 0x1c003b9260
             cat-122     [000] d.h1.   591.697798: serial8250_interrupt <-__handle_irq_event_percpu
             cat-122     [000] d.h1.   591.697920: <stack trace>
 => ftrace_call
 => handle_irq_event
 => handle_fasteoi_irq
 => generic_handle_domain_irq
 => plic_handle_irq
 => generic_handle_domain_irq
 => riscv_intc_irq
 => handle_riscv_irq
 => call_on_irq_stack
 => 0x1c001dc940
             cat-122     [000] d.h2.   591.697981: serial8250_default_handle_irq <-serial8250_interrupt
             cat-122     [000] d.h2.   591.698103: <stack trace>
 => ftrace_call
 => __handle_irq_event_percpu
 => handle_irq_event
 => handle_fasteoi_irq
 => generic_handle_domain_irq
 => plic_handle_irq
 => generic_handle_domain_irq
 => riscv_intc_irq
 => handle_riscv_irq
 => call_on_irq_stack
 => 0x1c001dc940
             cat-122     [000] d.h2.   591.698164: serial8250_handle_irq <-serial8250_default_handle_irq
             cat-122     [000] d.h2.   591.698286: <stack trace>
 => ftrace_call
 => serial8250_interrupt
 => __handle_irq_event_percpu
 => handle_irq_event
 => handle_fasteoi_irq
 => generic_handle_domain_irq
 => plic_handle_irq
 => generic_handle_domain_irq
 => riscv_intc_irq
 => handle_riscv_irq
 => call_on_irq_stack
 => 0x1c00595ba0
             cat-122     [000] d.h2.   591.698469: plic_irq_eoi <-handle_fasteoi_irq
             cat-122     [000] d.h2.   591.698530: <stack trace>
 => ftrace_call
 => generic_handle_domain_irq
 => plic_handle_irq
 => generic_handle_domain_irq
 => riscv_intc_irq
 => handle_riscv_irq
 => call_on_irq_stack
 => 0x1c00595ba0
             cat-122     [000] d..1.   591.698713: serial8250_do_set_mctrl <-serial8250_do_startup
             cat-122     [000] d..1.   591.698835: <stack trace>
 => ftrace_call
 => serial8250_startup
 => uart_port_startup
 => uart_port_activate
 => tty_port_open
 => uart_open
 => tty_open
 => chrdev_open
 => do_dentry_open
 => vfs_open
 => path_openat
 => do_filp_open
 => do_sys_openat2
 => __riscv_sys_openat
 => do_trap_ecall_u
 => ret_from_exception
             cat-122     [000] .n...   591.700177: serial8250_set_termios <-uart_change_line_settings
             cat-122     [000] .n...   591.700299: <stack trace>
 => ftrace_call
 => uart_port_startup
 => uart_port_activate
 => tty_port_open
 => uart_open
 => tty_open
 => chrdev_open
 => do_dentry_open
 => vfs_open
 => path_openat
 => do_filp_open
 => do_sys_openat2
 => __riscv_sys_openat
 => do_trap_ecall_u
 => ret_from_exception
             cat-122     [000] .n...   591.700421: serial8250_do_set_termios <-serial8250_set_termios
             cat-122     [000] .n...   591.700543: <stack trace>
 => ftrace_call
 => uart_change_line_settings
 => uart_port_startup
 => uart_port_activate
 => tty_port_open
 => uart_open
 => tty_open
 => chrdev_open
 => do_dentry_open
 => vfs_open
 => path_openat
 => do_filp_open
 => do_sys_openat2
 => __riscv_sys_openat
 => do_trap_ecall_u
 => ret_from_exception
             cat-122     [000] .n...   591.700604: serial8250_get_divisor <-serial8250_do_set_termios
             cat-122     [000] .n...   591.700787: <stack trace>
 => ftrace_call
 => serial8250_set_termios
 => uart_change_line_settings
 => uart_port_startup
 => uart_port_activate
 => tty_port_open
 => uart_open
 => tty_open
 => chrdev_open
 => do_dentry_open
 => vfs_open
 => path_openat
 => do_filp_open
 => do_sys_openat2
 => __riscv_sys_openat
 => do_trap_ecall_u
 => ret_from_exception
             cat-122     [000] dn.1.   591.700909: serial8250_do_set_divisor <-serial8250_do_set_termios
             cat-122     [000] dn.1.   591.701031: <stack trace>
 => ftrace_call
 => serial8250_set_termios
 => uart_change_line_settings
 => uart_port_startup
 => uart_port_activate
 => tty_port_open
 => uart_open
 => tty_open
 => chrdev_open
 => do_dentry_open
 => vfs_open
 => path_openat
 => do_filp_open
 => do_sys_openat2
 => __riscv_sys_openat
 => do_trap_ecall_u
 => ret_from_exception
             cat-122     [000] dn.1.   591.701092: serial8250_do_set_mctrl <-serial8250_do_set_termios
             cat-122     [000] dn.1.   591.701214: <stack trace>
 => ftrace_call
 => serial8250_set_termios
 => uart_change_line_settings
 => uart_port_startup
 => uart_port_activate
 => tty_port_open
 => uart_open
 => tty_open
 => chrdev_open
 => do_dentry_open
 => vfs_open
 => path_openat
 => do_filp_open
 => do_sys_openat2
 => __riscv_sys_openat
 => do_trap_ecall_u
 => ret_from_exception
             cat-122     [000] dn.1.   591.701458: serial8250_set_mctrl <-uart_update_mctrl
             cat-122     [000] dn.1.   591.701580: <stack trace>
 => ftrace_call
 => uart_dtr_rts
 => tty_port_block_til_ready
 => tty_port_open
 => uart_open
 => tty_open
 => chrdev_open
 => do_dentry_open
 => vfs_open
 => path_openat
 => do_filp_open
 => do_sys_openat2
 => __riscv_sys_openat
 => do_trap_ecall_u
 => ret_from_exception
             cat-122     [000] dn.1.   591.701702: serial8250_do_set_mctrl <-serial8250_set_mctrl
             cat-122     [000] dn.1.   591.701885: <stack trace>
 => ftrace_call
 => uart_update_mctrl
 => uart_dtr_rts
 => tty_port_block_til_ready
 => tty_port_open
 => uart_open
 => tty_open
 => chrdev_open
 => do_dentry_open
 => vfs_open
 => path_openat
 => do_filp_open
 => do_sys_openat2
 => __riscv_sys_openat
 => do_trap_ecall_u
 => ret_from_exception

/sys/kernel/tracing # plictool
Source	Pend	Prio	C0(7)	C1(0)
1	yes	1	-	firing
2	-	1	-	-
3	-	1	-	-
4	yes	1	-	-
33	-	0	-	masked

/sys/kernel/tracing # plictool -C 1
0

2024-10-04

To trace which accesses are done to the PLIC, I can use the mmiotracer, which will hopefully record how we configure the PLIC and lead to a reproducer.

CONFIG_MMIOTRACE=y

2024-10-07

Let's move on to the SPEC CPU benchmarks. I compiled them again with the current nixpkgs, but they are too large. I made a "mini" version with only the "speed" and "integer" variants and removing a couple of large benchmarks.

Still, the closure is gigantic, as they are collecting the environment during the build phase and that makes the result depend on the build packages.

2024-10-09

One of the problems with the speccmds.cmd file is that is assumes that it can write the output of the benchmarks in the same place that the binaries are located.

hut% cat benchspec/CPU/602.gcc_s/run/run_base_test_nix-m64.0000/speccmds.cmd
-r
-N C
-C /build/out/benchspec/CPU/602.gcc_s/run/run_base_test_nix-m64.0000
-o t1.opts-O3_-finline-limit_50000.out -e t1.opts-O3_-finline-limit_50000.err ../run_base_test_nix-m64.0000/sgcc_base.nix-m64 t1.c -O3 -finline-limit=50000 -o t1.opts-O3_-finline-limit_50000.s > t1.opts-O3_-finline-limit_50000.out 2>> t1.opts-O3_-finline-limit_50000.err

We can address this problem by modifying the -C ... command and just use -C 602.gcc_s (not sure if it creates it directly). Then we need to modify the ../run... part to use the full path of the binary.

hut% cat speccmds.cmd | sed '/^-C/d'
-r
-N C
-o t1.opts-O3_-finline-limit_50000.out -e t1.opts-O3_-finline-limit_50000.err ../run_base_test_nix-m64.0000/sgcc_base.nix-m64 t1.c -O3 -finline-limit=50000 -o t1.opts-O3_-finline-limit_50000.s > t1.opts-O3_-finline-limit_50000.out 2>> t1.opts-O3_-finline-limit_50000.err

I can create a symlink to the benchmark directory, so it finds it at ../run_base_test_nix-m64.0000.

2024-10-10

Managed to run some:

+ for srcbench in $SPEC/benchspec/CPU/*
++ basename /nix/store/h2by3qxqpzy5b1zszz7wviphv6vy1pjs-spec-cpu-mini-riscv64-unknown-linux-gnu-1.1.7/benchspec/CPU/620.omnetpp_s
+ name=620.omnetpp_s
+ bench=/tmp/spec/620.omnetpp_s
+ rm -rf /tmp/spec/620.omnetpp_s
+ cp -r /nix/store/h2by3qxqpzy5b1zszz7wviphv6vy1pjs-spec-cpu-mini-riscv64-unknown-linux-gnu-1.1.7/benchspec/CPU/620.omnetpp_s /tmp/spec/620.omnetpp_s
+ chmod +w -R /tmp/spec/620.omnetpp_s
+ rundir=/tmp/spec/620.omnetpp_s/run/run_base_test_nix-m64.0000
+ sed -i '/^-C/d' /tmp/spec/620.omnetpp_s/run/run_base_test_nix-m64.0000/speccmds.cmd
+ echo '--- Running 620.omnetpp_s for 1 iterations ---'
--- Running 620.omnetpp_s for 1 iterations ---
+ cd /tmp/spec/620.omnetpp_s/run/run_base_test_nix-m64.0000
+ specinvoke -i 1 -E speccmds.cmd
+ awk '/^run [0-9]* elapsed time/{printf \
    "%s\t%s\t%s\t%s\t%s\n", \
    "620.omnetpp_s","test","base",$2,$7}' /tmp/spec/620.omnetpp_s/run/run_base_test_nix-m64.0000/speccmds.out
+ cat /tmp/spec/620.omnetpp_s/run/run_base_test_nix-m64.0000/time.csv
620.omnetpp_s   test    base    1       1080.495394000
+ cat /tmp/spec/620.omnetpp_s/run/run_base_test_nix-m64.0000/time.csv
+ for srcbench in $SPEC/benchspec/CPU/*
++ basename /nix/store/h2by3qxqpzy5b1zszz7wviphv6vy1pjs-spec-cpu-mini-riscv64-unknown-linux-gnu-1.1.7/benchspec/CPU/631.deepsjeng_s
+ name=631.deepsjeng_s
+ bench=/tmp/spec/631.deepsjeng_s
+ rm -rf /tmp/spec/631.deepsjeng_s
+ cp -r /nix/store/h2by3qxqpzy5b1zszz7wviphv6vy1pjs-spec-cpu-mini-riscv64-unknown-linux-gnu-1.1.7/benchspec/CPU/631.deepsjeng_s /tmp/spec/631.deepsjeng_s
+ chmod +w -R /tmp/spec/631.deepsjeng_s
+ rundir=/tmp/spec/631.deepsjeng_s/run/run_base_test_nix-m64.0000
+ sed -i '/^-C/d' /tmp/spec/631.deepsjeng_s/run/run_base_test_nix-m64.0000/speccmds.cmd
+ echo '--- Running 631.deepsjeng_s for 1 iterations ---'
--- Running 631.deepsjeng_s for 1 iterations ---
+ cd /tmp/spec/631.deepsjeng_s/run/run_base_test_nix-m64.0000
+ specinvoke -i 1 -E speccmds.cmd
[12274.985482] __vm_enough_memory: pid: 661, comm: deepsjeng_s_bas, bytes: 7200002048 not enough memory for the allocation
[12274.998109] __vm_enough_memory: pid: 661, comm: deepsjeng_s_bas, bytes: 7200071680 not enough memory for the allocation
[12275.010615] __vm_enough_memory: pid: 661, comm: deepsjeng_s_bas, bytes: 7200133120 not enough memory for the allocation
bash-5.2# cat /tmp/spec/time.csv
benchmark       size    tune    iter    time_s
600.perlbench_s test    base    1       5326.027877000
602.gcc_s       test    base    1       2.515220000
605.mcf_s       test    base    1       1923.514976000
620.omnetpp_s   test    base    1       1080.495394000

After commenting 631.deepsjeng_s:

[12736.162509] 196608 pages RAM
[12736.165986] 0 pages HighMem/MovableOnly
[12736.171049] 18239 pages reserved
[12736.174892] 4096 pages cma reserved
[12736.179650] Tasks state (memory values in pages):
[12736.184896] [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[12736.197219] [    221]     0   221     1090      126      125        1         0    28672        0             0 bash
[12736.209480] [    236]     0   236     1057       84       83        1         0    28672        0             0 bash
[12736.221558] [    642]     0   642     1057       83       83        0         0    28672        0             0 bash
[12736.233637] [    643]     0   643      572       25       24        1         0    28672        0             0 specinvoke
[12736.246203] [    644]     0   644     1057       70       69        1         0    32768        0             0 sh
[12736.258037] [    645]     0   645   131277   105312   105279        1        32   872448        0             0 xz_s_base.nix-m
[12736.271031] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),global_oom,task_memcg=/,task=xz_s_base.nix-m,pid=645,uid=0
[12736.284939] Out of memory: Killed process 645 (xz_s_base.nix-m) total-vm:525108kB, anon-rss:421116kB, file-rss:4kB, shmem-rss:128kB, UID:0 pgtables:852kB oom_score_adj:0
[12749.985238] oom_reaper: reaped process 645 (xz_s_base.nix-m), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
bash-5.2# cat /tmp/spec/time.csv
benchmark       size    tune    iter    time_s
600.perlbench_s test    base    1       5338.291831000
602.gcc_s       test    base    1       2.520710000
605.mcf_s       test    base    1       1923.825657000
620.omnetpp_s   test    base    1       1091.905020000
641.leela_s     test    base    1       767.870615000
648.exchange2_s test    base    1       2815.577807000

Running out of memory on 657.xz_s:

bash-5.2# free -h
               total        used        free      shared  buff/cache   available
Mem:           696Mi        19Mi       405Mi       255Mi       271Mi       411Mi
Swap:             0B          0B          0B

bash-5.2# df -h
Filesystem                   Size  Used Avail Use% Mounted on
devtmpfs                      35M     0   35M   0% /dev
tmpfs                        349M     0  349M   0% /dev/shm
tmpfs                        175M   56K  175M   1% /run
tmpfs                        349M     0  349M   0% /run/wrappers
/dev/disk/by-label/NIXOS_SD  2.3G  1.9G  275M  88% /
nodev                        349M  256M   93M  74% /tmp

bash-5.2# du -sh /tmp/spec
256M    /tmp/spec

Let's comment it for now.

We may want to start running the tests in the CI, so I can read the logs there. There are some operations we need to do on the FS before running the tests:

bash-5.2# mkdir /tmp
bash-5.2# mount -t tmpfs nodev /tmp
bash-5.2# mkdir /bin
bash-5.2# mkdir /root
bash-5.2# ln -s $(which sh) /bin/sh
bash-5.2# export  TMPDIR=/tmp
bash-5.2# speclaunch

So, let's prepare a script that runs the SPEC mini.

The first benchmark to run is 600.perlbench_s which seems to take 5338 seconds (1.5 h) to run. I configured the pipeline to stop as soon as we have 2 h of silence, but after 150 minutes (2.5 h) of execution time it has not finished yet. Not sure if something is wrong now. Maybe I can run vmstat a few times and see the mount points to check everything is correct.

I may want to also increase the RAM available, so we can potentially run other benchmarks too.

At some point we may want to be able to specify the bootcmd from fpgactl directly.

2024-10-11

Another successful execution of SPEC mini:

https://pm.bsc.es/gitlab/rarias/nixos-riscv/-/jobs/34518

benchmark        size  tune  iter  time_s
600.perlbench_s  test  base  1     5380.726590000
602.gcc_s        test  base  1     2.525468000
605.mcf_s        test  base  1     1927.921307000
620.omnetpp_s    test  base  1     1110.200756000
641.leela_s      test  base  1     779.333069000
648.exchange2_s  test  base  1     2916.464893000