60 KiB
2024-07-02
Aleix provided some notes on how to trace the kernel using tracepoints that are printed to the console:
BOOT TIME TRACING
- compile with CONFIG_BOOTTIME_TRACING=y
- add to kernel parameters:
trace_options=sym-addr trace_event=initcall:* tp_printk trace_buf_size=1M ftrace=function ftrace_filter="vfs*"
trace_options=sym-addr trace_event=sched:* tp_printk trace_buf_size=1M
- tp_printk sends tracepoint (TRACE_EVENT) to printk buffer
- trace_event=[subsytem:event]; accepts
- *:<event_name>
- <subsystem>:*
- *:* all
- many more options at https://www.kernel.org/doc/html/latest/trace/boottime-trace.html
Start with:
trace_options=sym-addr trace_event=sched:* tp_printk trace_buf_size=1M loglevel=7
Can it be caused by the D extension? If I set the ISA to:
riscv,isa = "rv64imaf";
It locks the stage1 script without providing any output. Let see trying again. Locked again.
I can disable the FPU in the kernel, and then it will act as a detector of any floating point instruction.
Then I just need to rebuild busybox without support for double instructions.
Let see if I can build busybox with double instruction
Nope, the u-boot is reporting the d extension is in the isa:
riscv,isa = "rv64imafd";
2024-07-03
I cannot switch to gcc.arch = rv64ima
because rust fails to build.
Assumption: The extensions F and D work well and don't cause a hang in the CPU.
Let's go back and try to get the initrd shell, so we can systematically hang it
in the switch_root
OBSERVATION: The riscv-timer seems to be causing interrupts with IRQ 5:
[ 62.439060] irq_handler_entry: irq=5 name=riscv-timer
[ 62.444980] irq_handler_exit: irq=5 ret=handled
OBSERVATION: Rohan reports the serial startup routine being running after
the init begins.
OBSERVATION: Only interrupts in timer, others are zero.
With:
commit 4c656bd8ddd2e41ccaa976ff8d6bd9209175a632 (HEAD -> lagarto-ox)
Author: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
Date: Wed Jul 3 13:21:04 2024 +0200
Add busybox patch to see debug lines
The switch_root command seems to hang in the execv() syscall.
I can see this:
~ # cat /proc/interrupts ; sleep 10 ; cat /proc/interrupts
CPU0
10: 42926 RISC-V INTC 5 Edge riscv-timer
IPI0: 0 Rescheduling interrupts
IPI1: 0 Function call interrupts
IPI2: 0 CPU stop interrupts
IPI3: 0 CPU stop (for crash dump) interrupts
IPI4: 0 IRQ work interrupts
IPI5: 0 Timer broadcast interrupts
CPU0
10: 46023 RISC-V INTC 5 Edge riscv-timer
IPI0: 0 Rescheduling interrupts
IPI1: 0 Function call interrupts
IPI2: 0 CPU stop interrupts
IPI3: 0 CPU stop (for crash dump) interrupts
IPI4: 0 IRQ work interrupts
IPI5: 0 Timer broadcast interrupts
OBSERVATION: There is a timer configured in 0x40170000 but in the device
tree we only have one at timer@40002000
.
#define OX_ALVEO_TIMER_BASE 0x40170000
#define ADDR_TIME_L 0x0u // 32 lower bits of the time register
#define ADDR_TIME_H 0x1u // 32 higher bits of the time register
#define ADDR_TIMECMP_L 0x2u // 32 lower bits of the time comparator
#define ADDR_TIMECMP_H 0x3u // 32 higher bits of the time comparator
https://gitlab.bsc.es/hwdesign/bsc-linux/-/blob/d6d194bd30d9a8fe49c2a278ffb3c3ae7852e75d/bsc_tree/patches/ox_alveo/opensbi/0001-opensbi-ox_alveo-platform.patch#L63
OBSERVATION: When the serial console starts, the speed of the serial port
changes to 9600:
[ 6.845400] io scheduler mq-deadline registered
[ 6.851500] io scheduler kyber registered
[ 17.644460] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[ 18.141160] printk: console [ttyS0] disabled
[ 18.229480] 40001000.serial: ttyS0 at MMIO 0x40001000 (irq = 11, base_baud = 3125000) is a 16550
*** baud: 230400 ***
*** baud: 460800 ***
*** baud: 500000 ***
*** baud: 576000 ***
*** baud: 500000 ***
*** baud: 460800 ***
*** baud: 230400 ***
*** baud: 115200 ***
*** baud: 57600 ***
*** baud: 38400 ***
*** baud: 19200 ***
faaa0?xx
<ffa
fx3fÆf
ff
~3xf03faaa0?~3<3<`<~<f<`<憆3~fÆf̆x3fff
*** baud: 9600 ***
*** baud: 19200 ***
*** baud: 38400 ***
xxxxxx<xx<xx<xx<xxxxxxxx
*** baud: 57600 ***
*** baud: 38400 ***
xxxxxxx<x<xx<xx<xx<xx<x<xxxxxxx<xxx<x<x<xxx<xxxxxx<xx<xxxxx<x<xxxxxxxxxxxxxx<xxxxxxxx<xxxxx<x
*** baud: 19200 ***
faaa??~3fx3x3憆`?3f
ff3f<ffa<xÞ?<?f<x<x<`<`<f<x??`?`<怘
*** baud: 9600 ***
[ 77.929980] stage-1-init: [Thu Jan 1 00:01:17 UTC 1970] + ln -sfn /proc/self/fd/1 /dev/stdout
[ 79.258380] stage-1-init: [Thu Jan 1 00:01:18 UTC 1970] + ln -sfn /proc/self/fd/2 /dev/stderr
[ 80.212360] stage-1-init: [Thu Jan 1 00:01:19 UTC 1970] + mkdir -p /etc/systemd
[ 81.224900] stage-1-init: [Thu Jan 1 00:01:20 UTC 1970] + ln -sfn /nix/store/r4ycv0ymw9wrnalh0sdgkp39nwy3szqz-link-units /etc/systemd/network
[ 82.270000] stage-1-init: [Thu Jan 1 00:01:21 UTC 1970] + mkdir -p /etc/udev
[ 83.249640] stage-1-init: [Thu Jan 1 00:01:22 UTC 1970] + ln -sfn /nix/store/lx1q7j61dldj8bv0p33v42q0nldb6sd7-udev-rules /etc/udev/rules.d
[ 84.304940] stage-1-init: [Thu Jan 1 00:01:23 UTC 1970] + mkdir -p /dev/.mdadm
[ 85.221060] stage-1-init: [Thu Jan 1 00:01:24 UTC 1970] + systemd-udevd --daemon
[ 86.157320] stage-1-init: [Thu Jan 1 00:01:25 UTC 1970] Starting systemd-udevd version 254.3
[ 87.130060] stage-1-init: [Thu Jan 1 00:01:26 UTC 1970] + udevadm trigger '--action=add'
*** break sent ***
[ 158.281900] stage-1-init: [Thu Jan 1 00:02:37 UTC 1970] + udevadm settle
*** break sent ***
*** break sent ***
[ 214.524500] stage-1-init: [Thu Jan 1 00:03:33 UTC 1970] + kbd_mode -u -C /dev/console
[ 215.392960] stage-1-init: [Thu Jan 1 00:03:34 UTC 1970] kbd_mode: KDSKBMODE: Inappropriate ioctl for device
That was my mistake as I need to put the baud speed in the ttyS0, like this:
console=ttyS0,115200n8
OBSERVATION: Trying to read from the serial console /dev/ttyS0 causes no
more messages in the console (or a hang).
QUESTION: Can we make a heartbeat for the kernel?
The idea is to keep a counter in some memory of the kernel so we can see it from the host being moved.
QUESTION: Can we disable the serial driver 8250 from loading?
initcall_blacklist=<driver_init>
I need to know the 8250 init function name:
drivers/tty/serial/8250/8250_core.c:static int __init serial8250_init(void)
So...
initcall_blacklist=serial8250_init
Yes, but that doesn't seem to do anything. It is hanging:
+ modprobe dm_mod
[ 627.473580] stage-1-init: [Thu Jan 1 00:10:26 UTC 1970] + echo init /nix/store/v6pi2mqfgshxdsbyxlvpm9nvawxrpijv-nixos-system-nixos-riscv-23.11pre-git/init
[ 628.249440] stage-1-init: [Thu Jan 1 00:10:27 UTC 1970] + set -- init /nix/store/v6pi2mqfgshxdsbyxlvpm9nvawxrpijv-nixos-system-nixos-riscv-23.11pre-git/init
[ 629.004840] stage-1-init: [Thu Jan 1 00:10:28 UTC 1970] + stage2Init=/nix/store/v6pi2mqfgshxdsbyxlvpm9nvawxrpijv-nixos-system-nixos-riscv-23.11pre-git/init
[ 629.733920] stage-1-init: [Thu Jan 1 00:10:29 UTC 1970] + echo /nix/store/snvvqpxmryw1szlllk0bxpm37p8vj8sw-extra-utils/bin/modprobe
QUESTION: What happens if we remap the interruptions?
- Move the serial from 0 to 1
- Move the plic from 3 to 2 and remove 7
Now we have one context only:
[ 0.000000] riscv-intc: 64 local interrupts mapped
[ 0.000000] plic: plic@40800000: mapped 3 interrupts with 0 handlers for 1 contexts.
[ 0.000000] riscv: providing IPIs using SBI IPI extension
Rather than two:
[ 0.000000] riscv-intc: 64 local interrupts mapped
[ 0.000000] plic: plic@40800000: mapped 3 interrupts with 0 handlers for 2 contexts.
[ 0.000000] riscv: providing IPIs using SBI IPI extension
QUESTION: What happens if we block the sbi_ipi
driver?
initcall_blacklist=sbi_ipi_init
Nothing, it cannot be disabled it seems. I will remove SMP support so it won't be compiled in.
OBSERVATION: Searching for 'riscv,plic0' only matches irq-sifive-plic driver.
hut% rg 'riscv,plic0'
Documentation/devicetree/bindings/interrupt-controller/sifive,plic-1.0.0.yaml
72: - const: riscv,plic0
drivers/irqchip/irq-sifive-plic.c
572:IRQCHIP_DECLARE(riscv_plic0, "riscv,plic0", plic_init); /* for legacy systems */
So it looks that the only driver that setups the plic is the one used by SiFive. Here is the doc: https://static.dev.sifive.com/U54-MC-RVCoreIP.pdf
OBSERVATION: The number of handlers is 0, so there are no interruptions.
It seems the number next to the phandle of the interrupts-extended attribute in the plic follows a different convention of values. Using 9 and 11:
plic: plic@40800000: mapped 3 interrupts with 1 handlers for 2 contexts.
Remark: The key combination to run Magic SysRq using the HVC console is Ctrl-O and then the SysRq key. It only works it the console is being actively polled, otherwise it hangs.
2024-07-04
OBSERVATION: I saw they changed this option in Cinco Ranch DTS for the
serial:
reg-shift = <0>; // regs are spaced on 8 bit boundary (modified from Xilinx UART16550 to be ns16550 compatible)
Tested booting with debug1 and the ttyS0 console, and it goes extremely slow (but still outputs at 115200) and then continues to fail to read keyboard input.
QUESTION: Let's try setting the console in poll mode.
setenv bootargs "root=/dev/ram0 loglevel=7 debug rw earlycon=uart,io,0x40001000,115200n8 boot.trace console=uart,io,0x40001000,115200n8 debug1 init=/nix/store/wavmnv6wjj8y10ha07wxd5f0sqacivj8-nixos-system-nixos-riscv-23.11pre-git/init"
[ 5.909360] io scheduler mq-deadline registered
[ 5.914900] io scheduler kyber registered
[ 14.405980] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[ 14.856040] 40001000.serial: ttyS0 at MMIO 0x40001000 (irq = 2, base_baud = 3125000) is a 16550
[ 15.059680] nd_pmem namespace0.0: unable to guarantee persistence of writes
...
[ 40.250100] clk: Disabling unused clocks
[ 40.256220] Warning: unable to open an initial console. <---- SEE THIS
[ 40.618300] Freeing unused kernel image (initmem) memory: 5592K
[ 40.728300] Checked W+X mappings: passed, no W+X pages found
[ 40.735540] Run /init as init process
[ 40.740080] with arguments:
[ 40.743920] /init
[ 40.746660] with environment:
[ 40.750740] HOME=/
[ 40.754020] TERM=linux
[ 46.569960] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[ 46.578100] CPU: 0 PID: 1 Comm: init Not tainted 6.6.1 #1-NixOS
[ 46.584600] Hardware name: Barcelona Supercomputing Center - Lagarto Ox (NixOS) (DT)
[ 46.592740] Call Trace:
[ 46.595640] [<ffffffff80006688>] dump_backtrace+0x38/0x48
[ 46.601760] [<ffffffff80a96a04>] show_stack+0x50/0x68
[ 46.607540] [<ffffffff80aa3f74>] dump_stack_lvl+0x3c/0x5c
[ 46.613660] [<ffffffff80aa3fb4>] dump_stack+0x20/0x30
[ 46.619400] [<ffffffff80a970dc>] panic+0x158/0x374
[ 46.624900] [<ffffffff8001c18c>] do_exit+0x9e8/0x9f0
[ 46.630580] [<ffffffff8001c3b0>] do_group_exit+0x44/0xb0
[ 46.636600] [<ffffffff8001c454>] __wake_up_parent+0x0/0x40
[ 46.642780] [<ffffffff80aa4eac>] do_trap_ecall_u+0x14c/0x168
[ 46.649140] [<ffffffff800035ec>] ret_from_exception+0x0/0xac
[ 46.655500] Kernel Offset: 0x0 from 0xffffffff80000000
[ 46.661160] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100 ]---
Also found: no_console_suspend
OBSERVATION: There are messages of address space being assigned to
registers:
Slave segment '/MEEP_uart_0/S_AXI/Reg' is being assigned into address space '/m_axi_uart0' at <0x0000_0000 [ 4K ]>.
Slave segment '/MEEP_uart_1/S_AXI/Reg' is being assigned into address space '/m_axi_uart1' at <0x0000_0000 [ 4K ]>.
QUESTION: What happens if I enable CONFIG_CONSOLE_POLL
?
With console=ttyS0,115200n8 debug1
I cannot type.
OBSERVATION: I can dump iomem memory with the tool devmem:
But it seems I cannot dump the registers of the serial io mapped region:
~ # cat /proc/iomem
40001000-400010ff : serial
60000000-7fffffff : Reserved
80000000-ffefffff : System RAM
80201000-81fa0b87 : Kernel image
80201000-80cb177f : Kernel code
81400000-819fffff : Kernel rodata
81c00000-81f18747 : Kernel data
81f19000-81fa0b87 : Kernel bss
100000000-1bfffffff : namespace0.0
~ # devmem 0x40001000
devmem: mmap: Operation not permitted
It looks like the following options may be required to be disabled to allow user-space tools read those regions.
~ # zcat /proc/config.gz | grep CONFIG_STRICT_DEVMEM
CONFIG_STRICT_DEVMEM=y
~ # zcat /proc/config.gz | grep CONFIG_IO_STRICT_DEVMEM
CONFIG_IO_STRICT_DEVMEM=y
Let's try disabling CONFIG_STRICT_DEVMEM
and see if we can read the serial
registers.
It works!
~ # cat /proc/iomem
40001000-400010ff : serial
60000000-7fffffff : Reserved
80000000-ffefffff : System RAM
80201000-81fa0b87 : Kernel image
80201000-80cb159f : Kernel code
81400000-819fffff : Kernel rodata
81c00000-81f18707 : Kernel data
81f19000-81fa0b87 : Kernel bss
100000000-1bfffffff : namespace0.0
~ # devmem 0x40001000
0x0000006E
~ # devmem 0x40001000
0x0000006C
~ # devmem 0x40001000
0x00000072
~ # devmem 0x40001000
0x0000000D
OBSERVATION: The interrupt register of the serial console is 0x0:
Assuming the console registers follow AXI UART 16550, here is the IER:
0x1004 IER R/W Interrupt Enable Register
Which is zero:
~ # devmem 0x40001004 0x00000000
The line control register is 0x3:
~ # devmem 0x4000100C 0x00000003
QUESTION: Can I write to some memory address and see the result from the host?
For that I would need to find some address that is mapped to the DMA or to the
pmem. Xavi recommended 0x6000_0000
as it is uncached.
It seems to have some content already:
~ # devmem 0x60000000
0x00000093
~ # devmem 0x60000004
0x00000113
~ # devmem 0x60000008
0x00000193
~ # devmem 0x60000010
0x00000293
~ # devmem 0x60000014
0x00000313
~ # devmem 0x60000018
0x00000393
Writing test seems to work:
~ # devmem 0x60000000 32 0x11223344
~ # devmem 0x60000000
0x11223344
So, technically we should be using the /dev/qdma34000-MM-0
device, as we
already use the other one to map the memory.
[bsc015557@fpgan02 nixos]$ for i in {0..16}; do addr=$(($i * 0x10000000)); \
printf "addr 0x%09x: " $addr; dd if=/dev/qdma34000-MM-0 count=16 bs=1 skip=$addr 2>/dev/null | xxd; done
addr 0x000000000: 00000000: 4444 4444 4444 4444 4444 4444 4444 4444 DDDDDDDDDDDDDDDD
addr 0x010000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
addr 0x020000000: 00000000: 3333 3333 3333 3333 3333 3333 3333 3333 3333333333333333
addr 0x030000000: 00000000: cccc cccc cccc cccc cccc cccc cccc cccc ................
addr 0x040000000: 00000000: cccc cccc cccc cccc cccc cccc cccc cccc ................
addr 0x050000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
addr 0x060000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
addr 0x070000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
addr 0x080000000: 00000000: 3333 3333 3333 3333 3333 3333 3333 3333 3333333333333333
addr 0x090000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
addr 0x0a0000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
addr 0x0b0000000: 00000000: 2f2f 2f2f 2f2f 2f2f 2f2f 2f2f 2f2f 2f2f ////////////////
addr 0x0c0000000: 00000000: 6e6e 6e6e 6e6e 6e6e 6e6e 6e6e 6e6e 6e6e nnnnnnnnnnnnnnnn
addr 0x0d0000000: 00000000: 2020 2020 2020 2020 2020 2020 2020 2020
addr 0x0e0000000: 00000000: 6c6c 6c6c 6c6c 6c6c 6c6c 6c6c 6c6c 6c6c llllllllllllllll
addr 0x0f0000000: 00000000: 6767 6767 6767 6767 6767 6767 6767 6767 gggggggggggggggg
addr 0x100000000: 00000000: 2424 2424 2424 2424 2424 2424 2424 2424 $$$$$$$$$$$$$$$$
[bsc015557@fpgan02 nixos]$ for i in {0..16}; do addr=$(($i * 0x10000000)); \
printf "addr 0x%09x: " $addr; dd if=/dev/qdma34000-MM-1 count=16 bs=1 skip=$addr 2>/dev/null | xxd; done
addr 0x000000000: 00000000: 4444 4444 4444 4444 4444 4444 4444 4444 DDDDDDDDDDDDDDDD
addr 0x010000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
addr 0x020000000: 00000000: 3333 3333 3333 3333 3333 3333 3333 3333 3333333333333333
addr 0x030000000: 00000000: cccc cccc cccc cccc cccc cccc cccc cccc ................
addr 0x040000000: 00000000: cccc cccc cccc cccc cccc cccc cccc cccc ................
addr 0x050000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
addr 0x060000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
addr 0x070000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
addr 0x080000000: 00000000: 3333 3333 3333 3333 3333 3333 3333 3333 3333333333333333
addr 0x090000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
addr 0x0a0000000: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
addr 0x0b0000000: 00000000: 2f2f 2f2f 2f2f 2f2f 2f2f 2f2f 2f2f 2f2f ////////////////
addr 0x0c0000000: 00000000: 6e6e 6e6e 6e6e 6e6e 6e6e 6e6e 6e6e 6e6e nnnnnnnnnnnnnnnn
addr 0x0d0000000: 00000000: 2020 2020 2020 2020 2020 2020 2020 2020
addr 0x0e0000000: 00000000: 6c6c 6c6c 6c6c 6c6c 6c6c 6c6c 6c6c 6c6c llllllllllllllll
addr 0x0f0000000: 00000000: 6767 6767 6767 6767 6767 6767 6767 6767 gggggggggggggggg
addr 0x100000000: 00000000: 2424 2424 2424 2424 2424 2424 2424 2424 $$$$$$$$$$$$$$$$
None of the two queues seem to have the value 0x11223344 at any multiple of 0x1000_0000
.
Let's verify first that this method works. The kernel is loaded here:
[bsc015557@fpgan02 nixos]$ printf '0x%x\n' $FPGACTL_KERNEL_ADDR
0x24000000
So we should see the same values as the kernel file:
[bsc015557@fpgan02 nixos]$ dd if=kernel.bin count=16 bs=1 2>/dev/null| xxd
00000000: 6f00 400d 0000 0000 0000 2000 0000 0000 o.@....... .....
But we don't see the same:
[bsc015557@fpgan02 nixos]$ dd if=/dev/qdma34000-MM-1 count=16 bs=1 skip=$FPGACTL_KERNEL_ADDR 2>/dev/null | xxd
00000000: 9797 9797 9797 9797 9797 9797 9797 9797 ................
[bsc015557@fpgan02 nixos]$ dd if=/dev/qdma34000-MM-0 count=16 bs=1 skip=$FPGACTL_KERNEL_ADDR 2>/dev/null | xxd
00000000: 9797 9797 9797 9797 9797 9797 9797 9797 ................
QUESTION: Missing forward M to S via Mideleg?
Can it be happening that he MEDELEG is not forwarding the interruptions to the Supervisor (kernel)?
Boot HART MIDELEG : 0x0000000000000222
Boot HART MEDELEG : 0x000000000000b109
QUESTION: Can we add a timer to the PLIC to test the interrupts?
OBSERVATION: Here is the PLIC register dump:
~ # for i in `seq 0 16`; do addr=$((0x40600000 + $i)); printf '%08x: ' $addr; devmem $addr; done
40600000: 0x00010002
40600001: 0x09000000
40600002: 0x00090000
40600003: 0x00000900
40600004: 0x00010009
40600005: 0x00000000
40600006: 0x00000000
40600007: 0x00000000
40600008: 0x00000000
40600009: 0x00000000
4060000a: 0x00000000
4060000b: 0x00000000
4060000c: 0x00000000
4060000d: 0x00000000
4060000e: 0x00000000
4060000f: 0x00000000
40600010: 0x00000000
QUESTION: Can we boot with the new bitstream that includes the second UART?
The interruptions are enabled for the UART 1, not the default UART 0.
OBSERVATION: I'm using 0x100 not 0x1000 in the serial range:
reg = <0x0 0x40003000 0x0 0x100>;
reg = <0x0 0x40003000 0x0 0x1000>;
Can this produce any problem?
It doesn't seem to change anything, still unable to send any bytes.
QUESTION: Can we use virtio to mount a FS in the DMA shared memory?
2024-07-05
OBSERVATION: The kernel continues working when the console hangs.
Switching to 0x100000000 as 0x60000000 shows:
~ # devmem 0x6000000
0xBADCAB1E
With the following loop:
~ # i=0; while [ 1 ]; do let i=$i+1; devmem 0x100000000 32 $i; done &
~ # cat /dev/ttyS0
(hangs)
Shows the kernel works:
[bsc015557@fpgan02 nixos]$ while [ 1 ]; do xxd -s $((0x100000000 - 0x60000000)) -l 4 /dev/qdma34000-MM-1; sleep 0.2; done
...
a0000000: 6400 0000 d...
a0000000: 6500 0000 e...
a0000000: 6600 0000 f...
a0000000: 6700 0000 g...
a0000000: 6800 0000 h...
a0000000: 6900 0000 i...
QUESTION: Can we reproduce it with switch_root
?
For that I would have to ensure the process continues to operate, even if we exit the console. Maybe I can make a double fork?
I cannot use 0x1000_0000
as that is where the pmem will be. But I can try to
use an address in the end, as we are not filling the whole space.
[0x1_0000_0000, 0x1_c000_0000) -> PMEM (3072 MiB)
Maybe 0x1_bfff_0000
? Let's try first from the initrd shell.
i=0; while [ 1 ]; do let i=$i+1; devmem 0x1bfff0000 32 $i; done &
Then
while [ 1 ]; do xxd -s $((0x1bfff0000 - 0x60000000)) -l 4 /dev/qdma34000-MM-1; sleep 0.2; done
Yes, it seems to be working. Let's load the rootfs too.
I added a loop in the stage1 script.
QUESTION: Can we see any clock in memory?
This will allow us to check if the AXI still works.
OBSERVATION: The kernel stops updating the counter in the mount phase.
Managed to reach the mount and hang there:
[ 337.504740] stage-1-init: [Thu Jan 1 00:05:36 UTC 1970] + '[' -d
/dev/disk/by-label/NIXOS_SD ]
[ 338.284560] stage-1-init: [Thu Jan 1 00:05:37 UTC 1970] + mkdir -m 0755 -p
/mnt-root/
[ 339.017420] stage-1-init: [Thu Jan 1 00:05:38 UTC 1970] + local 'n=0'
[ 339.752560] stage-1-init: [Thu Jan 1 00:05:39 UTC 1970] + true
[ 340.488960] stage-1-init: [Thu Jan 1 00:05:39 UTC 1970] + mount /mnt-root/
After almost 6 minutes, with 571 beats:
$ xxd -s 5905514496 -l 4 /dev/qdma34000-MM-1
5fff0000: 3b02 0000 ;...
It looks like the kernel is the one getting stuck or at least is unable to propagate the heartbeat changes to the host. It would be nice to monitor a hardware clock from the DMA region too, so we can discard problems in the AXI.
OBSERVATION: There is an ioctl failed for /dev/console
[ 177.009540] stage-1-init: [Thu Jan 1 00:02:56 UTC 1970] + udevadm settle
+ kbd_mode -u -C /dev/console
kbd_mode: KDSKBMODE: Inappropriate ioctl for device
+ printf '\033%%G'
+ loadkmap
[ 266.301040] stage-1-init: [Thu Jan 1 00:04:25 UTC 1970] + kbd_mode -u -C /dev/console
ASSUMPTION: The kernel hangs.
If the kernel hangs, there must be an instruction or sequence of instructions
that causes it. First I need to determine what is being executed by the kernel.
For that I could use ftrace
to see which program is running at the time it
hangs.
trace_options=sym-addr trace_event=initcall:* tp_printk trace_buf_size=1M
(prev_comm != 2 && next_comm != 2)
So, we can just enable the tp_printk
but not the tracer. Then in the initrd
script, I enable the function tracer and the filter.
OBSERVATION: It takes a long time to init the pty:
Interesting timing:
[ 12.612620] initcall_start: func=pty_init+0x0/0x3f4
[ 20.962640] initcall_finish: func=pty_init+0x0/0x3f4 ret=0
OBSERVATION: The kcompactd0 daemon is using the CPU:
[ 290.394920] sched_switch: prev_comm=devmem prev_pid=129 prev_prio=120 prev_state=R ==> next_comm=init next_pid=69 next_prio=120
[ 290.408160] sched_switch: prev_comm=init prev_pid=69 prev_prio=120 prev_state=R ==> next_comm=tee next_pid=68 next_prio=120
[ 290.420720] sched_switch: prev_comm=tee prev_pid=68 prev_prio=120 prev_state=R+ ==> next_comm=ksoftirqd/0 next_pid=12 next_prio=120
[ 290.433960] sched_switch: prev_comm=ksoftirqd/0 prev_pid=12 prev_prio=120 prev_state=R ==> next_comm=init next_pid=1 next_prio=120
[ 290.447100] sched_switch: prev_comm=init prev_pid=1 prev_prio=120 prev_state=R ==> next_comm=kcompactd0 next_pid=22 next_prio=120
[ 290.460180] sched_switch: prev_comm=kcompactd0 prev_pid=22 prev_prio=120 prev_state=R ==> next_comm=khvcd next_pid=31 next_prio=120
[ 290.473400] sched_switch: prev_comm=khvcd prev_pid=31 prev_prio=120 prev_state=R ==> next_comm=kworker/u2:2 next_pid=19 next_prio=120
[ 290.486960] sched_switch: prev_comm=kworker/u2:2 prev_pid=19 prev_prio=120 prev_state=R ==> next_comm=khungtaskd next_pid=18 next_prio=120
[ 290.500800] sched_switch: prev_comm=khungtaskd prev_pid=18 prev_prio=120 prev_state=R ==> next_comm=kworker/0:1 next_pid=13 next_prio=120
[ 290.514560] sched_switch: prev_comm=kworker/0:1 prev_pid=13 prev_prio=120 prev_state=R ==> next_comm=tee next_pid=68 next_prio=120
[ 290.527720] sched_switch: prev_comm=tee prev_pid=68 prev_prio=120 prev_state=R+ ==> next_comm=init next_pid=69 next_prio=120
[ 290.540360] sched_switch: prev_comm=init prev_pid=69 prev_prio=120 prev_state=R ==> next_comm=devmem next_pid=129 next_prio=120
[ 290.553280] sched_switch: prev_comm=devmem prev_pid=129 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/0 next_pid=12 next_prio=120
[ 290.566780] sched_switch: prev_comm=ksoftirqd/0 prev_pid=12 prev_prio=120 prev_state=R ==> next_comm=kcompactd0 next_pid=22 next_prio=120
[ 290.580500] sched_switch: prev_comm=kcompactd0 prev_pid=22 prev_prio=120 prev_state=R ==> next_comm=init next_pid=1 next_prio=120
[ 290.593740] sched_switch: prev_comm=init prev_pid=1 prev_prio=120 prev_state=R ==> next_comm=khvcd next_pid=31 next_prio=120
[ 290.606340] sched_switch: prev_comm=khvcd prev_pid=31 prev_prio=120 prev_state=R ==> next_comm=kworker/u2:2 next_pid=19 next_prio=120
[ 290.619780] sched_switch: prev_comm=kworker/u2:2 prev_pid=19 prev_prio=120 prev_state=R ==> next_comm=khungtaskd next_pid=18 next_prio=120
[ 290.633620] sched_switch: prev_comm=khungtaskd prev_pid=18 prev_prio=120 prev_state=R ==> next_comm=tee next_pid=68 next_prio=120
[ 290.646700] sched_switch: prev_comm=tee prev_pid=68 prev_prio=120 prev_state=R+ ==> next_comm=init next_pid=69 next_prio=120
[ 290.659320] sched_switch: prev_comm=init prev_pid=69 prev_prio=120 prev_state=R ==> next_comm=kworker/0:1 next_pid=13 next_prio=120
[ 290.672560] sched_switch: prev_comm=kworker/0:1 prev_pid=13 prev_prio=120 prev_state=R ==> next_comm=devmem next_pid=129 next_prio=120
[ 290.686080] sched_switch: prev_comm=devmem prev_pid=129 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/0 next_pid=12 next_prio=120
[ 290.699720] sched_switch: prev_comm=ksoftirqd/0 prev_pid=12 prev_prio=120 prev_state=R ==> next_comm=init next_pid=1 next_prio=120
[ 290.712880] sched_switch: prev_comm=init prev_pid=1 prev_prio=120 prev_state=R ==> next_comm=khvcd next_pid=31 next_prio=120
[ 290.725500] sched_switch: prev_comm=khvcd prev_pid=31 prev_prio=120 prev_state=R ==> next_comm=kcompactd0 next_pid=22 next_prio=120
QUESTION: Can we reproduce this hang with 6.9.7?
Disabling clang as it is failing to build:
hut% nix develop '.#lagarto-ox'
error: builder for '/nix/store/x1nfa792pv28px70kvfakm3aalcfbdyw-clang-epi-479518d.drv' failed with exit code 2;
last 10 log lines:
> | ^~~~~~~~~~~~~~~
> 2 errors generated.
> make[2]: *** [lib/Support/CMakeFiles/LLVMSupport.dir/build.make:1868: lib/Support/CMakeFiles/LLVMSupport.dir/Signals.cpp.o] Error 1
> make[2]: *** Waiting for unfinished jobs....
> [ 9%] Built target obj.clang-tblgen
> 1 warning generated.
> make[1]: *** [CMakeFiles/Makefile2:9468: lib/Support/CMakeFiles/LLVMSupport.dir/all] Error 2
> make[1]: *** Waiting for unfinished jobs....
> [ 9%] Built target obj.llvm-tblgen
> make: *** [Makefile:156: all] Error 2
For full logs, run 'nix log /nix/store/x1nfa792pv28px70kvfakm3aalcfbdyw-clang-epi-479518d.drv'.
error: 1 dependencies of derivation '/nix/store/m54sxxyi3cg062djrcddcawp10z7r49l-riscv64-unknown-linux-gnu-clang-epi-wrapper-479518d.drv' failed to build
error: 1 dependencies of derivation '/nix/store/yl26dbqqj0snl807c0wjabg4dpbq5gvp-stdenv-linux.drv' failed to build
error: 1 dependencies of derivation '/nix/store/h6180fcl30kqy3apaqjsbkkik2p1spmr-rvb-riscv64-unknown-linux-gnu-da202d6.drv' failed to build
error: 1 dependencies of derivation '/nix/store/b13shgqj7128rdsdzzp4qicqbzl0wnfw-system-path.drv' failed to build
error: 1 dependencies of derivation '/nix/store/6qghlihqcyg6155309ldj5xm9m0v835i-nixos-system-nixos-riscv-24.11pre-git.drv' failed to build
error: 1 dependencies of derivation '/nix/store/l2x18cih29r1kn6vi8imwhkyk98yhw4i-nix-shell-riscv64-unknown-linux-gnu-env.drv' failed to build
QUESTION: Missing cache information may affect?
Other CPUs report the cache details in the DT. For example this one https://github.com/torvalds/linux/blob/master/arch/riscv/boot/dts/sifive/fu540-c000.dtsi#L45
cpu1: cpu@1 {
compatible = "sifive,u54-mc", "sifive,rocket0", "riscv";
d-cache-block-size = <64>;
d-cache-sets = <64>;
d-cache-size = <32768>;
d-tlb-sets = <1>;
d-tlb-size = <32>;
device_type = "cpu";
i-cache-block-size = <64>;
i-cache-sets = <64>;
i-cache-size = <32768>;
i-tlb-sets = <1>;
i-tlb-size = <32>;
mmu-type = "riscv,sv39";
reg = <1>;
riscv,isa = "rv64imafdc";
riscv,isa-base = "rv64i";
riscv,isa-extensions = "i", "m", "a", "f", "d", "c", "zicntr", "zicsr",
"zifencei", "zihpm";
tlb-split;
next-level-cache = <&l2cache>;
cpu1_intc: interrupt-controller {
#interrupt-cells = <1>;
compatible = "riscv,cpu-intc";
interrupt-controller;
};
};
We may want to add it to our DT to be sure that it has no effect.
OBSERVATION: Arrived to stage 2!
+ kill -9 74
+ readlink /proc/75/exe
[ 374.961120] stage-1-init: [Thu Jan 1 00:06:14 UTC 1970] + test 0 -ge 8
+ '[' 75 -eq 1 ]
+ kill -9 75
+ readlink /proc/102/exe
+ '[' 102 -eq 1 ]
+ kill -9 102
+ readlink /proc/137/exe
+ continue
+ readlink /proc/674/exe
+ continue
+ readlink /proc/675/exe
+ continue
+ test -n
+ echo /sbin/modprobe
+ '[' '!' -e /mnt-root//nix/store/xwqaqpc66ijvay9wxnm5nqmi30f2lp1i-nixos-system-nixos-riscv-24.11pre-git/init ]
+ mkdir -m 0755 -p /mnt-root/proc /mnt-root/sys /mnt-root/dev /mnt-root/run
+ mount --move /proc /mnt-root/proc
+ mount --move /sys /mnt-root/sys
+ mount --move /dev /mnt-root/dev
+ mount --move /run /mnt-root/run
+ type -P switch_root
+ exec env -i /nix/store/988axh0bq3wqp90gms4b4a0hkfwvkd3i-extra-utils/bin/switch_root /mnt-root /nix/store/xwqaqpc66ijvay9wxnm5nqmi30f2lp1i-nixos-system-nixos-riscv-24.11pre-git/init
<<< NixOS Stage 2 >>>
[ 384.203680] EXT4-fs (pmem0p2): re-mounted 44444444-4444-4444-8888-888888888888 r/w. Quota mode: none.
[ 384.287600] booting system configuration /nix/store/xwqaqpc66ijvay9wxnm5nqmi30f2lp1i-nixos-system-nixos-riscv-24.11pre-git
running activation script...
[ 388.163860] stage-2-init: running activation script...
[ 391.643500] random: perl: uninitialized urandom read (4 bytes read)
[ 391.884800] random: perl: uninitialized urandom read (4 bytes read)
[ 425.302000] random: perl: uninitialized urandom read (4 bytes read)
But then it hangs.
2024-07-08
QUESTION: Who sets the plic interrupts?
Shouldn't OpenSBI read the DT and do some configuration in the PLIC while in machine mode?
OBSERVATION: Semi-stack trace from CincoRanch
hvc_remove?
console_unlock <-- only called from hvc_remove()
prb_read_valid
desc_read_finalized_seq
__memcpy (multiple times)
get_data
do_trap_break
report_bug
is_valid_bugaddr
copy_from_kernel_nofault
copy_from_kernel_nofault_allowed
find_bug
_printk
vprintk
vprintk_default
vprintk_emit
vprintk_store
sched_clock
vsnprintf
format_decode
__memcpy
printk_parse_prefix
prb_reserve
do_page_fault
fixup_exception
search_exception_tables
search_extable
cmp_ex_search (multiple times)
search_module_extables
__module_address
no_context.part.0
die_kernel_fault <-- last frame(?)
QUESTION: Can we place a trace point in hvc_remove
?
If we are getting stuck in the same place, we should be able to see the backtrace (assuming the console still works) just before we try to remove the console device.
Placed, but still unable to see anything in any hang. Here is a hang in the Stage 2:
<<< NixOS Stage 2 >>>
[ 404.158340] EXT4-fs (pmem0p2): re-mounted 44444444-4444-4444-8888-888888888888 r/w. Quota mode: none.
[ 404.242500] booting system configuration /nix/store/0za1vqh5alk7mxqs59qxx8izmwmf21w6-nixos-system-nixos-riscv-24.11pre-git
running activation script...
[ 408.148380] stage-2-init: running activation script...
[ 411.612240] random: perl: uninitialized urandom read (4 bytes read)
[ 411.866440] random: perl: uninitialized urandom read (4 bytes read)
[ 447.588880] random: perl: uninitialized urandom read (4 bytes read)
Still, it may be hang in a similar way, causing a loop of page faults just while trying to printk to the console, which would explain why we don't see anything and why the heartbeat stops.
Fran has created another bitstream with two consoles enabled
(ox_u55c_a234c132.bit
), let see if I manage to boot with it.
First I will need to enable the consoles on the DTS.
It doesn't seem to produce any output in the UART. I cannot see OpenSBI while loading it on each baud rate:
Type [C-a] [C-h] to see available commands
Terminal ready
*** baud: 57600 ***
*** baud: 38400 ***
*** baud: 19200 ***
*** baud: 9600 ***
[...]
*** baud: 230400 ***
*** baud: 460800 ***
Let's keep the bitstream files in a repository, so I can carefully track them with git too.
OBSERVATION: The new bitstream requires a bootrom to start
I added it to the bitstream repository, as it is a binary blob too. Now I need to update the load addresses:
6a63bcea6d/boot_riscv/boot_sa.sh (L36-40)
Continues to hang just after those perl messages:
+ mount --move /proc /mnt-root/proc
+ mount --move /sys /mnt-root/sys
+ mount --move /dev /mnt-root/dev
+ mount --move /run /mnt-root/run
+ type -P switch_root
+ exec env -i /nix/store/988axh0bq3wqp90gms4b4a0hkfwvkd3i-extra-utils/bin/switch_root /mnt-root /nix/store/0za1vqh5alk7mxqs59qxx8izmwmf21w6-nixos-system-nixos-riscv-24.11pre-git/init
<<< NixOS Stage 2 >>>
[ 541.559320] EXT4-fs (pmem0p2): re-mounted 44444444-4444-4444-8888-888888888888 r/w. Quota mode: none.
[ 541.641280] booting system configuration /nix/store/0za1vqh5alk7mxqs59qxx8izmwmf21w6-nixos-system-nixos-riscv-24.11pre-git
running activation script...
[ 545.569700] stage-2-init: running activation script...
[ 549.019380] random: perl: uninitialized urandom read (4 bytes read)
[ 549.274940] random: perl: uninitialized urandom read (4 bytes read)
QUESTION: What happens if we disable CONFIG_BUG
May be a long shot, but if we are experiencing the same page fault loop as in cincoranch we may as well try.
2024-07-09
QUESTION: Maybe we can try without out-of-order?
I made a small tool in C to view and change the CSR register that controls the in-order/out-of-order. Maybe we can try with the "in-order" setting.
We arrive to execute systemd
:
Starting interactive shell...
+ setsid /nix/store/xm3mpj9aldz5r4s5yb7p08jdjv98hj4w-extra-utils/bin/ash -c 'exec /nix/store/xm3mpj9aldz5r4s5yb7p08jdjv98hj4w-extra-utils/bin/ash < /dev/hvc0 >/dev/hvc0 2>/dev/hvc0'
[ 90.077300] stage-1-init: [Thu Jan 1 00:01:27 UTC 1970] + '[' -n 1 -a i '=' f ]
[ 90.639760] stage-1-init: [Thu Jan 1 00:01:28 UTC 1970] + '[' -n 1 -a i '=' i ]
~ # [ 90.967260] stage-1-init: [Thu Jan 1 00:01:28 UTC 1970] + echo 'Starting interactive shell...'
[ 91.234980] stage-1-init: [Thu Jan 1 00:01:28 UTC 1970] Starting interactive shell...
[ 91.569580] stage-1-init: [Thu Jan 1 00:01:29 UTC 1970] + setsid /nix/store/xm3mpj9aldz5r4s5yb7p08jdjv98hj4w-extra-utils/bin/ash -c 'exec /nix/store/xm3mpj9aldz5r4s5yb7p08jdjv98hj4w-extr
a-utils/bin/ash < /dev/hvc0 >/dev/hvc0 2>/dev/hvc0'
which csrtool
/nix/store/xm3mpj9aldz5r4s5yb7p08jdjv98hj4w-extra-utils/bin/csrtool
~ # csrtool
CSR 0x801 = 0u
~ # csrtool o
unknown 'o', use: mem-in-order, all-in-order or all-out-of-order
~ # csrtool all-in-order
CSR 0x801 = 7u
~ # csrtool
CSR 0x801 = 7u
~ #
+ IFS='='
+ echo init /nix/store/xmagm60y90pfh3yvqanvmaswa0m3cb0a-nixos-system-nixos-riscv-24.11pre-git/init
+ set -- init /nix/store/xmagm60y90pfh3yvqanvmaswa0m3cb0a-nixos-system-nixos-riscv-24.11pre-git/init
+ stage2Init=/nix/store/xmagm60y90pfh3yvqanvmaswa0m3cb0a-nixos-system-nixos-riscv-24.11pre-git/init
+ echo /nix/store/xm3mpj9aldz5r4s5yb7p08jdjv98hj4w-extra-utils/bin/modprobe
+ basename dm_mod
[...]
+ echo /sbin/modprobe
+ '[' '!' -e /mnt-root//nix/store/xmagm60y90pfh3yvqanvmaswa0m3cb0a-nixos-system-nixos-riscv-24.11pre-git/init ]
+ mkdir -m 0755 -p /mnt-root/proc /mnt-root/sys /mnt-root/dev /mnt-root/run
+ mount --move /proc /mnt-root/proc
+ mount --move /sys /mnt-root/sys
+ mount --move /dev /mnt-root/dev
+ mount --move /run /mnt-root/run
+ type -P switch_root
+ exec env -i /nix/store/xm3mpj9aldz5r4s5yb7p08jdjv98hj4w-extra-utils/bin/switch_root /mnt-root /nix/store/xmagm60y90pfh3yvqanvmaswa0m3cb0a-nixos-system-nixos-riscv-24.11pre-git/init
<<< NixOS Stage 2 >>>
[ 967.703320] EXT4-fs (pmem0p2): re-mounted 44444444-4444-4444-8888-888888888888 r/w. Quota mode: none.
[ 967.928020] booting system configuration /nix/store/xmagm60y90pfh3yvqanvmaswa0m3cb0a-nixos-system-nixos-riscv-24.11pre-git
running activation script...
[ 977.608980] stage-2-init: running activation script...
bbbbbbbbbbbbbbbbbbbbsetting up /etc...
[ 1084.376420] stage-2-init: setting up /etc...
starting systemd...
Not sure if it is a good reproducer, as it is taking around 15 minutes to hang in a very large piece of software, while when we have the out-of-order enabled, we can hang in half of the time in some script.
Either way, we need to see a backtrace of where it is hanging to understand why it does. We may also enable a stage-2 heartbeat to be sure that it is hanging the kernel and not only the console.
Another idea is to arrive at a proper bash shell, where we can have debugging tools, which may allow us to go slowly until we catch the bug.
OBSERVATION: Setting memory in-order only causes a hang
Tested with:
$ csrtool mem-in-order
And it hangs just after exiting the tool.
QUESTION: Can we see the printk buffer from the host?
If the problem that we are observing is somehow related to the recursive segfault of the kernel in Cincoranch, we may be able to see the printk ring buffer by directly poking at the memory from the host.
QUESTION: Can we crash the CPU by exercising the memory?
I did a small tool memtool
that performs allocations and
deallocations.
Good news, the tool has hang the console (potentially the kernel too).
~ # which memtool
/nix/store/amj11aclwx62d4mnvkhdgj19kq5gjb9y-extra-utils/bin/memtool
~ # memtool
iter 0, nblocks 1, nbytes 0.1M (A)
iter 1, nblocks 2, nbytes 3.5M (A)
iter 2, nblocks 3, nbytes 3.8M (A)
iter 3, nblocks 4, nbytes 5.7M (A)
iter 4, nblocks 5, nbytes 6.8M (A)
iter 5, nblocks 6, nbytes 10.7M (A)
iter 6, nblocks 5, nbytes 10.6M (D)
iter 7, nblocks 6, nbytes 13.9M (A)
iter 8, nblocks 7, nbytes 16.4M (A)
iter 9, nblocks 8, nbytes 19.8M (A)
iter 10, nblocks 9, nbytes 21.2M (A)
iter 11, nblocks 10, nbytes 24.3M (A)
iter 12, nblocks 11, nbytes 27.4M (A)
iter 13, nblocks 12, nbytes 28.3M (A)
iter 14, nblocks 13, nbytes 31.6M (A)
iter 15, nblocks 12, nbytes 28.2M (D)
iter 16, nblocks 13, nbytes 29.2M (A)
iter 17, nblocks 14, nbytes 30.8M (A)
iter 18, nblocks 15, nbytes 32.6M (A)
iter 19, nblocks 16, nbytes 32.8M (A)
iter 20, nblocks 17, nbytes 36.8M (A)
iter 21, nblocks 18, nbytes 39.6M (A)
iter 22, nblocks 19, nbytes 41.1M (A)
iter 23, nblocks 20, nbytes 44.1M (A)
iter 24, nblocks 21, nbytes 46.9M (A)
iter 25, nblocks 20, nbytes 46.5M (D)
iter 26, nblocks 21, nbytes 50.2M (A)
iter 27, nblocks 22, nbytes 53.8M (A)
Let's see if we can reproduce it again in the same position.
~ # memtool
iter 0, nblocks 1, nbytes 0.1M (A)
iter 1, nblocks 2, nbytes 3.5M (A)
iter 2, nblocks 3, nbytes 3.8M (A)
iter 3, nblocks 4, nbytes 5.7M (A)
iter 4, nblocks 5, nbytes 6.8M (A)
iter 5, nblocks 6, nbytes 10.7M (A)
iter 6, nblocks 5, nbytes 10.6M (D)
iter 7, nblocks 6, nbytes 13.9M (A)
iter 8, nblocks 7, nbytes 16.4M (A)
iter 9, nblocks 8, nbytes 19.8M (A)
iter 10, nblocks 9, nbytes 21.2M (A)
iter 11, nblocks 10, nbytes 24.3M (A)
iter 12, nblocks 11, nbytes 27.4M (A)
iter 13, nblocks 12, nbytes 28.3M (A)
iter 14, nblocks 13, nbytes 31.6M (A)
iter 15, nblocks 12, nbytes 28.2M (D)
iter 16, nblocks 13, nbytes 29.2M (A)
iter 17, nblocks 14, nbytes 30.8M (A)
iter 18, nblocks 15, nbytes 32.6M (A)
iter 19, nblocks 16, nbytes 32.8M (A)
iter 20, nblocks 17, nbytes 36.8M (A)
iter 21, nblocks 18, nbytes 39.6M (A)
iter 22, nblocks 19, nbytes 41.1M (A)
iter 23, nblocks 20, nbytes 44.1M (A)
iter 24, nblocks 21, nbytes 46.9M (A)
iter 25, nblocks 20, nbytes 46.5M (D)
iter 26, nblocks 21, nbytes 50.2M (A)
iter 27, nblocks 22, nbytes 53.8M (A)
Let's make it automatic, so we only need to boot and confirm that it hangs. Just in case we can make it not hang by a miracle.
With blocks of up to 64K we hang in the ~50 M region.
iter=2042 nblocks=1577 allocated=50458K (A)
iter=2043 nblocks=1578 allocated=50489K (A)
iter=2044 nblocks=1579 allocated=50550K (A)
iter=2045 nblocks=1580 allocated=50605K (A)
With blocks of maxsize=512K it also hangs around ~57M.
memtool v1.0.0 maxsize=512K
iter=0 nblocks=1 allocated=88K (A)
iter=1 nblocks=2 allocated=464K (A)
...
iter=275 nblocks=218 allocated=56674K (A)
iter=276 nblocks=219 allocated=56787K (A)
iter=277 nblocks=220 allocated=57252K (A)
iter=278 nblocks=221 allocated=57493K (A)
iter=279 nblocks=222 allocated=57581K (A)
iter=280 nblocks=221 allocated=57416K (D)
iter=281 nblocks=222 allocated=57521K (A)
Maybe there is a problem in the memory segment? Can we reduce it to 1 GiB only and see if it has any effect?
iter=289 nblocks=228 allocated=58636K (A)
iter=290 nblocks=227 allocated=58412K (D)
iter=291 nblocks=228 allocated=58480K (A)
iter=292 nblocks=229 allocated=58599K (A)
Has changed, but not much.
aaaiter=291 nblocks=228 allocated=58480K (A)
allocating...
filling...
aaaaaaaaiter=292 nblocks=229 allocated=58599K (A)
allocating...
aafilling...
It seems to be getting stuck in the filling phase. Can we trace it down with ftrace? It should be generating page faults.
2024-07-10
So, if we manage to crash in the filling phase, we can further pinpoint the
issue and remove any effect of malloc()
. It would be only related to a page
fault and the MMU at this point.
Let's make a much simpler program that only allocates once a buffer of N bytes and then begins filling it, printing the progress in the output.
OBSERVATION: Writing to a vector also stops around 58 MiB
+ memtool fill 536870912
memtool v0.0.1 - Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
mode fill: nbytes=512M, n=134217728
written=0K, addr=0x3f9b800020 OK
written=4096K, addr=0x3f9bc00020 OK
written=8192K, addr=0x3f9c000020 OK
written=12288K, addr=0x3f9c400020 OK
written=16384K, addr=0x3f9c800020 OK
written=20480K, addr=0x3f9cc00020 OK
written=24576K, addr=0x3f9d000020 OK
written=28672K, addr=0x3f9d400020 OK
written=32768K, addr=0x3f9d800020 OK
written=36864K, addr=0x3f9dc00020 OK
written=40960K, addr=0x3f9e000020 OK
written=45056K, addr=0x3f9e400020 OK
written=49152K, addr=0x3f9e800020 OK
written=53248K, addr=0x3f9ec00020 OK
written=57344K, addr=0x3f9f000020 OK
Let's see if we can trace the page fault.
In today's meeting, Jonnatan suggests test the memtool program with all-in-order configuration. I should also try to reproduce this hang with a "production" bitstream (from master).
Also, Xavi reports that the memtool chain test continued to run until the memory was exhausted using an old bistream.
Let's do the quick CSR test first, and then we go back to the ftrace testing, which will take more time.
Here are the commands I was testing, but nothing comes out of the console, even
after booting with the tp_printk trace_buf_size=1M
boot options:
mkdir -p /sys/kernel/debug/
mount -t debugfs none /sys/kernel/debug/
td=/sys/kernel/debug/tracing
echo nop > $td/current_tracer
echo 100 > $td/max_graph_depth
echo do_page_fault > $td/set_graph_function
echo function_graph > $td/current_tracer
OBSERVATION: Using all-in-order causes the hang in the same place
After setting the CSR 0x801 register to 0x7, the all-in-order configuration, the memtool fill tests continues to hang in the same position:
+ csrtool all-in-order
CSR 0x801 = 7u
+ memtool fill 536870912
memtool v0.0.1 - Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
mode fill: nbytes=512M, n=134217728
written=0K, addr=0x3f8d600020 OK
written=4096K, addr=0x3f8da00020 OK
written=8192K, addr=0x3f8de00020 OK
written=12288K, addr=0x3f8e200020 OK
written=16384K, addr=0x3f8e600020 OK
written=20480K, addr=0x3f8ea00020 OK
written=24576K, addr=0x3f8ee00020 OK
written=28672K, addr=0x3f8f200020 OK
written=32768K, addr=0x3f8f600020 OK
written=36864K, addr=0x3f8fa00020 OK
written=40960K, addr=0x3f8fe00020 OK
written=45056K, addr=0x3f90200020 OK
written=49152K, addr=0x3f90600020 OK
written=53248K, addr=0x3f90a00020 OK
written=57344K, addr=0x3f90e00020 OK
It doesn't seem to have any observable effect with this test, other than going more slow.
Interesting article on how to write a simple bootrom that outputs some ASCII text into the console.
Regarding the do_page_fault
filter, it seems that is no available. I need to
pick one from available_filter_functions
instead. Let's try handle_page_fault
.
It still doesn't seem to appear in the console. It is visible with a SysRq request Ctrl+O then 'z'.
Maybe I can try disabling the huge pages, just to discard that it may be
related to it: CONFIG_HUGETLBFS
.
Same hang with huge pages disabled, but a bit further ~70 MB.
OBSERVATION: The Linux memtest fails in the first round
[ 0.000000] Linux version 6.9.7 (nixbld@localhost) (riscv64-unknown-linux-gnu-gcc (GCC) 13.3.0, GNU ld (GNU Binutils) 2.41) #1-NixOS Thu Jun 27 11:52:32 UTC 2024
[ 0.000000] Machine model: Barcelona Supercomputing Center - Lagarto Ox (NixOS)
[ 0.000000] SBI specification v2.0 detected
[ 0.000000] SBI implementation ID=0x1 Version=0x10004
[ 0.000000] SBI TIME extension detected
[ 0.000000] SBI IPI extension detected
[ 0.000000] SBI RFENCE extension detected
[ 0.000000] SBI DBCN extension detected
[ 0.000000] earlycon: sbi0 at I/O port 0x0 (options '')
[ 0.000000] printk: legacy bootconsole [sbi0] enabled
[ 0.000000] Reserved memory: created DMA memory pool at 0x0000000060000000, size 256 MiB
[ 0.000000] OF: reserved mem: initialized node dma_pool@60000000, compatible id shared-dma-pool
[ 0.000000] OF: reserved mem: 0x0000000060000000..0x000000006fffffff (262144 KiB) map non-reusable dma_pool@60000000
[ 0.000000] Reserved memory: created DMA memory pool at 0x0000000070000000, size 256 MiB
[ 0.000000] OF: reserved mem: initialized node dma_pool@70000000, compatible id shared-dma-pool
[ 0.000000] OF: reserved mem: 0x0000000070000000..0x000000007fffffff (262144 KiB) map non-reusable dma_pool@70000000
[ 0.000000] cma: Reserved 16 MiB at 0x00000000bf000000 on node -1
[ 0.000000] early_memtest: # of tests: 3
[ 0.000000] 0x0000000080000000 - 0x0000000080013000 pattern 5555555555555555
This is not suggesting that the problem is not in the virtual memory, but with the actual physical memory.
I will try a similar test with uboot with the mtest
command, but requires
enabling it first.
OBSERVATION: Memory in the 0xb0000000..0xc0000000 range is bad
Reproduced from U-Boot:
=> mtest 0x80000000 0x90000000 0 2
Testing 80000000 ... 90000000:
Pattern FFFFFFFFFFFFFFFF Writing... Reading...Iteration: 2
Tested 2 iteration(s) with 0 errors.
=> mtest 0x90000000 0xa0000000 0 2
Testing 90000000 ... a0000000:
Pattern FFFFFFFFFFFFFFFF Writing... Reading...Iteration: 2
Tested 2 iteration(s) with 0 errors.
=> mtest 0xa0000000 0xb0000000 0 2
Testing a0000000 ... b0000000:
Pattern FFFFFFFFFFFFFFFF Writing... Reading...Iteration: 2
Tested 2 iteration(s) with 0 errors.
=> mtest 0xb0000000 0xc0000000 0 2
Testing b0000000 ... c0000000:
Pattern 0000000000000000 Writing...
Let's see if we can fix the boot hang by reducing the memory enough to avoid this bad region.
2024-07-11
OBSERVATION: U-Boot mtest hangs in the last 256 MiB
After reducing the size of the RAM segment, I run again the mtest, but this time it fails in the last 256 MiB block.
I assume that U-Boot moves itself to the last part of the memory, and them mtest overwrites the U-Boot code, causing a hang.
So, I simply changed the FDT from U-Boot to skip the first 2M:
fdt set /memory@80000000 reg <0x0 0x80200000 0x0 0x40000000>
And then I enabled the memtest in the kernel boot parameters:
=> fdt set /memory@80000000 reg <0x0 0x80200000 0x0 0x30000000>
=> setenv bootargs "root=/dev/ram0 loglevel=7 debug rw earlycon=sbi console=hvc0 boot.trace boot.tracedebug init=/nix/store/zxbq93zfg8ijkyq5cq5sb4742rczqfck-nixos-system-nixos-riscv-24.11pre-git/init"
=> setenv ramdisk_size 12611657
=> setenv bootargs "root=/dev/ram0 loglevel=7 debug rw earlycon=sbi console=hvc0 boot.trace boot.tracedebug memtest=3 init=/nix/store/zxbq93zfg8ijkyq5cq5sb4742rczqfck-nixos-system-nixos-risc=> 4.11pre-git/init"
=> booti ${kernel_addr_r} ${ramdisk_addr_r}:${ramdisk_size} ${fdtcontroladdr}
Moving Image from 0x84000000 to 0x80200000, end=8303c4d0
## Flattened Device Tree blob at 80013000
Booting using the fdt blob at 0x80013000
Working FDT set to 80013000
Using Device Tree in place at 0000000080013000, end 000000008001696f
Working FDT set to 80013000
Starting kernel ...
[ 0.000000] Linux version 6.9.7 (nixbld@localhost) (riscv64-unknown-linux-gnu-gcc (GCC) 13.3.0, GNU ld (GNU Binutils) 2.41) #1-NixOS Thu Jun 27 11:52:32 UTC 2024
[ 0.000000] Machine model: Barcelona Supercomputing Center - Lagarto Ox (NixOS)
[ 0.000000] SBI specification v2.0 detected
[ 0.000000] SBI implementation ID=0x1 Version=0x10004
[ 0.000000] SBI TIME extension detected
[ 0.000000] SBI IPI extension detected
[ 0.000000] SBI RFENCE extension detected
[ 0.000000] SBI DBCN extension detected
[ 0.000000] earlycon: sbi0 at I/O port 0x0 (options '')
[ 0.000000] printk: legacy bootconsole [sbi0] enabled
[ 0.000000] Reserved memory: created DMA memory pool at 0x0000000060000000, size 256 MiB
[ 0.000000] OF: reserved mem: initialized node dma_pool@60000000, compatible id shared-dma-pool
[ 0.000000] OF: reserved mem: 0x0000000060000000..0x000000006fffffff (262144 KiB) map non-reusable dma_pool@60000000
[ 0.000000] Reserved memory: created DMA memory pool at 0x0000000070000000, size 256 MiB
[ 0.000000] OF: reserved mem: initialized node dma_pool@70000000, compatible id shared-dma-pool
[ 0.000000] OF: reserved mem: 0x0000000070000000..0x000000007fffffff (262144 KiB) map non-reusable dma_pool@70000000
[ 0.000000] cma: Reserved 16 MiB at 0x00000000af000000 on node -1
[ 0.000000] early_memtest: # of tests: 3
[ 0.000000] 0x0000000083200000 - 0x000000008c300000 pattern 5555555555555555
[ 0.000000] 0x000000009e912000 - 0x00000000aeff9308 pattern 5555555555555555
[ 0.000000] 0x00000000aeff9337 - 0x00000000aeff9338 pattern 5555555555555555
[ 0.000000] 0x00000000aeff9367 - 0x00000000aeff9368 pattern 5555555555555555
[ 0.000000] 0x00000000aeffcffc - 0x00000000aeffd000 pattern 5555555555555555
[ 0.000000] 0x0000000083200000 - 0x000000008c300000 pattern ffffffffffffffff
[ 0.000000] 0x000000009e912000 - 0x00000000aeff9308 pattern ffffffffffffffff
[ 0.000000] 0x00000000aeff9337 - 0x00000000aeff9338 pattern ffffffffffffffff
[ 0.000000] 0x00000000aeff9367 - 0x00000000aeff9368 pattern ffffffffffffffff
[ 0.000000] 0x00000000aeffcffc - 0x00000000aeffd000 pattern ffffffffffffffff
[ 0.000000] 0x0000000083200000 - 0x000000008c300000 pattern 0000000000000000
[ 0.000000] 0x000000009e912000 - 0x00000000aeff9308 pattern 0000000000000000
[ 0.000000] 0x00000000aeff9337 - 0x00000000aeff9338 pattern 0000000000000000
[ 0.000000] 0x00000000aeff9367 - 0x00000000aeff9368 pattern 0000000000000000
[ 0.000000] 0x00000000aeffcffc - 0x00000000aeffd000 pattern 0000000000000000
[ 0.000000] Zone ranges:
[ 0.000000] DMA32 [mem 0x0000000080200000-0x00000000b01fffff]
[ 0.000000] Normal empty
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000080200000-0x00000000b01fffff]
That seems to pass the memtest fine, however the boot process hangs in different stages.
OBSERVATION: Cannot open /dev/ttyS0
~ # setserial -g /dev/ttyS1 -a
/dev/ttyS1, Line 1, UART: 16550, Port: 0x0000, IRQ: 1
Baud_base: 3125000, close_delay: 50, divisor: 0
closing_wait: 3000
Flags: spd_normal
~ # setserial -g /dev/ttyS0 -a
(hangs)
This page seems to have good resources on the serial console:
https://tldp.org/HOWTO/Serial-HOWTO-16.html
It seems that there are some differences in the way the serial port is handled regarding 16550 and 16550A.
I can write to the UART console from U-Boot by directly writing in the 0x40001000 address (A = 0x41):
=> help mw
mw - memory write (fill)
Usage:
mw [.b, .w, .l, .q] address value [count]
=> mw 0x40001000 0x41
A=> mw 0x40001000 0x42
B=> mw 0x40001000 0x43
C=>
OBSERVATION: I can type with the ttyS0 8250 driver
Tried to boot too, but hangs:
=> fdt set /memory@80000000 reg <0x0 0x80200000 0x0 0x30000000>
=> setenv bootargs "root=/dev/ram0 loglevel=7 debug rw earlycon=sbi console=ttyS0,115200n8 boot.trace boot.tracedebug init=/nix/store/zxbq93zfg8ijkyq5cq5sb4742rczqfck-nixos-
system-nixos-riscv-24.11pre-git/init"
=> setenv ramdisk_size 12611657
=> booti ${kernel_addr_r} ${ramdisk_addr_r}:${ramdisk_size} ${fdtcontroladdr}
...
[ 30.740740] riscv-plic 40800000.plic: mapped 3 interrupts with 1 handlers for 2 contexts.
[ 40.048300] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[ 40.420000] of_serial 40001000.serial: error -ENXIO: IRQ index 0 not found
[ 40.496720] printk: legacy console [ttyS0] disabled
[ 40.558860] 40001000.serial: ttyS0 at MMIO 0x40001000 (irq = 0, base_baud = 3125000) is a 16550
[ 40.583940] printk: legacy console [ttyS0] enabled
[ 40.583940] printk: legacy console [ttyS0] enabled
[ 40.595760] printk: legacy bootconsole [sbi0] disabled
[ 40.595760] printk: legacy bootconsole [sbi0] disabled
[ 40.820380] 40003000.serial: ttyS1 at MMIO 0x40003000 (irq = 1, base_baud = 3125000) is a 16550
...
<<< NixOS Stage 2 >>>
[ 394.678980] EXT4-fs (pmem0p2): re-mounted 44444444-4444-4444-8888-888888888888 r/w. Quota mode: none.
[ 394.764620] booting system configuration /nix/store/zxbq93zfg8ijkyq5cq5sb4742rczqfck-nixos-system-nixos-riscv-24.11pre-git
runnin[ 398.543300] stage-2-init: running activation script...
g activation script...
So, if we observe a hang when writing to a bad memory segment, can there be a problem in the place we are placing the pmem? Maybe we can test it with u-boot first.
Another note, the serial device 16550 doesn't seem to use a FIFO, but the 16550A does.
We may want to switch to the A variant, as it seems to be supported by U-boot and the kernel:
https://github.com/u-boot/u-boot/blob/master/drivers/serial/ns16550.c#L607-L619
https://github.com/torvalds/linux/blob/v6.9/drivers/tty/serial/8250/8250_of.c#L285
And defines a FIFO size of 16 bytes:
https://github.com/torvalds/linux/blob/v6.9/drivers/tty/serial/8250/8250_port.c#L74-L81
Still, we would have to wait for a bitstream that can forward the interrupts from the host to the serial console to test it.
OBSERVATION: The memory for the pmem seems to be ok
=> mtest 0x100000000 0x1c0000000 0 2
Testing 100000000 ... 1c0000000:
Pattern FFFFFFFFFFFFFFFF Writing... Reading...Iteration: 2
Tested 2 iteration(s) with 0 errors.
It may be worth enabling huge pages again, as there didn't seem to have any effect.
OBSERVATION: There are extra regions mapped by OpenSBI
In the OpenSBI 1.2 test from buildroot:
Domain0 Region00 : 0x0000000080000000-0x000000008003ffff ()
Domain0 Region01 : 0x0000000000000000-0xffffffffffffffff (R,W,X)
Domain0 Next Address : 0x0000000080200000
Domain0 Next Arg1 : 0x0000000080010000
But with OpenSBI 1.4:
Domain0 Region00 : 0x0000000040000000-0x0000000040000fff M: (I,R,W) S/U: (R,W)
Domain0 Region01 : 0x0000000080020000-0x000000008002ffff M: (R,W) S/U: ()
Domain0 Region02 : 0x0000000080000000-0x000000008001ffff M: (R,X) S/U: ()
Domain0 Region03 : 0x0000000000000000-0xffffffffffffffff M: () S/U: (R,W,X)
Domain0 Next Address : 0x0000000080200000
Domain0 Next Arg1 : 0x0000000080013000
2024-07-12
I find it strange that if we are writing to the 0x80000000 region from the kernel and OpenSBI has mapped a segment for Machine mode only, there is no error message about it in the console.
Let's update to OpenSBI 1.5 for now, and see if we get any more information.