Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RCU CPU stall warning in a multi-core system simulation #51

Open
chiangkd opened this issue Aug 4, 2024 · 12 comments
Open

RCU CPU stall warning in a multi-core system simulation #51

chiangkd opened this issue Aug 4, 2024 · 12 comments
Assignees

Comments

@chiangkd
Copy link
Collaborator

chiangkd commented Aug 4, 2024

Execute semu with multi-core system simulation

make check SMP=4

The RCU CPU stall warning, as discussed in #49 , is accompanied by an increase in timer interrupts and clock_gettime system calls to produce a real-time timer. This causes the CPU to wait longer than a typical grace period, which is usually 21 seconds.

[   33.044071] rcu: INFO: rcu_sched self-detected stall on CPU
[   33.048071] rcu: 	2-....: (5247 ticks this GP) idle=da1c/1/0x40000002 softirq=689/689 fqs=1847
[   33.056071] 	(t=5253 jiffies g=-887 q=11 ncpus=4)
[   33.064071] CPU: 2 PID: 8 Comm: kworker/u8:0 Not tainted 6.1.94 #12
[   33.068071] Hardware name: semu (DT)
[   33.072071] Workqueue: events_unbound async_run_entry_fn
[   33.080071] epc : eat+0x34/0x54
[   33.088071]  ra : do_reset+0x38/0x6c
[   33.096071] epc : c03811fc ra : c038137c sp : c0871e40
[   33.104071]  gp : c04de858 tp : c0852f40 t0 : c0aabc70
[   33.112071]  t1 : 00000008 t2 : 67e0cb47 s0 : c0871e50
[   33.120071]  s1 : c03a48b8 a0 : 00000001 a1 : 00000000
[   33.128071]  a2 : 00000000 a3 : 0053be47 a4 : 00000000
[   33.136071]  a5 : c03a48b8 a6 : 00000000 a7 : 0000000c
[   33.144071]  s2 : 004725a7 s3 : 0038da58 s4 : 00000000
[   33.152071]  s5 : 00000030 s6 : 00000007 s7 : c03a496c
[   33.156071]  s8 : c03a48b8 s9 : c038112c s10: c03a4968
[   33.160071]  s11: c0381538 t3 : c09e4140 t4 : 00000003
[   33.168071]  t5 : ec37d331 t6 : c0871e08
[   33.172071] status: 00000120 badaddr: c03a48c0 cause: 80000005
[   33.180071] [<c03811fc>] eat+0x34/0x54
[   33.184071] [<c038137c>] do_reset+0x38/0x6c
[   33.192071] [<c0381514>] write_buffer+0x40/0x64
[   33.200071] [<c0381d20>] unpack_to_rootfs+0x1c4/0x300
[   33.212072] [<c03824ec>] do_populate_rootfs+0x70/0xd8
[   33.220072] [<c002bd58>] async_run_entry_fn+0x30/0xd0
[   33.232072] [<c002195c>] process_one_work+0x198/0x224
[   33.240072] [<c0021ee4>] worker_thread+0x238/0x294
[   33.248072] [<c002898c>] kthread+0xd4/0xd8
[   33.252072] [<c0002508>] ret_from_exception+0x0/0x1c

Implementing multi-thread support for semu, as discussed in PR #49, might improve the performance of the booting process.

@Mes0903
Copy link
Collaborator

Mes0903 commented Dec 25, 2024

Although implementing multi-threaded system emulation can significantly alleviate this issue, I think it does not mean the problem will no longer occur after multi-threaded system emulation is complete. Perhaps we can start by optimizing the timer.

Currently, the function semu_timer_get relies on clock_gettime, and semu_timer_get is primarily called during reads and writes to mtime. In other words, every time mtime is accessed, clock_gettime is invoked. This leads to an excessive number of timer interrupts, causing performance issues in the emulator.

In #49, it was suggested that lowering the frequency set in semu_timer_init could mitigate this problem. However, this would mean the Linux kernel perceives the clock as no longer being "real-time" but rather a "slowed down" or "imprecise" clock, which goes against the intent of that pull request.

To strike a balance between these two extremes, I think we can maintain a dedicated emulator timer and updating it at the start of each emulation cycle. This way, clock_gettime would only be called once per emulation cycle, while tick updates would be maintained separately. The logic for reading mtime would also be simplified.

I made some modifications to test this approach, and it did result in a slight performance improvement for the emulator. On my machine, with SMP=6, the RCU CPU Stall warning no longer appears. However, with SMP=8, the warning still occurs.

@chiangkd
Copy link
Collaborator Author

I completely agree that your implementation can help avoid RCU CPU stall warnings. However, as the number of simulated cores increases, the warning is likely to occur again unless the emulation period per cycle is also increased proportionally to the number of cores. Is my understanding correct?

I rebuilt the Linux kernel with CONFIG_PREEMPT=y to enable kernel preemption. This configuration suppresses the warning when simulating with SMP=32. The system boots successfully, as shown below:

[  351.875391] virtio_blk virtio1: [vda] 4800 512-byte logical blocks (2.46 MB/2.34 MiB)
[  356.664664] virtio_net virtio0: Assigned random MAC address 6e:8d:9a:9d:3f:86
[  360.199865] clk: Disabling unused clocks
[ 1133.368994] Freeing initrd memory: 8188K
[ 1135.085089] Freeing unused kernel image (initmem) memory: 180K
[ 1135.149092] Kernel memory protection not selected by kernel config.
[ 1135.201095] Run /init as init process
Starting syslogd: OK
Starting klogd: OK
Running sysctl: OK
Starting network: OK

Welcome to Buildroot
buildroot login:

But preemption may increase the frequency of context switches, causing overall CPU usage to increase.
For high-throughput tasks, overall performance may be reduced.

@Mes0903
Copy link
Collaborator

Mes0903 commented Dec 26, 2024

However, as the number of simulated cores increases, the warning is likely to occur again unless the emulation period per cycle is also increased proportionally to the number of cores. Is my understanding correct?

yes, the improvement is limited, as the number of cores grow up, the warning would occur again.

@Mes0903
Copy link
Collaborator

Mes0903 commented Dec 26, 2024

The accuracy of the timer primarily affects user programs. I think that during the boot process, it is unnecessary to use such a high-precision timer. Instead, a less precise timer, or even one that simply increments in a straightforward manner, could be used. After the boot process is complete, the system can switch back to a more precise timer.

Here is a simple example. I added a global flag boot_complete to detect whether the boot process has finished. This can be determined by the first transition to U mode.

static void op_sret(hart_t *vm)
{
    /* Restore from stack */
    vm->pc = vm->sepc;
    mmu_invalidate(vm);
    vm->s_mode = vm->sstatus_spp;
    vm->sstatus_sie = vm->sstatus_spie;

    /* After the booting process is complete, initrd will be loaded. At this
     * point, the sytstem will switch to U mode for the first time. Therefore,
     * by checking whether the switch to U mode has already occurred, we can
     * determine if the boot process has been completed.
     */
    if (!boot_complete && !vm->s_mode)
        boot_complete = true;

    /* Reset stack */
    vm->sstatus_spp = false;
    vm->sstatus_spie = true;
}

Before the boot process is complete, I didn't use clock_gettime. Instead, I simply increment tv_nsec and combine it with the lower frequency mentioned in #49 to eliminate the CPU stall warning.

bool boot_complete = false;
static struct timespec host_time;

// ...

void semu_timer_init(semu_timer_t *timer, uint64_t freq)
{
    timer->freq = freq;
    clock_gettime(CLOCKID, &host_time);
    semu_timer_rebase(timer, 0);
}

static uint64_t semu_timer_clocksource(uint64_t freq)
{
#if defined(HAVE_POSIX_TIMER)
    if (boot_complete) {
        clock_gettime(CLOCKID, &host_time);
        return host_time.tv_sec * freq +
               mult_frac(host_time.tv_nsec, freq, 1e9);
    } else {
        return host_time.tv_sec * freq +
               mult_frac(host_time.tv_nsec++, freq / 1000, 1e9);
    }
    
// ...
#endif
}

uint64_t semu_timer_get(semu_timer_t *timer)
{
    /* Rebase the timer to the current time after the boot process. */
    static bool first = true;
    if (first && boot_complete) {
        first = false;
        timer->begin = semu_timer_clocksource(timer->freq);
    }

    return semu_timer_clocksource(timer->freq) - timer->begin;
}

// ...

This is the sample output:

mes@DESKTOP-HLQ9F6A:~/semu$ make check SMP=32
  CC    riscv.o
  CC    ram.o
  CC    utils.o
  CC    plic.o
  CC    uart.o
  CC    main.o
  CC    aclint.o
  CC    virtio-blk.o
  CC    virtio-net.o
  CC    netdev.o
  LD    semu
 DTC    minimal.dtb
Ready to launch Linux kernel. Please be patient.
failed to allocate TAP device: Operation not permitted
[    0.000000] Linux version 6.1.99 (jserv@node1) (riscv32-buildroot-linux-gnu-gcc.br_real (Buildroot 2024.02.4) 12.3.0, GNU ld (GNU Binutils) 2.41) #1 SMP Thu Jul 18 13:04:10 CST 2024
[    0.000000] Machine model: semu
[    0.000000] earlycon: ns16550 at MMIO 0xf4000000 (options '')
[    0.000000] printk: bootconsole [ns16550] enabled
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] SBI specification v2.0 detected
[    0.000000] SBI implementation ID=0x999 Version=0x1
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI SRST extension detected
[    0.000000] SBI HSM extension detected
[    0.000000] riscv: base ISA extensions aim
[    0.000000] riscv: ELF capabilities aim
[    0.000000] percpu: Embedded 10 pages/cpu s11604 r8192 d21164 u40960
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 130048
[    0.000000] Kernel command line: earlycon console=ttyS0
[    0.000000] printk: log_buf_len individual max cpu contribution: 4096 bytes
[    0.000000] printk: log_buf_len total cpu_extra contributions: 126976 bytes
[    0.000000] printk: log_buf_len min size: 65536 bytes
[    0.000000] printk: log_buf_len: 262144 bytes
[    0.000000] printk: early log buf free: 63952(97%)
[    0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 504100K/524288K available (3578K kernel code, 345K rwdata, 873K rodata, 185K init, 140K bss, 20188K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=32, Nodes=1
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] riscv-intc: 32 local interrupts mapped
[    0.000000] plic: interrupt-controller@0: mapped 31 interrupts with 32 handlers for 32 contexts.
[    0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    0.000000] riscv-timer: riscv_timer_init_dt: Registering clocksource cpuid [0] hartid [0]
[    0.000000] clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0xefdb196da, max_idle_ns: 440795204367 ns
[    0.000000] sched_clock: 64 bits at 65MHz, resolution 15ns, wraps every 2199023255550ns
[    0.000007] Console: colour dummy device 80x25
[    0.000008] Calibrating delay loop (skipped), value calculated using timer frequency.. 130.00 BogoMIPS (lpj=260000)
[    0.000009] pid_max: default: 32768 minimum: 301
[    0.000014] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.000015] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.000038] ASID allocator disabled (0 bits)
[    0.000040] rcu: Hierarchical SRCU implementation.
[    0.000040] rcu:     Max phase no-delay instances is 1000.
[    0.000057] smp: Bringing up secondary CPUs ...
[    0.000295] smp: Brought up 1 node, 32 CPUs
[    0.000320] devtmpfs: initialized
[    0.000381] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.000382] futex hash table entries: 8192 (order: 7, 524288 bytes, linear)
[    0.000406] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[    0.000486] platform soc@F0000000: Fixed dependency cycle(s) with /soc@F0000000/interrupt-controller@0
[    0.000662] clocksource: Switched to clocksource riscv_clocksource
[    0.000806] NET: Registered PF_INET protocol family
[    0.000811] IP idents hash table entries: 8192 (order: 4, 65536 bytes, linear)
[    0.000858] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096 bytes, linear)
[    0.000859] Table-perturb hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.000860] TCP established hash table entries: 4096 (order: 2, 16384 bytes, linear)
[    0.000862] TCP bind hash table entries: 4096 (order: 4, 65536 bytes, linear)
[    0.000865] TCP: Hash tables configured (established 4096 bind 4096)
[    0.000868] UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.000869] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.000874] NET: Registered PF_UNIX/PF_LOCAL protocol family
[    0.000883] Unpacking initramfs...
[    0.005874] Freeing initrd memory: 8188K
[    0.005907] workingset: timestamp_bits=30 max_order=17 bucket_order=0
[    0.006895] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[    0.006917] printk: console [ttyS0] disabled
[    0.006918] f4000000.serial: ttyS0 at MMIO 0xf4000000 (irq = 1, base_baud = 312500) is a 16550
[    0.006919] printk: console [ttyS0] enabled
[    0.006919] printk: console [ttyS0] enabled
[    0.006920] printk: bootconsole [ns16550] disabled
[    0.006920] printk: bootconsole [ns16550] disabled
[    0.006952] virtio_blk virtio1: 1/0/0 default/read/poll queues
[    0.006977] virtio_blk virtio1: [vda] 4800 512-byte logical blocks (2.46 MB/2.34 MiB)
[    0.006993] virtio_net virtio0: Assigned random MAC address 9e:21:77:c8:0b:a8
[    0.007019] clk: Disabling unused clocks
[    0.007029] Freeing unused kernel image (initmem) memory: 180K
[    0.007030] Kernel memory protection not selected by kernel config.
[    0.007031] Run /init as init process
[   17.029125] hrtimer: interrupt took 79999969 ns
Starting syslogd: OK
Starting klogd: OK
Running sysctl: OK
Starting network: OK

Welcome to Buildroot
buildroot login: root
# cat /proc/cpuinfo
processor       : 0
hart            : 0
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 1
hart            : 1
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 2
hart            : 2
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 3
hart            : 3
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 4
hart            : 4
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 5
hart            : 5
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 6
hart            : 6
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 7
hart            : 7
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 8
hart            : 8
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 9
hart            : 9
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 10
hart            : 10
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 11
hart            : 11
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 12
hart            : 12
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 13
hart            : 13
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 14
hart            : 14
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 15
hart            : 15
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 16
hart            : 16
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 17
hart            : 17
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 18
hart            : 18
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 19
hart            : 19
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 20
hart            : 20
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 21
hart            : 21
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 22
hart            : 22
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 23
hart            : 23
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 24
hart            : 24
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 25
hart            : 25
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 26
hart            : 26
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 27
hart            : 27
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 28
hart            : 28
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 29
hart            : 29
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 30
hart            : 30
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 31
hart            : 31
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

@chiangkd
Copy link
Collaborator Author

I agree that your changes can quickly and easily resolve the RCU stall warning issue! However, your log cannot represent the actual boot time in this situation ([ 17.029125]). It should take much more time, am I correct?

I try to reproduce your work, the RCU stall warning is resolved when simulate SMP=32

[    0.007006] Run /init as init process
[    0.026121] hrtimer: interrupt took 12000723 ns
Starting syslogd: OK

another message hrtimer: interrupt took XXXX ns is shown (also in your log)

A quick research that this warning is produced by hrtimer_interrupt() (high resolution timer) because our kernel has config CONFIG_HIGH_RES_TIMERS=y

I'm not sure is any side effect here

@Mes0903
Copy link
Collaborator

Mes0903 commented Dec 26, 2024

However, your log cannot represent the actual boot time in this situation ([ 17.029125]). It should take much more time, am I correct?

Yes, the primary reason for incrementing tv_nsec is to allow the boot process time to progress, ensuring the boot process can proceed smoothly. Therefore, during the boot process, the timer does not serve as an actual timing mechanism.

Once the system switches to U mode for the first time, the boot_complete flag is set, and the time reference is reset. As a result, the [17.029125] here represents 17.029125 seconds since the first switch to U mode.

As for the HRT warning, I think it was triggered due to a sudden jump in the system clock right after the boot process completed. I'm not sure is any side effect here too.

@jserv
Copy link
Collaborator

jserv commented Dec 26, 2024

A quick research that this warning is produced by hrtimer_interrupt() (high resolution timer) because our kernel has config CONFIG_HIGH_RES_TIMERS=y

This option was set for the sake of benchmarking purpose. RT-Tests relies on HRT features.

@jserv
Copy link
Collaborator

jserv commented Dec 26, 2024

However, your log cannot represent the actual boot time in this situation ([ 17.029125]). It should take much more time, am I correct?

The timer increments should align with the frequency defined in the device tree. We could use an approach similar to BogoMips to make the necessary adjustments.

@Mes0903
Copy link
Collaborator

Mes0903 commented Dec 29, 2024

After multiple attempts, I realize that independently maintaining nsec is not a good idea. This results in time intervals between reads being uniform, whereas in reality, we call clock_gettime at varying time intervals.

In contrast, I believe maintaining a frequency scaling factor is a better solution. As mentioned in #49, this achieves the purpose of slowing down time during the boot process. Since it’s merely a scaling factor, we can still derive real-time values from it.

As for the hrtimer: interrupt took XXXX ns issue, after a few days of testing, I found that this doesn’t seem to be caused by time drift. In fact, during these tests, I re-based the timer after the boot process using the recorded time to ensure no drift, but the hrtimer message still appeared. The timing of the message is not consistent—it sometimes appears after the "Welcome to Buildroot" message, and sometimes right after switching to U-mode. But always after switch to U-mode.

Interestingly, on my machine, the warning doesn't appear at all with SMP=16, regardless of how the frequency is adjusted. Moreover, the current implementation only affects the timer during the boot process; after switching to U-mode, the timer behaves exactly as before. Therefore, I believe this warning is due to the current sequential emulation approach. After the multi-threaded emulation is implemented, I think the situation would be mitigated a lot.

Here’s a diagram of the overall flow:

image

Below is a simple example:

static uint64_t semu_timer_clocksource(uint64_t freq)
{
#if defined(HAVE_POSIX_TIMER)
    struct timespec t;
    clock_gettime(CLOCKID, &t);

    if (boot_complete)
        return t.tv_sec * freq + mult_frac(t.tv_nsec, freq, 1e9);
    else
        return t.tv_sec * (freq / 100) + mult_frac(t.tv_nsec, (freq / 100), 1e9); 
#elif defined(HAVE_MACH_TIMER)
    static mach_timebase_info_data_t t;
    if (t.denom == 0)
        (void) mach_timebase_info(&t);
    return mult_frac(mult_frac(mach_absolute_time(), t.numer, t.denom), freq,
                     1e9);
#else
    return time(0) * freq;
#endif
}

uint64_t semu_timer_get(semu_timer_t *timer)
{
    static bool first = true;
    if (first && boot_complete) {
        first = false;
        semu_timer_rebase(timer, 0);
        printf("\033[1;33m[SEMU LOG]: Switch to real time\033[0m\n");
    }

    return semu_timer_clocksource(timer->freq) - timer->begin;
}

I think this approach is better than maintaining two separate timers during the boot process. Dividing the frequency by 100 means the boot process operates at one one-hundredth of real-time, allowing us to easily derive the actual boot time. This scaling factor can also be configured in the Makefile. I attempted to dynamically measure the cost of clock_gettime to set this value, but realized that using clock_gettime as a benchmark was not accurate. Therefore, in this example, a scale of 100 was used for simplicity.

Here is the output of the test:

mes@DESKTOP-HLQ9F6A:~/MesRv32emu/semu$ make check SMP=32
 DTC    minimal.dtb
Ready to launch Linux kernel. Please be patient.
failed to allocate TAP device: Operation not permitted
[    0.000000] Linux version 6.1.99 (jserv@node1) (riscv32-buildroot-linux-gnu-gcc.br_real (Buildroot 2024.02.4) 12.3.0, GNU ld (GNU Binutils) 2.41) #1 SMP Thu Jul 18 13:04:10 CST 2024
[    0.000000] Machine model: semu
[    0.000000] earlycon: ns16550 at MMIO 0xf4000000 (options '')
[    0.000000] printk: bootconsole [ns16550] enabled
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] SBI specification v2.0 detected
[    0.000000] SBI implementation ID=0x999 Version=0x1
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI SRST extension detected
[    0.000000] SBI HSM extension detected
[    0.000000] riscv: base ISA extensions aim
[    0.000000] riscv: ELF capabilities aim
[    0.000000] percpu: Embedded 10 pages/cpu s11604 r8192 d21164 u40960
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 130048
[    0.000000] Kernel command line: earlycon console=ttyS0
[    0.000000] printk: log_buf_len individual max cpu contribution: 4096 bytes
[    0.000000] printk: log_buf_len total cpu_extra contributions: 126976 bytes
[    0.000000] printk: log_buf_len min size: 65536 bytes
[    0.000000] printk: log_buf_len: 262144 bytes
[    0.000000] printk: early log buf free: 63952(97%)
[    0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 504100K/524288K available (3578K kernel code, 345K rwdata, 873K rodata, 185K init, 140K bss, 20188K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=32, Nodes=1
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] riscv-intc: 32 local interrupts mapped
[    0.000000] plic: interrupt-controller@0: mapped 31 interrupts with 32 handlers for 32 contexts.
[    0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    0.000000] riscv-timer: riscv_timer_init_dt: Registering clocksource cpuid [0] hartid [0]
[    0.000000] clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0xefdb196da, max_idle_ns: 440795204367 ns
[    0.000000] sched_clock: 64 bits at 65MHz, resolution 15ns, wraps every 2199023255550ns
[    0.000694] Console: colour dummy device 80x25
[    0.000893] Calibrating delay loop (skipped), value calculated using timer frequency.. 130.00 BogoMIPS (lpj=260000)
[    0.000992] pid_max: default: 32768 minimum: 301
[    0.001489] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.001588] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.004070] ASID allocator disabled (0 bits)
[    0.004269] rcu: Hierarchical SRCU implementation.
[    0.004368] rcu:     Max phase no-delay instances is 1000.
[    0.006056] smp: Bringing up secondary CPUs ...
[    0.049211] smp: Brought up 1 node, 32 CPUs
[    0.055530] devtmpfs: initialized
[    0.071056] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.071360] futex hash table entries: 8192 (order: 7, 524288 bytes, linear)
[    0.077127] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[    0.097399] platform soc@F0000000: Fixed dependency cycle(s) with /soc@F0000000/interrupt-controller@0
[    0.141221] clocksource: Switched to clocksource riscv_clocksource
[    0.176651] NET: Registered PF_INET protocol family
[    0.177847] IP idents hash table entries: 8192 (order: 4, 65536 bytes, linear)
[    0.189801] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096 bytes, linear)
[    0.190000] Table-perturb hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.190199] TCP established hash table entries: 4096 (order: 2, 16384 bytes, linear)
[    0.190697] TCP bind hash table entries: 4096 (order: 4, 65536 bytes, linear)
[    0.191394] TCP: Hash tables configured (established 4096 bind 4096)
[    0.192190] UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.192390] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.193684] NET: Registered PF_UNIX/PF_LOCAL protocol family
[    0.196073] Unpacking initramfs...
[    0.217861] workingset: timestamp_bits=30 max_order=17 bucket_order=0
[    0.463628] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[    0.468923] printk: console [ttyS0] disabled
[    0.469123] f4000000.serial: ttyS0 at MMIO 0xf4000000 (irq = 1, base_baud = 312500) is a 16550
[    0.469422] printk: console [ttyS0] enabled
[    0.469422] printk: console [ttyS0] enabled
[    0.469622] printk: bootconsole [ns16550] disabled
[    0.469622] printk: bootconsole [ns16550] disabled
[    0.477610] virtio_blk virtio1: 1/0/0 default/read/poll queues
[    0.483698] virtio_blk virtio1: [vda] 4800 512-byte logical blocks (2.46 MB/2.34 MiB)
[    0.487487] virtio_net virtio0: Assigned random MAC address 1e:07:10:f3:00:37
[    0.494367] clk: Disabling unused clocks
[    1.486134] Freeing initrd memory: 8188K
[    1.488423] Freeing unused kernel image (initmem) memory: 180K
[    1.488622] Kernel memory protection not selected by kernel config.
[    1.488821] Run /init as init process
[SEMU LOG]: Switch to real time
[    7.946974] hrtimer: interrupt took 69481784 ns
Starting syslogd: OK
Starting klogd: OK
Running sysctl: OK
Starting network: OK

Welcome to Buildroot
buildroot login:

Another output for the factor set to 50:

mes@DESKTOP-HLQ9F6A:~/MesRv32emu/semu$ make check SMP=32
  CC    riscv.o
  CC    ram.o
  CC    utils.o
  CC    plic.o
  CC    uart.o
  CC    main.o
  CC    aclint.o
  CC    virtio-blk.o
  CC    virtio-net.o
  CC    netdev.o
  LD    semu
 DTC    minimal.dtb
Ready to launch Linux kernel. Please be patient.
failed to allocate TAP device: Operation not permitted
[    0.000000] Linux version 6.1.99 (jserv@node1) (riscv32-buildroot-linux-gnu-gcc.br_real (Buildroot 2024.02.4) 12.3.0, GNU ld (GNU Binutils) 2.41) #1 SMP Thu Jul 18 13:04:10 CST 2024
[    0.000000] Machine model: semu
[    0.000000] earlycon: ns16550 at MMIO 0xf4000000 (options '')
[    0.000000] printk: bootconsole [ns16550] enabled
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] SBI specification v2.0 detected
[    0.000000] SBI implementation ID=0x999 Version=0x1
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI SRST extension detected
[    0.000000] SBI HSM extension detected
[    0.000000] riscv: base ISA extensions aim
[    0.000000] riscv: ELF capabilities aim
[    0.000000] percpu: Embedded 10 pages/cpu s11604 r8192 d21164 u40960
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 130048
[    0.000000] Kernel command line: earlycon console=ttyS0
[    0.000000] printk: log_buf_len individual max cpu contribution: 4096 bytes
[    0.000000] printk: log_buf_len total cpu_extra contributions: 126976 bytes
[    0.000000] printk: log_buf_len min size: 65536 bytes
[    0.000000] printk: log_buf_len: 262144 bytes
[    0.000000] printk: early log buf free: 63952(97%)
[    0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 504100K/524288K available (3578K kernel code, 345K rwdata, 873K rodata, 185K init, 140K bss, 20188K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=32, Nodes=1
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] riscv-intc: 32 local interrupts mapped
[    0.000000] plic: interrupt-controller@0: mapped 31 interrupts with 32 handlers for 32 contexts.
[    0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    0.000000] riscv-timer: riscv_timer_init_dt: Registering clocksource cpuid [0] hartid [0]
[    0.000000] clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0xefdb196da, max_idle_ns: 440795204367 ns
[    0.000000] sched_clock: 64 bits at 65MHz, resolution 15ns, wraps every 2199023255550ns
[    0.001390] Console: colour dummy device 80x25
[    0.001589] Calibrating delay loop (skipped), value calculated using timer frequency.. 130.00 BogoMIPS (lpj=260000)
[    0.001788] pid_max: default: 32768 minimum: 301
[    0.002781] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.002980] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.007549] ASID allocator disabled (0 bits)
[    0.007947] rcu: Hierarchical SRCU implementation.
[    0.007947] rcu:     Max phase no-delay instances is 1000.
[    0.011523] smp: Bringing up secondary CPUs ...
[    0.095725] smp: Brought up 1 node, 32 CPUs
[    0.108034] devtmpfs: initialized
[    0.138606] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.139202] futex hash table entries: 8192 (order: 7, 524288 bytes, linear)
[    0.150715] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[    0.190212] platform soc@F0000000: Fixed dependency cycle(s) with /soc@F0000000/interrupt-controller@0
[    0.279523] clocksource: Switched to clocksource riscv_clocksource
[    0.350564] NET: Registered PF_INET protocol family
[    0.352978] IP idents hash table entries: 8192 (order: 4, 65536 bytes, linear)
[    0.375270] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096 bytes, linear)
[    0.375872] Table-perturb hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.376273] TCP established hash table entries: 4096 (order: 2, 16384 bytes, linear)
[    0.377275] TCP bind hash table entries: 4096 (order: 4, 65536 bytes, linear)
[    0.378478] TCP: Hash tables configured (established 4096 bind 4096)
[    0.380083] UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.380484] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.382890] NET: Registered PF_UNIX/PF_LOCAL protocol family
[    0.387903] Unpacking initramfs...
[    0.415743] workingset: timestamp_bits=30 max_order=17 bucket_order=0
[    0.910292] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[    0.921461] printk: console [ttyS0] disabled
[    0.921867] f4000000.serial: ttyS0 at MMIO 0xf4000000 (irq = 1, base_baud = 312500) is a 16550
[    0.922476] printk: console [ttyS0] enabled
[    0.922476] printk: console [ttyS0] enabled
[    0.922881] printk: bootconsole [ns16550] disabled
[    0.922881] printk: bootconsole [ns16550] disabled
[    0.939299] virtio_blk virtio1: 1/0/0 default/read/poll queues
[    0.951841] virtio_blk virtio1: [vda] 4800 512-byte logical blocks (2.46 MB/2.34 MiB)
[    0.959919] virtio_net virtio0: Assigned random MAC address 8e:5e:2e:78:c2:e5
[    0.974247] clk: Disabling unused clocks
[    2.984914] Freeing initrd memory: 8188K
[    2.989718] Freeing unused kernel image (initmem) memory: 180K
[    2.989918] Kernel memory protection not selected by kernel config.
[    2.990318] Run /init as init process
[SEMU LOG]: Switch to real time
Starting syslogd: OK
Starting klogd: OK
Running sysctl: OK
Starting network: OK

Welcome to Buildroot
buildroot login: [   88.934001] hrtimer: interrupt took 60891661 ns

Also another output of the factor set to 10:

mes@DESKTOP-HLQ9F6A:~/MesRv32emu/semu$ make check SMP=32
  CC    riscv.o
  CC    ram.o
  CC    utils.o
  CC    plic.o
  CC    uart.o
  CC    main.o
  CC    aclint.o
  CC    virtio-blk.o
  CC    virtio-net.o
  CC    netdev.o
  LD    semu
 DTC    minimal.dtb
Ready to launch Linux kernel. Please be patient.
failed to allocate TAP device: Operation not permitted
[    0.000000] Linux version 6.1.99 (jserv@node1) (riscv32-buildroot-linux-gnu-gcc.br_real (Buildroot 2024.02.4) 12.3.0, GNU ld (GNU Binutils) 2.41) #1 SMP Thu Jul 18 13:04:10 CST 2024
[    0.000000] Machine model: semu
[    0.000000] earlycon: ns16550 at MMIO 0xf4000000 (options '')
[    0.000000] printk: bootconsole [ns16550] enabled
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] SBI specification v2.0 detected
[    0.000000] SBI implementation ID=0x999 Version=0x1
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI SRST extension detected
[    0.000000] SBI HSM extension detected
[    0.000000] riscv: base ISA extensions aim
[    0.000000] riscv: ELF capabilities aim
[    0.000000] percpu: Embedded 10 pages/cpu s11604 r8192 d21164 u40960
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 130048
[    0.000000] Kernel command line: earlycon console=ttyS0
[    0.000000] printk: log_buf_len individual max cpu contribution: 4096 bytes
[    0.000000] printk: log_buf_len total cpu_extra contributions: 126976 bytes
[    0.000000] printk: log_buf_len min size: 65536 bytes
[    0.000000] printk: log_buf_len: 262144 bytes
[    0.000000] printk: early log buf free: 63952(97%)
[    0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 504100K/524288K available (3578K kernel code, 345K rwdata, 873K rodata, 185K init, 140K bss, 20188K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=32, Nodes=1
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] riscv-intc: 32 local interrupts mapped
[    0.000000] plic: interrupt-controller@0: mapped 31 interrupts with 32 handlers for 32 contexts.
[    0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    0.000000] riscv-timer: riscv_timer_init_dt: Registering clocksource cpuid [0] hartid [0]
[    0.000000] clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0xefdb196da, max_idle_ns: 440795204367 ns
[    0.000000] sched_clock: 64 bits at 65MHz, resolution 15ns, wraps every 2199023255550ns
[    0.006954] Console: colour dummy device 80x25
[    0.008940] Calibrating delay loop (skipped), value calculated using timer frequency.. 130.00 BogoMIPS (lpj=260000)
[    0.009934] pid_max: default: 32768 minimum: 301
[    0.014901] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.015894] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.040726] ASID allocator disabled (0 bits)
[    0.042712] rcu: Hierarchical SRCU implementation.
[    0.042712] rcu:     Max phase no-delay instances is 1000.
[    0.059595] smp: Bringing up secondary CPUs ...
[    0.510227] smp: Brought up 1 node, 32 CPUs
[    0.577710] devtmpfs: initialized
[    0.738451] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.740435] futex hash table entries: 8192 (order: 7, 524288 bytes, linear)
[    0.801942] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[    1.015202] platform soc@F0000000: Fixed dependency cycle(s) with /soc@F0000000/interrupt-controller@0
[    1.502151] clocksource: Switched to clocksource riscv_clocksource
[    1.887699] NET: Registered PF_INET protocol family
[    1.900829] IP idents hash table entries: 8192 (order: 4, 65536 bytes, linear)
[    2.030783] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096 bytes, linear)
[    2.032790] Table-perturb hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    2.034797] TCP established hash table entries: 4096 (order: 2, 16384 bytes, linear)
[    2.040818] TCP bind hash table entries: 4096 (order: 4, 65536 bytes, linear)
[    2.047842] TCP: Hash tables configured (established 4096 bind 4096)
[    2.055870] UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
[    2.057877] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
[    2.070922] NET: Registered PF_UNIX/PF_LOCAL protocol family
[    2.095005] Unpacking initramfs...
[    2.187237] workingset: timestamp_bits=30 max_order=17 bucket_order=0
[    4.908858] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[    4.971514] printk: console [ttyS0] disabled
[    4.972524] f4000000.serial: ttyS0 at MMIO 0xf4000000 (irq = 1, base_baud = 312500) is a 16550
[    4.975553] printk: console [ttyS0] enabled
[    4.975553] printk: console [ttyS0] enabled
[    4.978582] printk: bootconsole [ns16550] disabled
[    4.978582] printk: bootconsole [ns16550] disabled
[    5.065338] virtio_blk virtio1: 1/0/0 default/read/poll queues
[    5.132830] virtio_blk virtio1: [vda] 4800 512-byte logical blocks (2.46 MB/2.34 MiB)
[    5.178077] virtio_net virtio0: Assigned random MAC address 6e:6c:05:99:21:82
[    5.251437] clk: Disabling unused clocks
[   15.876369] Freeing initrd memory: 8188K
[   15.901243] Freeing unused kernel image (initmem) memory: 180K
[   15.903233] Kernel memory protection not selected by kernel config.
[   15.905222] Run /init as init process
[SEMU LOG]: Switch to real time
Starting syslogd: [   16.343392] hrtimer: interrupt took 61000570 ns
OK
Starting klogd: OK
Running sysctl: OK
Starting network: OK

Welcome to Buildroot
buildroot login: 

In my environment, even with varying scale factors, the hrtimer warning consistently appeared at approximately 60000000 ns. I think this observation supports my hypothesis.

@Mes0903
Copy link
Collaborator

Mes0903 commented Jan 3, 2025

Here is a summary of two potential approaches to mitigate RCU CPU stalls under the current sequentially-emulation scenario, we have two methods now

Methods

  1. Scale Frequency
  2. Manually Maintain Increment of nsec

1. Scale Frequency

This method involves calling clock_gettime during the boot process to obtain the time difference and then scaling the frequency to slow down the system. An example implementation was provided in earlier discussions of this issue.

Pros

  • The boot process can use logs to approximate real-world time very easily.
  • By scaling, RCU CPU stall warnings can be significantly reduced. As long as the scaling factor is large enough, no matter how many harts are involved, there will be no RCU CPU stall warnings during the boot process.

Cons

  • Requires calling clock_gettime, resulting in a longer execution time.

2. Manually Increment nsec

This method uses statistical estimations to determine an empirical value representing the boot process period. Based on this value, the increment of nsec is designed. A rough estimate of the relationship is provided below:

image

The orange segments in the diagram represent calls to semu_timer_clocksource. Each call adds a predefined value (blue arrows) regardless of the interval between calls.

Pros

  • Does not require clock_gettime, so the boot process executes faster.

Cons

  • During the boot process, semu_timer_clocksource no longer functions as a timer period calculator but acts like a counter. Therefore, the logs cannot be used to approximate real-world time.
  • Since nsec increments are based on a designed value, RCU CPU stalls may still occur under different environments.

Synchronization and Time Rebase

Regardless of the chosen method, the emulator's internal time and real-world execution time will not synchronize during the boot process. Thus, when switch back to real-time timer, an big time difference would appear. However, this time discrepancy can be resolved by rebasing real-time with emulation time when switching to U mode.

image

Additionally, hrtimer warnings during the boot process are unrelated to the timers being used. This issue can be revisited during the development of multi-threaded emulation to check if it can be mitigated.

Analysis of Method 2: Empirical Estimation

Below is the data from tests conducted on my computer with freq/10, showing consistency in results across multiple trials:

SMP times call semu_timer_clocksource time(sec) of boot process hrtimer warning
1 223,992,364 3.40001
2 382,486,686 8.01002
3 577,491,593 13.44003
4 774,125,110 17.85185
5 973,274,729 22.94007
6 1,174,038,398 27.11009
7 1,377,244,622 31.80010
8 1,605,001,986 37.52011
9 1,793,136,295 41.41014
10 2,005,988,752 45.53015
11 2,220,126,569 51.66018
12 2,440,897,255 56.13018
13 2,651,860,790 60.71019
14 2,882,701,067 65.92020
15 3,103,978,838 70.30022
16 3,343,030,072 76.31025
17 3,566,365,881 80.24026
18 3,800,214,669 86.59028
19 4,031,961,176 92.00030
20 4,280,331,336 94.47030
21 4,516,731,902 101.68033
22 4,883,959,327 104.95035
23 5,143,022,258 110.69036
24 5,260,058,753 118.59098
25 5,526,277,854 125.30041
26 5,790,681,086 132.98045 50000184 ns
27 6,044,658,240 140.04046 80000307 ns
28 6,328,119,424 146.18047 60000231 ns
29 6,598,156,499 154.15050 80000261 ns
30 6,868,480,625 159.83052 90000308 ns
31 7,129,979,196 163.82054 50000169 ns
32 7,410,129,712 170.80054 80000508 ns

Tests were also conducted on my workstation:

SMP times call semu_timer_clocksource time(sec) of boot process hrtimer warning
1 223,450,834 15.21302
2 388,551,174 31.45406
3 586,279,749 48.33009
4 791,644,232 68.00714
5 1,003,639,012 83.64418
6 1,216,761,778 99.95122 12000031 ns
7 1,438,276,507 120.21144 14000047 ns
8 1,704,344,789 122.50440 11000030 ns
9 1,900,605,464 156.91848 10000031 ns
10 2,140,147,966 176.43249 11000031 ns
11 2,451,031,756 179.20599 12000062 ns
12 2,633,717,918 217.70393 14000046 ns
13 2,993,790,985 216.13076 15000046 ns
14 3,165,383,012 262.75081 14000046 ns
15 3,437,855,090 286.43180 15000015 ns

Since the workstation was slow, the execution time was long. Thus I just statistics until SMP=15.

Target Time Configuration

To use the second method, a target time need to be determined. If a target boot time of 10 seconds is set, nsec increment values can be calculated based on the SMP parameter.

image

For example, with SMP=4 and a target time of 10 seconds ($10^{10}$ ns), each call to semu_timer_clocksource adds approximately:

$$ \frac{10^{10}}{774125110} \approx 13 \text{ ns} $$

to nsec.

However, this method may introduce timing discrepancies across different environments. For instance, with SMP=1, the boot process takes approximately 3 seconds on my personal computer but 18 seconds on my workstation, resulting in a sixfold difference.

This leads to an implicit problem: if adding a core increases the number of semu_timer_clocksource calls by an approximate number $2 \times 10^8$, and the target time is set to 10 seconds, then each call to semu_timer_clocksource will increment nsec by approximately:

$$ \frac{10^{10}}{SMPs \times 2 \times 10^8} = \frac{50}{SMPs} \text{ ns} $$

where SMPs represents the number followed by SMP parameter.

Under this method, the time in emulator during boot process is calculated as:

$$ \text{Number of Calls} \times \frac{10^{10}}{SMPs \times 2 \times 10^8} $$

If the assumption of $2 \times 10^8$ calls per core is incorrect, the emulation time will deviate from real-world time. For instance, if the actual number of semu_timer_clocksource calls exceeds $2 \times 10^8$, the boot process will take much longer than the target time, potentially triggering RCU CPU stall warnings.

The value $2 \times 10^8$ was derived from tests on my personal computer and workstation. Despite the sixfold difference in execution times, the number of semu_timer_clocksource calls was remarkably consistent, leading to this assumption. The corresponding numbers could be checked in the tables above: for each increment of the SMP parameter, the number of calls to semu_timer_clocksource roughly increases by $2 \times 10^8$.

If we still want a coarse-grained timer during the boot process to roughly approximate real-world time, clock_gettime could be called at specific intervals (e.g., every $10^n$ calls). A pseudo-random number generator (PRNG) or the earlier mentioned relationship could then provide time displacement values, allowing approximate synchronization with real-world time during the boot process.

However, if the actual number of semu_timer_clocksource calls exceeds $2 \times 10^8$, time recalibration using clock_gettime may lead to time regression if real-world time is less than emulation time.

image

In contrast, if the number of semu_timer_clocksource calls is too low, time will continue to increment, leading only to a deviation that can be corrected via rebase.

Although manually incrementing nsec avoids calling clock_gettime, differences in execution time across environments reduce the stability of RCU CPU stall warning mitigation. This also eliminates the ability to correlate boot process logs with real-world time.

Nonetheless, since boot process timing may not be critical, meanwhile, as the number of harts increment, we can easily notice that the execution time of boot process is getting longer and longer, so I think the benefits of avoiding the call of clock_gettime still remain attractive.

In my opinion,

  • If boot time accuracy matters: Use scaled frequency and continue relying on clock_gettime to update nsec. There is an simple example code in the previous discussion of this issue.
  • If boot time accuracy does not matter: Manually update nsec without scaling frequency. Increment nsec by the method mentioned above.
    • If the boot process timing is completely irrelevant, I think even just update nsec by an really small PRNG number like 1~3 is okay

Maybe we can discuss which method to adopt or any better modifications here. Once decided, I think I can start to submit a PR.

@jserv
Copy link
Collaborator

jserv commented Jan 3, 2025

@chiangkd and @RinHizakura, please comment the above.

@chiangkd
Copy link
Collaborator Author

chiangkd commented Jan 3, 2025

I tend to prefer using "scaled frequency", based on your analysis.

Pros

  • The boot process can use logs to approximate real-world time very easily.
  • By scaling, RCU CPU stall warnings can be significantly reduced. As long as the scaling factor is large enough, no matter how >many harts are involved, there will be no RCU CPU stall warnings during the boot process.

Cons

  • Requires calling clock_gettime, resulting in a longer execution time.

In my opinion, using real-world time offers valuable benefits for developers. It aids in analyzing and identifying potential improvements to accelerate the booting process (e.g., multi-threaded simulations).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants