Oops in hfi when using multiple ceph writers #32

Open
opened 2023-08-30 13:28:19 +02:00 by rarias · 2 comments
rarias commented 2023-08-30 13:28:19 +02:00 (Migrated from pm.bsc.es)

In lake2, when using ceph via the IPoIB with multiple writers, after some seconds it causes an oops:

[ 2116.528509] BUG: kernel NULL pointer dereference, address: 0000000000000010
[ 2116.536343] #PF: supervisor read access in kernel mode
[ 2116.542106] #PF: error_code(0x0000) - not-present page
[ 2116.547853] PGD 0 P4D 0
[ 2116.550699] Oops: 0000 [#1] PREEMPT SMP PTI
[ 2116.555380] CPU: 4 PID: 42 Comm: ksoftirqd/4 Not tainted 6.4.11 #1-NixOS
[ 2116.562889] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
[ 2116.574768] RIP: 0010:napi_schedule_prep+0x9/0x50
[ 2116.580050] Code: 68 54 0c 94 e8 58 3e cf ff 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 <48> 8b 4f 10 f6 c1 04 75 29 48 89 ca 48 89 c8 83 e2 01 48 01 d2 48
[ 2116.601069] RSP: 0018:ffffabe5c65f0eb8 EFLAGS: 00010046
[ 2116.606923] RAX: ffffffffc14f1ab0 RBX: 0000000000000000 RCX: 0000000000000001
[ 2116.614916] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 2116.622905] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 2116.630897] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000617
[ 2116.638887] R13: ffff9164955396b0 R14: 0000000000000016 R15: ffff916498d09a00
[ 2116.646878] FS:  0000000000000000(0000) GS:ffff9173bfb00000(0000) knlGS:0000000000000000
[ 2116.655940] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2116.662375] CR2: 0000000000000010 CR3: 0000000a8ee20002 CR4: 00000000003706e0
[ 2116.670366] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2116.678356] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2116.686346] Call Trace:
[ 2116.689089]  <IRQ>
[ 2116.691350]  ? __die+0x23/0x70
[ 2116.694782]  ? page_fault_oops+0x17d/0x4b0
[ 2116.700050]  ? ip_protocol_deliver_rcu+0x32/0x170
[ 2116.705968]  ? exc_page_fault+0x6d/0x150
[ 2116.711007]  ? asm_exc_page_fault+0x26/0x30
[ 2116.716336]  ? __pfx_hfi1_ipoib_sdma_complete+0x10/0x10 [hfi1]
[ 2116.723646]  ? napi_schedule_prep+0x9/0x50
[ 2116.728875]  hfi1_ipoib_sdma_complete+0x38/0x90 [hfi1]
[ 2116.735353]  sdma_make_progress+0x178/0x460 [hfi1]
[ 2116.741459]  ? __pfx_hfi1_ipoib_sdma_complete+0x10/0x10 [hfi1]
[ 2116.748712]  sdma_engine_interrupt+0x72/0x100 [hfi1]
[ 2116.755030]  sdma_interrupt+0x36/0x110 [hfi1]
[ 2116.760632]  __handle_irq_event_percpu+0x4d/0x1a0
[ 2116.766538]  handle_irq_event+0x3e/0x80
[ 2116.771462]  handle_edge_irq+0x9d/0x280
[ 2116.776380]  __common_interrupt+0x46/0xc0
[ 2116.781495]  common_interrupt+0x81/0xa0
[ 2116.786418]  </IRQ>
[ 2116.789403]  <TASK>
[ 2116.792382]  asm_common_interrupt+0x26/0x40
[ 2116.797708] RIP: 0010:skb_segment+0x86b/0xf00
[ 2116.803222] Code: 24 44 8b 74 24 60 49 89 cc 48 8b 4c 24 28 e9 8b 00 00 00 48 8b 11 48 8b 79 08 49 89 14 24 48 89 d0 49 89 7c 24 08 48 8b 50 08 <f6> c2 01 0f 85 c9 03 00 00 0f 1f 44 00 00 f0 ff 40 34 41 8b 44 24
[ 2116.825561] RSP: 0018:ffffabe5c65dbb90 EFLAGS: 00000213
[ 2116.832097] RAX: ffffd6a144ae8c00 RBX: ffff9164af715c00 RCX: ffff9164db525400
[ 2116.840773] RDX: 0000000000000000 RSI: ffff91648734f0e8 RDI: 0000000000008000
[ 2116.849444] RBP: ffffabe5c65dbc60 R08: 0000000000005dac R09: 0000000000006574
[ 2116.858127] R10: 25dd4e99d6e1ffe7 R11: 0000000000000003 R12: ffff916487cb7980
[ 2116.866801] R13: 0000000000005df8 R14: 0000000000000001 R15: 0000000000000000
[ 2116.875493]  ? __pfx_csum_partial_ext+0x10/0x10
[ 2116.881263]  ? __pfx_csum_block_add_ext+0x10/0x10
[ 2116.887289]  tcp_gso_segment+0xec/0x4e0
[ 2116.892247]  ? __pfx_tcp_wfree+0x10/0x10
[ 2116.897283]  inet_gso_segment+0x159/0x3d0
[ 2116.902393]  ? hfi1_ipoib_send+0x246/0x560 [hfi1]
[ 2116.908364]  skb_mac_gso_segment+0xa4/0x110
[ 2116.914180]  __skb_gso_segment+0xb7/0x170
[ 2116.919271]  ? netif_skb_features+0x151/0x2e0
[ 2116.924746]  validate_xmit_skb+0x16c/0x340
[ 2116.929930]  validate_xmit_skb_list+0x4e/0x70
[ 2116.935392]  sch_direct_xmit+0x18a/0x380
[ 2116.940372]  __qdisc_run+0x149/0x5a0
[ 2116.944952]  net_tx_action+0x1df/0x2a0
[ 2116.949714]  __do_softirq+0xca/0x2ae
[ 2116.954278]  ? __pfx_smpboot_thread_fn+0x10/0x10
[ 2116.960005]  run_ksoftirqd+0x2c/0x40
[ 2116.964575]  smpboot_thread_fn+0xdc/0x1d0
[ 2116.969622]  kthread+0xe8/0x120
[ 2116.973702]  ? __pfx_kthread+0x10/0x10
[ 2116.978465]  ret_from_fork+0x2c/0x50
[ 2116.983033]  </TASK>
[ 2116.986029] Modules linked in: netconsole ipmi_si nfsv3 nfs_acl nfs lockd grace netfs fscache msr sb_edac edac_core intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common hfi1 x86_pkg_temp_thermal intel_powerclamp coretemp crc32_pclmul polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel sha512_ssse3 sha512_generic aesni_intel mgag200 libaes drm_shmem_helper crypto_simd cryptd igb drm_kms_helper rdmavt rapl iTCO_wdt mei_me intel_cstate intel_pmc_bxt ptp syscopyarea ib_uverbs pps_core watchdog sysfillrect mxm_wmi sunrpc intel_uncore sysimgblt mei i2c_i801 i2c_algo_bit ioatdma i2c_smbus lpc_ich evdev dca input_leds joydev led_class mousedev mac_hid wmi tiny_power_button acpi_power_meter acpi_pad button xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_rpfilter xt_pkttype xt_LOG nf_log_syslog xt_tcpudp nft_compat sch_fq_codel nf_tables libcrc32c nfnetlink atkbd libps2 serio vivaldi_fmap loop cpufreq_powersave tun tap macvlan bridge stp llc kvm irqbypass ib_ipoib ib_cm
[ 2116.986177]  ib_umad ib_core ipmi_watchdog ipmi_devintf ipmi_msghandler fuse drm efi_pstore backlight configfs dmi_sysfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 hid_generic usbhid hid sd_mod ahci xhci_pci xhci_pci_renesas libahci firmware_class ehci_pci xhci_hcd libata ehci_hcd nvme nvme_core usbcore scsi_mod t10_pi crc32c_intel crc64_rocksoft crc64 crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common usb_common scsi_common rtc_cmos dm_mod dax [last unloaded: ipmi_si]
[ 2117.145385] CR2: 0000000000000010
[ 2117.149915] ---[ end trace 0000000000000000 ]---
[ 2117.215956] RIP: 0010:napi_schedule_prep+0x9/0x50
[ 2117.222128] Code: 68 54 0c 94 e8 58 3e cf ff 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 <48> 8b 4f 10 f6 c1 04 75 29 48 89 ca 48 89 c8 83 e2 01 48 01 d2 48
[ 2117.244851] RSP: 0018:ffffabe5c65f0eb8 EFLAGS: 00010046
[ 2117.251528] RAX: ffffffffc14f1ab0 RBX: 0000000000000000 RCX: 0000000000000001
[ 2117.260351] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 2117.269151] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 2117.277962] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000617
[ 2117.286754] R13: ffff9164955396b0 R14: 0000000000000016 R15: ffff916498d09a00
[ 2117.295538] FS:  0000000000000000(0000) GS:ffff9173bfb00000(0000) knlGS:0000000000000000
[ 2117.305396] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2117.312654] CR2: 0000000000000010 CR3: 0000000a8ee20002 CR4: 00000000003706e0
[ 2117.321457] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2117.330257] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2117.339079] Kernel panic - not syncing: Fatal exception in interrupt
[ 2117.347081] Kernel Offset: 0x12200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 2117.420699] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

I didn't saw any hfi changes in 6.4.12, but it may be worth the try.

Reported to the kernel maintainers.

In lake2, when using ceph via the IPoIB with multiple writers, after some seconds it causes an oops: ``` [ 2116.528509] BUG: kernel NULL pointer dereference, address: 0000000000000010 [ 2116.536343] #PF: supervisor read access in kernel mode [ 2116.542106] #PF: error_code(0x0000) - not-present page [ 2116.547853] PGD 0 P4D 0 [ 2116.550699] Oops: 0000 [#1] PREEMPT SMP PTI [ 2116.555380] CPU: 4 PID: 42 Comm: ksoftirqd/4 Not tainted 6.4.11 #1-NixOS [ 2116.562889] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016 [ 2116.574768] RIP: 0010:napi_schedule_prep+0x9/0x50 [ 2116.580050] Code: 68 54 0c 94 e8 58 3e cf ff 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 <48> 8b 4f 10 f6 c1 04 75 29 48 89 ca 48 89 c8 83 e2 01 48 01 d2 48 [ 2116.601069] RSP: 0018:ffffabe5c65f0eb8 EFLAGS: 00010046 [ 2116.606923] RAX: ffffffffc14f1ab0 RBX: 0000000000000000 RCX: 0000000000000001 [ 2116.614916] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 2116.622905] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 2116.630897] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000617 [ 2116.638887] R13: ffff9164955396b0 R14: 0000000000000016 R15: ffff916498d09a00 [ 2116.646878] FS: 0000000000000000(0000) GS:ffff9173bfb00000(0000) knlGS:0000000000000000 [ 2116.655940] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2116.662375] CR2: 0000000000000010 CR3: 0000000a8ee20002 CR4: 00000000003706e0 [ 2116.670366] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2116.678356] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2116.686346] Call Trace: [ 2116.689089] <IRQ> [ 2116.691350] ? __die+0x23/0x70 [ 2116.694782] ? page_fault_oops+0x17d/0x4b0 [ 2116.700050] ? ip_protocol_deliver_rcu+0x32/0x170 [ 2116.705968] ? exc_page_fault+0x6d/0x150 [ 2116.711007] ? asm_exc_page_fault+0x26/0x30 [ 2116.716336] ? __pfx_hfi1_ipoib_sdma_complete+0x10/0x10 [hfi1] [ 2116.723646] ? napi_schedule_prep+0x9/0x50 [ 2116.728875] hfi1_ipoib_sdma_complete+0x38/0x90 [hfi1] [ 2116.735353] sdma_make_progress+0x178/0x460 [hfi1] [ 2116.741459] ? __pfx_hfi1_ipoib_sdma_complete+0x10/0x10 [hfi1] [ 2116.748712] sdma_engine_interrupt+0x72/0x100 [hfi1] [ 2116.755030] sdma_interrupt+0x36/0x110 [hfi1] [ 2116.760632] __handle_irq_event_percpu+0x4d/0x1a0 [ 2116.766538] handle_irq_event+0x3e/0x80 [ 2116.771462] handle_edge_irq+0x9d/0x280 [ 2116.776380] __common_interrupt+0x46/0xc0 [ 2116.781495] common_interrupt+0x81/0xa0 [ 2116.786418] </IRQ> [ 2116.789403] <TASK> [ 2116.792382] asm_common_interrupt+0x26/0x40 [ 2116.797708] RIP: 0010:skb_segment+0x86b/0xf00 [ 2116.803222] Code: 24 44 8b 74 24 60 49 89 cc 48 8b 4c 24 28 e9 8b 00 00 00 48 8b 11 48 8b 79 08 49 89 14 24 48 89 d0 49 89 7c 24 08 48 8b 50 08 <f6> c2 01 0f 85 c9 03 00 00 0f 1f 44 00 00 f0 ff 40 34 41 8b 44 24 [ 2116.825561] RSP: 0018:ffffabe5c65dbb90 EFLAGS: 00000213 [ 2116.832097] RAX: ffffd6a144ae8c00 RBX: ffff9164af715c00 RCX: ffff9164db525400 [ 2116.840773] RDX: 0000000000000000 RSI: ffff91648734f0e8 RDI: 0000000000008000 [ 2116.849444] RBP: ffffabe5c65dbc60 R08: 0000000000005dac R09: 0000000000006574 [ 2116.858127] R10: 25dd4e99d6e1ffe7 R11: 0000000000000003 R12: ffff916487cb7980 [ 2116.866801] R13: 0000000000005df8 R14: 0000000000000001 R15: 0000000000000000 [ 2116.875493] ? __pfx_csum_partial_ext+0x10/0x10 [ 2116.881263] ? __pfx_csum_block_add_ext+0x10/0x10 [ 2116.887289] tcp_gso_segment+0xec/0x4e0 [ 2116.892247] ? __pfx_tcp_wfree+0x10/0x10 [ 2116.897283] inet_gso_segment+0x159/0x3d0 [ 2116.902393] ? hfi1_ipoib_send+0x246/0x560 [hfi1] [ 2116.908364] skb_mac_gso_segment+0xa4/0x110 [ 2116.914180] __skb_gso_segment+0xb7/0x170 [ 2116.919271] ? netif_skb_features+0x151/0x2e0 [ 2116.924746] validate_xmit_skb+0x16c/0x340 [ 2116.929930] validate_xmit_skb_list+0x4e/0x70 [ 2116.935392] sch_direct_xmit+0x18a/0x380 [ 2116.940372] __qdisc_run+0x149/0x5a0 [ 2116.944952] net_tx_action+0x1df/0x2a0 [ 2116.949714] __do_softirq+0xca/0x2ae [ 2116.954278] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 2116.960005] run_ksoftirqd+0x2c/0x40 [ 2116.964575] smpboot_thread_fn+0xdc/0x1d0 [ 2116.969622] kthread+0xe8/0x120 [ 2116.973702] ? __pfx_kthread+0x10/0x10 [ 2116.978465] ret_from_fork+0x2c/0x50 [ 2116.983033] </TASK> [ 2116.986029] Modules linked in: netconsole ipmi_si nfsv3 nfs_acl nfs lockd grace netfs fscache msr sb_edac edac_core intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common hfi1 x86_pkg_temp_thermal intel_powerclamp coretemp crc32_pclmul polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel sha512_ssse3 sha512_generic aesni_intel mgag200 libaes drm_shmem_helper crypto_simd cryptd igb drm_kms_helper rdmavt rapl iTCO_wdt mei_me intel_cstate intel_pmc_bxt ptp syscopyarea ib_uverbs pps_core watchdog sysfillrect mxm_wmi sunrpc intel_uncore sysimgblt mei i2c_i801 i2c_algo_bit ioatdma i2c_smbus lpc_ich evdev dca input_leds joydev led_class mousedev mac_hid wmi tiny_power_button acpi_power_meter acpi_pad button xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_rpfilter xt_pkttype xt_LOG nf_log_syslog xt_tcpudp nft_compat sch_fq_codel nf_tables libcrc32c nfnetlink atkbd libps2 serio vivaldi_fmap loop cpufreq_powersave tun tap macvlan bridge stp llc kvm irqbypass ib_ipoib ib_cm [ 2116.986177] ib_umad ib_core ipmi_watchdog ipmi_devintf ipmi_msghandler fuse drm efi_pstore backlight configfs dmi_sysfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 hid_generic usbhid hid sd_mod ahci xhci_pci xhci_pci_renesas libahci firmware_class ehci_pci xhci_hcd libata ehci_hcd nvme nvme_core usbcore scsi_mod t10_pi crc32c_intel crc64_rocksoft crc64 crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common usb_common scsi_common rtc_cmos dm_mod dax [last unloaded: ipmi_si] [ 2117.145385] CR2: 0000000000000010 [ 2117.149915] ---[ end trace 0000000000000000 ]--- [ 2117.215956] RIP: 0010:napi_schedule_prep+0x9/0x50 [ 2117.222128] Code: 68 54 0c 94 e8 58 3e cf ff 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 <48> 8b 4f 10 f6 c1 04 75 29 48 89 ca 48 89 c8 83 e2 01 48 01 d2 48 [ 2117.244851] RSP: 0018:ffffabe5c65f0eb8 EFLAGS: 00010046 [ 2117.251528] RAX: ffffffffc14f1ab0 RBX: 0000000000000000 RCX: 0000000000000001 [ 2117.260351] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 2117.269151] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 2117.277962] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000617 [ 2117.286754] R13: ffff9164955396b0 R14: 0000000000000016 R15: ffff916498d09a00 [ 2117.295538] FS: 0000000000000000(0000) GS:ffff9173bfb00000(0000) knlGS:0000000000000000 [ 2117.305396] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2117.312654] CR2: 0000000000000010 CR3: 0000000a8ee20002 CR4: 00000000003706e0 [ 2117.321457] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2117.330257] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2117.339079] Kernel panic - not syncing: Fatal exception in interrupt [ 2117.347081] Kernel Offset: 0x12200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 2117.420699] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- ``` I didn't saw any hfi changes in 6.4.12, but it may be worth the try. Reported to the kernel maintainers.
rarias commented 2023-08-30 16:31:09 +02:00 (Migrated from pm.bsc.es)

This can be reproduced with 06c75eb3d9 in bay, lake2 and hut. To change mon ip follow https://docs.ceph.com/en/reef/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-advanced-method.

Mounting the ceph FS with fuse (setting client_die_on_failed_dentry_invalidate=false):

# ceph-fuse -f -s -n client.user -m 10.0.42.40 /ceph2

And running:

$ fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=1M --numjobs=2 --size=32g --iodepth=1 --runtime=300 --time_based --directory=/ceph2/rarias
This can be reproduced with 06c75eb3d960ce46d1207e9336d487516141ce35 in bay, lake2 and hut. To change mon ip follow https://docs.ceph.com/en/reef/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-advanced-method. Mounting the ceph FS with fuse (setting `client_die_on_failed_dentry_invalidate=false`): ``` # ceph-fuse -f -s -n client.user -m 10.0.42.40 /ceph2 ``` And running: ``` $ fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=1M --numjobs=2 --size=32g --iodepth=1 --runtime=300 --time_based --directory=/ceph2/rarias ```
rarias commented 2023-10-19 09:15:54 +02:00 (Migrated from pm.bsc.es)

Cornelis networks provided a potential patch:

diff --git a/drivers/infiniband/hw/hfi1/sdma.c
b/drivers/infiniband/hw/hfi1/sdma.c
index bb2552dd29c1..146b2f55b652 100644
--- a/drivers/infiniband/hw/hfi1/sdma.c
+++ b/drivers/infiniband/hw/hfi1/sdma.c
@@ -3145,7 +3145,7 @@ int _pad_sdma_tx_descs(struct hfi1_devdata *dd, struct
sdma_txreq *tx)
 {
        int rval = 0;

-       if ((unlikely(tx->num_desc + 1 == tx->desc_limit))) {
+       if ((unlikely(tx->num_desc == tx->desc_limit))) {
                rval = _extend_sdma_tx_descs(dd, tx);
                if (rval) {
                        __sdma_txclean(dd, tx);

Essentially, what was happening is that the descriptor array
was being overflowed and it corrupted the ipoib structure that contained
it, which resulted in corruption that was detected when the completion
for the send was called.

Cornelis networks provided a potential patch: ```diff diff --git a/drivers/infiniband/hw/hfi1/sdma.c b/drivers/infiniband/hw/hfi1/sdma.c index bb2552dd29c1..146b2f55b652 100644 --- a/drivers/infiniband/hw/hfi1/sdma.c +++ b/drivers/infiniband/hw/hfi1/sdma.c @@ -3145,7 +3145,7 @@ int _pad_sdma_tx_descs(struct hfi1_devdata *dd, struct sdma_txreq *tx) { int rval = 0; - if ((unlikely(tx->num_desc + 1 == tx->desc_limit))) { + if ((unlikely(tx->num_desc == tx->desc_limit))) { rval = _extend_sdma_tx_descs(dd, tx); if (rval) { __sdma_txclean(dd, tx); ``` > Essentially, what was happening is that the descriptor array was being overflowed and it corrupted the ipoib structure that contained it, which resulted in corruption that was detected when the completion for the send was called.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: rarias/jungle#32
No description provided.