Intermittent EtherCAT network crashing

@edco That’s a good data point. Yes, there’s no need to attempt to get Network log messages if the state is not SHUTDOWN.

It sounds like you’ve got NIC driver/PCIe/performance issues. Rather than trying to work this out on a boat, I’ll see if we can come up with some simpler code to test the NIC latency, outside of RMP and EtherCAT.

This NVIDIA forum post has some similarities:

Hey there! Checking in again. We’re still trying to resolve this failure mode. We’ve tried a few things since we last checked in and solved a few issues. I think we’ve gotten to the point where we’re not quite sure what to try next.

2025-08-06 13:39:34.048 ( 303.868s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:27.967   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1090 us
2025-08-06 13:39:34.048 ( 303.868s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:30.169   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1125 us
2025-08-06 13:39:34.048 ( 303.868s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:32.343   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1089 us
2025-08-06 13:39:34.048 ( 303.868s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:33.133   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1090 us
2025-08-06 13:39:34.048 ( 303.868s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:34.388   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1077 us
2025-08-06 13:39:34.048 ( 303.868s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:35.265   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1095 us
2025-08-06 13:39:34.048 ( 303.868s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:36.303   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1096 us
2025-08-06 13:39:34.048 ( 303.868s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:36.863   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1086 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:38.394   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1097 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:38.933   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1080 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:39.554   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1095 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:40.752   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1110 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:42.006   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1091 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:43.931   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1080 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:44.857   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1091 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:46.731   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1086 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:47.345   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1096 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:48.518   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1108 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:49.049   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1097 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:49.689   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1116 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:51.286   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1091 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:51.866   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1092 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:52.406   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1081 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:53.245   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1101 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:53.771   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1108 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:55.000   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1109 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:55.903   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1108 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:56.425   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1092 us
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: (X) 00:04:56.704   EtherCAT    RMPNetworkStarter.cpp:168  RMP is dead
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:56.704   EtherCAT           EcDcMaster.cpp:914  1 working counter failure. WC = 10, expected 12. cmd=Logical Read/Write (LRW)
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:56.704   EtherCAT           EcDcMaster.cpp:914  2 working counter failure. WC = 4, expected 12. cmd=Logical Read/Write (LRW)
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:56.705   EtherCAT           EcDcMaster.cpp:914  3 working counter failure. WC = 4, expected 12. cmd=Logical Read/Write (LRW)
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: (X) 00:04:56.705   EtherCAT           EcDcMaster.cpp:937  Abnormal response of slaves to cyclic commands. Please, check number and state of slaves.
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: (i) 00:04:56.706   EtherCAT              EcSlave.cpp:545  Drive 0 (Kollmorgen AKD2G SIL2): AL Status (0x14), Code (0x34)
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: (i) 00:04:56.706   EtherCAT              EcSlave.cpp:545  Drive 1 (Kollmorgen AKD2G SIL2): AL Status (0x14), Code (0x34)
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: (i) 00:04:56.706   EtherCAT              EcSlave.cpp:545  Drive 2 (Kollmorgen AKD2G SIL2): AL Status (0x14), Code (0x34)
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: (i) 00:04:56.706   EtherCAT              EcSlave.cpp:545  Drive 3 (Kollmorgen AKD2G SIL2): AL Status (0x14), Code (0x34)
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: (i) 00:04:56.706   EtherCAT              EcSlave.cpp:545  Term 4 (Beckhoff - EK1100): AL Status (0x8), Code (0x0)
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: (i) 00:04:56.706   EtherCAT              EcSlave.cpp:545  Term 5 (Beckhoff - EL2788): AL Status (0x8), Code (0x0)
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: (i) 00:04:56.706   EtherCAT              EcSlave.cpp:545  Term 6 (Beckhoff - EK1310): AL Status (0x8), Code (0x0)
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: (i) 00:04:56.706   EtherCAT              EcSlave.cpp:545  Term 7 (Beckhoff - EK1300): AL Status (0x8), Code (0x0)
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: (i) 00:04:56.706   EtherCAT              EcSlave.cpp:545  Term 8 (Beckhoff - EL2409): AL Status (0x8), Code (0x0)
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: (i) 00:04:56.706   EtherCAT              EcSlave.cpp:545  Term 9 (Beckhoff - EL1409): AL Status (0x8), Code (0x0)
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:56.707   EtherCAT              EcSlave.cpp:2138 'Drive 1 (Kollmorgen AKD2G SIL2)' (1002): CoE - Emergency (Hex: 878e, 01, '01 00 00 00 00').
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:56.707   EtherCAT              EcSlave.cpp:2138 'Drive 2 (Kollmorgen AKD2G SIL2)' (1003): CoE - Emergency (Hex: 878e, 01, '01 00 00 00 00').
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:56.707   EtherCAT              EcSlave.cpp:2138 'Drive 3 (Kollmorgen AKD2G SIL2)' (1004): CoE - Emergency (Hex: 878e, 01, '01 00 00 00 00').
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:56.709   EtherCAT              EcSlave.cpp:2138 'Drive 0 (Kollmorgen AKD2G SIL2)' (1001): CoE - Emergency (Hex: 878e, 01, '01 00 00 00 00').
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:56.710   EtherCAT              EcSlave.cpp:2138 'Drive 1 (Kollmorgen AKD2G SIL2)' (1002): CoE - Emergency (Hex: 878e, 01, '02 00 00 00 00').
2025-08-06 13:39:34.049 ( 303.869s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:56.711   EtherCAT              EcSlave.cpp:2138 'Drive 2 (Kollmorgen AKD2G SIL2)' (1003): CoE - Emergency (Hex: 878e, 01, '02 00 00 00 00').
2025-08-06 13:39:34.049 ( 303.870s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:56.712   EtherCAT              EcSlave.cpp:2138 'Drive 3 (Kollmorgen AKD2G SIL2)' (1004): CoE - Emergency (Hex: 878e, 01, '02 00 00 00 00').
2025-08-06 13:39:34.049 ( 303.870s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:56.714   EtherCAT              EcSlave.cpp:2138 'Drive 0 (Kollmorgen AKD2G SIL2)' (1001): CoE - Emergency (Hex: 878e, 01, '02 00 00 00 00').
2025-08-06 13:39:34.049 ( 303.870s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: (X) 00:04:56.815   EtherCAT RMPNetworkFirmwareLinux.cpp:68   Failed to wait on a semaphore (with a timeout of 100 milliseconds).  |  errno: [110] "Connection timed out"
2025-08-06 13:39:34.049 ( 303.870s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: (i) 00:04:56.815   EtherCAT   RMPNetworkFirmware.cpp:1833 Exiting ServiceChannel Thread
2025-08-06 13:39:34.049 ( 303.870s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:56.815        IDE             EcMaster.cpp:1047 EtherCAT missed                                                         12326260 receive frame during the last 2000 cycles (EtherCAT missed                                                         
2025-08-06 13:39:34.049 ( 303.870s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:56.910   EtherCAT             EtherCAT.cpp:769  State changed from Running to StoppingOnError
2025-08-06 13:39:34.049 ( 303.870s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:58.159   EtherCAT             EtherCAT.cpp:769  State changed from StoppingOnError to Error
2025-08-06 13:39:34.049 ( 303.870s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: (X) 00:04:58.159   EtherCAT          LinuxDevice.cpp:178  No frames received after 1000 milliseconds! Stopping reception of packets.
2025-08-06 13:39:34.049 ( 303.870s) [main thread     ]         psdn_namazu.cc:90     ERR| Network log message: /!\ 00:04:59.704   EtherCAT             EtherCAT.cpp:3980 --> Close driver
2025-08-06 13:39:34.049 ( 303.870s) [main thread     ]         psdn_namazu.cc:94     ERR| EtherCAT network is not operational, exiting

Also, in the majority of crashes we usually don’t get any network log messages from namazu.

What new things have you tried and what issues have you resolved? Are you still having issues when connecting to the ntp time server? If so, to help us address the issue, are you able to send us any additional logs from journald or other Linux utilities like ethtool, ss, or nstat?

Also, some searching directed me to the Docker webpage for resource constraints, and I saw that there is a setting for cpu-rt-runtime that may need to be configured per container or on the Docker Daemon itself. I don’t think I saw it in the Docker compose file sent earlier in this thread. Have you tried this configuring this setting?

We’ve since correctly installed a PREEMPT_RT kernel (previously it was only PREEMPT). We replaced the default time syncing daemon with chronyd and configured it to slerp between current and remote time in very small increments. (time syncing seemed to consistently cause the network to crash without this change)

We’ve also modified our compose.yaml to include which should correspond to cpu-rt-runtime flag (though we haven’t setup the daemon to be configured with that)

    ulimits:
      rttime: -1
      memlock: -1

We’ll have to look into gathering logs from those tools.

Some more debug data for reference:

2025-08-06 15:42:54.324 ( 125.152s) [main thread     ]         psdn_namazu.cc:103   INFO| Network timing: min=836 ms, max=1157 ms, processor usage=85.79%
2025-08-06 15:42:55.324 ( 126.152s) [main thread     ]         psdn_namazu.cc:103   INFO| Network timing: min=839 ms, max=1157 ms, processor usage=92.31%
2025-08-06 15:42:56.324 ( 127.152s) [main thread     ]         psdn_namazu.cc:103   INFO| Network timing: min=836 ms, max=1161 ms, processor usage=82.31%
2025-08-06 15:42:57.325 ( 128.153s) [main thread     ]         psdn_namazu.cc:103   INFO| Network timing: min=842 ms, max=1159 ms, processor usage=101.02%
2025-08-06 15:42:58.325 ( 129.153s) [main thread     ]         psdn_namazu.cc:103   INFO| Network timing: min=833 ms, max=1170 ms, processor usage=68.23%

These logs output the motioncontrollers self-reported min and max network timing deltas & processor usage (ignore the ms, its us and the target is 1000 us)

I’ve also noticed that, when running cyclictest at a priority higher than our motion controller rt prio, we see that cyclictest is usually pretty stable but when run at a priority lower than our motion controller, the max recorded latencies spike along with the motion controller (as expected?) to 1000s of microseconds. This confirms some constraint mem, IRQ, NIC, is blocking the process rather than some other process consuming time on the core.

Referring back to the Ubuntu real-time kernel tuning blog Scott mentioned earlier, it seems you can add additional options to your grub boot parameters to specify a set of CPUs for housekeeping tasks to run on: kthread_cpus=0,1 irqaffinity=0,1, preventing them from running on the cores where the network and other crucial tasks are running. You can verify that no interrupts are running on certain cores using watch -n1 -d "cat /proc/interrupts". The guide also specifies some other boot parameters you could try and see if they help, such as isolcpus=domain,managed_irq,5.

Also, is the debug data you sent (with the stats on the network’s timing deltas and processor usage) collected with the network running in parallel with other processes like the perception task or the ntp time server synchronization via starlink, or is it running in isolation (just the RMP MotionController running in its container)? If not, I’d be curious to see what the numbers are with the MotionController running by itself.

With regards to the cyclictest latency spikes when run at priorities lower than the MotionController, we expect the latency to increase since the MotionController’s threads will take precedence over the cyclictest threads. However, I don’t expect the latency spikes to exceed a millisecond due to the MotionController’s execution. If you run cyclictest at the lowered priorities but without the MotionController running, do you still see the same spikes?

@shota we can try a few of those boot parameters. The difficult thing about this failure mode is that it can take up to 44 hours to occur. Or could be as little as a few minutes. Jitter looks really good most of the time over those 44 hours. When taking data points every 20ms, Stdev for FIRMWARE_TIMIT_DELTA is 2.19. We do see some spikes that don’t seem to be causing failures. And interestingly, NetworkTimingMaxGet() isn’t catching these spikes that we are recording with RSIControllerAddressTypeFIRMWARE_TIMING_DELTA

I was hoping that reducing sample rate from 1000hz to 500hz would resolve the issue, but we are still seeing network crashes.

Another data point I want to add, I am seeing evidence that RMP “Processor Usage” percentage can be affected by IO usage. Video linked below. We are normally sitting at ~35% usage but when I start a test with fio, The usage spikes up to ~58%. I am wondering if we should be seeing a distinct error if that usage goes over 100%, or would we just see IsNetworkOperational() return false, as we are currently seeing?

We added those boot parameters but issue is not resolved. These are the parameters we are currently using:

isolcpus=domain,managed_irq,4,5 rcu_nocbs=4,5 rcu_nocb_poll kthread_cpus=1,2,3 irqaffinity=1,2,3

RMP should be running on core 5 and the rest of our docker containers are limited to cores 1-3

In addition, we made some tweaks to our logging and I finally got some network logs from RMP. The lines that stand out to me are:

IDE EcMaster.cpp:1047 EtherCAT missed 11445620 receive frame during the last 2000 cycles (EtherCAT missed
and
EtherCAT RMPNetworkStarter.cpp:168 RMP is dead

Full logs from crash are too long so I’m pruning most of the lines that just say Last cyclic frame was NNNN us.

namazu  | 2025-08-13 11:23:00.291 (3140.191s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=996 ms, max=1002 ms, processor usage=31.49%
namazu  | 2025-08-13 11:23:01.291 (3141.192s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=885 ms, max=1113 ms, processor usage=48.84%
namazu  | 2025-08-13 11:23:02.292 (3142.193s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=996 ms, max=1003 ms, processor usage=31.58%
namazu  | 2025-08-13 11:23:03.293 (3143.193s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=996 ms, max=1003 ms, processor usage=31.22%
namazu  | 2025-08-13 11:23:04.293 (3144.194s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=997 ms, max=1003 ms, processor usage=31.30%
namazu  | 2025-08-13 11:23:05.294 (3145.194s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=996 ms, max=1003 ms, processor usage=31.27%
namazu  | 2025-08-13 11:23:06.296 (3146.197s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=993 ms, max=1007 ms, processor usage=77.99%
namazu  | 2025-08-13 11:23:07.296 (3147.197s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=898 ms, max=1102 ms, processor usage=52.20%
namazu  | 2025-08-13 11:23:08.297 (3148.198s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=907 ms, max=1094 ms, processor usage=64.61%
namazu  | 2025-08-13 11:23:09.297 (3149.198s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=996 ms, max=1002 ms, processor usage=31.47%
namazu  | 2025-08-13 11:23:10.297 (3150.198s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=997 ms, max=1002 ms, processor usage=85.62%
namazu  | 2025-08-13 11:23:11.297 (3151.198s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=900 ms, max=1099 ms, processor usage=52.88%
namazu  | 2025-08-13 11:23:12.298 (3152.199s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=996 ms, max=1002 ms, processor usage=31.30%
namazu  | 2025-08-13 11:23:13.298 (3153.199s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=990 ms, max=1008 ms, processor usage=84.17%
namazu  | 2025-08-13 11:23:14.298 (3154.199s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=997 ms, max=1003 ms, processor usage=30.91%
namazu  | 2025-08-13 11:23:15.298 (3155.199s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=996 ms, max=1002 ms, processor usage=30.88%
namazu  | 2025-08-13 11:23:16.299 (3156.199s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=996 ms, max=1004 ms, processor usage=30.92%
namazu  | 2025-08-13 11:23:17.299 (3157.199s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=997 ms, max=1003 ms, processor usage=31.22%
namazu  | 2025-08-13 11:23:18.299 (3158.200s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=995 ms, max=1004 ms, processor usage=31.81%
namazu  | 2025-08-13 11:23:19.299 (3159.200s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=996 ms, max=1004 ms, processor usage=32.16%
namazu  | 2025-08-13 11:23:20.299 (3160.200s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=996 ms, max=1006 ms, processor usage=32.48%
namazu  | 2025-08-13 11:23:21.300 (3161.201s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=988 ms, max=1015 ms, processor usage=36.22%
namazu  | 2025-08-13 11:23:22.300 (3162.201s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=996 ms, max=1003 ms, processor usage=31.94%
namazu  | 2025-08-13 11:23:23.301 (3163.201s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=996 ms, max=1003 ms, processor usage=31.56%
namazu  | 2025-08-13 11:23:24.301 (3164.201s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=996 ms, max=1003 ms, processor usage=32.33%
namazu  | 2025-08-13 11:23:25.301 (3165.202s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=995 ms, max=1004 ms, processor usage=84.37%
namazu  | 2025-08-13 11:23:26.301 (3166.202s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=937 ms, max=1062 ms, processor usage=78.45%
namazu  | 2025-08-13 11:23:27.307 (3167.208s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=992 ms, max=1006 ms, processor usage=31.80%
namazu  | 2025-08-13 11:23:28.307 (3168.208s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=996 ms, max=1004 ms, processor usage=31.98%
namazu  | 2025-08-13 11:23:29.307 (3169.208s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=996 ms, max=1003 ms, processor usage=31.76%
namazu  | 2025-08-13 11:23:30.308 (3170.208s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=997 ms, max=1002 ms, processor usage=33.65%
namazu  | 2025-08-13 11:23:31.308 (3171.208s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=0 ms, max=1002 ms, processor usage=43.10%
namazu  | 2025-08-13 11:23:32.308 (3172.209s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=1000 ms, max=1000 ms, processor usage=31.69%
namazu  | 2025-08-13 11:23:33.308 (3173.209s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=1000 ms, max=1000 ms, processor usage=23.43%
namazu  | 2025-08-13 11:23:34.308 (3174.209s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=1000 ms, max=1000 ms, processor usage=22.30%
namazu  | 2025-08-13 11:23:35.308 (3175.209s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=1000 ms, max=1000 ms, processor usage=33.63%
namazu  | 2025-08-13 11:23:36.308 (3176.209s) [main thread     ]         psdn_namazu.cc:119   INFO| Network timing: min=1000 ms, max=1000 ms, processor usage=35.36%
namazu  | 2025-08-13 11:23:37.309 (3177.209s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: /!\ 00:27:43.612   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1072 us
...
namazu  | 2025-08-13 11:23:37.311 (3177.211s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: (i) 00:43:45.294   EtherCAT   RMPNetworkFirmware.cpp:1597 SDO Write Node (3), Index (0x6872), Sub (0x0), ByteCount (2), Value (0x38e)
namazu  | 2025-08-13 11:23:37.311 (3177.211s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: /!\ 00:43:51.442   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1074 us
...
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: /!\ 00:51:56.664        IDE             EcMaster.cpp:1047 EtherCAT missed                                                         11445620 receive frame during the last 2000 cycles (EtherCAT missed                                                         
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: /!\ 00:52:01.482   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1076 us
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: /!\ 00:52:15.019   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1077 us
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: /!\ 00:52:15.878   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1073 us
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: /!\ 00:52:18.683   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1088 us
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: /!\ 00:52:24.577   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1080 us
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: /!\ 00:52:26.202   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1073 us
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: /!\ 00:52:28.543   EtherCAT             EtherCAT.cpp:3137 Last cyclic frame was 1078 us
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: (X) 00:52:49.317   EtherCAT    RMPNetworkStarter.cpp:168  RMP is dead
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: (X) 00:52:49.420   EtherCAT RMPNetworkFirmwareLinux.cpp:68   Failed to wait on a semaphore (with a timeout of 100 milliseconds).  |  errno: [110] "Connection timed out"
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: (i) 00:52:49.420   EtherCAT   RMPNetworkFirmware.cpp:1833 Exiting ServiceChannel Thread
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: /!\ 00:52:50.380   EtherCAT              EcSlave.cpp:2138 'Drive 3 (Kollmorgen AKD2G SIL2)' (1004): CoE - Emergency (Hex: 8385, 01, '02 00 00 00 00').
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: /!\ 00:52:50.385   EtherCAT              EcSlave.cpp:2138 'Drive 3 (Kollmorgen AKD2G SIL2)' (1004): CoE - Emergency (Hex: 0000, 00, '00 00 00 00 00').
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: /!\ 00:52:50.410   EtherCAT              EcSlave.cpp:2138 'Drive 3 (Kollmorgen AKD2G SIL2)' (1004): CoE - Emergency (Hex: 0000, 00, '00 00 00 00 00').
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: (X) 00:52:50.628   EtherCAT             EtherCAT.cpp:704  EtherCAT communication broken during system operation
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: /!\ 00:52:50.628   EtherCAT             EtherCAT.cpp:769  State changed from Running to StoppingOnError
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: (X) 00:52:51.413   EtherCAT          LinuxDevice.cpp:178  No frames received after 1000 milliseconds! Stopping reception of packets.
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: /!\ 00:52:51.954   EtherCAT             EtherCAT.cpp:769  State changed from StoppingOnError to Error
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: (X) 00:52:53.943   EtherCAT             EtherCAT.cpp:1204 Timeout waiting for motion engines to stop
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:91     ERR| Network log message: /!\ 00:52:54.943   EtherCAT             EtherCAT.cpp:3980 --> Close driver
namazu  | 2025-08-13 11:23:37.312 (3177.213s) [main thread     ]         psdn_namazu.cc:95     ERR| EtherCAT network is not operational, exiting
namazu  | 2025-08-13 11:23:37.313 (3177.213s) [main thread     ]         psdn_namazu.cc:96     ERR| Network state: SHUTDOWN - EtherCAT was shutdown or stopped, must restart

The 14 millisecond NETWORK_TIMING_DELTA shown in the screenshot your earlier post is pretty concerning. After some discussion, we decided to write a small network utility to get some timing metrics on the underlying system calls that RMPNetwork uses. We’ll get you a source file and some instructions on how to compile tomorrow.

The RMPNetwork receive thread is actually the highest priority thread we create, higher than the RMP firmware’s threads, so if it is being starved then it may cause the lower priority threads in the RMP to also miss their timings, which is likely the source of the “RMP is dead” message. We think that message is merely symptomatic of the latencies RMPNetwork is experiencing.

We haven’t observed latencies this large on our in-house systems (both x86 and ARM systems) or on other Linux customers’ systems before, so we’re still searching for additional configuration options that your specific hardware might need.

I’ve sent both @edco and @rickys an email with a source file for the network timing utility along with some instructions on compilation and usage. We plan to eventually put it on GitHub in a public repo, but for now we’ve emailed the current version of it with the subject line “Linux Socket Performance Utility”. I’ve pasted the contents of the email below.

The command to compile the utility is:

g++ -o nicperformance--std=c++20 main.cpp.

It sends some EtherCAT discovery broadcast packets on the specified NIC, so you will have to run the utility with at least one EtherCAT node plugged in. Also, the code uses raw sockets and sets things like CPU affinity and thread priorities, so it will be easiest to run the program with sudo unless you explicitly give the executable the required capabilities (which might be doable with setcap cap_net_raw,cap_net_admin,cap_sys_nice=eip if I remember all the capabilities correctly). The program outputs the mean and the maximum latencies on sending and receiving the discovery packets in microseconds.

You’ll have to specify which NIC you want the utility to use with the argument --nic=<nic_name>. You can also specify some other configuration settings if you’d like:
Options:
–nic, -n Network interface card name
–iterations, -i Number of iterations
–send-sleep, -s Send sleep duration in microseconds
–send-priority, -sp Send thread priority
–receive-priority, -rp Receive thread priority
–send-cpu, -sc CPU core to use for the sender thread
–receive-cpu, -rc CPU core to use for the receiver thread