Cannot Enable After the EtherCAT Master Stopped (without Shutting Down Network)

Leadshine CS3E Drive Fails to Enable

Regarding the behavior described below,

  • What kind of troubleshooting can I do regarding failing to enable?
  • What other kind of data/details could I gather to help diagnose this problem?

The Scenario

[RMP 10.3.8]

I have two devices exhibiting some unexpected (mis)behavior.

Starting with a vanilla RMP 10.3.8 distribution, I add the ESI files for these devices (CS3E_V1.20.xml, ESI_SI-ES3_2_02_Rev_02.xml) and added entries in NodeInfo.xml. (For details, see below.)

The Problem

The problem that I’ve observed is that the Leadshine drive will frequently not enable. Enabling fails without any sort of fault code from the drive or explicitly stated reason from RMP. It “seems” like it takes less than a second to fail, but it’s difficult to time it using RapidSetup. I’ve tried enabling via RapidCode, using
int32_t AmpEnableSet(bool enable, int32_t ampActiveTimeoutMilliseconds)
with a 5000 ms timeout, and it still fails to enable.

What I’ve observed is that if I do not shut down the network before stopping the RMP RTA (e.g. by stopping/restarting down the INtime node), then I will usually see this problem. If I do shut down the network before stopping the RMP RTA, then the problem behavior doesn’t happen. I can clear faults and enable the Leadshine drive.

When I do not shut down the network, the Leadshine will have a fault code of 0x821b, which their documentation describes as “Watchdog Time-Out of Synchronization Manager 2” with a suggested resolution of “Check the network cable.” I interpret this to mean that it lost contact with the EtherCAT master. That, at least would make sense based on my observations.

The curious observation is that this behavior does not happen at all if the GA500 is not present on the network. I can have the drive alone on the network or with some Beckhoff I/O slices and Yaskawa servo drives, and the behavior never happens, so long as the GA500 isn’t present.

Observations

  • Both the Leadshine CS3E and GA500 must be on the network, with or without other nodes.
  • The Leadshine drive must report 0x821b as a fault (after killing the EtherCAT master) in order for the bad behavior to happen.
    • I can’t seem to make the problem happen by merely unplugging the ethernet cable from one of the objects on the network.
  • The Leadshine drive does not appear to be doing a “wake and shake” that is thwarting RMP enable. (I see no change in reported positions.)
  • The Leadshine sends fault codes via PDO, and the value of that object matches what RapidSetup reports on the axis view.

Problem Reproduction

  1. Start RapidSetup.
  2. Start the EtherCAT network.
  3. Close RapidSetup.
  4. Stop the EtherCAT master (restart INtime node).
  5. Start RapidSetup.
  6. Start the EtherCAT network.
    1. The Leadshine drive will have fault code 0x821b.
  7. Clear the fault.
  8. Attempt to enable the Leadshine.
    1. The Leadshine will fail to enable without any fault code returned.

Workaround

When the problem behavior is exhibited…

  1. Shut down the EtherCAT network.
  2. Start the EtherCAT network.
  3. Clear any faults on the Leadshine drive. (This is not essential, but I usually do it.)
  4. Enable the Leadshine drive.
    1. The drive will enable and stay enabled.

Verbose Details

Trace During AmpEnableSet()

I enabled tracing (RSITraceALL) for the RSI::RapidCode::Axis object when invoking AmpEnableSet(true,5000).
The trace log is 153,218 lines and 8.6 MiB. I don’t see obvious fires in the log (noting errors or failures).
The return values are generally 0x0, which I interpret to mean success. The calls that don’t return 0 are few.

mpiElementValidate(0xc1c338) returns 0x3

The others return what look like firmware memory addresses.

The lack of meaningful error/failure info means that I don’t have much data from that log to post here.
I can provide that info, if it would be useful.

NodeInfo.xml

Leadshine CS3E

<Vendor Id="0x4321"><VendorName>Leadshine Technology Co.,Ltd.</VendorName>
  <Product Code="0x1200">
    <ProductName>CS3E-D1008</ProductName>
    <ShortName>Leadshine Stepper</ShortName>
    <AxisCount>1</AxisCount>
    <ItemSubType>Drive</ItemSubType>
    <StatusWord>Transmit PDO 1.Status Word</StatusWord>
    <PositionActual>Transmit PDO 1.Position Actual Value</PositionActual>
    <ControlWord>Receive PDO 1.Control Word</ControlWord>
    <PositionDemand>Receive PDO 1.Profile Target Position</PositionDemand>
    <IO>
      <DigitalInputItems>
        <DigitalInput SigBits="0x00000600" Size="32" Home="2" PosLimit="1" NegLimit="0">Transmit PDO 1.Digital Inputs</DigitalInput>
      </DigitalInputItems>
    </IO>
  </Product>
</Vendor>

Yaskawa GA500

<VendorName>Yaskawa</VendorName>
  <Product Code="0x47413530">
    <ProductName>Yaskawa GA500</ProductName>
    <ShortName>Yaskawa VFD</ShortName>
    <ItemSubType>Box</ItemSubType>
    <PDOs>
      <PDOAssignment Index="0x1600" IsOutput="True" Include="False"/>
      <PDOAssignment Index="0x1623" IsOutput="True" Include="True">
        <AddEntry Name="Selectable 1" Index="0x2080" SubIndex="1" BitLen="32" DataType="UDINT"/>
      </PDOAssignment>
      <PDOAssignment Index="0x1624" IsOutput="True" Include="True">
        <AddEntry Name="Digital outputs" Index="0x20f0" SubIndex="1" BitLen="16" DataType="UINT"/>
      </PDOAssignment>
      <PDOAssignment Index="0x1627" IsOutput="True" Include="True"/>
      <PDOAssignment Index="0x1628" IsOutput="True" Include="True"/>
      <PDOAssignment Index="0x1a00" IsOutput="False" Include="False"/>
      <PDOAssignment Index="0x1a23" IsOutput="False" Include="True"/>
      <PDOAssignment Index="0x1a24" IsOutput="False" Include="True">
        <AddEntry Name="Output current" Index="0x2120" SubIndex="1" BitLen="16" DataType="UINT"/>
      </PDOAssignment>
      <PDOAssignment Index="0x1a26" IsOutput="False" Include="True">
        <AddEntry Name="Digital inputs" Index="0x2180" SubIndex="1" BitLen="16" DataType="UINT"/>
      </PDOAssignment>
      <PDOAssignment Index="0x1a27" IsOutput="False" Include="True">
        <AddEntry Name="MEMOBUS write response" Index="0x2150" SubIndex="1" BitLen="32" DataType="UDINT"/>
      </PDOAssignment>
    </PDOs>
    <IO>
      <DigitalInputItems>
        <DigitalInput SigBits="0xffff" Size="16">Inputs.Drive status</DigitalInput>
        <DigitalInput SigBits="0xffff" Size="16">Inputs.Digital inputs</DigitalInput>
      </DigitalInputItems>
      <DigitalOutputItems>
        <DigitalOutput SigBits="0xffff" Size="16">Outputs.Digital outputs</DigitalOutput>
      </DigitalOutputItems>
    </IO>
  </Product>
  ...
2 Likes

Thank you for the detailed description!

I’ve copied a section from the AmpEnableSet docs:

" Attention
For DS402 axes, the RMP firmware will automatically generate an Amp Fault action (default is to Abort) if the AMP_ACTIVE signal does not got high within one second of AmpEnableSet(true). If it fails to enable, the Axis will end in an ERROR state. This value is configurable using RSIAxisAddressType::RSIAxisAddressTypeAMP_ENABLE_AMP_FAULT_TIMEOUT with MotionController::MemoryDoubleGet() and MotionController::MemoryDoubleSet(). The default value is Axis::AmpEnableAmpFaultTimeoutSecondsDefault."

The default value is 1.0 seconds, so this is likely the behavior you witness from RapidSetup when it fails to enable.

You might want to set the value at RSIAxisAddressTypeAMP_ENABLE_AMP_FAULT_TIMEOUT to five or more seconds and retry?

This value was made programmable (and exposed) in 10.4.1 (per the changelog).

Are there any known SDOs you could send to the Leadshine to reset its internal states?

Keep us posted.

1 Like

What is the ampActiveTimeoutMilliseconds parameter for AmpEnableSet() function do differently than this value that belongs to the Axis object?

Also, what kind of interplay could exist between nodes that would alter the behavior of the Leadshine when the GA500 is [not] present?

I can’t find any references on their documentation. )-8

AMP_ENABLE_AMP_FAULT_TIMEOUT is a firmware value for DS402. The firmware measure the time since an enable was requested and looks for the OPERATION_ENABLED bit in the DS402 status word. If the bit does not turn on within that TIMEOUT value, an AMP_FAULT status bit is set (which results in an Abort by default).

The RapidCode AmpEnableSet ampActiveTimeoutMilliseconds value is for your software to be notified if the amp does not enable within your timeout.

Most drives enable within a few milliseconds but some take up to 500ms. In your case it’s worth testing to see if waiting longer than one second helps.

1 Like

Regarding interplay between nodes, I’m not sure. We can discuss more with Jacob tomorrow.

Considering your workaround of restarting the network, I wonder if additional SDO init commands might help? Those would be specified in the in the ESI file but you have some capability to add them in CustomNodeInfo.xml.

So, I put 10.4.3 on this machine so that I could try enabling with a modified DS402 timeout value.

Curiously, the bad behavior does not happen for 10.4.3, but happens with 10.3.8.

Hi @todd_mm

In 10.3.10, we added the following change:

  • [Change] RMPNetwork discovery requests nodes transition to Init state.

This was due to another node taking way too long process its internal state transitions when in error. By requesting a clean “init” state during discovery, we found that drive wouldn’t be in unready state by the time we called NetworkStart. It looks like you found another “efficiency” drive that isn’t wasting processing potential by being ready before we are.


Nodes can affect one another in the network. If Node 1 takes X time to transition through it’s state machine and Node 2 has timeouts on state machine transition, the slow Node 1 can cause the faster Node 2 to fault. That is what we observed in the above case. It is pretty rare however. I can’t think of any other example. Normally a failing node takes the whole network down.

2 Likes

So, for the system we have in the lab, there have been reports of the network going down after it’s been started and enabled and everything. I’m looking into that now.

So, if we wanted to make an educated guess as to which node is “at fault,” is the slower node the bad guy? I need to make a recommendation to our engineers as to which scenarios to avoid (at least until we solve this problem). Is the faster node too high strung to get along?

What kind of options do we have regarding fixing this behavior? Scott suggested modifying start/init commands in NodeInfo.xml. Would that prevent the network from going down after it’s been running a while? If I were going to talk to the firmware manufacturers, what do I need to tell them to do in order to Behave Correctly™? I’m not likely to have a lot of traction telling them that their devices don’t play well with some other manufacturer’s devices. They’ll each just point the finger at the other and say, “Tell them to fix their stuff.”

Also, on this system, I have seen an unusual number of RTA crashes.

image

I don’t know if I can discern any patterns yet. Each time I’ve seen it, I’ve been trying to monitor network stability.

  1. I’ve restarted the INtime node
  2. Started RapidSetup
  3. Started the network
  4. (Enabled our E-Stop circuit to allow us to enable the drives)
  5. Enabled each axis
  6. Wait for trouble…

All the ones I’ve seen (at least 5 since last week) have been in RMPNetwork.rta.

  • some have been after I get everything enabled, then I open the Network Packet Errors window and wait 10-30 seconds
  • some have happened after getting everything enabled, and just waiting
  • others have happened after starting things in our application (a more complex scenario, but not one in which we usually see crashes—because we don’t usually see crashes, period)

Hi Todd,

In my experience it is normally the higher standards node which cause the network shutdown. Yaskawa or Mitsubishi being common.

I would get error histories on every node. (Node specific. Check the manual for each.) From that see if you can determine which node is failing with which error. It is often the case that we can change the limit settings for any given error. Once we’ve identified what to change, then we’d add in new limits into the init cmds. Alternatively you could just save the settings to the drive but that isn’t as robust.

Hi Todd,

Disturbing. General RMPnetwork instability isn’t something I’ve seen for many years.

Can we setup a remote session when you can reproduce the error? I’ll log into the system to see if I can collect more information about the page fault. There is a fairly low chance that I can trace the page fault to the failing code.