User Limits and Axes' Commanded Positions: Runaway

todd_mm · February 21, 2020, 7:51pm

[RMP 8.1.8]
We’re implementing probing.

The Scenario

To accomplish this, we create one user limit per axis. The user limit for axis #0 has RSIActionTRIGGERED_MODIFY as the action. The other user limits don’t trigger any actions. Depending on how probing is configured, the user limit may also capture the axis position. (Some axes may use high-speed inputs on the drive to capture the probe position, others rely on RMP to capture the position.)

The Problem

Now, 2532 times out of 2533, this works as I would expect. That one time, something odd happens.
Somewhere around 3 ms after the (active low) input goes low, the RMP commanded position for axis #0 changes to the exact RMP commanded position for axis #2. About 427 ms later, the commanded position jumps back to where it should have been after I send it the next batch of points (I think).
In this scenario, axis 0 is using the high-speed inputs and axis 1 is using RMP’s user limit to capture the position. (I don’t have a reason to think that’s making a difference.)

Details

The curious thing here is that I cannot find any indication that I’m sending axis #0 to the position that MotionScope reports as the commanded position. None of my position data looks like this.

Here’s a plot of the scope data.

The green line is axis 0 commanded position.

For the sake of troubleshooting timing, I have recently added writing data to user buffers so that I can correlate events in my code with what the scope is showing.
The “Status” section of the plot visualizes this data (basically as a boolean).

Here’s the same data, zoomed in to the point in time where the jump happens.

The input we’re using for the user limits is labeled “T7X12_Probe0”. The other bits that I’m setting are in my code, some of which are handlers that get called when a user limit interrupt/event comes in.

What I see here is that shortly after the input changes, the commanded position changes. But the commanded position changes before my handler indicates that it’s started (Motor_0_ProbeUserLimitHit). From my perspective, it almost seems to happen before I could have done anything (directly) to cause it.

Conclusions

This is a complex scenario that I cannot easily make simple, so I won’t suggest that I have assurance of the source of the problem.

What I do know is that at no time did I ever send motion points (via MovePT()) that include any other position for axis #0:2018214 and axis #2:-444223. There is no reference in my code to Axis::CommandPositionSet(). The user limits are not configured to change positions or anything else. One of them captures the axis position (via MotionController::UserLimitInterruptUserDataAddressSet()). I don’t know how the commanded position for axis 0 was changed. It is conspicuous that it was set to exactly the commanded position for axis 2, but I don’t know that it proves anything.

I don’t know how this is happening. It’s conceivable that my application is at fault, but I can’t think of a way to show that I’m causing the jump.

It’s conceivable (though unlikely?) that my MemorySet() calls could do something destructive like this, but the addresses get set once just after the program starts and are used without being modified thereafter.

I didn’t attempt to capture user limit config and log it, so I can’t say that the user limits definitely weren’t misconfigured. I’m using the same function every time to configure them, though.

Suggestions?

Can you suggest some kind of data I could gather or test I could run that might make it easier to find the cause of this problem?

scott · February 21, 2020, 11:41pm

Hi Todd,

Thank you for the detailed report and the phone call with additional details. Can you confirm if the Axis 0 position jump occurs at about the same relative time, next time you reproduce the issue? I looks like it is occuring 2~3 samples after the UserLimit for Axis 1 triggers. Is that consistent?

You mentioned that you did not experience this issue with ESTOP_MODIFY as the selected action. Would you be able to test that action with your test code, to give us another data point? I think your motion thread (using MovePT) would behave differently if it was ESTOP_MODIFY because MovePT() will throw an error if the MultiAxis is in an ERROR or STOPPING_ERROR state. It would be an interesting clue.

Keep us posted.

todd_mm · February 26, 2020, 6:35pm

I added some data to the MotionScope capture, hoping to learn something useful about the RMP state during this anomaly.

The layout is similar to the previous ones. Axis 0 command position jumps immediately to the Axis 2 command position.

Based on what I see, it looks like 2ms after the probe input triggers the MultiAxis starts a TriggeredModify (which is what the user limit should do), though the next millisecond, the command position changes to something really bad (rather than do what a normal modify would do).

As a basis for comparison, here’s the same data for a normal probe strike a few seconds earlier.

I couldn’t show both Axis 0 and Axis 1 positions at the same time because the motor positions differ by a great deal (on purpose, to illustrate jumping). Here’s the same time, but showing Axis 0 positions.

(Note the scale for Axis 0 is += 10 PUU)
No change in Axis 0 command position.

todd_mm · February 26, 2020, 6:36pm

I have implemented tracing, but I did not use it in this last run. It creates some real performance problems.

todd_mm · February 26, 2020, 6:50pm

Any ideas as to what might cause this situation only some of the time? I performed 2889 probe moves, with 672 probe strikes during this time (about 109 minutes).

The same code configures the user limit(s) each time. I have no reason to suspect that it’s doing it differently the time that this anomaly happens.

One thing that I notice is different between “normal” and “bad” is that the multiaxis status bit for TriggeredModify_EStop is set in “bad” but it’s never set for “normal.”

scott · February 26, 2020, 8:00pm

Yeah I also noticed the TriggeredModify_EStop bit was on in the “bad” case. Do you have any actions configured for TriggeredModify_EStop on any of the configurable limits?

todd_mm · February 26, 2020, 8:08pm

I don’t think so. There isn’t actually an action labelled “TriggeredModify_EStop” in rsi.h.
We use RSIActionTRIGGERED_MODIFY for the user limit here, and not anywhere else.
The axis position error action is normally set to RSIActionE_STOP_MODIFY. Does this correspond to TriggeredModify_EStop?

scott · February 26, 2020, 8:10pm

Yes, RSIActionE_STOP_MODIFY is what I meant by TriggeredModify_EStop.

So likely the position error is triggering that bit.

todd_mm · February 26, 2020, 8:11pm

@scott
Which means that the firmware bit is a symptom of the problem and not the cause, correct? Position error surely would have been triggered as soon as the commanded position changed by 4M PUUs.

scott · February 26, 2020, 8:33pm

Agreed. Were you able to check the state of the TriggeredModify bit in relation to your calls to MovePT()?

Does the problem ever occur on any other axes? Or is it just Axis 0?

todd_mm · February 26, 2020, 8:33pm

Something else that I’ve noticed since I’ve been tracking MultiAxis status bits is that while the bad position is being commanded, both TriggeredModify and TriggeredModify_EStop pulse off for a millisecond every ~42 milliseconds. IDK what I ought to expect, but it is a little curious.

scott · February 26, 2020, 8:36pm

I’ve never heard of anything happening every ~42ms, are there any things happening around that frequency in your app?

todd_mm · February 26, 2020, 8:36pm

I’m preparing to test checking MovePT for the status bits.

I’ve tried the same basic scenario using different axes. I’m streaming points most of the time, so when I say “moving,” I mean that I was commanding the motors to move (rather than hold still).

If the commanded motion (for the probe) is on Axis 0, I never see the problem.
If the commanded motion (for the probe) is on Axis 1 or Axis 2, Axis 0 gets commanded to the commanded position of the other axis that wasn’t moving.

So, I’ve only ever observed Axis 0 to be commanded to teleport.

scott · February 26, 2020, 8:55pm

Ok, I’m going to setup some tests here in an attempt to reproduce what you’re seeing. I’ll start simple and increase complexity if I can’t get it to fail.

scott · February 26, 2020, 10:06pm

Ok first tests with ~30000 “probe strikes” and no failures, so I’ll try adding more complexity.

What exactly do you do with MovePT position commands if the MultiAxis is executing a triggered modify?

scott · February 26, 2020, 10:15pm

also per your original post,

“About 427 ms later, the commanded position jumps back to where it should have been after I send it the next batch of points (I think).”

that’s 10x your ~42ms observation. any connections?

todd_mm · February 26, 2020, 10:26pm

So far, we’ve been catch-ing exceptions and handling them. So, we may try to re-send or just ignore the error. If we really think something’s wrong, we raise an alarm and stop everything.
We don’t save the motion sent (except for the one attempt to re-send on error).

scott · February 26, 2020, 10:27pm

I’m finding that the TriggeredModify will get the MultiAxis to IDLE state, and if I keep calling MovePT (continually streaming points), it throws an exception telling me the state is IDLE.

todd_mm · February 26, 2020, 10:52pm

I just finished a run where I check the status bits of the MultiAxis before calling MovePT. This test didn’t change any behavior, it just emitted a log message if the condition was met.

Over the course of 40 minutes, I saw 30 occurrences of the MultiAxis F/W status having any of these bits turned on:

		r::RSIFirmwareStatusSTOP | 
		r::RSIFirmwareStatusESTOP |
		r::RSIFirmwareStatusESTOP_ABORT |
		r::RSIFirmwareStatusESTOP_CMD_EQ_ACT |
		r::RSIFirmwareStatusTRIGGERED_MODIFY_ESTOP |
		r::RSIFirmwareStatusTRIGGERED_MODIFY |
		r::RSIFirmwareStatusABORT

I saw these two words:

0x80080120
    MOTION_START(0x00000020)
    MOTOR_LIMIT(0x80000000)
    MS_FAULT(0x00000100)
    TRIGGERED_MODIFY(0x00080000)

(25 times)

0x80080020
    MOTION_START(0x00000020)
    MOTOR_LIMIT(0x80000000)
    TRIGGERED_MODIFY(0x00080000)

(5 times)

They happened every couple of minutes. (Frequently in pairs, within 50 ms of each other—probably the retry-send.) I’m probing about once per second, or 5 times every 6 seconds.
The last occurrence was ~25 seconds before (20 probe attempts) before the anomaly was encountered and we disabled.

scott · February 26, 2020, 10:59pm

It seems to me that commanding more motion while it is in a TRIGGERED_MODIFY state could be a bad idea. I would think the desired behavior would be to wait for the TRIGGERED_MODIFY to complete, then start new PT motion. What errors are you getting such that you need to re-send motion?