Discussion:
[j-nsp] 40G QSFP problems on QFX5100 after 16.1R6
Sebastian Wiesinger
2018-04-24 07:16:33 UTC
Permalink
Hello,

we've noticed problems with third party vendors QSFP 40G optics after
upgrading our JunOS on QFX5100. The problems manifest as a general
instablility on the QSFP links with symptoms like:

* Links take minutes to come up
* Links go down randomly
* Links show CRC/Align errors and packets get dropped

Further testing revealed that this is happening with 16.1R6 and higher
trains (tested 17.4 and 18.1).

After downgrading to 16.1R5-S2 the problems are gone IF you also
downgrade the host OS (force-host option). Only downgrading the JunOS
VM does not work.

Looking through release notes I noticed that PR1296011 got fixed in
16.1R6 and subsequent release trains. It reads:

On QFX5100 platform, 40G interface might not come up if specific
vendor direct attach copper (DAC) cable is used (eg: Molex cable 3m).

I wonder if some change for this PR caused these problems? Doe anyone
have the same problems and maybe already successfully communicated
this to Juniper? I have a case open right now but I'm not sure I'm
getting somewhere.

So in conclusion be wary about upgrading beyond 16.1R5-S2 when using
third party optics...

Regards

Sebastian
--
GPG Key: 0x93A0B9CE (F4F6 B1A3 866B 26E9 450A 9D82 58A2 D94A 93A0 B9CE)
'Are you Death?' ... IT'S THE SCYTHE, ISN'T IT? PEOPLE ALWAYS NOTICE THE SCYTHE.
-- Terry Pratchett, The Fifth Elephant
_______________________________________________
juniper-nsp mailing list juniper-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Chris via juniper-nsp
2018-04-24 07:56:27 UTC
Permalink
Hi,
Post by Sebastian Wiesinger
Hello,
we've noticed problems with third party vendors QSFP 40G optics after
upgrading our JunOS on QFX5100. The problems manifest as a general
* Links take minutes to come up
* Links go down randomly
* Links show CRC/Align errors and packets get dropped
Yes, I have 10 QFX5100-48S and I have been experiencing the same issues.

All 10 devices have third party QSFP+ optics/DAC's (fs.com coded for
Juniper). So far I have had similar issues to you but not in all cases:

* 4 of the QFX devices are on 16.1R3. These 4 devices each have 1 x
QSFP+-40G-LR4 and 2 x QSFP+-40G-CU3M. I have not had any issues at all
with these.

* 4 of the QFX devices are on 17.3R1.
- 2 devices with QSFP+-40G-LR4: These are what we have been mainly
experiencing issues with. Initially these had QSFP+40GE-IR4 optics and I
blamed the issues on the length of the fibre run being a problem (it was
quite close to the link budget). One of the links was flapping, and
currently we are having a problem where we see CRC/Align errors. The
optics are verified to be good testing in other equipment with the same
patch cables.
- 2 of the devices with QSFP+-40G-CU1M: No issues.
- 2 of the devices with QSFP+40GE-IR4: No issues.
- 2 of the devies with QSFP+-40G-CU5M: No issues.

* 2 of the QFX devices were on 17.3R1 but I have upgraded them to 18.1R1
yesterday. Before the upgrade I had some problems where certain traffic
wasn't working when it cross the virtual chassis (the virtual chassis
connection is over a pair of QSFP+-40G-CU5M links). Simply disabling the
virtual chassis port then enabling it again one by one fixed the
problem. The problem reoccured after a while so I elected to try
upgrading to 18.1R1 to see if it made any difference, that specific
problem has not occured since. I opened a JTAC case for the specific
problem with certain traffic not working and didn't get anywhere - I was
told to reboot the device which I said is not acceptable.
- Both devices have QSFP+-40G-CU5M (virtual chassis): This had the
issue noted above.
- Both devices have QSFP+-40G-LR4: These have the same issue with
traffic not working in some cases, or traffic will be super slow.

I can't keep switching firmware around to try and resolve this/isolate
to a specific revision, but it is interesting that you also note you
have not experienced any issues with 16.1, the same as us. If you get a
proper answer to what this issue is I would really like to know, but it
looks like I will probably have to downgrade to 16.1 due to these issues
as they are impacting services.

I have just ordered some EX4600's for a new office fitout along with
some 40G DAC's and 40G QSFP+ interfaces from fs.com as well. I am
curious to see if I have the same issues with those, I suspect that
would be a yes.

Thanks
_______________________________________________
juniper-nsp mailing list juniper-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Sebastian Wiesinger
2018-04-24 08:19:06 UTC
Permalink
I can't keep switching firmware around to try and resolve this/isolate to a
specific revision, but it is interesting that you also note you have not
experienced any issues with 16.1, the same as us. If you get a proper answer
to what this issue is I would really like to know, but it looks like I will
probably have to downgrade to 16.1 due to these issues as they are impacting
services.
Interesting stuff, I'm fishing around but I'm suspecting perhaps this
is some sort of timing issue. Just be aware that 16.1R6 is also a
"bad" version from our point of view. 16.1R5-S2 is fine.

Sebastian
--
GPG Key: 0x93A0B9CE (F4F6 B1A3 866B 26E9 450A 9D82 58A2 D94A 93A0 B9CE)
'Are you Death?' ... IT'S THE SCYTHE, ISN'T IT? PEOPLE ALWAYS NOTICE THE SCYTHE.
-- Terry Pratchett, The Fifth Elephant
_______________________________________________
juniper-nsp mailing list juniper-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Luca Salvatore via juniper-nsp
2018-04-26 18:09:12 UTC
Permalink
Also experienced these issues with most versions above 14.1X53.
FWIW I've had no problems with 17.3R2-S1.2, but we did have issues with
17.4R1

We're working with our account team to try and sort this stuff out
Post by Chris via juniper-nsp
Post by Chris via juniper-nsp
I can't keep switching firmware around to try and resolve this/isolate
to a
Post by Chris via juniper-nsp
specific revision, but it is interesting that you also note you have not
experienced any issues with 16.1, the same as us. If you get a proper
answer
Post by Chris via juniper-nsp
to what this issue is I would really like to know, but it looks like I
will
Post by Chris via juniper-nsp
probably have to downgrade to 16.1 due to these issues as they are
impacting
Post by Chris via juniper-nsp
services.
Interesting stuff, I'm fishing around but I'm suspecting perhaps this
is some sort of timing issue. Just be aware that 16.1R6 is also a
"bad" version from our point of view. 16.1R5-S2 is fine.
Sebastian
--
GPG Key: 0x93A0B9CE (F4F6 B1A3 866B 26E9 450A 9D82 58A2 D94A 93A0 B9CE)
'Are you Death?' ... IT'S THE SCYTHE, ISN'T IT? PEOPLE ALWAYS NOTICE THE SCYTHE.
-- Terry Pratchett, The Fifth Elephant
_______________________________________________
https://puck.nether.net/mailman/listinfo/juniper-nsp
_______________________________________________
juniper-nsp mailing list juniper-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Jared Mauch
2018-04-27 23:50:48 UTC
Permalink
I’ve seen issues on QFX-5200 depending on the optics and cabling type. We’ve had to set
FEC on the ports to none. This may also be a problem with the QSFP/QSFP+ type optics.
The threshold was in going from 15.x -> 17.x, so it’s possible it showed up in 16.x as well.

- Jared
Post by Luca Salvatore via juniper-nsp
Also experienced these issues with most versions above 14.1X53.
FWIW I've had no problems with 17.3R2-S1.2, but we did have issues with
17.4R1
We're working with our account team to try and sort this stuff out
Post by Chris via juniper-nsp
Post by Chris via juniper-nsp
I can't keep switching firmware around to try and resolve this/isolate
to a
Post by Chris via juniper-nsp
specific revision, but it is interesting that you also note you have not
experienced any issues with 16.1, the same as us. If you get a proper
answer
Post by Chris via juniper-nsp
to what this issue is I would really like to know, but it looks like I
will
Post by Chris via juniper-nsp
probably have to downgrade to 16.1 due to these issues as they are
impacting
Post by Chris via juniper-nsp
services.
Interesting stuff, I'm fishing around but I'm suspecting perhaps this
is some sort of timing issue. Just be aware that 16.1R6 is also a
"bad" version from our point of view. 16.1R5-S2 is fine.
Sebastian
--
GPG Key: 0x93A0B9CE (F4F6 B1A3 866B 26E9 450A 9D82 58A2 D94A 93A0 B9CE)
'Are you Death?' ... IT'S THE SCYTHE, ISN'T IT? PEOPLE ALWAYS NOTICE THE SCYTHE.
-- Terry Pratchett, The Fifth Elephant
_______________________________________________
https://puck.nether.net/mailman/listinfo/juniper-nsp
_______________________________________________
https://puck.nether.net/mailman/listinfo/juniper-nsp
_______________________________________________
juniper-nsp mailing list juniper-***@puck.nether.net
https://puck.nether.net/mail
Sebastian Wiesinger
2018-08-22 08:52:47 UTC
Permalink
Post by Jared Mauch
I’ve seen issues on QFX-5200 depending on the optics and cabling type. We’ve had to set
FEC on the ports to none. This may also be a problem with the QSFP/QSFP+ type optics.
The threshold was in going from 15.x -> 17.x, so it’s possible it showed up in 16.x as well.
Hello,

apparently there is now a PR for this: PR1309613

The PR mentioned that the problem should be fixed in:
14.1X53-D47 15.1R7 16.1R7 17.1R3 17.2R3 17.3R2 17.3R3 17.4R2 18.1R1

So if you're hit by this give one these releases a try. We'll test as
soon as 17.4R2 is actually released.

Regards

Sebastian
--
GPG Key: 0x93A0B9CE (F4F6 B1A3 866B 26E9 450A 9D82 58A2 D94A 93A0 B9CE)
'Are you Death?' ... IT'S THE SCYTHE, ISN'T IT? PEOPLE ALWAYS NOTICE THE SCYTHE.
-- Terry Pratchett, The Fifth Elephant
_______________________________________________
juniper-nsp mailing list juniper-***@puck.nether.net
https://pu
Jason Healy
2018-08-22 10:21:13 UTC
Permalink
Post by Sebastian Wiesinger
apparently there is now a PR for this: PR1309613
I realize you may not have the answers, but if you do...

1) Does this affect platforms other than the QFX?

2) Were you seeing the CRC count increase in all cases of traffic loss?

3) Was there any pattern to the traffic loss (e.g., every X seconds, certain types of traffic, certain percentage of all traffic)?

I ask because we have a setup right now with a EX4600 VC going to a QFX5100 via 40Gb DWDM fiber. Client devices are seeing "pauses" in traffic with ICMP loss or severe delay.

However, there are no CRC errors reported on either switch as mentioned in the PR. Also, we've only observed serious loss over WiFi clients so there are still plenty of possible culprits.

This is our only EX4600 stack, and our first 40Gb optics, so we don't have a similar working setup elsewhere that's known to be good.

Thanks,

Jason
_______________________________________________
juniper-nsp mailing list juniper-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Sebastian Wiesinger
2018-08-22 10:25:19 UTC
Permalink
Post by Jason Healy
Post by Sebastian Wiesinger
apparently there is now a PR for this: PR1309613
I realize you may not have the answers, but if you do...
1) Does this affect platforms other than the QFX?
I'm not aware of other platforms affected. We only saw this on QFX.
Post by Jason Healy
2) Were you seeing the CRC count increase in all cases of traffic loss?
Yes.
Post by Jason Healy
3) Was there any pattern to the traffic loss (e.g., every X seconds,
certain types of traffic, certain percentage of all traffic)?
No, there was no apparent pattern. Sometimes we could make the problem
go away OR reoccur by flapping interfaces.
Post by Jason Healy
I ask because we have a setup right now with a EX4600 VC going to a
QFX5100 via 40Gb DWDM fiber. Client devices are seeing "pauses" in
traffic with ICMP loss or severe delay.
However, there are no CRC errors reported on either switch as
mentioned in the PR. Also, we've only observed serious loss over
WiFi clients so there are still plenty of possible culprits.
This is our only EX4600 stack, and our first 40Gb optics, so we
don't have a similar working setup elsewhere that's known to be
good.
I really can't say. If you open a case perhaps let the TAC engineer
crosscheck with this PR?

Regards

Sebastian
--
GPG Key: 0x93A0B9CE (F4F6 B1A3 866B 26E9 450A 9D82 58A2 D94A 93A0 B9CE)
'Are you Death?' ... IT'S THE SCYTHE, ISN'T IT? PEOPLE ALWAYS NOTICE THE SCYTHE.
-- Terry Pratchett, The Fifth Elephant
_______________________________________________
juniper-nsp mailing list juniper-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Sebastian Wiesinger
2018-09-03 14:37:52 UTC
Permalink
Post by Sebastian Wiesinger
Hello,
we've noticed problems with third party vendors QSFP 40G optics after
upgrading our JunOS on QFX5100. The problems manifest as a general
So,

we've got the information that this might be fixed in 17.4R2. Initial
testing seems to confirm this but we're still waiting a bit if the
problem reoccurs. If this is the fix then PR1309613 was the culprit
and the roblem should be solved in:

14.1X53-D47 15.1R7 16.1R7 17.1R3 17.2R3 17.3R2 17.3R3 17.4R2 18.1R1

Regards

Sebastian
--
GPG Key: 0x93A0B9CE (F4F6 B1A3 866B 26E9 450A 9D82 58A2 D94A 93A0 B9CE)
'Are you Death?' ... IT'S THE SCYTHE, ISN'T IT? PEOPLE ALWAYS NOTICE THE SCYTHE.
-- Terry Pratchett, The Fifth Elephant
_______________________________________________
juniper-nsp mailing list juniper-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Loading...