Skip to content
This repository has been archived by the owner on Dec 3, 2021. It is now read-only.

Final adjustments and promotion to prod for JET and OC lessons #218

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Mierdin
Copy link
Member

@Mierdin Mierdin commented Apr 17, 2019

Looks like the majority of the image issues have been sorted. This PR will take care of some last-minute cleanups in the JET and OpenConfig lessons, including the addition of a second image in the JET lesson so that the ping tests in stage 4 will work.

In addition, the two lessons will be promoted to production so they'll show up in the main site in the next release (currently targeted for later this week)

/cc @valjeanchan @jnpr-raylam

Mierdin added 2 commits April 17, 2019 00:55
Signed-off-by: Matt Oswalt <[email protected]>
Signed-off-by: Matt Oswalt <[email protected]>
@jnpr-raylam
Copy link
Contributor

I just try the JET lab in PTR site, it is using antidotelabs/vqfx-full:18.1R1.9 image, it takes about 5 minutes for the PFE up and running after the lesson page is returned. Is it expected?

The reason I asked is when I go through the lesson, there is no output in stage 2 after interface is added. Further debug the root cause is the PFE doesn't up at that moment.

Shall we have some mechanisms to check the PFE status before returning the page to user?

@Mierdin
Copy link
Member Author

Mierdin commented Apr 17, 2019

Ah, I knew I forgot about something. We probably can't/shouldn't adjust things on the Syringe side, but we can play around with the image. Either boot the PFE first and add a big honkin' sleep before booting the rest, or maybe add a PFE check inside the image and block SSH until it's up (which would delay Syringe effectively)

I'll play with it and let you know

@Mierdin
Copy link
Member Author

Mierdin commented Apr 17, 2019

@mwiget What's the best way from within the container image to verify that the PFE is up and running? I can telnet to port 3000 immediately after the lesson is starting but Junos doesn't quite see it. So I'm hoping there's some other way I can detect PFE health within the container image.

@Mierdin
Copy link
Member Author

Mierdin commented Apr 17, 2019

Looks like the PFE is detected, but goes through a testing phase? Wonder if it would even be useful then, to delay the vcp since it looks like it detects it right away, but has to do a bunch of testing stuff before it makes those interfaces available...

antidote@vqfx> show chassis fpc                        
Temp  CPU Utilization (%)   CPU Utilization (%)  Memory    Utilization (%)
Slot State            (C)  Total  Interrupt      1min   5min   15min  DRAM (MB) Heap     Buffer 
0  Online           Testing  88        20        0      0      0    1920        0         39

@Mierdin
Copy link
Member Author

Mierdin commented Apr 17, 2019

Disregard my last. If the entry shows up in show chassis fpc, the xe interfaces are present, even if the module is undergoing testing.

So my question is, how can we get visibility into the cosim boot process? I'm poking around at the logs in the container, but they're significantly lacking in useful data. And as I mentioned before, I can telnet to ports 3000 and 3001 right away, so that's not useful as a valid health check.

@mwiget
Copy link
Contributor

mwiget commented Apr 18, 2019

@Mierdin I just ran some tests on GCP with nested kvm active, and I'm surprised how long it takes to boot. I see a total of 15 minutes (6 minutes for Junos VCP alone). Connectivity between VCP and cosim is possible (checked via telnet 169.254.0.1 port 3000 from vcp) long before the pfe gets detected.

To your question on what to check for the PFE to be ready. I use something like this:

    fpcmem=$(ssh -o StrictHostKeyChecking=no -o ConnectTimeout=1 $ip $CLI show chassis fpc 0 2>/dev/null | grep Online | awk '{print $9}')
    fpcmem="${fpcmem:-0}"
    if [ "$fpcmem" -gt "1023" ]; then
      success=$(($success + 1))
      echo -e "$descr ready"
    else
      echo -e "$descr ..."
    fi

Basically logging into Junos and check for memory on FPC 0.

On non-nested kvm, PFE's are coming up within seconds of being able to log into Junos.

I'm not hopeful finding a workaround to the delay. Must be some code that pull kvm into emulation mode within the Junos VM.

@mwiget
Copy link
Contributor

mwiget commented Apr 18, 2019

Looking at the Junos messages, I see many of these warnings when running nested, which I never see on baremetal:

mwiget@instance-1:~/container-vqfx$ ssh 172.25.0.2 show log messages|grep JTASK_SCHED_SLIP_KEVENT|wc -l
186

mwiget@instance-1:~/container-vqfx$ ssh 172.25.0.2 show log messages|grep JTASK_SCHED_SLIP_KEVENT|tail -5
Apr 18 11:36:43  container-vqfx_vqfx1_1 sflowd[1967]: JTASK_SCHED_SLIP_KEVENT: 4 sec 898387 usec kevent block
Apr 18 11:36:45  container-vqfx_vqfx1_1 vccpd[1757]: JTASK_SCHED_SLIP_KEVENT: 7 sec 30583 usec kevent block
Apr 18 11:36:46  container-vqfx_vqfx1_1 overlayd[1963]: JTASK_SCHED_SLIP_KEVENT: 7 sec 331353 usec kevent block
Apr 18 11:36:57  container-vqfx_vqfx1_1 overlayd[1963]: JTASK_SCHED_SLIP_KEVENT: 4 sec 702050 usec kevent block
Apr 18 11:36:58  container-vqfx_vqfx1_1 vccpd[1757]: JTASK_SCHED_SLIP_KEVENT: 4 sec 762980 usec kevent block

Taking 186 messages, each reporting at least 4 seconds, eats up 12 minutes. Explaining the overall delay.

@Mierdin
Copy link
Member Author

Mierdin commented Apr 18, 2019

I think it's time to start looking at baremetal hosting. I really want this content published ASAP (these two lessons are really good) but I feel like it won't have nearly the right impact if we don't give it the performance it needs. So I'm sorry it's taking so long to get this content published but I think I'm going to push this until the next release, likely 0.4.0. That will give me time to mull over options and come up with a better game plan for the infra ops side of things. @valjeanchan @jnpr-raylam you good with this? I just want to make sure this content is shown in the best light possible, and it's starting to look like nested virt is just not going to cut it.

@jnpr-raylam
Copy link
Contributor

we're fine with this, and it's good to know we're going to investigate for the baremetal hosting. without the nested virt, we can try to add the vmx and develop some courses for telemetry, and also it's possible to develop some contrail stuffs.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants