Sometime over this past weekend we began to experience strange internet/network related issues that has us (my boss, myself, and our ISP) somewhat stumped. I’ve personally never dealt with anything like this. The boss has been running a capture with Wireshark, I’ve been investigating potential malware/virus infections, and we’ve even considered a rogue device being connected to our network somewhere. The current theory could be a bad core switch (5406) despite 2 vlans never failing.

Internet (and the ability to reach or ping our gateway) has come and gone 2 dozen times over the last couple of days. Just when we think we might have an “AH HAH!” moment, it happens again 5 minutes later, on a different machine.

I’m wondering if anyone has any good, but general suggestions as to where I can even begin to try and tackle this? This really falls on the bosses plate, but I’m fascinated and would like to help if possible.

I’ve been running XARP to try and detect ARP Cache poisoning and it does say it has detected instances of it… assuming it’s accurate and I’m using it correctly (press start essentially)… but once it finds a potential “poisoning”… then what?

I’ve cleared my ARP cache a dozen times this morning… sometimes my internet is up, and sometimes it isn’t… that’s both a symptom as well as an explanation that I may not even be able to get any replies…

Internally we aren’t aware of anything being wrong. Meaning that all internal resources seem to be working and responding (mail servers, file servers, network shares, printers, etc)…

Some may suggest that we install Wireshark, and as I said my boss is currently running it, but my question in response would be “What exactly are we looking for”?

Anyway thanks for any direction… this is the oddest situation…!

@HP

6 Spice ups

what are you using for a gateway/firewall? can that ping it’s next hop/or gateway. Can you ping anything from the gateway?

I would run a Tracert and see what the hops are. Make sure that it is hitting everything inside your network as it should. Once the connection goes, down see what happens with tracert. This can give you the idea of where the issue is.

1 Spice up

I would bypass everything, plug a PC you don’t care about directly in to where your ISP comes in. Bypass your whole network and give your PC whatever IP address your ISP has for your and sit that PC on the internet. If you still have connection issues, tell your ISP to get it fixed. If you don’t have connection issues you can start looking internally. There are to many variables to look at everything at once, start at one end and work your way back.

Others will have a bunch of tools you can use, I just don’t have any of them jumping out at the moment.

Good luck finding the issue.

5 Spice ups

Our gateway, as I understand, is a Cisco switch… Outward from the gateway seems to be normal. It seems physical for 5 minutes (like it’s our 5406), until something else happens to make us think otherwise.

We’ve placed another switch in front of the 5406, then patched that switch directly to the Cisco and a PC in to that switch. Basically as a test we’ve eliminated the 5406 using a single test case, and it exhibited the identical behaviour … Eliminating it being the 5406 itself.

Have you considered software (such as antivirus/firewall/content filtering) interference? Sometimes this random kind of thing can happen with Endpoint protection suits. If you actively use a particular security software, uninstall it and see if anything is different.

Also I hope you’ve tried rebooting your gateway.

Who was on the network over the weekend? Anything in the event logs?

We had an HP printer that went crazy. People were complaining VPN links to data centre were slow. speedtests to internet were slow

We found the HP was attempting to send packets to the internet, as quick as possible…it had a 1Gbit network card, the router had 1Gbit LAN and an primary 80Mbit VDSL to internet…you can guess the result.

end result…overloaded router.

on router: …RAM usage went to max, processor went to max…NAT sessions high and data through put was high. and looking at logs showed all NAT and data all coming from one device.

The switch inbetween had a blackplane that could handle that volume of traffic so router was the main symptom.

glad we found that one.

This is what I described doing in a subsequent post… we did it to eliminate our core switch and it seems to have confirmed it is not our switch (as it happened with the device we used to side-step our switch).

A tracert is a good idea, however we already know it dies at our gateway… so we seem to get conflicting info from one test to another, adding to the confusion. We are unique in that our ISP actually has their equipment/server rack in our server room… so we have an on-going, long term relationship with our provider and actually house the equipment as well… From his their perspective everything seems normal.

Well we are 24/7 in regards to employees, there is a security control room that never closes, and monitors IP cams and security software. They were the first to alert my boss on Saturday. He was here from 8:00pm until 1:30 pm when he left, and thought it was working… Then came Monday and it’s happening again.

Well if you bypassed your network and going from a PC directly out through your ISP and it still doesn’t work, your ISP needs to get working on it.

1 Spice up

That’s a wrinkle. It’s likely an issue on their equipment. I’d look there.

This is where good network documentation will make it easier to isolate. I would look for common denominators like the systems that are having issues, if they all happen to be plugged into the same switch. Also, determine if the issue is only with outbound traffic. If that is the case, I would do what was brought up earlier and plug a PC directly to your ISP and check for issues there. If you don’t have issues, then you need to look at the firewall/router. If it is a Cisco, they have the best troubleshooting tools built into those things (personally, the reason I buy them).

Nothing common among the systems at any given point… well… that’s not exactly true it just doesn’t help. I have no been able to respond for the last hour because it’s been down.

  • The core switch CAN ping the gateway, when all (or many) PC’s cannot ping the gateway. At the same time the can ping the multiple IP’s on the switch.

  • While it seems like that must indicate the core switch, at that same time all of our phones, DMZ, and other vlans on the same switch, work just fine…

  • The boss, our ISP, and myself have been banging our heads most of yesterday, and all of today so far. At some points the internet comes back and responds normally and at other times it’s down, without us making changes in either case).

I strongly believe this is some form of ARP cache poisoning, but can’t seem to prove that definitively and even if I could, being my first encounter with it (assuming that is the case) I still have no great idea(s) how to combat it.

Every subsequent test or piece of information we seem to deduce or find, contradicts what the last piece of information lead us to believe…

Were both the switches you tried managed? Do the clients lose connectivity to other things on the network, or only the gateway? In the switch logs did you notice it shutting down any ports? Do you have any laptops that can connect to the wireless and wired network, or are all wireless devices on the second VLAN?

Looking at XArp log files it appears almost as though the MAC addresses for our broadcast IP for that subnet, as well as our gateway for that subnet, are being changed roughly every 30 seconds… (take this with a grain of salt… I’m new to XArp and could be reading it wrong).

Should be easy to tell by looking at the ARP tables. Just look for an entry for your gateway IP that is not the MAC address of the gateway. Or, compare the gateway IP entry between when it is working and when it is not.

It does sound like a proxy ARP problem to me as well. Since you have two switches, I would put half your clients on one, and half on the other. When you have a client that can’t hit the gateway, see if it can ping a client on the same switch. If ping works locally on one switch and not the other, you can keep narrowing it down from there.

Yes both switches are/were managed… clients seem to only lose connectivity to the gateway and beyond, internal resources seem to work fine. I did not hear of any ports being shut down via the switch logs, but did not check those personally (the boss did) however we have during this process, tried other ports with the same results.

We have notebooks that are being used to test currently… they can connect to wireless #1 (open and using a different ISP) as well as wireless #2 (secured and subject to the same network outtages as the wired) and then the wired…

I think something is changing the MAC addresses of the gateway and/or the broadcast address…

I have a 100 foot cable… waiting for mine to die, then I’m going to swap cables and see if the other switch (the one in place to bypass the core) and see if that resolves it… although if my ARP table is corrupt/wrong I’m not sure it will…