Basic Network Troubleshooting

In the daily grind of a network administrator’s job, troubleshooting can eat up as much as 90% of your time. You need good troubleshooting skills to quickly and effectively respond to issues that can come up. This article discusses areas you can easily look into to quickly isolate network-related problems.

Always start a troubleshoot with these basic questions:

  1. What changed?
  2. Has this issue occurred before? If yes, when?
  3. Can you replicate the problem?
  4. Did the user do anything differently? If yes, what?
  5. Are other users experiencing the same issue?

With each succeeding question, try isolating the problem by process of strategic elimination. For example, if a workstation cannot connect to the network, determine if it is a network-wide problem or a workstation-specific problem. If it is only the workstation, then you have removed a significant half of the variables and have moved closer to isolating the problem. Even if you cannot find a solution yourself, eliminating extraneous factors saves time when you seek outside help.

You will find that network-related challenges usually take either of these two forms:

  • Slow response times from the remote server, which can be caused by
    • network congestion
    • overloaded server at the remote end of the connection
    • poor routing
    • misconfigured DNS
    • misconfigured NIC duplex and speed
    • bad cabling
  • Lost connectivity/disconnection from network, which can be caused by
    • power failures
    • shut down of the remote server or an application on the remote server
    • hardware and software failures (ex. kernel panic, OOM, etc.)

Note that slowness can escalate to the point where connectivity is lost. This means that symptoms for slow response times can be used to gauge lost connectivity.

 In tracing the cause of a network problem, you either go up or go down examining the model layers—without skipping any—to eliminate the variables. These model layers are the different levels in the network stack:

  1. Application Layer (i.e. Secure Shell (SSH), Telnet, HTTPd)
  2. Transport Layer (i.e. flow control)
  3. Network Layer (i.e. addressing, routing)
  4. Link Layer (i.e. hardware or device drivers)
  5. Physical Layer (i.e. actual cables and other physical media)

When troubleshooting remote servers, we highly recommend using an Intelligent Platform Management Interface (IPMI) with iKVM since it runs through a dedicated network port and has a separate IP address.

Testing link via cables

Your server is communicating with the other devices on your network when the light of the network interface controller (NIC; also referred to as the network interface card or network adapter) link is on. It indicates that the connection between your server and the switch/router is functioning correctly. If it is not lighting up and there is link failure, start troubleshooting by checking the following basic sources:

  • Are the cables in good condition?
  • Are the cables plugged securely and properly?
  • Is the switch or the router where the server is connected turned on?


Testing link status via command-line interface

The ethtool command brings up a report on the link status and duplex settings for supported NICs. In the example below, the NICs are operating at 100Mbps with full duplex, and the link is functioning properly.

# ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes:

10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full

Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Advertised link modes:

10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full

Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: Unknown
Supports Wake-on: pumbg
Wake-on: g
Current message level:

0x00000007 (7)
drv probe link

Link detected: yes

Checking NIC status via command-line interface
The ifconfig command shows you all the activated NICs in your system, even those that have no link. An interface will not appear if it is turned off.

# ifconfig

The ifconfig -a command shows all NICs, whether they are functional or not. Network interfaces that are shut down or are non-functional do not show an IP address line. The word “UP” also does not appear in the second line of their output as can be seen in the examples below:

Shut-down Interface

eth1      Link encap:Ethernet  HWaddr 0C:C4:7C:06:45:0F

BROADCAST MULTICAST  MTU:1500  Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
Memory:fb900000-fb920000

Active Interface

eth0      Link encap:Ethernet  HWaddr 0C:C4:7C:06:78:4E

inet addr:X.X.X.X  Bcast:Y.Y.Y.X  Mask:255.255.255.248
inet6 addr: fe80::ec4:7aff:fe06:780e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:2148573 errors:0 dropped:0 overruns:0 frame:0
TX packets:2652221 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:348130719 (332.0 MiB)  TX bytes:452866425 (431.8 MiB)
Memory:fb920000-fb940000

Viewing NIC errors via ifconfig

Slow connectivity can be traced to errors that creep up due to poor configuration or excessive bandwidth utilization. These errors should be corrected whenever possible as error rates in excess of 0.5% results in noticeable sluggish performance.

Aside from what was stated above, the ifconfig command also shows the number of overrun, carrier, dropped packets, and frame errors:

eth0      Link encap:Ethernet  HWaddr 0C:C4:7C:06:78:4E

inet addr:X.X.X.X  Bcast:Y.Y.Y.X  Mask:255.255.255.248
inet6 addr: fe80::ec4:7aff:fe06:780e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:2148573 errors:0 dropped:0 overruns:0 frame:0
TX packets:2652221 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:348130719 (332.0 MiB)  TX bytes:452866425 (431.8 MiB)
Memory:fb920000-fb940000

 Viewing NIC errors via ethtool

The ethtool command can provide a much more detailed report and show errors when used with the -Sswitch as shown in the example below:

 # ethtool -S eth0

NIC statistics:

rx_packets: 2148660
tx_packets: 2652312
rx_bytes: 356733445
tx_bytes: 469741533
rx_broadcast: 197923
tx_broadcast: 6877
rx_multicast: 0
tx_multicast: 6
multicast: 0
collisions: 0
rx_crc_errors: 0
rx_no_buffer_count: 0
rx_missed_errors: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 0
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_good: 0
tx_tcp_seg_failed: 0
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_long_byte_count: 356733445

rx_errors: 0
tx_errors: 0
tx_dropped: 0
rx_length_errors: 0
rx_over_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0

Viewing NIC errors via netstat

The netstat command is useful for systems where ethtool is not available. It provides a limited report when used with the -iswitch. See the example below:

# netstat -i

Kernel Interface table
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0       1500   0  2148672      0      0      0  2652323      0      0      0 BMRU
lo        16436   0        0      0      0      0        0      0      0      0 LRU
 

Learning the possible causes of Ethernet errors

The list below is a rundown of what causes Ethernet errors:

  • Collisions happen when the NIC detects itself and another server on the LAN attempting data transmission at the same time. They can be expected as a normal part of operation and are typically below 0.1% of all frames sent. Note that faulty NICs or poor cabling may cause higher error rates. There are two kinds of collisions:
    • Single collisions are when the Ethernet frame went through only one collision.
    • Multiple collisions are when several collisions caused the NIC to attempt sending a frame multiple times before doing so successfully.
  • Cyclic redundancy check (CRC) errors happen when the frames were sent but were corrupted in transit. CRC errors, when there are not many collisions, are indicative of electrical noise. Check if you are using the correct type of cable, that the cabling is undamaged, and that the connectors are plugged securely.
  • Frame errors happen when an incorrect CRC and a non-integer number of bytes are received. This is usually the result of collisions or a bad Ethernet device.
  • FIFO and overrun errors happen when the NIC is unable to properly hand off data to its memory buffers due to the existing data-rate capabilities of the hardware. This kind of error is usually a sign of excessive traffic.
  • Length errors happen when the received frame length is less than or exceeded the Ethernet standard usually due to incompatible duplex settings.
  • Carrier errors happen when the NIC loses its link to the hub or switch. If this occurs, check for faulty cabling or faulty interfaces on the NIC and networking equipment.

Checking ARP values to see MAC addresses

When you lose connectivity with another server, which is directly connected to your local network, look at the address resolution protocol (ARP) table of the server you are troubleshooting to determine whether the remote server’s NIC is responding to any type of traffic. Lack of communication at this level may mean any of the following issues:

  • Either server might be disconnected from the network
  • Bad network cabling
  • A NIC might be disabled or the remote server might be turned off
  • The remote server might be running a firewall like iptables or the Windows built-in firewall
    (Note: Typically in this case, you can see the MAC address and that the server is running the correct software. In spite of this, communication is not occurring to the client on the same network.)

The ifconfig -a command shows you both the NIC’s MAC address and the associated IP addresses of the server you are currently logged into. In the example below, you can see that the eth0 interface has one IP address X.X.X.X tied to the NIC hardware MAC address of 0C:C4:7C:06:78:4E:

# ifconfig -a
eth0      Link encap:Ethernet  HWaddr 0C:C4:7C:06:78:4E

inet addr:X.X.X.X  Bcast:Y.Y.Y.X  Mask:255.255.255.248
inet6 addr: fe80::ec4:7aff:fe06:780e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:2148573 errors:0 dropped:0 overruns:0 frame:0
TX packets:2652221 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:348130719 (332.0 MiB)  TX bytes:452866425 (431.8 MiB)
Memory:fb920000-fb940000

The arp -a command shows you the MAC addresses in your server’s ARP table, i.e. all other nodes on the directly connected network that have been sending Ethernet frames during last few minutes. In the example below, we see some form of connectivity with the router at address Z.Z.Z.Z:

# arp -a
test.mydomain.com (Z.Z.Z.Z) at 90:e2:be:39:bb:49 [ether] on eth0

Note: Make sure the IP addresses listed in the ARP table match those of the servers you expect in your network. If they don’t, your server might be plugged into the wrong switch or router port. Remember to also check the ARP table of the remote server to see whether it is populated with acceptable values.

Testing network connectivity via arping

An Ethernet network needs ARP to function properly; thus, ARP requests are not usually blocked by a firewall. If they were blocked, no host can find another host in a network and connect to it—essentially unplugging the system from the network.

To test connectivity using the arping, you need to be on the same subnet as the host you are trying to connect to. Note that your default gateway is usually a good target for this kind of testing. By sending an ARP request rather than an ICMP echo, you are virtually guaranteed to get a reply as long as the other host is actually reachable on the same subnet.

The arping utility makes testing hosts easy. It performs an action similar to the ping command, but on the Ethernet layer. You give it an IP address to ping, and arping sends the proper ARP request. Arping then listens for ARP replies and prints them (if any), including the round trip time as shown in the example below:

# arping 192.168.1.100
ARPING 192.168.1.100
60 bytes from 00:40:05:01:fc:1e (192.168.1.100): index=0 time=190.973 usec

You can also use arping to detect whether more than one host is configured to use the same IP address. In the example below, two machines are replying to queries for the same IP address:

# arping -I eth0 -c 2 192.168.1.100
ARPING 192.168.1.100 from 192.168.1.1 eth0
Unicast reply from 192.168.1.100 [0a:00:3e:d1:bf:49]  0.743ms
Unicast reply from 192.168.1.100 [00:02:b3:99:2c:f8]  0.768ms
Sent 2 probes (1 broadcast(s))

ARP pinging is a good ICMP ping replacement on Ethernet networks. It enables you to confidently take firewalls out of the equation and know that a failed ARP ping indicates a real problem that should be looked into.

NOTE: There are two (2) popular arping implementations:

  • Linux iputils suite – cannot resolve MAC addresses to IP addresses
  • Arping implementation written by Thomas Habets – can ping hosts by MAC address and IP address

Configuring Linux iptables firewall

The Linux iptables firewall is coming to be a source of connectivity issues, especially for brand new servers. It is installed by default under most popular Linux distributions and usually allows only a limited range of traffic. To prevent this issue, read our article on configuring iptables HERE.

Resolving basic IP issues

An ICMP response of “Destination Host Unreachable” indicates that there is a default gateway misconfiguration on the initiating host. On the other hand, a ping response of “Request timed out” could suggest a misconfigured default gateway and/or other lower-layer (L1/L2) issues at play.

To verify that the IP address and network mask (netmask) displayed as your default gateway are correct, check the output when you key in any of the following commands:

# ifconfig ethX

# route -n
OR
# netstat -rn

The “Destination Host Unreachable” error message is your router or server telling you that the target IP address is part of a valid network but is getting no response from the target server. This lack of response may be due to any of the following reasons:

A host on a directly connected network—

  • The client or server might be down, or disconnected for the network.
  • You may be using an incorrect type of cable. (Note: There are two basic types – straight through and crossover.)

A host on remote network —
The network device does not have a route in its routing table to the destination network and sends an ICMP reply type 3, which triggers the error message.

Testing connectivity via ping

It is always good practice to force a response from your server to check its connectivity with your local network. The ping command is the most common method to do this across multiple networks. It sends ICMP echo packets that request a corresponding ICMP echo-reply response from the device at the target address. Most servers respond to a ping query and a lack of response should alert you to potential problems that could be caused by any of the following situations:

  • A server with that IP address does not exist.
  • The server has been configured not to respond to pings.
  • A firewall or router along the network path is blocking ICMP traffic.
  • You have incorrect routing. In this case, check the routes and subnet masks on both the local and remote servers and all routers in between. A classic symptom of bad routes on a server is the ability to ping servers only on your local network and nowhere else. Use traceroute to ensure you are on the correct path.
  • Either the source or the destination device has an incorrect IP address or subnet mask.

Note: There are a variety of ICMP response codes that can help in further troubleshooting.

The Linux ping command sends continuous pings once a second, until you order it to stop with a Ctrl-C. See below for an example of a successful ping to a Google server:

# ping -c 5 google.us
PING google.us (74.125.196.99) 56(84) bytes of data.
64 bytes from yk-in-f99.1e100.net (74.125.196.99): icmp_seq=1 ttl=43 time=7.26 ms
64 bytes from yk-in-f99.1e100.net (74.125.196.99): icmp_seq=2 ttl=43 time=7.37 ms
64 bytes from yk-in-f99.1e100.net (74.125.196.99): icmp_seq=3 ttl=43 time=7.25 ms
64 bytes from yk-in-f99.1e100.net (74.125.196.99): icmp_seq=4 ttl=43 time=7.43 ms
64 bytes from yk-in-f99.1e100.net (74.125.196.99): icmp_seq=5 ttl=43 time=7.30 ms

— google.us ping statistics —
5 packets transmitted, 5 received, 0% packet loss, time 4013ms
rtt min/avg/max/mdev = 7.257/7.327/7.432/0.127 ms

See also Advanced Network Troubleshooting: Using traceroute.
See also Advanced Network Troubleshooting: Using My Traceroute (MTR).
See our Knowledgebase for more How-To articles.

Comments are closed.