Network Issues at AGLT2

This page is intended to capture the network related issues at AGLT2

Network Issues after UltraLight Router at Starlight (R04CHI) was Retired

On August 28 around noon Eastern the UltraLight router was retired at Starlight. This resulted in the loss of around 1100 routes that AGLT2 was receiving and only about 250 additional routes came in via the USLHCnet E600 connection for AGLT2. The primary impacts from this change:
  1. We lost routing to some subnets at BNL, one of which ( hosted the pilot factory AGLT2 needed access to
    • This should be working, now that ESnet is accepting AGLT2 routes from USLHCnet.
  2. Three ultralight.org subnets lost routing
    • VLAN 57 hosting 192.84.86.126/28 (Terapaths/VNODs/ESCPS/StorNet nodes)
    • VLAN 2002 hosting 198.32.43.32/28 (REDDnet nodes)
    • VLAN 2001 hosting 198.32.43.48/28 (REDDnet nodes)
  3. Various R&E subnets relevant to LHC started to have problems contacting our SRM headnode head01.aglt2.org
    • http://atlas.web.cern.ch/Atlas/GROUPS/DATABASE/project/ddm/releases/TiersOfATLASCache.py By searching it you can find the locations below (note these seem to have a problem reaching head01.aglt2.org). Below are the locations we have seen problems with. All of these fail to contact head01.aglt2.org
    • LIP-LISBON_DATADISK srm01.lip.pt 193.136.90.58
      • Have route for this
      • Traceroute is good, so most likely an end site issue.
    • NCG-INGRID-PT_DATADISK srm01.ncg.ingrid.pt 193.136.75.141
      • Have route for this
      • Traceroute is good, so most likely an end site issue.
    • LIP-COIMBRA_DATADISK storm.gridc.lip.pt 193.137.227.11
      • Have route for this
      • Traceroute is good, so most likely an end site issue.
    • UAM-LCG2_DATADISK grid002.ft.uam.es 150.244.244.41
    • IFIC-LCG2_DATADISK srmv2.ific.uv.es 147.156.116.232
      • Have route for this
      • Traceroute is good, so most likely an end site issue.
    • IFAE_DATADISK srmifae.pic.es 193.109.172.158
      • Do not have route for this Network, need to figure this one out
    • SARA-MATRIX_DATADISK srm.grid.sara.nl 145.100.32.248
      • Have route for this
      • Traceroute is good, so most likely an end site issue.
The first main bullet was addressed by getting ESnet to remove the filter on the AGLT2 subnets that USLHCnet was sending them. (Fixed around noon Eastern on August 29, 2012)

The second main bullet (Ultralight.org subnets) was also fixed by ESnet accepting the routes (as above). However, this will not work longer-term since ESnet (and others) don't want to accept /27 or /28 announcements. Roy is working on getting a full /24 allocated (192.41.238.0/24) that we could migrate the two /28 and the /27 onto. In process as of September 10, 2012.

The third main bullet is still a problem. We need to figure out why those remote hosts can't get to head01.aglt2.org. Are we missing routes to them? As of September 10, 2012, many (all) of these seem to be OK. Maybe it just took a while for the new routes to propagate and stabilize?

Planning and Known Issues

Longer term, some improvements could be made in the following.

  1. There is still a problem with how traffic is getting to the MSU site. In many cases it comes via UM. We need to make sure MSU can directly receive relevant traffic, rather than having it route via UM.
  2. MSU is unable to "stack" its Juniper EX4500 with its current stack of two EX4200s. The EX4200's support 15K routes while the EX4500 only supports 10K routes. The current number of R&E routes coming to AGLT2 is around 13K. If we could find another system to handle most of the routes for MSU, we could stack the Junipers and setup a default route to the "new" router. Need to investigate options. In the interim MSU will not add the Juniper EX4500 into the stack but will only use it as a Layer-2 switch. A big disadvantage of this configuration is the use of 4 10GE ports which would otherwise be free if the stacking cables could be used.
  3. The C6506(AMAZON) in 2268 Randall is too big for its purpose. We want to be able to retire this system. Roy has a 1U replacement we could swap for but he needs to clear it and reconfigure it so we can use it. As of September 10, 2012 this is still in process.
  4. We need to integrate the new Dell 8024F into the room 2268 network. This 10GE device should host the 10GE link to Nile (replacing SW5-unit 3). We need to retire Amazon (see above) to do this. We need the additional 10GE ports in the room to connect up the "test" rack of older storage systems.
  5. I think it would be beneficial to have AGLT2 and BNL exchange a more complete set of prefixes (certainly including any subnets that host ATLAS or LHC services). NOTE: BNL is review those ATLAS related systems that are not on routes being shared with Tier-2's via the virtual circuits or LHCONE and will be migrating them to new addresses to resolve this (underway as of September 5, 2012).
Other items? Add them here or send email to Shawn McKee? (smckee@umich.edu).

Shawn -- ShawnMcKee - 10 Sep 2012
Topic revision: r5 - 10 Sep 2012 - 13:23:58 - ShawnMcKee
 

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback