A Xeon E3-1220 has an 8MB L3 cache and costs about $210. If you deaggregate the entire IPv4 routing table to /24s to support fast lookups and store one nibble of destination data per route (probably just a destination interface number, if you can live with only 15 or 16 possible destinations, but you probably don't have that many upstreams), doesn't that mean that the entire routing table can fit in the L3 cache on the CPU (with a bit of the L3 not used by class E space, which will leave a bit of space for code, assuming you aren't doing anything but routing on this CPU)?
Now, it's quite possible that the Linux kernel's routing isn't dimensioned to provide this sort of performance, but if you want to avoid CAM in the IPv4 world, I don't see why you'd really need CAM...
(Also, I haven't looked carefully at how the L3 cache actually works on the E3s to make sure that it's shared between code and data, etc etc etc. Also, if you're actually implementing this, you presumably do want to include class E in the table, and then hope/assume that when class E doesn't get used, the CPU will be smart enough to cache the code instead.)
>A Xeon E3-1220 has an 8MB L3 cache and costs about $210. If you deaggregate the entire IPv4 routing table to /24s to support fast lookups and store one nibble of destination data per route (probably just a destination interface number, if you can live with only 15 or 16 possible destinations, but you probably don't have that many upstreams), doesn't that mean that the entire routing table can fit in the L3 cache on the CPU (with a bit of the L3 not used by class E space, which will leave a bit of space for code, assuming you aren't doing anything but routing on this CPU)?
Hm. interesting. so, uh, for simplicity, we have 256^3 routes, right? each one is, uh, what,33 bits of data? I mean, you need 3 bytes for the network and then what, uh, 5 bits for the mask? then okay 4 bits for the dest so 33 bits per route, no? so (33*256^3)/8 is 69206016 bits, or what,66 megabytes? sweet jesus, you are right. I mean, you are off by an order of magnitude, but at this scale, who gives a shit about an order of magnitude.
Interesting. 'cause none of the commercial routers do this. which is fucking weird. With this optimization (only store the first 3 octets, as you aren't routing anything smaller, have a lookup table for dest. addresses.) you could take full tables in puny amounts of CAM that come on, say, l3 switches. I will ask around as to why this isn't done.
(of course, with IPv6, this does not come close to solving the problem. /32 is the 'standard' handout for ISPs, and usually the smallest prefix you accept is a /48. There are a lot of fucking /32s.)
Also, you'd have some complications, as /32s are commonly used to blackhole DDoS targets, but you only have a handful of those, so that would add complexity but it wouldn't kill the idea.
I bet I'm missing something- I mean, every two bit ISP would be really happy to give you ten grand for a 10G full-tables bgp router, and you can get l3 switches with enough cam for that and a few 10G ports for little more than half that.
So, I /really/ made myself look like an idiot here:
>you can get l3 switches with enough cam for that and a few 10G ports for little more than half that.
Because I was confusing TCAM with DRAM; rather different sorts of things. (and that, I'd classify as a mistake of inattention. I made other mistakes in that message due to lack of knowledge, but, uh, yeah; I'm not usually /that/ dumb.)
Anyhow, uh, yeah. Looks like I also misunderstood what the parent comment was trying to do (/increase/ the size of the lookup table by breaking larger blocks into /24s, in order to reduce the number of cycles the CPU spends on the lookup.)
So yeah, in full? I'd delete my comment that I'm responding to here if I could. As I can't, I'd like to acknowledge my ignorance, for the record.
If I could rewrite that, I'd say something to the effect that huge amounts of effort have gone into making TCAM obsolete, and so far? well, progress is being made, but the progress isn't moving any faster than line-rates are going up, as far as I can tell. There are very smart people working on the problem, and they say it's a big problem. (my understanding and experience, as a sysadmin that also deals with networks, is that your linux-based PC routers can route something between 1G and 10G of line-rate small packets. Above that, PC routers can't cope with the PPS.)
Note, uh, I do have two spare E3 xeons laying about the office, and 10G interface cards for both, as well as (at least for a short while longer) two nice arista brand 10G switches, so if you are in the sunnyvale area, and you do want to test an idea or bench the current state of the art of software routers, well, it's something I've gotta do anyhow. I still don't have anything even attempting to route my new 10GbE Cogent port. (I've split off a 1Gbe connection using a switch, and plugged the 1Gbe connection into my existing quagga router.)
Given your constraints, I wonder if what you actually want is one switch for each upstream, and a gigabit link from each of your physical Xen servers to each switch that goes to an upstream. If your switches have to route at all, they can each have a default route to the one respective upstream connected to that switch. Then, have a VM on each physical Xen server with full routing tables to control how the outbound packets from all the VMs on that physical host get routed. And hopefully your switches can deal with a /32 route for each VM or something, if they just have a route per VM plus a default route to the one ISP they're connected to.
There's also the variant where you have a /24 block (or something) which has one IP address which is your ISP's router, and then one IP address per physical Xen server, and if your ISP is willing to take individual /32 announcements to your servers (I think this is technically feasible, but may require more mental flexibility than some ISPs have; on the other hand, I think you're talking Cogent, and mentioned that Cogent is willing to peer directly with your customers, which may imply that you're dealing with a flexible ISP), you can have the ISP do the layer 3 routing to the right physical host for traffic going to each VM. (The idea is that the ISP would not propagate the /32s to their peers; you'd separately send them a broader announcement that they could pass along to their peers.)
One other issue in this scheme is that if a single gigabit link between one physical Xen host and the switch going to one ISP breaks, inbound traffic that happens to come to the ISP attached to the switch with the broken link ought to have some mechanism to find its way over to the other switch for the one physical Xen host that has the broken link. I think there are ways that OSPF and or iBGP with the non-default next-hop-self setting can be made to do this.
Now, it's quite possible that the Linux kernel's routing isn't dimensioned to provide this sort of performance, but if you want to avoid CAM in the IPv4 world, I don't see why you'd really need CAM...
(Also, I haven't looked carefully at how the L3 cache actually works on the E3s to make sure that it's shared between code and data, etc etc etc. Also, if you're actually implementing this, you presumably do want to include class E in the table, and then hope/assume that when class E doesn't get used, the CPU will be smart enough to cache the code instead.)