Playing with AWS Lambda

I started playing with AWS Lambda tonight. Eventually, I’d like to use Lambda and the API gateway to provide a sort of DDNS (dynamic DNS); a script should run on my home server, touch the API gateway, and Lambda should reprogram an address in Route 53 to match whatever was used for the origin IP. That involves passing a few parameters around, so I figured a good first step was to write a Lambda to collect any arguments and email them to me. Should be easy, right? There’s even a quick example on sending email in the Python smtplib docs. Should be easy, right?

The joys of Amazon email-handling

It turns out Amazon’s Lambda environment doesn’t allow connections to just any SMTP server; you need to use one of the servers that provide AWS’ SES (Simple Email Service). Amazon provides several servers, one per region; use what’s closest. Connections to all other mail servers will fail with a generic “Connection closed” message (presumably Amazon is simply resetting these connections as they’re opened).

Once I was able to open a server connection, I started getting failures due to a lack of authentication. Amazon charges by the email, so I needed to create an IAM user to handle my mail sending (and add Python code to turn on STARTTLS and actually log in). I used the SES credential creation wizard, but any IAM user with the AmazonSesSendingAccess inline policy will work as well. In a custom policy, ensure you’ve allowed the ses:SendRawEmail action.

After that, I started getting errors about my sending and receiving email addresses not be “verified”. Turns out Amazon won’t let you send email unless you’ve proven you own the addresses or domains involved. In my case, I verified my domain with Amazon SES, and (since these were largely testing emails) stuck to sending emails to myself.

By the way, the SES verification directions indicate that the verification is region-specific. If you use multiple SES endpoints, you’ll need to verify your email addresses or domains with each one. For domains hosted by Route 53, this process is easy - there’s even a button to propagate records to Route 53 right from the SES console. There’s also support for DKIM, a system for identifying the validity of emails. Must remember to look into that further someday…

Lambda and API Gateway

Creating an API Gateway interface to a Lambda function is pretty easy, once the lambda already exists. Since I wanted to inspect the HTTP headers coming into the gateway, it was important to turn on the Lambda Proxy Integration checkbox. With that, AWS will expect to get back a dictionary (of headers, body, and statusCode) in return. Much of the API Gateway documentation indicates that this should be a JSON dictionary, but if the Lambda is written in Python the Gateway expects a native Python dictionary back.

The API Gateway will pass useful things to your function in the events and context variables. events contains all the HTTP headers, browser info, etc., while context includes any additional information (including meta-parameters, like permissable runtime). In Python, the context variable is actually an object of typeLambdaContext; useful API client data is probably in context.client_context (though that will be None if nothing is passed).

For my purposes, I’m most interested in events['requestContext']['identity']['sourceIP'] - a string containing the client IP address. I’ll turn that into the basis of a dynamic DNS API in the near future. For now, here’s the code I’m using for my test lambda function:

[Test Lambda function] []
import smtplib, pprint
from email.mime.text import MIMEText
sender = ""
recipient = ""
server = ""
password = "At8aj2lvnASuweAvKu3v49siaselinv492nn1jlHFadjJsjsjwl"
port = "587"
def lambda_handler(event, context):
pp = pprint.PrettyPrinter(indent=4)
rdict = {}
rdict['body'] = "Hello from Lambda: <br><pre>" + pp.pformat(event) + "</pre><br><pre>" + pp.pformat(context.client_context) + "</pre>"
rdict['headers'] = { "Content-Type": "text/html" }
rdict['statusCode'] = "200"
msg = MIMEText(pp.pformat(event))
msg['Subject'] = "Test from lambda"
msg['From'] = sender
msg['To'] = recipient
s = smtplib.SMTP(host=server, port=port)
s.login(username, password)
s.sendmail(sender, [recipient], msg.as_string())
return rdict

Replace the sender, recipient, username, and password variables with your own values.

When visiting the API Gateway with a web browser, I get output similar to the following:

[Lambda sample output]
Hello from Lambda:
{ u'body': None,
u'headers': { u'Accept': u'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
u'Accept-Encoding': u'gzip, deflate',
u'Accept-Language': u'en-us',
u'Cache-Control': u'max-age=0',
u'CloudFront-Forwarded-Proto': u'https',
u'CloudFront-Is-Desktop-Viewer': u'true',
u'CloudFront-Is-Mobile-Viewer': u'false',
u'CloudFront-Is-SmartTV-Viewer': u'false',
u'CloudFront-Is-Tablet-Viewer': u'false',
u'CloudFront-Viewer-Country': u'US',
u'Cookie': u'regStatus=pre-register; s_dslv=1482545852452; s_fid=023C0FA3C5B564D7-149E1840C1D08425; s_nr=1482545852462-New; s_vn=1514081675457%26vn%3D1',
u'DNT': u'1',
u'Host': u'',
u'Referer': u'',
u'User-Agent': u'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/602.3.12 (KHTML, like Gecko) Version/10.0.2 Safari/602.3.12',
u'Via': u'1.1 (CloudFront)',
u'X-Amz-Cf-Id': u'KNiwjv9wnadjwJFHbWJCjbdbdyyxx==',
u'X-Forwarded-For': u',',
u'X-Forwarded-Port': u'443',
u'X-Forwarded-Proto': u'https'},
u'httpMethod': u'GET',
u'isBase64Encoded': False,
u'path': u'/',
u'pathParameters': None,
u'queryStringParameters': None,
u'requestContext': { u'accountId': u'175919371',
u'apiId': u'81i44Fkwn',
u'httpMethod': u'GET',
u'identity': { u'accessKey': None,
u'accountId': None,
u'apiKey': None,
u'caller': None,
u'cognitoAuthenticationProvider': None,
u'cognitoAuthenticationType': None,
u'cognitoIdentityId': None,
u'cognitoIdentityPoolId': None,
u'sourceIp': u'',
u'user': None,
u'userAgent': u'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/602.3.12 (KHTML, like Gecko) Version/10.0.2 Safari/602.3.12',
u'userArn': None},
u'requestId': u'713e07f6-db86-11e6-9fb8-3b48cbecf41e',
u'resourceId': u'lasdu138cjc',
u'resourcePath': u'/',
u'stage': u'Production'},
u'resource': u'/',
u'stageVariables': None}

Took a bit of digging, but now I have a nice little URL I can visit that calls a Lambda, prints off the various arguments, and even emails me the event data. Nifty!

Moving to Route 53

I’ve run my own nameservers for years. A decade or two ago, setting up a DNS was a fun way to learn how bits of the internet worked; I even hosted backup name services for a couple friends. While nothing here is really broken, I’ve begun using Amazon’s CloudFront and need to make a change. CloudFront (Amazon’s web caching system) uses a randomly-generated hostname and dynamic set of IP addresses, so if you want a static name in your own domain you’ll either need to use a CNAME record or host the domain’s nameserver with Route 53. Unfortunately, since I’d like to have the toplevel of my domain (e.g., point to a CloudFront address, the CNAME option is right out. CNAME records aren’t really allowed for a domain, and can cause all sorts of practical problems.

Setting up a Route 53 DNS host was extremely easy - just go to the Route 53 dashboard, hit Create Hosted Zone, and fill out the (short) form. I just imported my zone config file verbatim (click the new hosted domain name in the list, then Import Zone File, and paste your existing content in the dialog box that appears). AWS automatically changed my NS and SOA records to match their servers and imported everything else (even the AAAA records). Once the hosted domains were in Route 53, all I had to do was navigate to my registrar (Dotster, for now) and enter Amazon’s provided nameservers instead of my own. After the TTL expired, my names began serving from Amazon’s infrastructure.

With all the plumbing re-routed, adding the CloudFront linkage was pretty simple. The only less-than-obvious piece was the Alias radio button in the Create Record Set interface; changing this from “No” to “Yes” changes the form, and provides a list of alias targets (CloudFront distributions, S3 buckets, and Elastic Load Balancers/IPs) to choose from. Save Record Set, and now you’ve got a geographically-distributed, highly redundant infrastructure for serving a low-traffic blog. Nice!

A Hexo Blog Part 3: Serve It with CloudFront

I’m still setting up a Hexo blog in Amazon’s AWS, and the next step on the game plan is to front the AWS S3 bucket with CloudFront. On the plus side, this seems to be incredibly easy. Amazon even has some very thorough documentation on how to set up CloudFront to be a basic web cache.

The first time I made a CloudFront distribution I forgot to include a default root object. It’s an easy fix; make sure the root object is set to index.html, and things should load up fine. It’s also worth noting that I picked my site’s S3 bucket for an origin (rather than the S3 website URL, as indicated in the CloudFront docs. Not sure why this isn’t recommended, but it seems to work fine.

2017-01-07 update: Using your S3 bucket directly from CloudFront (rather than the S3 website URL) doesn’t work fine. Sure, the toplevel page displays, but pages in subdirectories don’t show up. Oops! I see why Amazon says you should use the URL - as soon as I switched it over, everything behaved nicely.

Multiple Origins

In addition to this blog, I also use a webserver for a few other dynamically-created things. These don’t need to be public, or scale as broadly as a blog, but they do need to appear from my domain’s main web server. Thankfully, CloudFront lets you do that by creating multiple origins.

Once you have a distribution made, go back to the CloudFront dashboard and click its ID. Select the Origins tab, then Create Origin. I entered in the domain name of my web server (e.g.,, not - that would be the CloudFront service) for the Origin Domain Name, and left Origin Path blank. Then go to the Behaviors tab, and click Create Behavior. I used example/* for a Path Pattern; be warned, this must agree with the web server’s configuration! The web server must be set to respond to GET requests to /example. If your server is set to serve its content on /, then the Origin Path for this origin would need to be set to /example. I mistakenly set both the Origin Path and the Path Pattern when setting this up, and got my web server’s error screen back from CloudFront. The server received a GET /example/example - the Origin Path and Path Pattern were combined. Best to pick one or the other, not both.

By the way, if you do have dynamic content (like I do), it’s probably guided by a query string or set of cookies. There are options for the behaviors on your distribution to forward some or all of these on to the origin. You can even set a regex-style whitelist, if you only want a few query string elements or cookies to be sent along. In my case, I forwarded all query strings over (and ignored cookies) and things seemed to work perfectly. I say “seemed”, because a day later I realized my content was no longer updating: CloudFront’s default TTL is a bit too long for me. At some point I’ll need to modify that application to return HTTP Cache-Control tags.

Deployment Times

AWS docs indicate a deploy should take about 15 minutes (and editing origins and behaviors count as a re-deploy). In my experience, this can take considerably longer - often up to an hour or so. It’ll synchronize eventually, so give it a lot of time.

A Hexo Blog: Part 2 (Hexo into AWS S3)

After setting up a basic Hexo blog, the next logical step is to start publishing the blog to AWS S3. In the past, I’ve used a dedicated VM for this sort of thing, but that means I’ve still got a machine to patch, update, and care about. If I use S3, Amazon takes care of all that - and the costs are lower, to boot.

For the most part, I’m cribbing from Sven Flickinger. However, after following his directions I got some AWS permission errors, so I’m documenting my steps here as well. Your mileage may vary…

First, we need to add the deployer:

npm install --save hexo-deployer-s3

This requires a new config stanza for _config.yml:

type: s3
bucket: <bucket>
aws_key: <key>
aws_secret: <secret>
region: <region>

At some point, you’ll need to log into AWS and start making an S3 bucket. Buckets need a name, and a region; for my blog, I used the domain name for the bucket name and stuck it in us-east-1. Be sure to enable website hosting, and list index.html as the Index Document.

Once an empty bucket is created, we’ll also need an IAM user with appropriate permissions to upload the blog pieces. Creating an IAM user is simple; be sure to create an access key when you do (or go back into the IAM display, hit the Security Credentials tab, and click Create access key). The access key ID and secret key need to be plugged into _config.yml, or in environment variables AWS_KEY and AWS_SECRET (and removed from _config.yml).

The new IAM user is going to need permissions to manipulate the S3 bucket. Sveen gave a short policy doc, but I found that to be incomplete - at least for the first deploy. Go into IAM, Create Policy, and use the Policy Generator. You can edit the policy document; I’m using this:

S3 Access Policy
"Version": "2012-10-17",
"Statement": [
"Sid": "S3blogFullAccess",
"Effect": "Allow",
"Action": [
"Resource": [

The Sid field is an arbitrary string (no whitespaces) to name the policy. The Resource field lists all things this policy can act upon; it’s important to list both the contents of the bucket (*) as well as the top-level of the bucket itself ( Without both, hexo deploy won’t be able to function.

Once the policy is made, select it (from IAM‘s Policies sidebar) and click the Attached Entities tab. Hit Attach, then pick your user from the list of IAM users. That should be it; hexo deploy should function now.

After the first deploy, your blog should be accessible via <bucket>.s3-website-<region> - a not-too-friendly domain name. At some point I’ll get around to integrating this with AWS CloudFront, to put it under a more human-readable name (and will probably type up another blog post as well).

A Hexo Blog: Part 1

For years, I haven’t done much of anything with a personal web site. Around the turn of the century, I played with all the popular PHP/MySQL-powered blog and photo album systems, but these have largely been left by the wayside. Up until recently, my personal web site was merely a few static HTML pages’ worth of notes, running on a hosted virtual machine.

Recently, I decided to start delving into serverless computing (mainly around AWS). Overhauling my personal site, and moving it into AWS, seemed like a nice way to do so. I like having a static web site, but more modern tools would be really handy. A friend recommended I try Hexo, so here we are.

Hexo is a Node.js engine for generating and deploying a set of static web pages for a blog site. Source can be comfortably checked into git, and the hexo CLI tool will generate web pages and deploy them as needed.

Hexo+AWS Game Plan

So here’s my initial plan:

  1. Set up Hexo in AWS’s Code Commit git tree.
  2. Get Hexo to deploy to AWS S3
  3. Front the web server with AWS CloudFront (for theoretically infinite scaling)
  4. Use AWS Lambda to automatically rebuild the blog on every git commit

For now, I’m just covering the first item. The rest are topics for future days…

Getting Started with Hexo

There’s lots of “getting started with Hexo” sorts of postings out there, plus a fairly fleshed-out bit of documentation on the Hexo web site, so I’m not going to go into tons of detail here. Similarly, there’s plenty of documentation on AWS CodeCommit, and how to set up an initial repository, so I’m only going to cover a few oddities here.

Hexo requires a slew of Node.js dependencies. npm is your friend, and will put things in node_package by default. For future reference, I added these:

npm dependencies
npm install --save hexo-cli hexo hexo-renderer-jade hexo-renderer-pug hexo-renderer-stylus hexo-generator-archive hexo-generator-tag hexo-generator-feed hexo-generator-sitemap hexo-browsersync
git add packages.json
git commit -a

That bit about packages.json wasn’t obvious to me (a Node.js-neophyte) initially. Apparently running npm install will parse your packages.json file, and auto-install anything listed therein. Much easier than mucking about with system dependencies, or checking piles of Node scripts into your blog’s git tree. Really gotta learn more about Node one of these days…

Anyway, being a long-time UNIX fan I whipped up a quick Makefile to build everything:

all: node_modules public
node_modules: package.json
npm install --save
public: source node_modules/*
hexo generate
distclean: clean
rm -rf node_modules
rm -rf public

For now, my fledgling blog just lives in an AWS code tree. Eventually, though, I should get around to the other points listed above (though that will be several subsequent posts).

OS X and Dynamic DNS Updates

OS X (and other Apple things) and Dynamic DNS Updates

A while ago I found a couple notes on Dynamic DNS, using TSIG and dynamic
updates, and put together a dynamic subdomain for my domain. Apple gear (or
at least OS X) seems to require a couple particulars to function, and since
I keep forgetting what’s necessary I put together these notes.


Apple calls TSIG-signed dynamic updates “dynamic global hostname”. On OS X,
this can be turned on in System Preferences; hit the “Sharing” button, assign
a hostname (short name, not fully-qualified), and click the “Edit…” button.
Check the “Use dynamic global hostname” box.

Hostname should be the fully-qualified domain name you want to update. Oddly
enough, the User should also be the FQDN you wish to update. The
Password is the TSIG key.

DNS Config

For this to work, you’ll need a few special records in DNS; Apple calls this
“Bonjour”. Really, it’s a bunch of SRV records. Add the following:

; DDNS update glue
_dns-update.udp         IN      SRV     0 0 53
b._dns-sd._udp          IN      PTR     @    ; browse domain
db._dns-sd._udp         IN      PTR     @    ;
dr._dns-sd._udp         IN      PTR     @    ;
lb._dns-sd._udp         IN      PTR     @    ; legacy browse
r._dns-sd._udp          IN      PTR     @    ; registration domain

Explanations of all these SRV records can be found in the references, below.

In addition, you’ll need to configure your domain to support DDNS, and set up
a TSIG key for your machine. When you set up the TSIG key, you’ll need to
make a 256-bit MD5 authentication code:
dnssec-keygen -a HMAC-MD5 -b 256 -r /dev/urandom -n HOST

Don’t forget the trailing period on the hostname when using
dnssec-keygen. It’s not necessary for OS X, but bind
really likes it.

One last little oddity: it looks like Apple devices will only update a single
nameserver with their changes. If you have multiple DNSes listed as
authoritative for your dynamic zone, you’ll want to make the first one listed
(and the one listed in the update glue record) able to receive updates, then
fashion some method of replication to your other nameservers. It appears
that if Apple gets a successful submission from one server, it never bothers
to attempt injecting the update into other machines (but it will fail over
to other nameservers, and update them, if the first one fails to respond).


Here’s a few links I found handy in piecing things together:

Bandwidth Delay Product Tuning & You

Wide-area networks abound, and fast networks abound (where fast is > 200 Mbps),
but your average consumer will never deal with both at the same time. As of
this writing (late 2013), typical US broadband connections are 40 Mbps or less.
Generally less. Most operating systems seem to be tuned to work acceptably
well across the Internet at these speeds by default. Unfortunately, users of
faster links (like gigabit ethernet, or 10 or 40 Gbps ethernet) are often left
at a loss to explain why their network connections seem amazingly fast on a
local connection (intra-building, across a small academic campus, etc.) but
fall to rather paltry speeds when covering any sort of distance. In my
experience, users generally chalk this up to “the network is slow”, and live
with it. If some network support engineers (ISP, corporate network group,
whatever) is engaged, you usually get some sort of finger-pointing; all sides
have plenty of evidence that both the client, server, and network are operating
just fine, thank you, and that something else must be broken.

In many cases, TCP itself is the limiting factor. TCP must be lossless, even
in the face of packet losses, retransmissions, and corruption. To support
that, a TCP implementation (read: your operating system kernel) must save
every byte of data it transmits until the recipient has explicitly acknowledged
it. If a packet is lost, the recipient will fail to acknowledge (ACK) it (or
will repeatedly ACK the last byte it did receive); the sender can use its
stored copy to re-transmit the missing data. So how big does this buffer
need to be, anyway? Yeah, that would be the
bandwidth delay

Bits do not propagate instantly - the speed of light is a constant. That means
a sender must buffer enough data for its network adapter to run at full speed
while waiting for the full round-trip delay to the recipient. The round-trip
delay can be measured via the UNIX ping command; typical values are
in tens of miliseconds. Multiply the bandwidth and the time (in seconds) for
a round trip, and you’ve got the amount of buffer space needed to keep a
connection busy at that distance. For example, a 1 Gbps network connection
with a 54 ms ping latency (say, from the midwest to the west coast), we
require 1 Gb/s * 0.054 s = 54 Mb = 6.75 MB of buffer space. Obviously, a
10 Gbps ethernet connection (and appropriate routers) would require 67.5 MB
of buffer to fill the available bandwidth.

The remainder of this document outlines how to tune TCP stacks in a couple OSes
for high bandwidth delay product communication. There’s a wide array of
OS-specific TCP and IP tuning parameters; here, I’m only focusing on the ones
related to long-haul TCP sessions. For more info, check out the links
referenced below.


Linux’s TCP stack includes tunables for overall maximum socket memory, as well
as a three-part value for send and receive, listing minimum, initial, and
maximum memory use. There are many other tunables, but as of RedHat Enterprise
6 (kernel 2.6.32 or so) most of these default to usable values for a 10 Gbps
WAN connection. The socket memory settings, however, default to a maximum of
4 MB of buffer space - probably far too small for modern WAN things.

TCP tunables are controlled via sysctl (read the man page). Add
the following to /etc/sysctl.conf:

net.core.rmem_max = 524288000
net.ipv4.tcp_rmem = 8192 262144 131072000
net.core.wmem_max = 524288000
net.ipv4.tcp_wmem = 8192 262144 131072000

The rmem_max line allows up to 0.5 GB of memory to be used for a
socket. Technically, this is way overkill, as the next line (for
tcp_rmem) will limit this to 128 MB max (and 8 kB minimum, with
a default of 256 kB). If 128 MB proves insufficient, simply raise this third
value. Both are repeated for wmem (memory for a sending socket).

Apple OS X

Apple’s TCP stack is BSD-derived. It also uses sysctl for tuning,
but has different tunables from Linux. Total socket memory is limited by
the maxsockbuf parameter; unfortunately, as of OS 10.9, this is
limited to a mere 6 MB - and that must be split (statically!) between send
and receive memory. Honestly, that’s just not enough for long-distance
transfers, but we’ll make the most of it that we can.

Currently, I’m recommending these lines in /etc/sysctl.conf:


  • kern.ipc.maxsockbuf: This is the maximum amount of memory to
    use for a socket, including both read and write memory. Again, in 10.9, this is limited
    to 6 MB (and defaults to 6 MB) - rather disappointing, Apple. Note that this
    probably also affects SYSV IPC sockets (though, that’s unlikely to make a
    major difference for anyone).

  • net.inet.tcp.sendspace: Allow for up to 3 MB of memory for a send buffer.
    This, plus net.inet.tcp.recvspace, must be less than maxsockbuf.

  • net.inet.tcp.recvspace: Allow for up to 3 MB of memory for a receive buffer.
    This, plus net.inet.tcp.sendspace, must be less than maxsockbuf.

  • net.inet.tcp.doautorcvbuf,doautosndbuf: MacOS has a mechanism
    for auto-tuning buffer sizes. By default, this is limited to 512 kB for each
    of send and receive. Setting these to 0 will disable the buffer
    auto-tuning entirely.

  • net.inet.tcp.autorcvbufmax,autosndbufmax: If you’d rather
    keep the auto-tuning buffer logic enabled (see above), you’ll want to raise
    this maximum. The default (at least in 10.9) is 512 KB; a value of 3 MB
    (3145728) is more appropriate, and will allow your machine to hit higher
    transfer speeds. I suggest tuning this if your machine handles a lot of TCP
    connections. Most users probably won’t care, but at up to 6 MB per TCP
    connection, you could burn through memory quickly if you’ve got hundreds of
    connections in progress.

  • net.inet.tcp.autorcvbufinc,autosndbufinc: Based on the name,
    I suspect this determines how aggressively buffer auto-tuning ramps up to its
    full buffer size. It defaults to 8 KB; if you do use buffer auto-tuning, and
    if you see poor performance on short-lived connections (but better performance
    on TCP transfers that take at least a couple minutes to complete), you might
    try increasing this value by a factor of 10-20.

  • net.inet.tcp.mssdflt: Yeah, this should be higher. MacOS
    defaults to 512 bytes for its maximum segment size (the largest packet it will
    attempt to send). “Normal” ethernet frames are up to 1500 bytes (and there
    are specs for yet larger packets). 512 bytes is appropriate for modems, but
    not for anything faster (and that includes cable modems). If you’re using
    ethernet, I’d recommend 1460 (that’s a 1500-byte ethernet frame, minus 40 bytes
    of TCP/IP headers). If your ethernet goes through a PPPoE relay (e.g., DSL,
    and maybe some cable modems) you probably want 1440 (to account for 20 bytes of
    PPPoE framing data). Note that this doesn’t really make your connection
    faster - you just use fewer packets (and therefore fewer network resources)
    to get the job done.

  • net.inet.tcp.win_scale_factor: Most TCP implementations
    automatically calculate the window scale factor. In case MacOS doesn’t, I
    set this to 8 - though I’m not certain this is required. Try it, try omitting
    it, see if there’s any difference. If you’re wondering what a window scale
    factor is, I suggest reading the wikipedia page. Essentially, it controls how large a buffer
    your machine can advertise to the other side of the TCP connection.

  • net.inet.tcp.delayed_ack: Delayed ACKs are generally a good
    idea - wait until a few packets have arrived, and acknowledge them all at once.
    Fewer reply packets, less network traffic, etc. This can result in slightly
    higher latency (since the receiver waits slightly for multiple packets to
    arrive, even if only one is on the wire). Worse still, in some not-so-rare
    circumstances, this can interact very badly with Nagle’s algorithm
    (a similar sender-side optimization) - so much so that you can get several
    orders of magnitude worse performance, with no obvious reason why. If you
    suspect this is a problem, turn it off; for more information, look

References, Next Steps

There are a plethora of TCP tuning guides out there. If you’re tuning to a
specific application, or with certain high-end hardware (in particular,
Mellanox 10 and 40 Gbps adapters), I’d recommend looking at ethtool
settings as well.

Making an iPad print to a CUPS queue

Apple’s iPads (really, any recent iOS device - phones, iPods, etc.) use their
“Airprint™” system for discovering and using printers. If you have
something plugged into a Mac (or AirPort), and shared appropriately, this
is all automatic. My printers are all handled by a CUPS server, though, which
isn’t detected by default. Thankfully, this is amazingly easy to set up if
you install Avahi.

Set up Avahi to advertise your printer’s CUPS queue. Basically, you just need
to write an XML description of the service advertisement; mine looks like this:


    <name replace-wildcards="yes">HP Color LaserJet CP2025dn on server</name>

    <txt-record>MDL=Color LaserJet CP2025dn</txt-record>
    <txt-record>product=(HP Color LaserJet CP2025dn)</txt-record>


A couple important points:

  • You must advertise IPP (since that’s CUPS’ native language, this should be obvious).

  • The _universal subtype is required for AirPrint (though OSX will find the printer without this).

  • Printer capabilities are indicated by <txt-record> tags.

  • The pdl record lists all data types the printer (or in this case, CUPS) can natively handle.

  • The URF record is required. It can be equal to “none“, but if this record is missing AirPrint won’t recognize the printer.

  • The “Duplex“ option will make a Duplex option appear in iOS when you try and print to this printer. Other useful boolean options may include:

    • Duplex

    • Copies

    • Transparent

    • Binary

This example is cobbled together from other sources. Ryan Finnie’s blog was
quite useful (, as was tjfontaine’s airprint-generate script (

RoCE - RDMA over Converged Ethernet


Currently, I work for a mid-sized high-performance computing (HPC) shop.
For many of the scientific codes we run, communication performance matters -
both in terms of inter-machine (a.k.a., inter-node) bandwidth and latency.
Like most HPC shops, we have some experience with Infiniband, but in recent
years we’ve been using 10 Gbps Ethernet (10gigE) for a cluster interconnect.
Given ethernet’s prevalence, and general dominance in datacenter networking,
10gigE seems on the surface to be a general win, and a decent choice for
a cluster interconnect (particularly for a user base that historically
prefers gigabit ethernet for cost reasons).

I’ve designed three 10gigE clusters, two of which are on the current
(November, 2011) Top 500 list. I do
not recommend this. 10gigE has its place, but currently economics favor
Infiniband for high-performance computing. If your code uses MPI, and you
need more cores than you can fit in one compute node (and your code isn’t
embarassingly parallel - I’ve seen some that could operate nicely over
10 Mbps ethernet), you should be looking at Infiniband.

Rather than delving into why I’ve been building 10gigE clusters, this page
discusses modern technology that can help you get the most performance from
a high-speed ethernet fabric. Be warned, the content from here on out gets
technical quickly. I’ve likely spent more time than is healthy examining
this space, and doing so requires a fair amount of expertise in TCP, IP,
ethernet, Infiniband (as well as general RDMA theory, and its multiple
incarnations), operating systems, MPI libraries, and several vendors’ product

To quote the xterm source code: “There be dragons here.”

Defining “slow”, and Why Plain TCP/IP is Bad

TCP/IP is great, for most things - but the API pretty much requires kernel
intervention. Your app calls socket() and write(),
some library fires off a syscall, and the kernel starts formatting data to
go over the wire. Under Linux, a null syscall has an overhead of around 1000
instructions (if you’ll pardon the blind assertion), so that means you can
do around 2.5 million syscalls per second on a 2.5 GHz CPU (using some vague
hand-waving to avoid calulating effects of load-store queuing and superscalar
processors). On paper, that means a hard max of around 30 Gbps of throughput -
more, with frame sizes over 1500 bytes.

Unfortunately, that’s not reality. First off, a processor will need to do
some data formatting and copying beyond the time to enter the syscall. Second,
data arriving will also trigger syscalls. Some of this can be ameliorated
(e.g., jumbo frames, interrupt coalescing, etc.) but at a cost of tying up a
processor to handle the kernel’s side of the communication. If your application
requires frequent data exchange (like most HPC simulations), the added latency
and processor overhead can greatly degrade performance - even without fully
utilizing the available bandwidth.


TOE (TCP Offload Engine) NICs may help, to a limited degree. A TOE will
reduce the CPU’s workload, but won’t significantly reduce overall message
latency - unless the TOE vendor comes with a wrapper library to replace the
sockets API (Solarflare does this, for example).


If you need to do RDMA over Ethernet, this is the easiest way to do it. It’s
not quite Infiniband, but many of the various IB-related commands in OFED
will work. Many RDMA apps will work with this, and as iWARP is encapsulated
by TCP/IP it can transit a router. Latency will be higher than RoCE (at least
with both Chelsio and Intel/NetEffect implementations), but still well under
10 μs. iWARP is reasonably stable with recent versions of the
OpenFabrics stack - in-kernel drivers
may not be as stable (including those baked into Redhat Enterprise 5 and 6).
Caveat emptor.


RoCE is RDMA over Converged Ethernet - but Infiniband over Ethernet would be
a more apt description. Strip the GUIDs out of the IB header, replace them
with Ethernet MAC addresses, and send it over the wire. As of this writing,
only Mellanox ( makes
RoCE-capable equipment (their CX2 and CX3 line of products).

Infiniband is a lossless physical-layer protocol, so RoCE requires lossless
Ethernet. Also, since it’s Ethernet, RoCE cannot transit a router. It’s
strictly a layer-2 protocol, and it needs a complicated layer-2.

Lossless Ethernet: a Quick Review

Ethernet becomes lossless by re-using 802.1D PAUSE frames for explicit flow
control. This is timing-sensitive; a receiver must send a PAUSE soon enough
such that it is received and processed before the receive buffer can fill.
Obviously, there are issues stretching this over some distance. Switches
must be internally lossless, and must be able to send PAUSE frames as well
as receive them. Such switches are usually marketed with acronyms like “DCB”
(DataCenter Bridging) or “CEE” (Converged Enhanced Ethernet).

Obviously, this coarse-grained approach will pause all traffic over the link -
including any IP or FCoE traffic. As this can have a negative impact on
non-RoCE performance, Cisco has proposed Priority Flow Control (PFC, now
covered under IEEE 802.1Qbb). This
is a PAUSE frame with a special payload, indicating which Ethernet QoS class
should be paused. This is accompanied by other protocols, to negotiate
QoS values on either end of a link (i.e., between NIC and switch).

Finally, all types of traffic on the link will have different Ethernet frame
types (as described by
IPv4, IPv6, FCoE, and RoCE all have different ID values.


While RoCE is supported by
OFED, as of OFED 1.5.3 it isn’t
completely stable. You’ll want to use Mellanox’s OFED - version 1.5.3 or
higher. Stock OFED will work fine for small tests, but large applications
will have a tendency to crash.

PFC is a pain. The tools to auto-negotiate may not exist for RoCE - the
only documentation I’ve found was limited to FCoE. Avoid it if at all possible.

Somehow, you’ll need to classify RoCE traffic as lossless. Here’s some
suggestions, in my order of preference:

  1. Discriminate RoCE traffic by Ethertype - RoCE packets would be
    treated losslessly, and non-RoCE traffic could be dropped (during congestion).

  2. Classify ALL traffic as lossless (and deal with the performance impact, if
    any, on non-RoCE traffic).

  3. Assign a QoS class for lossless traffic. Unfortunately, Mellanox adapters will
    only emit a QoS when they emit a VLAN tag, so you’ll need to do the following:

    • Set a default IB Service Level to match your QoS using options rdma_cm def_prec2sl=4 in /etc/modprobe.d (Obviously, I’m using the value 4)

    • Configure your Ethernet switch to treat that traffic as lossless

    • Create a tagged VLAN device on your RoCE NIC on all connected systems

    • Assign those VLAN devices a private IP address

    • Stick that IP address in /etc/mv2.conf, so MVAPICH2 will know what IP address to try for RoCE connections

    • Configure all other RDMA-aware applications to use a non-default GID (since VLAN interfaces will appear as additional GID indexes on the Infiniband HCA side of the RoCE adapter)

    So you have Cisco Nexus switches…

    If you can, stop reading and go buy some Infiniband adapters. You’ll save a
    considerable amount of staff time.

    Fine. Keep reading. But don’t say I didn’t warn you.

    The Nexus 5000-series and the Nexus 7000-series switches are completely
    different products. The interface to building lossless queues is different,
    the command syntax is different, and different values can be used for lossless
    traffic classes on each series of switches. If you have environments with
    both, you’ll be picking different QoS values.

    The Nexus 7000 platform only supports lossless queuing on the newest “F”
    boards - the fabric boards that have no routing abilities. You’ll want to
    buy those, if you plan on having stable RoCE.

    Finally, be wary of ANY firmware updates. We’ve had a functional RoCE
    configuration on a Nexus 7000 switch, using firmware 5.1(3), using the
    third method above. That broke, however, when we upgraded to 5.1(5).
    Something changed in the default queuing config, and since you can only build
    on the default lossless queue config (rather than nuke it and define your
    own), you are subject to changes in the default. In our case, RoCE performance
    dropped to 30 Mbps (down from 9.91 Gbps). All wasn’t lost, though - after
    the upgrade, all traffic was lossless (except what we’d previously tagged
    via QoS, of course). We just stopped using QoS, and now have reliable
    Ethernet. Absolutely bizarre.

    Making this all work for practical apps

    Making this work depends on how RoCE traffic was classified. If RoCE
    Ethertypes are lossless, or if all traffic is lossless (options #1 or #2,
    above) any RDMA application should just work - the RoCE adapter presents as an
    Infiniband HCA.

    If you picked option #3, you’ll need to jump through some extra hoops. First,
    set the def_prec2sl module parameter and /etc/mv2.conf
    as described above. At this point, MVAPICH2 applications should work. For
    OpenMPI, you’ll need to use OpenMPI 1.4.4 or 1.5.4 or newer. They need
    additional command-line options to set the IB service level and the IP address
    to use: -mca btl_openib_ib_service_level <number> and
    -mca btl_openib_ipaddr_include <ipaddr>, respectively.
    These can be baked into a config file (like openmpi-mca-params.conf
    in your OpenMPI’s share directory). Note that
    btl_openib_ipaddr_include can take CIDR notation for a subnet to
    match, so you can use the same config file for all nodes in a cluster.

    In theory, it may be possible to use RoCE for non-MPI applications - including
    kernel-level things like Lustre. I’d only attempt this if options #1 or #2
    are in use, though - setting extra VLANs, non-default GIDs, and custom IB
    service levels (mapped to Ethernet QoSes) is likely to be hard to integrate
    in anything other than OpenMPI and MVAPICH2.

    Additional Resources

    There isn’t a lot of documentation (practically zero, outside of Mellanox)
    on RoCE. Any useful links I can find will be added here.

    Time Machine, Meet Netatalk. But in Lion.


    This morning, I upgraded the first Mac around the house to MacOS 10.7 (aka,
    “Lion”). Went smoothly, and it’s re-indexing Spotlight now. Insert comments
    about how wonderful it is to have to get used to new trackpad finger gestures
    (gestures are nice, but it’ll be a few days before I’m used to the workflow

    Naturally, Time Machine is now horribly broken. Originally, I was using AFP
    and netatalk, as described here, but then I
    switched to SMB and Samba (since Netatalk 2.1.x wasn’t as stable). Lion no
    longer supports either of these methods; it only works with AFP 3.3. That’s
    only supported by Netatalk 2.2, which (as of this writing) was committed to
    git yesterday.

    This page serves to document my odyessy in setting up netatalk on a FreeBSD
    jail in the basement, from the latest source in git. Here’s a couple useful

    Throughout all this, I’m assuming a similar earlier
    of Time Machine has been done, and the previous netatalk packages
    have been removed. Right now, I’m mainly concerned with differences.

    Source Setup

    As shown in the links above, get git, grab the source, and start building:

    pkg_add -r git
    git clone git://
    cd netatalk
    git checkout netatalk-2-2-0
    ./configure –without-acls –without-pam –disable-ddp –disable-cups

    I didn’t have appropriate zeroconf headers on my FreeBSD jail, so I didn’t
    configure with –enable-zeroconf. I’ll use Avahi for that setup, if needed.
    My config ended up looking like this (printout from ./configure):

    Using libraries:
        LIBS =  -L$(top_srcdir)/libatalk
        CFLAGS = -I$(top_srcdir)/include -D_U_="__attribute__((unused))" -g -O2 -I$(top_srcdir)/sys
            LIBS   =  -lcrypto
            CFLAGS =  -I/usr/include/openssl
            LIBS   = -L/usr/local/lib -lgcrypt -lgpg-error
            CFLAGS = -I/usr/local/include
            LIBS   =  -L/usr/local/lib -ldb-4.6
            CFLAGS =  -I/usr/local/include/db46
    Configure summary:
        Install style:
             AFP 3.x calls activated: 
             Extended Attributes: ad | sys
             backends:  dbd last tdb
             DHX     ()
             DHX2    ()
             RANDNUM ()
             passwd  ()
             DDP (AppleTalk) support: no
             CUPS support:            no
             SLP support:             no
             Zeroconf support:        no
             tcp wrapper support:     yes
             quota support:           no
             admin group support:     yes
             valid shell check:       yes
             cracklib support:        no
             dropbox kludge:          no
             force volume uid/gid:    no
             Apple 2 boot support:    no
             ACL support:             no

    The lack of CUPS and ACLs should be tolerable, since this is just going to
    be used for Time Machine (I use Samba for everything else). Note that
    initially I did leave ACL support to autodetect; it was enabled, but that led
    to compilation errors.

    Before you make, if you’re using FreeBSD like me you’ll need to
    fix some compilation errors. I’m sure the ports folks will fix this in due
    time, but as I’d rather not wait…

    First, at.h:

    — sys/netatalk/at.h.orig 2011-07-24 12:28:55.823029116 -0400
    +++ sys/netatalk/at.h 2011-07-24 12:29:40.522913740 -0400
    @@ -24,6 +24,14 @@
    #include / so that we can deal with sun’s s_net #define /

    +typedef unsigned char u_char;
    +typedef unsigned short u_short;
    +typedef unsigned int u_int;
    +typedef unsigned long u_long;
    #ifdef MACOSX_SERVER
    #endif / MACOSX_SERVER /

    Then cnid_metad.c:

    — etc/cnid_dbd/cnid_metad.c.orig 2011-07-24 12:48:52.140103389 -0400
    +++ etc/cnid_dbd/cnid_metad.c 2011-07-24 12:49:21.195654454 -0400
    @@ -45,6 +45,7 @@
    #define _XPG4_2 1

    make, make install, and move on. Be warned: since
    this install comes from source, there likely won’t be an init.d
    or rc.d script to start up daemons. A usable FreeBSD template is
    below (based of the most current port, as of this writing).

    # $FreeBSD: ports/net/netatalk/files/,v 1.3 2010/03/27 00:13:49 dougb Exp $
    # PROVIDE: atalkd papd cnid_metad timelord afpd
    # KEYWORD: shutdown
    # AppleTalk daemons. Make sure not to start atalkd in the background:
    # its data structures must have time to stablize before running the
    # other processes.
    # Set defaults. Please overide these in /usr/local/etc/netatalk.conf
    ATALK_NAME="`/bin/hostname -s`"
    # Load user config
    if [ -f /usr/local/etc/netatalk/netatalk.conf ]; then . /usr/local/etc/netatalk/netatalk.conf; fi
    . /etc/rc.subr
    hostname=`hostname -s`
    netatalk_start() {
        checkyesno atalkd_enable && /usr/local/sbin/atalkd
        checkyesno atalkd_enable && \
            /usr/local/bin/nbprgstr -p 4 "${ATALK_NAME}:Workstation${ATALK_ZONE}" &
        checkyesno atalkd_enable && \
            /usr/local/bin/nbprgstr -p 4 "${ATALK_NAME}:netatalk${ATALK_ZONE}" &
        checkyesno papd_enable && /usr/local/sbin/papd
        checkyesno cnid_metad_enable && /usr/local/sbin/cnid_metad
        checkyesno timelord_enable && /usr/local/sbin/timelord
        checkyesno afpd_enable && \
            /usr/local/sbin/afpd -n "${ATALK_NAME}${ATALK_ZONE}" \
                    -s /usr/local/etc/netatalk/AppleVolumes.system \
                    -f /usr/local/etc/netatalk/AppleVolumes.default \
                    -g ${AFPD_GUEST} \
                    -c ${AFPD_MAX_CLIENTS} \
    netatalk_stop() {
        checkyesno timelord_enable && killall timelord
        checkyesno afpd_enable && killall afpd
        checkyesno cnid_metad_enable && killall cnid_metad
        checkyesno papd_enable && killall papd
        checkyesno atalkd_enable && killall atalkd
    load_rc_config ${name}
    run_rc_command "$1"


    A few extra options are needed, both for each mount and for the server itself.

    Here’s the relavant (non-comment) bits at the end of AppleVolumes.default. Use your own paths and logins as appropriate.

    # The line below sets some DEFAULT, starting with Netatalk 2.1.
    :DEFAULT: options:upriv,usedots
    # The "~" below indicates that Home directories are visible by default.
    # If you do not wish to have people accessing their Home directories,
    # please put a pound sign in front of the tilde or delete it.
    /tm/laptop "Laptop Backup" allow:laptop_login cnidscheme:dbd options:usedots,upriv,tm
    /tm/desktop "Desktop Backup" allow:desktop_login cnidscheme:dbd options:usedots,upriv,tm
    # End of File

    And here’s the relevant pieces from afpd.conf. Obviously, use
    your own server name and IP.

    # default:
    # - -tcp -noddp -uamlist, -nosavepassword
    SERVER -tcp -ipaddr -noddp -uamlist,, -nosavepassword


    Avahi is relatively unchanged. If you were using Avahi before Lion, it should work the same. I think.

    File System Bits

    Oddly enough, it looks like the file is no longer required.

    Client Configuration

    I'm still using the preference for an unsupported time machine volume. Run the following on the client:

    defaults write TMShowUnsupportedNetworkVolumes 1

    If you aren’t dealing with a recently-upgraded client and pre-existing backups,
    you may want to read the original notes on setting up sparsebundles on the
    client here.


    None so far, but then, I’m still in the middle of my first Time Machine backup
    under Lion. Things largely seem to work, though. Expect to spend some
    non-trivial time on the first backup, to re-index any pre-existing dumps, but
    then Time Machine appears to just do its thing normally.