Update 03/08/2015: There is now an official Weave guide to using Mesos and Marathon, as many things have changed with Weave 1.0 release, and the usage of Docker bridge is now discouraged in favour of proxy and IP allocator.
Disclaimer: The following article is based on my own experiments with Weave and Docker. I am not associated with Weaveworks and this is not an official guide. If you get stuck, please contact the super nice and helpful weave guys at firstname.lastname@example.org or ask in the #weavenetwork channel on Freenode.
Update 16/07/15: since writing this article Weave has evolved a lot. The setup described below is not recommended officially by Weaveworks. Please read this blog post by Weaveworks for more information.
Update 02/03/15: avoid
weave expose and assign IP directly to avoid a loop in the routing of packets (see what changed).
Update 25/03/15: use
10.2.0.0/16 addresses to match Weave's documentation and to avoid IP clashes in many environments.
In the Docker world one always feels like being at the edge of the development. After orchestration slowly gets settled with tools like Mesos, fleet or Kubernetes, the next big topic is networking.
Networking – the next big thing in the Docker world
Docker started out with a simple one-host networking solution. Unfortunately, this does not really scale to multiple hosts. A number of projects attack this problem, to name a few: socketplane, weave, pipeworks, flannel.
What they have in common is that they try to get rid of the port management which one usually has to do when running multiple apps in a Docker environment: if you have two webservers, you cannot reuse port 80 on the same host. Hence, in many former setups (like with Mesos Marathon) one had to choose (or let the orchestrator choose) different host ports which were mapped to port 80 of the webserver inside of Docker containers.
Now, with those new networking solutions each container gets its own IP address. Consequently, every webserver inside of a container can not only listen to port 80, but other services can access those webservers via port 80 and the correct IP.
Towards container IPs with Mesos Marathon
In the following I will describe the way to use Weave to assign each Mesos Marathon container its own IP address. All Mesos tasks will live in a Weave overlay network 10.2.0.0/16 which is shared among all hosts running the Mesos slaves.
With Weave it's possible to define different networks for different apps. For the sake of simplicity we will not implement any network separation here. Instead we choose one overlay network for all tasks. By doing this we get IP management for free from the Docker deamon. You will read in a minute how.
With Weave one usually runs containers using the
weave run wrapper script which runs
docker runand then attaches the
weave bridge to the container. We will go another route and use the
weave bridge directly as Docker bridge, i.e. as replacement of the
docker0 bridge. This way we will get rid of the problems around this wrapper hack (i.e. that the container process will start before the
weave bridge is attached). In the future, with the integration of network plugins into Docker, this wrapper will go away. But in the current state using the wrapper is not an option.
0. Setting up weave
Using Weave is pretty simple:
- download the weave shell script:
$ sudo wget -O /usr/local/bin/weave \ https://github.com/zettio/weave/releases/download/latest_release/weave $ sudo chmod a+x /usr/local/bin/weave
- starting the router container:
$ weave launch
- put the host into the 10.2.0.0/16 network:
$ weave expose 10.2.0.1/16 $ ifconfig weave weave Link encap:Ethernet HWaddr 7a:d2:8d:b1:26:7b inet addr:10.2.0.1 Bcast:0.0.0.0 Mask:255.255.0.0 UP BROADCAST MULTICAST MTU:65535 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) $ brctl show bridge name bridge id STP enabled interfaces weave 8000.7ad28db1267b no
- start a container:
$ weave run 10.2.0.2/16 -itd sttts/python-ubuntu python -m SimpleHTTPServer 80 $ curl 10.2.0.2
1. Patching the weave shell script
We want to use the
weave bridge directly as Docker bridge (using the
-b weave parameter for the Docker daemon). The
weave script creates the
weave bridge on
launch. But in order to run
weave launch Docker must be running already. Here, the snake tries to bite its tail.
It is easy to patch the
weave script such that the
weave bridge can created without a running Docker. A pull request exists to add this to
weave: https://github.com/zettio/weave/pull/357. Until this is merged a patched version is available here: https://raw.githubusercontent.com/sttts/weave/888edd69f3344006dac7865a0afe567ebfaa5967/weave
With this in place in
/usr/local/bin/weave we create the
There is an
create-bridge command in recent
weave which does exactly what we want:
$ weave create-bridge
Then we assign an ip to the
weave bridge for the local host. This can be done with the
weave expose command.
expose requires a running Docker daemon. A pull request exists to remove this requirement
weave: https://github.com/zettio/weave/pull/377. Until this is merged a patched version is available here: https://github.com/sttts/weave/blob/d77e3a4c226ab7803a947927a61da084de5162ca/weave.
With this in place in
/usr/local/bin/weave we assign the IP:
$ ip addr add dev weave 10.2.0.1/16
Update: In a previous version of this artical,
weave expose was used to assign the IP. As it turned out in this issue in weave's Github project, this can lead to loops. This is because
weave expose will add additional iptables masquerading rules which can cause the tunneled data packets to be routed again via the
weave interface. Hence, in the upper command the ip address is assigned directly.
2. Configuring Docker
Then we change the Docker daemon parameters accordingly, on Ubuntu this is done in
We have also set the
fixed-cidr value, which defines where the containers of this host will live. The given network is a subnet of 10.2.0.0/16 which must be exclusive for this host. In a bigger Mesos cluster you have to configure disjunct subnets of course.
Now start the Docker daemon:
$ start docker
New containers will get a new IP address in the
fixed-cidr address range. Notice that we offload the IP management for every host to Docker. Because the subnets above are selected properly we get cluster wide unique IP addresses. Fantastic!
$ docker run -it ubuntu /bin/bash root@3ce56d73fc18:/# ping 10.2.0.1 PING 10.2.0.1 (10.2.0.1) 56(84) bytes of data. 64 bytes from 10.2.0.1: icmp_seq=1 ttl=64 time=0.152 ms
3. Automatic startup of the weave bridge
Everything done above can easily be scriped using Debian/Ubuntu's interfaces definitions in
auto weave iface weave inet manual pre-up /usr/local/bin/weave create-bridge post-up ip addr add dev weave 10.2.0.1/16 pre-down ifconfig weave down post-down brctl delbr weave
Update: Compare update remark.
We can use
ifdown to start and stop the script:
$ ifup weave $ ifdown weave
As the interface is defined as
auto it will come up by itself after a reboot. Because
/etc/init/docker.conf defines a dependency on the network interfaces, the Docker daemon will properly wait for the
weave bridge as well.
4. Launching the router
To connect multiple hosts in one giant
10.2.0.0/16 network we have to start the weave router container:
$ weave launch
Then you can connect multiple hosts via e.g.
$ weave connect 192.168.0.42
Note the following:
- All containers and hosts can "see" eachother via the weave network.
- All ports that container processes listen to on the weave network interface will be accessible by all containers.
- Those ports are not exposed outside of the weave network. This can be done by exposing them with the means of Docker e.g.
docker run -itd -p 12345:80 sttts/python-ubuntu python -m SimpleHTTPServer 80The webserver will be accessible via the weave network address, e.g.
192.168.0.41:12345if the later is the host interface IP. Via
10.2.1.1:80it can even be accessed on every node in the cluster because we assigned a host IP
10.2.0.1to the weave interface.
5. Service discovery with Consul and progrium/registrator
They are pretty simple to get up an running. You start consul as usual, e.g. in a one-host test setup like
$ consul agent -server -node=goedel -bootstrap -data-dir ./data -client=0.0.0.0
and then you can point the registrator container to Consul's address, e.g.
$ docker run -it -v /var/run/docker.sock:/tmp/docker.sock sttts/registrator consul://192.168.0.41:8500
Now, Consul's DNS server can be queried to find out about the running containers, e.g. the
SimpleHTTPServer container from above:
$ dig @192.168.0.41 -p 8600 python-ubuntu.service.dc1.consul. ANY ; <<>> DiG 9.8.3-P1 <<>> @192.168.0.41 -p 8600 python-ubuntu.service.dc1.consul. ANY ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45546 ;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 ;; WARNING: recursion requested but not available ;; QUESTION SECTION: ;python-ubuntu.service.dc1.consul. IN ANY ;; ANSWER SECTION: python-ubuntu.service.dc1.consul. 0 IN A 192.168.0.41 ;; Query time: 3 msec ;; SERVER: 192.168.0.41#8600(127.0.0.1) ;; WHEN: Thu Jan 22 17:30:09 2015 ;; MSG SIZE rcvd: 146
As you see the problem is that Consul will return the public node ID, more precisely the Consul agent advertised IP. This is not what we want.
Luckily, in Consul's git master branch from 2 weeks ago a new feature of custom service addresses was merged: https://github.com/hashicorp/consul/pull/570. This is now available in the Consul 0.5.0rc1 pre-release.
But the very first sentence of this post strikes again. The features was merged, but the golang API in the same repository (used by registrator) does not support the Address extension of the service definition yet.
But, as Consul is open source it is easy to make a pull request for this: https://github.com/hashicorp/consul/pull/627.
With this patch weWith the new address feature of Consul we can extend registrator with a few lines of code to support the
-internal flag also for Consul (it had turned out that registrator already knows about internal container addresses but due to the lack of support in Consul, it was only implemented for etcd). So, here is another pull request, this time against registrator: https://github.com/progrium/registrator/pull/91.
With this in place we can use the extended registrator and get proper, cluster-wide routable weave network addresses:
$ docker run -it -v /var/run/docker.sock:/tmp/docker.sock sttts/registrator -internal consul://192.168.0.41:8500
and then the DNS response is much better:
$ dig @192.168.0.41 -p 8600 python-ubuntu.service.dc1.consul. ANY ; <<>> DiG 9.8.3-P1 <<>> @192.168.0.41 -p 8600 python-ubuntu.service.dc1.consul. ANY ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 41271 ;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 ;; WARNING: recursion requested but not available ;; QUESTION SECTION: ;python-ubuntu.service.dc1.consul. IN ANY ;; ANSWER SECTION: python-ubuntu.service.dc1.consul. 0 IN A 10.2.1.1 ;; Query time: 0 msec ;; SERVER: 192.168.0.41#8600(127.0.0.1) ;; WHEN: Thu Jan 22 17:32:40 2015 ;; MSG SIZE rcvd: 146
6. Mesos and Marathon
Actually, there is hardly anything to be done for Mesos and Marathon support.
A mesos-slave with the
docker containerizer calls
docker run without any additional magic. Due to our work above, Docker automatically assigns a cluster-wide, unique IP to each container. The daemon inside the container listens on that IP. In order to communicate between containers via the
weave IP, not even ports must be exposed.
In order to use Mesos or Marathon health checks the mesos-slave IP is used. Mesos doesn't know about the internal container IPs (yet). This is certainly worth a pull request, but I haven't done that yet.
Instead, for now it works to expose the health check ports as usual via Marathon's app definition. In other words: we use internal IPs for intra-container communication via Consul service discovery and the port mapping for everything else.
Another nice thing would be to transform Marathon's app id (e.g.
/web/customer1/site1) into a Consul service name. Then the app would be automatically available via
site1.customer1.web.service.dc1.consul. This would either require to set the
SERVICE_TAGS variables which are then read by
progrium/registrator should understand the
MARATHON_APP_ID variable which Marathon exports to the container environment. Either way would be nice. But for now we can work with explicit
SERVICE_TAGS definition within the Marathon app definitions.
We have a Mesos cluster running with cluster-wide unique IPs for each container. All this is completely transparent for Mesos and Marathon. Service discovery works with Consul and progrium/registrator (with one open https://github.com/hashicorp/consul/pull/627).
All this is pretty feasible with the current state of the Docker art and all its components. I am pretty sure it will not take long until container IPs are just the way to use Docker. The use of IP + standard-port makes the system engineering life so much easier because normal DNS lookups suddenly work again as they should (and always did before containers came up). No haproxys, no ambassadors, no SRV record logic in the apps.
Anybody who likes to see all this in a production like setup should take a look at my Ansible code base for a Mesos cluster: https://github.com/sttts/compute-platform. It includes all the weave bits and pieces in a