View Issue Details

IDProjectCategoryView StatusLast Update
0006082Talerotherpublic2020-03-31 16:04
ReporterFlorian Dold Assigned ToChristian Grothoff  
PrioritynormalSeverityminorReproducibilityhave not tried
Status closedResolutionfixed 
Product Versiongit (master) 
Target Version0.7.0Fixed in Version0.7.0 
Summary0006082: run (integration)tests inside a container
DescriptionCurrently we run into problems when testing components, as tests get skipped when the right database isn't available, or even a fixed port not open. This is especially bad for automated testing on taler.net, as it means we pretty much can't reliably run test cases for the exchange at the same time even on two different accounts.

To solve this problem, we should build and test the components in a (docker) container.
TagsNo tags attached.

Relationships

related to 0004439 closedChristian Grothoff add a buildbot build for taler on a fresh debian container 

Activities

Christian Grothoff

2020-02-08 18:01

manager   ~0015320

Last edited: 2020-02-10 00:33

I disagree partially. I think we should not use Docker for this. What we should use is Linux network namespaces:

# NAME=test # (or whatever we like to call it)
# ip netns add $NAME
# ip netns exec $NAME ip link set lo up
# ip netns exec $NAME sudo -u $USER $COMMAND

and voila, port conflicts are GONE -- while running using the same hard disk / OS / installation and without the need for any containers!
This minimal setup doesn't add an interface for Internet access. So we likely want to bridge the network namespace to the host, like this:

# ip link add $A type veth peer name $B
# ip link set $B netns $NAME
# ip link set $A up
# ip addr add 10.42.42.1/24 dev $A
# ip netns exec $NAME ip addr add 10.42.42.2/24 dev $B
# ip netns exec $NAME ip route add default via 10.42.42.1

Only bit missing now is NAT-style forwarding in the underlying host to route the 10.42.42.* traffic to the Internet, but that's "well-known".

Now, this overall approach still needs 'root', but I suspect we can simply do the above as systemd when we launch the respective buildslave!

Details at https://blog.scottlowe.org/2013/09/04/introducing-linux-network-namespaces/

Christian Grothoff

2020-02-10 00:39

manager   ~0015328

One more note: we might want to use a virtual bridge to simplify the setup when we run 'many' virtual networks.

In that case, the setup changes slightly (see also: https://www.opencloudblog.com/?p=66):

# ip netns add ns1
# ip netns add ns2
# BRIDGE=br-test
# brctl addbr $BRIDGE
# brctl stp $BRIDGE off
# ip link set dev $BRIDGE up

Connect bridge to host:
# ip link add tap0 type veth peer name br-tap0
# brctl addif br-test br-tap0
# ip link set dev tap0 up
# ip link set dev br-tap0 up
(now configure IP address ending with *.*.*.1/24 on tap0)

Connect bridge to each of the namespaces:
# ip link add tap1 type veth peer name br-tap1
# brctl addif br-test br-tap1
# ip link set tap1 netns ns1
# ip netns exec ns1 ip link set dev tap1 up
# ip link set dev br-tap1 up
(configure *.*.*.2/24 on tap1 inside of ns1)

# ip link add tap2 type veth peer name br-tap2
# brctl addif br-test br-tap2
# ip link set tap2 netns ns2
# ip netns exec ns2 ip link set dev tap2 up
# ip link set dev br-tap2 up
(configure *.*.*.2/24 on tap2 inside of ns2)

Florian Dold

2020-02-10 01:33

manager   ~0015329

Last edited: 2020-02-10 01:34

You are suggesting to run each buildbot worker process in a separate long-term light-weight container that only isolates the network.

I'm worried about a few things with this approach:
 * What about lingering processes that a failed test run leaves behind (think uwsgi)? Systemd can use (process-)namespaces to take care of them, but the whole BB worker running in a long-lived "container" in your approach wouldn't deal nicely with those.
 * The former problem actually makes it possible to re-introduce port clashes. Version A of the exchange horribly goes wrong, leaving a port open, now version B will skip some tests, that's exactly what we wanted to avoid in the first place
 * What about the database? According to "man 7 network_namespaces", unix domain sockets are also isolated, so we can't use socket auth anymore. IMHO the integration test must also run its own database anyway, this would improve the situation, but that might require some extra effort in the test cases.

Overall, at least with what you're currently suggesting, I'm not sure if the complexity justifies what we actually get out of it (as port clashes still can and will happen).

Christian Grothoff

2020-02-10 13:58

manager   ~0015332

* Lingering processes: kill -1 and similar methods should work fine. The BB worker can just create a process group for the job.
* Database: I am indeed thinking of just running a separate postgres instance per worker.

Complexity: I think this is a massive complexity reduction compared to setting up containers/Dockers where we'd have to maintain an entire additional OS (updating, security updates, build dependencies, etc.).

Florian Dold

2020-02-10 14:29

manager   ~0015333

Regarding complexity: I'm not saying this is the wrong approach, I actually like the idea of light-weight containers.

Regarding the database: I don't think we should have a separate database per worker, but per *test run*. Creating a new DB is sufficiently fast, especially compared to compilation time.

I still don't like that we have one container that runs the BB worker. Main reason: It's hard to run the whole "test suite" manually without the BB worker.

I have an alternate approach based on your idea: The worker itself should "containerize" the integration test on each run. There are two approaches that get around the root limitation:
 * Either: enable unprivileged user namespaces (via kernel.unprivileged_userns_clone). I think by now these are considered "safe".
 * Or: If we don't trust user namespaces yet, then we can have a (hopefully rather simple!) setuid script that runs a process in a network-isolated container. This script needs to be well-audited to ensure that permissions are dropped correctly before we execute the user-defined process to be executed.

This makes testing/debugging inside the container trivial, we we can do "taler-testing-isolate /bin/bash".

How taler-testing-isolate is implemented depends on which of the above choices we take, either via userns or via a setuid binary.

Christian Grothoff

2020-02-10 14:46

manager   ~0015334

Oh, of course I meant creating the DB per-worker, just the Postgres "instance" would be run under the same UID as the worker, hence 'per worker'. And I disagree that it would be hard to run the test suite without a worker -- you just then have to watch for port conflicts. That's OK, as when I manually run test suites on a dev system, it's easy to make sure I don't run conflicting processes.

That said, unpriviledged_userns_clone might be reasonable. However, it doesn't work on my system: even with the option, "ip netns exec $NAME bash" fails. So I'm not sure it is a real option here. Also, even if we used clone() directly to make this work with this option, I suspect we would still have to do the basic network setup with root rights.

Doing this via a SUID script would be fine, except that Linux ignores SUID on scripts (https://unix.stackexchange.com/questions/364/allow-setuid-on-shell-scripts) and patching the kernel to enable SUID on scripts is not really an option either. That's why I was thinking of doing the setup inside of the systemd scripting (but having had bad experiences with systemd error reporting, I would indeed prefer a separate wrapper). Now, maybe we should write some simple C program that system()-calls the right sequence of shell instructions and then exec's into the container, waitpid()s and then does the clean up? Should be doable in 0000014:0000020 lines of C, and that would avoid both the SUID restriction on shell scripts *and* the need launch via systemd...

Florian Dold

2020-02-10 15:37

reporter   ~0015335

After enabling user namespaces, did you actually enter a new user namespace? You *first* need to become uid=0 within a user namespace to create/enter a network namespace, AFAIK. (So you'll be root inside the user namespace, but of course not outside of it).

So try "unshare -fUr /bin/bash" first. Then I'm able to also enter other types namespaces without ever being the "real" root.

An alternative to the setuid wrapper would also be to allow the script to be called via sudo. It's already installed, no need to write custom C code.

Christian Grothoff

2020-02-10 16:31

manager   ~0015336

On killing processes via process groups:
https://www.webhostinghero.com/how-to-create-a-process-group-in-linux/

Aside from that, Florian and I agreed that we should just have the buildslave do sudo on a shell script (which must be in sudoers) and have that shell script create the network namespace and then start a process group from the shell script and then inside that process group run the logic (i.e. test suite).

Christian Grothoff

2020-02-16 15:30

manager   ~0015357

I've now written two shell scripts (/usr/local/bin/netjail*.sh). netjail-init.sh sets up the virtual bridge and NAT. netjail.sh can be used to start a network jail. sudoers on tripwire has been adjusted with /etc/sudoers.d/buildslave-netjail to allow certain users (list to be extended there) to
$ sudo /usr/local/bin/netjail.sh $NUM $CMD

Example:
========

grothoff@gv:~$ sudo /usr/local/bin/netjail.sh 2 /bin/bash
#$ sudo /usr/local/bin/netjail.sh 2 /bin/bash
grothoff@gv:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
13: tap2@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 92:bc:03:ae:b1:da brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.42.42.2/24 scope global tap2
       valid_lft forever preferred_lft forever
    inet6 fe80::90bc:3ff:feae:b1da/64 scope link
       valid_lft forever preferred_lft forever
grothoff@gv:~$

So this works. Next stop: using setsid to clean up.

Christian Grothoff

2020-02-16 16:44

manager   ~0015358

$ /usr/local/bin/buildjob.sh $TIMEOUT $NUM $CMD

now works (if you are in the right sudoer file) to run $CMD for $TIMEOUT seconds in the virtual network $NUM.

Note that $CMD must not be an interactive shell (so no passing /bin/bash) as intractive bash doesn't like being detached from the tty which is a side-effect of the setsid() that is being used here.

Examples:

# Check network setup inside of build job:

$ /usr/local/bin/buildjob.sh 30 4 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
45: tap4@if44: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 8a:d6:c6:0f:43:9e brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.42.42.4/24 scope global tap4
       valid_lft forever preferred_lft forever
    inet6 fe80::88d6:c6ff:fe0f:439e/64 scope link tentative
       valid_lft forever preferred_lft forever

# Probe network inside of jail, and use timeout of 2s to die:
$ /usr/local/bin/buildjob.sh 2 4 ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=53 time=3.99 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=53 time=3.83 ms
Killed
grothoff@gv:~$ Killed

Florian Dold

2020-02-16 16:49

reporter   ~0015359

Why don't we instead us a PID namespace instead of this setsid hackery that breaks shells? This guarantees that all processes spawned directly and indirectly by netjail.sh will be reliably killed when netjail.sh is killed.

Thus it's more reliable *and* we can still use it to start a shell in the jail.

Christian Grothoff

2020-02-16 17:02

manager   ~0015360

I honestly don't care too much either way. If you want to change it to use a process namespace, go ahead. This works for now, and as long as the interface (buildjob.sh) doesn't change, I don't really care.

Christian Grothoff

2020-02-16 17:03

manager   ~0015361

Oh, but I should say this: process namespaces seem more dangerous: setsid requires NO extra permissions, no sudo, and no disabling Debian security options in the kernel. So that's my main reason to go with them.

Florian Dold

2020-02-16 17:15

reporter   ~0015362

Are we versioning this program anywhere? I guess it would be a good idea to have a copy of it in deployment.git.

Also, the script currently has two critical vulnerabilities:

(posting this publicly since currently the sudoers are still pretty restricted.)

1. you are passing user input to "sudo" without the "--" that terminates the argument list. since sudo is executed as root, I can pass things that I shouldn't be allowed to. As a practical example, I could become a member of a group I'm not allowed to become.

2. Due to missing quoting of $NAME in "ip netns exec $NAME sudo [...]", I could currently execute arbitrary programs.

Florian Dold

2020-02-16 17:21

reporter   ~0015363

And no, process namespaces do not require disabling security options in the kernel. The only thing that requires disabling a Debian security option in the kernel is creating a UID namespace as an unprivileged user, which we are not doing.

(Coincidentally, had we relied on unprivileged user namespaces, the two vulnerabilities I found above would not have had such an impact. We would only have a problem when somebody discovers a critical kernel bug.)

Florian Dold

2020-02-16 17:43

reporter   ~0015364

Oh this is fun, there is a third problem: Using logname. It allows privilege escalation in certain situations:

I log in as root, then do "su - buildbot". But my logname will *not* change in this situation, it will still be "root". So the netjail.sh will happily execute what I pass to it as root, even though my current effective and real user ID is *not* root, but buildbot.

So instead of calling logname, we should use $SUDO_USER, which is set by sudo correctly, i.e., to the real (instead of the effective or something else) user ID after sudo is called as a setuid binary.

Christian Grothoff

2020-02-16 18:55

manager   ~0015365

sudo: agree. $NAME not quoted is before sudo, so hardly actually problematic. logname: well, if you already are root ... But sure, we should change it to $SUDO_USER. Can you make these changes?

Florian Dold

2020-02-16 20:06

reporter   ~0015367

Done, I've also committed a copy to deployment.git, together with some other changes:

 * We now run the target process in a separate PID namespace, so clean exit is always guaranteed
 * Cleanup is done *before* the jailed program is started, as otherwise we could end up in a situation where the netjail is not destroyed properly and can't be created again without manual intervention via root. Also allows us to directly exec the target process instead of forking.

BTW, $NAME not quoted was definitely a problem, because the whole script is running as root already via the "outer" sudo.

Same for the $SUDO_USER thing. Doing "su - $FOO" as root should never give the user $FOO the possibility to run code as root. Otherwise I can just put something into the .bashrc of user $FOO and completely take over the machine the next time root does "su - $FOO".

Christian Grothoff

2020-02-16 21:34

manager   ~0015368

I deliberately didn't put the scripts into deployment.git, because at least with 'sudo' this means that someone without root could change deployment.git and then introduce security issues.

However, if your new solution is truly sudo-less to the the separate unprivileged PID namespace, then we should drop the sudo permissions and can keep everything in deployment.git. However, at least the netjail-init.sh seems to be definitively needing root, so that cannot live in deployment.git IMO.

I also don't like you cleaning up a *previous* run like this, because that means the netns will stay around until the script is run the next time. As a result, we'll likely have dozens of 'dead' netns namespace zombies in the system. I don't see the 'exec' as a real advantage here, and would rather have a script like mine that is written to reliably clean up after itself.

Similarly, your current script introduces a nice failure mode: if two users use the same $N at the same time, the second one kill the existing netns and then launch a new one. My script ensured that IF there is a conflicting use of $N, then the second user fails and does NOT disrupt the existing use of $N.

Florian Dold

2020-02-17 01:18

reporter   ~0015369

We should still put the scripts in deployment.git, even if we need to manually copy them and adjust permissions!

Regarding sudo: Since the network jail needs to create a pair of interfaces on the *host*, it can't actually run without root permissions. We could only do it without sudo for containers that have their own *completely* isolated network stack.

However, I've fixed the cleanup problem in an IMHO rather elegant way: Before we execute the target process, we delete *name* of the network namespace. What's left over is an anonymous namespace that will be automatically cleaned up by the kernel when the last process that references it exits. It's easy to verify using "lsns" that this works.

Thus, when our target executable exits or is killed in whatever way, the namespace is now automatically "deleted", including the veth devices connected to the bridge. This will also automatically delete the host's interface of the veth pair. No more device pollution, no matter in what way the process dies!

Also, there is not really any reason to assign different IP addresses inside each network jail, as we're behind NAT. I've kept them constant, and used a short UID for the device names. Thus now we do just

  $ sudo netjail.sh /bin/bash

without any "N", as it's simply not needed, since we can now have an arbitrary number of netjails without worrying about cleanup or clashes in $N.

Christian Grothoff

2020-02-17 01:28

manager   ~0015370

Not having read the code, I'm a tiny bit confused: how do you put the child into the anonymous netns after having deleted it? (Didn't read the code). Also, the NAT setup does not work: try doing a ping from two jails. The moment you start the second one, the network of the first one breaks now. Basically, the kernel's NAT doesn't remember the netns when routing replies back, and the virtual bridge actually acts as a switch and learns the second interface and stops forwarding to the first. So we do need $N (you could possibly try to probe for an available $N by ping'ing and then automatically finding a value that is available).

Christian Grothoff

2020-02-17 01:30

manager   ~0015371

Ah, I see now. I didn't know you could drop a namespace from within it.

Anyway, the $N hack still breaks the network.

Florian Dold

2020-02-17 13:33

reporter   ~0015372

Ah. I was somehow under the wrong impression that the netns would be considered for NAT. Oops.

As more of a "learning exercise" for myself, I wanted to see if I can make it work without static IPs, so I tried setting it up with DHCP. But it's actually so simple that I don't seen why we should not use it. The current setup is:

1. host runs a dnsmasq (as a systemd unit) that only serves DHCP, required configuration is pretty much just "--dhcp-range=tun0,10.42.42.1,10.42.42.200,2h"
2. we set up everything like before, skip static IP config
3. inside the netns container, we run dhclient (daemonized) and then the target executable
4. when the target executable finishes, we use dhclient to release the lease "nicely"

(If anything is killed during this, processes will be cleaned up as they're in a PID ns, and the lease will expire eventually.)

You can try it out as "sudo netjail.sh /bin/bash", currently it prints every command it runs for debugging.

The parallel ping from two netjails now works for me, I would consider this "resolved" unless you have any concerns.

Christian Grothoff

2020-02-17 13:56

manager   ~0015373

Well, I must admit I didn't even think about DHCP in this context. Feels a bit like overkill, but the result should be convenient and avoid pitfalls, so sure, why not. My only concern is that you bound the DHCP to 0.0.0.0, thereby exposing the service 'to the world'. We should bind it to 10.42.42.1 only, i.e. by setting

bind-interfaces=tap0

Also, you did not commit the dnsmasq configuration files to the Git, is there a reason for that?

Florian Dold

2020-02-17 14:13

reporter   ~0015374

Files are committed now. Good point regarding bind-interfaces. We're only listening on tun0 now.

Christian Grothoff

2020-02-17 16:31

manager   ~0015375

Sorry, not "resolved", as we do not use this yet. I still have to adopt the buildbots...

Christian Grothoff

2020-02-24 17:22

manager   ~0015402

The 'integrationtest' user now runs 'make check' (for libmicrohttpd, parts of GNUnet, twister, exchange, merchant and wallet) in its own netjail.

Issue History

Date Modified Username Field Change
2020-02-06 15:49 Florian Dold New Issue
2020-02-08 18:01 Christian Grothoff Note Added: 0015320
2020-02-08 18:01 Christian Grothoff Assigned To => Florian Dold
2020-02-08 18:01 Christian Grothoff Status new => feedback
2020-02-10 00:33 Christian Grothoff Note Edited: 0015320
2020-02-10 00:39 Christian Grothoff Note Added: 0015328
2020-02-10 01:33 Florian Dold Note Added: 0015329
2020-02-10 01:34 Florian Dold Note Edited: 0015329
2020-02-10 12:22 Florian Dold Assigned To Florian Dold => Christian Grothoff
2020-02-10 13:58 Christian Grothoff Note Added: 0015332
2020-02-10 14:29 Florian Dold Note Added: 0015333
2020-02-10 14:29 Florian Dold Status feedback => assigned
2020-02-10 14:46 Christian Grothoff Note Added: 0015334
2020-02-10 15:37 Florian Dold Note Added: 0015335
2020-02-10 16:31 Christian Grothoff Note Added: 0015336
2020-02-16 15:30 Christian Grothoff Note Added: 0015357
2020-02-16 16:44 Christian Grothoff Note Added: 0015358
2020-02-16 16:49 Florian Dold Note Added: 0015359
2020-02-16 17:02 Christian Grothoff Note Added: 0015360
2020-02-16 17:03 Christian Grothoff Note Added: 0015361
2020-02-16 17:15 Florian Dold Note Added: 0015362
2020-02-16 17:21 Florian Dold Note Added: 0015363
2020-02-16 17:43 Florian Dold Note Added: 0015364
2020-02-16 18:55 Christian Grothoff Note Added: 0015365
2020-02-16 20:06 Florian Dold Status assigned => feedback
2020-02-16 20:06 Florian Dold Note Added: 0015367
2020-02-16 21:34 Christian Grothoff Note Added: 0015368
2020-02-17 01:18 Florian Dold Note Added: 0015369
2020-02-17 01:18 Florian Dold Status feedback => assigned
2020-02-17 01:28 Christian Grothoff Note Added: 0015370
2020-02-17 01:30 Christian Grothoff Note Added: 0015371
2020-02-17 13:33 Florian Dold Note Added: 0015372
2020-02-17 13:56 Christian Grothoff Note Added: 0015373
2020-02-17 14:13 Florian Dold Note Added: 0015374
2020-02-17 15:46 Florian Dold Status assigned => resolved
2020-02-17 15:46 Florian Dold Resolution open => fixed
2020-02-17 16:31 Christian Grothoff Status resolved => assigned
2020-02-17 16:31 Christian Grothoff Note Added: 0015375
2020-02-23 22:16 Christian Grothoff Relationship added related to 0004439
2020-02-24 17:22 Christian Grothoff Status assigned => resolved
2020-02-24 17:22 Christian Grothoff Note Added: 0015402
2020-02-24 17:22 Christian Grothoff Product Version => git (master)
2020-02-24 17:22 Christian Grothoff Fixed in Version => 0.7.0
2020-02-24 17:22 Christian Grothoff Target Version => 0.7.0
2020-03-31 16:04 Christian Grothoff Status resolved => closed