torstai 16. maaliskuuta 2017

Is static linking the solution to all of our problems?

Almost all programming languages designed in the last couple of years have a strong emphasis on static linking. Their approach to dependencies is to have them all in source which is compiled for each project separately. This provides many benefits, such as binaries that can be deployed everywhere and not needing to have or maintain a stable ABI in the language. Since everything is always recompiled and linked from scratch (apart from the standard library), ABI is not an issue.

The proponents of static linking often claim that shared libraries are unnecessary. Recompiling is fast and disks are big, thus it makes more sense to link statically than define and maintain ABI for shared libraries, which is a whole lot of ungrateful and hard work.

To see if this is the case, let's do an approximation experiment.

Enter the Dost!

Let's assume a new programming language called Dost. This language is special in that it provides code that is just as performant as the equivalent C code and takes the same amount of space (which is no small feat). It has every functionality anyone would ever need, does not require a garbage collector and whose syntax is loved by all. The only thing it does not do is dynamic linking. Let us further imagine that, by magic, all open source projects in the world get rewritten in Dost overnight. How will this affect a typical Linux distro?

Take for example the executables in /usr/bin. They are all implemented in Dost, and thus are linked statically. They are probably a bit larger than their original C versions which were linked dynamically. But by how much? How would we find out?

Science to the rescue

Getting a rough ballpark estimate is simple. Running ldd /usr/bin/executable gives a list of all libraries the given executable links against. If it were linked statically, the executable would have a duplicate copy of all these libraries. Said in another way, each executable grows by the size of its dependencies. Then it is a matter of writing a script that goes through all the executables, looks up their dependencies, removes language standard libraries (libc, stdlibc++, a few others) and adds up how much extra space these duplicated libraries would take.

The script to do this can be downloaded from this Github repo. Feel free to run it on your own machines to verify the results.

Measurement results

Running that script on a Raspberry Pi with Rasbian used for running an IRC client and random compile tests says that statically linked binaries would take an extra 4 gigabytes of space.

Yes, really.

Four gigabytes is more space than many people have on their Raspi SD card. Wasting all that on duplicates of the exact same data does not seem like the best use of those bits. The original shared libraries take only about 5% of this, static linking expands them 20 fold. Running the measurement script on a VirtualBox Ubuntu install says that on that machine the duplicates would take over 10 gigabytes. You can fit an entire Ubuntu install in that space. Twice. Even if this were not in issue for disk space, it would be catastrophic for instruction caches.

A counterargument people often make is that static linking is more efficient than dynamic linking because the linker can throw away those parts of dependencies that are not used. If we assume that the linker did this perfectly, executables would need to use only 5% of the code in their dependencies for static linking to take less space than dynamic linking. This seems unlikely to be the case in practice.

In conclusion

Static linking is great for many use cases. These include embedded software, firmwares and end user applications. If your use case is running a single application in a container or VM, static linking is a great solution that simplifies deployment and increases performance.

On the other hand claiming that a systems programming language that does not provide a stable ABI and shared libraries can be used to build the entire userland of a Linux distribution is delusional. 

maanantai 13. maaliskuuta 2017

Dependencies and unity builds with Meson

Prebuilt dependencies provided by the distro are awesome. You just install them and start working on your own project. Unfortunately there are many cases where distro packages are either not available or too old. This is especially common when compiling on non-Linux platforms such as Windows but happens on Linux as well when using jhbuild, Flatpak, Snappy or one of the many other aggregators. Dependencies obtained via Meson subprojects also fall into this category.

Unity builds

There is a surprisingly simple way of compiling projects faster: unity builds. The basic principle is that if your target has files foo.cpp, bar.cpp and baz.cpp, you don't compile them. Instead you generate a file called target-unity.cpp, whose contents are this:

#include<foo.cpp>
#include<bar.cpp>
#include<baz.cpp>

Then you compile just this file. This makes the compilation go faster. A lot faster. As an example converting Qt Creator to compile as a Unity build made it compile 90% faster. Counterintuitively it is even faster to compile a unity file with one core than to use four cores to build the files individually. If this is the first time you have encountered unity builds, this probably feels like a sham, something that just can't be possible. Not only is this possible, but unity builds are used in production in many places, especially in game development. As a different kind of example SQLite ships as a single "amalgamation file", which is the same thing as a unity build. Unity builds also act as a sort of poor man's link time optimization, which works even on compilers that do not support LTO natively.

Unity builds have their own limitations and problems. Some of them are discussed at the end of this article.

Dependencies

Meson has had unity build support available for a long time. The unfortunate downside is that incremental builds take a lot more time. This makes the edit-compile-debug cycle slower which is annoying. There are now two merge requests outstanding (number one, number two) that aim to make the user experience better.

With these changes you can tell Meson to unity build only subprojects, not the main project. This means that all your deps build really fast but the master project is built incrementally. In most use cases subprojects are read only. Only the master project is edited. For most people dependencies are slower to build than projects using them, so this gives a nice productivity boost. The other merge request enables finer grained control by allowing the user to override unityness for each target separately.

Please try out the branches and write your feedback and comments to the merge requests.

Problems of unity builds

Not all projects can be built as unity builds out of the box. The most common problem is having static variables and functions with the same name in different source files. In a unity build these will clash and not compile. There are other problems of similar nature, especially for projects that do crazy things with the preprocessor. These problems are fixable but take some effort. Most open source projects probably won't compile as unity builds out of the box.

Any project that wants to provide for a unity build must, in practice, have gating CI that compiles the source as a unity build. Otherwise it will break every now and then.

maanantai 13. helmikuuta 2017

Looking into what Rust can do that other languages can't ... or can they

A recent blog post What Rust Can Do That Other Languages Can't the following statement is presented:

struct X {
  y: Y
}
impl X {
  fn y(&self) -> &Y { &self.y }
}

This defines an aggregate type containing a field y of type Y directly (not as a separate heap object). Then we define a getter method that returns the field by reference. Crucially, the Rust compiler will verify that all callers of the getter prevent the returned reference from outliving the X object. It compiles down to what you'd expect a C compiler to produce (pointer addition) with no space or time overhead.

Let's look into this a bit more deeply. But, since this is The Internet, first a pre-emptive statement.

Things not claimed in this blog post

To make things easier for forum discussers who have strong opinions but have hard time reading entire articles, here is a list of things that are not claimed. I repeat: not claimed.
  • that [other language] is as safe as Rust (it is most likely not)
  • that you should use [other language] instead of Rust
  • that you should use Rust instead of [other language]
  • that [other language] is safe out of the box (it probably is not)
  • security feature X of language Y will prevent all bugs of type Z (it does not)

To business

Plain C code corresponding to the setup above is roughly the following:

struct foo {
  int x;
};

struct bar {
  struct foo f;
};

A simple function that demonstrates the bug is the following:

int main(int argc, char **argv) {
  struct foo *f;
  struct bar *expiring = malloc(sizeof(struct bar));
  f = &expiring->f;
  free(expiring);
  do_something(f);
  return 0;
}

This uses an obviously invalid pointer in a function call. Neither GCC nor Clang warns about this when compiling. Scan-build, however, does detect the bug:

expired.c:23:3: warning: Use of memory after it is freed
  do_something(f);
  ^~~~~~~~~~~~~~~~~
1 warning generated.
scan-build: 1 bug found.

Address sanitizer also detects the invalid access operation inside do_something. Things get more interesting if we change the function so that expiring is on the stack instead of the heap.

int main(int argc, char **argv) {
  struct foo *b;
  {
    struct bar expiring;
    b = &expiring.f;
  }
  do_something(b);
  return 0;
}

This one is not detected at all. Neither GCC, Clang, Scan-build, Asan, CppCheck nor Valgrind see anything wrong with this. Yet this seems so simple and obvious that it should be detectable automatically.

Recently this has become possible with Address sanitizer. It grew a new argument -fsanitize-address-use-after-scope, which checks for exactly this and finds the bug effortlessly:

==10717==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7ffe5d0c1920 at pc 0x00000050767f bp 0x7ffe5d0c18a0 sp 0x7ffe5d0c1898

This requires, of course, a run of the program that executes this code path. It would be nice to have this in a static analyzer so it could be found in even more cases (there is already a bug report for this for scan-build).

In conclusion

The security features of Rust are great. However many of them (not all, but many) that were unique to Rust are now possible in other languages as well. If you are managing a project in C or C++ and you are not using asan + scan-build + other error checkers on your code, you should use them all and fix the issues detected. Do it even if you intend to rewrite your code in Rust, Go or any other language. Do it now! There is no excuse.

keskiviikko 8. helmikuuta 2017

Enabling HTTPS is easy

To see how easy I looked how well it is supported in a bunch of free software blog aggregator sites.

The cobbler's shoes have no children.

sunnuntai 29. tammikuuta 2017

Testing exception vs error code behaviour with real world code

Our previous looks at exception performance have used microbenchmarks and generated code (here and here). These are fun and informative tests but ultimately flawed. What actually matters is performance on real world code. The straightforward way of measuring this is to convert a program using error codes into one using exceptions. This is harder than it seems.

The problem is that most code using error codes is not exception safe so this change can not be done easily (I tried, would not recommend). Fortunately there is a simple solution: going the other way. A fully exception safe code base can be easily converted into one using GLib style error objects with the following simple steps:

  1. replace all instances of throw with a call to a function such as Error* create_error(const char *msg)
  2. replace all catch statements with equivalent error object handling code
  3. alter the function signature of all functions creating or using error objects from void func() to void func(Error **e)
  4. change every call site to pass an error object and check if it points to non-null after the call, exit immediately if so
  5. repeat steps 4 and 5 recursively until compiler errors go away
This implements the exception code flow with explicit hand-written code

For testing we used Parzip, a high performance PkZip implementation. For simplicity we removed all parallel code and also the parts that deal with Zip file creation. Thus we ended up with a simple single threaded Zip file unpacker. The full code is available in this repository.

Source code size

LOC is a poor metric in general but illuminating in this case, because it directly shows the difference between these two approaches. The measurement is done with wc and if we ignore the license header on each file we find that the implementation using exceptions is 1651 lines and the one with error objects has 1971 lines. This means that error codes have roughly 19% more lines of source code than the equivalent code using exceptions. 

Looking at the code it is easy to see where this difference comes from. As an extreme example there is this function that reads data from file with endianness swapping:

localheader read_local_entry(File &f) {
    localheader h;
    uint16_t fname_length, extra_length;
    h.needed_version = f.read16le();
    h.gp_bitflag = f.read16le();
    h.compression = f.read16le();
    h.last_mod_time = f.read16le();
    h.last_mod_date = f.read16le();
    h.crc32 = f.read32le();

    /* And so on for a while. */
    return h;
}

And this is how it looks when using error objects:

localheader read_local_entry(File &f, Error **e) {
    localheader h;
    uint16_t fname_length, extra_length;
    h.needed_version = f.read16le(e);
    if(*e) {
        return h;
    }
    h.gp_bitflag = f.read16le(e);
    if(*e) {
        return h;
    }
    h.compression = f.read16le(e);
    if(*e) {
        return h;
    }
    h.last_mod_time = f.read16le(e);
    if(*e) {
        return h;
    }
    h.last_mod_date = f.read16le(e);
    if(*e) {
        return h;
    }
    h.crc32 = f.read32le(e);
    if(*e) {
        return h;
    }

    /* And so on and so on. */
    return h;
}

This example nicely shows the way that exceptions can make code more readable. The former sample is simple, straightforward, linear and understandable code. The latter is not. It looks like idiomatic Go (their words, but also mine). Intermixing the happy path with error handling makes the code quite convoluted.

Binary size

We tested the size of the generated code with GCC 6.2.0, Clang 3.8.1 and with Clang trunk. We built the code on 64 bit Ubuntu 16/10 using -g -O2 and the error code version was built with and without -fno-exceptions. The results look like this.
The results are, well, weird. Building with -fno-exceptions reduces code size noticeably in every case but the other parts are odd. When not using -fno-exceptions, the code that uses exceptions is within a few dozen bytes within the size of the code that uses error objects. GCC manages to make the error code version smaller even though it has the abovementioned 19% more lines of code. Some of this may be caused by the fact that -fno-exceptions links in less stuff from the C++ standard library. This was not researched further, though.

Looking at the ELF tables we find that the extra size taken by exceptions goes in the text segment rather than data segment. One would have expected it to have gone to unwind tables in a readonly data segment instead of text (which houses runnable code).

Things get even weirder with noexcept specifications. We added those to all functions in the exception code base which do not take an error object pointer in the error code version (meaning they will never throw). We then measured the sizes again and found that the resulting binaries had exactly the same size. On all three compilers. One would have expected some change (such as smaller exception unwind tables or anything) but no.

What have we learned from all this?

Mainly that things have more complex behavior than one would expect. The binary sizes especially are unexpected, especially the way Clang produces the same binary size for both error objects and exceptions. Even with this contradictory information we can make the following conclusions:
  • using exceptions can cause a noticeable reduction in lines of code compared to the equivalent functionality using error objects
  • compiling with -fno-exceptions can reduce binary size by 10-15%
Other than that these measurements really raise more questions than they answer. Why does noexcept not change output size? Why does Clang produce the same code size with exceptions and error objects? Are exception performance and code size fully optimized or could they be made smaller (this area probably has not had as much work done because many compiler contributors do not use exceptions internally)? 

lauantai 21. tammikuuta 2017

A Python extension module using C, C++, FORTRAN and Rust

One of the advantages of using a a general build system is that combining code written in different languages is easy. To demonstrate this I wrote a simple Python module called Polysnake. It compiles and links four different compiled languages into one shared extension module. The languages are C, C++, FORTRAN and Rust.

The build system setup consists of five declarations in Meson. It should be doable in four, but Rust is a bit special. Ignoring the boilerplate the core business logic looks like this:

rustlib = static_library('func', 'func.rs')

py3_mod.extension_module('polysnake',
  'polysnake.c',
  'func.cpp',
  'ffunc.f90',
  link_with : rustlib,
  dependencies : py3_dep)

The code is available on Github.

Compiling and running.

Compiling is done using the default Meson commands:

meson build
ninja -C build

Once built the main script can be run with this command:

PYTHONPATH=build ./main.py

The script just calls into the module and prints the string it returns. The output looks like this:

Combining many languages is simple.

This line is created in C.
This line is created in FORTRAN.
This line is created in C++.
This line is created in Rust.

Why not COBOL?

lauantai 14. tammikuuta 2017

Using a smart card for decryption from the command line

In Finland the national ID card contains a fully featured smart card, much like in Belgium and several other countries. It can be used for all sorts of cool things, like SSH logins. Unfortunately using this card for HW crypto operations is not easy and there is not a lot of easily digestible documentation available. Here's how you would do some simple operations from the command line.

The commands here should work on most other smart cards with very little changes. Note that these are just examples, they are not hardened for security at all. In production you'd use the libraries directly, keep all data in memory rather than putting it in temp files and so on.

First you plug in the device and card and check the status:

$ pkcs15-tool -k

Private RSA Key [todentamis- ja salausavain]
Object Flags   : [0x1], private
Usage          : [0x26], decrypt, sign, unwrap
Access Flags   : [0x1D], sensitive, alwaysSensitive, neverExtract, local
Access Rules   : execute:01;
ModLength      : 2048
Key ref        : 0 (0x0)
Native         : yes
Path           : XXXXXXXXXXXX
Auth ID        : 01
ID             : 45
MD:guid        : XXXXXXXXXXXX

Private RSA Key [allekirjoitusavain]
Object Flags   : [0x1], private
Usage          : [0x200], nonRepudiation
Access Flags   : [0x1D], sensitive, alwaysSensitive, neverExtract, local
Access Rules   : execute:02;
ModLength      : 2048
Key ref        : 0 (0x0)
Native         : yes
Path           : XXXXXXXXXXXX
Auth ID        : 02
ID             : 46
MD:guid        : XXXXXXXXXXXX

This card has two keys. We use the first one whose usage is "decrypt". Its ID number is 45. Now we need to extract the public key from the card:

$ pkcs15-tool --read-public-key 45 > mykey.pub

Next we generate some data and encrypt it with the public key. The important thing to note here is that you can only encrypt a small amount of data, on the order of a few dozen bytes.

$ echo secret message > original

Then we encrypt this message with OpenSSL using the extracted public key.

$ openssl rsautl -encrypt -inkey mykey.pub -pubin -in original -out encrypted

The file encrypted now contains the encrypted message. The only way to decrypt it is to transfer the data to the smart card for decryption, because the private key can only be accessed inside the card. This is achieved with the following command.

$ pkcs11-tool --decrypt -v --input-file encrypted --output-file decrypted --id 45 -l -m RSA-PKCS --pin 1234

After this the decrypted contents have been written to the file decrypted. The important bit here is -m RSA-PKCS, which specifies the exact form of RSA to use. There are several and if you use the wrong one the output will be random bytes without any errors or warnings.

What can this be used for?

Passwordless full disk encryption is one potential use. Combining a card reader with a Raspberry Pi and an electrical lock makes for a pretty spiffy DIY physical access control system.