maanantai 13. helmikuuta 2017

Looking into what Rust can do that other languages can't ... or can they

A recent blog post What Rust Can Do That Other Languages Can't the following statement is presented:

struct X {
  y: Y
}
impl X {
  fn y(&self) -> &Y { &self.y }
}

This defines an aggregate type containing a field y of type Y directly (not as a separate heap object). Then we define a getter method that returns the field by reference. Crucially, the Rust compiler will verify that all callers of the getter prevent the returned reference from outliving the X object. It compiles down to what you'd expect a C compiler to produce (pointer addition) with no space or time overhead.

Let's look into this a bit more deeply. But, since this is The Internet, first a pre-emptive statement.

Things not claimed in this blog post

To make things easier for forum discussers who have strong opinions but have hard time reading entire articles, here is a list of things that are not claimed. I repeat: not claimed.
  • that [other language] is as safe as Rust (it is most likely not)
  • that you should use [other language] instead of Rust
  • that you should use Rust instead of [other language]
  • that [other language] is safe out of the box (it probably is not)
  • security feature X of language Y will prevent all bugs of type Z (it does not)

To business

Plain C code corresponding to the setup above is roughly the following:

struct foo {
  int x;
};

struct bar {
  struct foo f;
};

A simple function that demonstrates the bug is the following:

int main(int argc, char **argv) {
  struct foo *f;
  struct bar *expiring = malloc(sizeof(struct bar));
  f = &expiring->f;
  free(expiring);
  do_something(f);
  return 0;
}

This uses an obviously invalid pointer in a function call. Neither GCC nor Clang warns about this when compiling. Scan-build, however, does detect the bug:

expired.c:23:3: warning: Use of memory after it is freed
  do_something(f);
  ^~~~~~~~~~~~~~~~~
1 warning generated.
scan-build: 1 bug found.

Address sanitizer also detects the invalid access operation inside do_something. Things get more interesting if we change the function so that expiring is on the stack instead of the heap.

int main(int argc, char **argv) {
  struct foo *b;
  {
    struct bar expiring;
    b = &expiring.f;
  }
  do_something(b);
  return 0;
}

This one is not detected at all. Neither GCC, Clang, Scan-build, Asan, CppCheck nor Valgrind see anything wrong with this. Yet this seems so simple and obvious that it should be detectable automatically.

Recently this has become possible with Address sanitizer. It grew a new argument -fsanitize-address-use-after-scope, which checks for exactly this and finds the bug effortlessly:

==10717==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7ffe5d0c1920 at pc 0x00000050767f bp 0x7ffe5d0c18a0 sp 0x7ffe5d0c1898

This requires, of course, a run of the program that executes this code path. It would be nice to have this in a static analyzer so it could be found in even more cases (there is already a bug report for this for scan-build).

In conclusion

The security features of Rust are great. However many of them (not all, but many) that were unique to Rust are now possible in other languages as well. If you are managing a project in C or C++ and you are not using asan + scan-build + other error checkers on your code, you should use them all and fix the issues detected. Do it even if you intend to rewrite your code in Rust, Go or any other language. Do it now! There is no excuse.

keskiviikko 8. helmikuuta 2017

Enabling HTTPS is easy

To see how easy I looked how well it is supported in a bunch of free software blog aggregator sites.

The cobbler's shoes have no children.

sunnuntai 29. tammikuuta 2017

Testing exception vs error code behaviour with real world code

Our previous looks at exception performance have used microbenchmarks and generated code (here and here). These are fun and informative tests but ultimately flawed. What actually matters is performance on real world code. The straightforward way of measuring this is to convert a program using error codes into one using exceptions. This is harder than it seems.

The problem is that most code using error codes is not exception safe so this change can not be done easily (I tried, would not recommend). Fortunately there is a simple solution: going the other way. A fully exception safe code base can be easily converted into one using GLib style error objects with the following simple steps:

  1. replace all instances of throw with a call to a function such as Error* create_error(const char *msg)
  2. replace all catch statements with equivalent error object handling code
  3. alter the function signature of all functions creating or using error objects from void func() to void func(Error **e)
  4. change every call site to pass an error object and check if it points to non-null after the call, exit immediately if so
  5. repeat steps 4 and 5 recursively until compiler errors go away
This implements the exception code flow with explicit hand-written code

For testing we used Parzip, a high performance PkZip implementation. For simplicity we removed all parallel code and also the parts that deal with Zip file creation. Thus we ended up with a simple single threaded Zip file unpacker. The full code is available in this repository.

Source code size

LOC is a poor metric in general but illuminating in this case, because it directly shows the difference between these two approaches. The measurement is done with wc and if we ignore the license header on each file we find that the implementation using exceptions is 1651 lines and the one with error objects has 1971 lines. This means that error codes have roughly 19% more lines of source code than the equivalent code using exceptions. 

Looking at the code it is easy to see where this difference comes from. As an extreme example there is this function that reads data from file with endianness swapping:

localheader read_local_entry(File &f) {
    localheader h;
    uint16_t fname_length, extra_length;
    h.needed_version = f.read16le();
    h.gp_bitflag = f.read16le();
    h.compression = f.read16le();
    h.last_mod_time = f.read16le();
    h.last_mod_date = f.read16le();
    h.crc32 = f.read32le();

    /* And so on for a while. */
    return h;
}

And this is how it looks when using error objects:

localheader read_local_entry(File &f, Error **e) {
    localheader h;
    uint16_t fname_length, extra_length;
    h.needed_version = f.read16le(e);
    if(*e) {
        return h;
    }
    h.gp_bitflag = f.read16le(e);
    if(*e) {
        return h;
    }
    h.compression = f.read16le(e);
    if(*e) {
        return h;
    }
    h.last_mod_time = f.read16le(e);
    if(*e) {
        return h;
    }
    h.last_mod_date = f.read16le(e);
    if(*e) {
        return h;
    }
    h.crc32 = f.read32le(e);
    if(*e) {
        return h;
    }

    /* And so on and so on. */
    return h;
}

This example nicely shows the way that exceptions can make code more readable. The former sample is simple, straightforward, linear and understandable code. The latter is not. It looks like idiomatic Go (their words, but also mine). Intermixing the happy path with error handling makes the code quite convoluted.

Binary size

We tested the size of the generated code with GCC 6.2.0, Clang 3.8.1 and with Clang trunk. We built the code on 64 bit Ubuntu 16/10 using -g -O2 and the error code version was built with and without -fno-exceptions. The results look like this.
The results are, well, weird. Building with -fno-exceptions reduces code size noticeably in every case but the other parts are odd. When not using -fno-exceptions, the code that uses exceptions is within a few dozen bytes within the size of the code that uses error objects. GCC manages to make the error code version smaller even though it has the abovementioned 19% more lines of code. Some of this may be caused by the fact that -fno-exceptions links in less stuff from the C++ standard library. This was not researched further, though.

Looking at the ELF tables we find that the extra size taken by exceptions goes in the text segment rather than data segment. One would have expected it to have gone to unwind tables in a readonly data segment instead of text (which houses runnable code).

Things get even weirder with noexcept specifications. We added those to all functions in the exception code base which do not take an error object pointer in the error code version (meaning they will never throw). We then measured the sizes again and found that the resulting binaries had exactly the same size. On all three compilers. One would have expected some change (such as smaller exception unwind tables or anything) but no.

What have we learned from all this?

Mainly that things have more complex behavior than one would expect. The binary sizes especially are unexpected, especially the way Clang produces the same binary size for both error objects and exceptions. Even with this contradictory information we can make the following conclusions:
  • using exceptions can cause a noticeable reduction in lines of code compared to the equivalent functionality using error objects
  • compiling with -fno-exceptions can reduce binary size by 10-15%
Other than that these measurements really raise more questions than they answer. Why does noexcept not change output size? Why does Clang produce the same code size with exceptions and error objects? Are exception performance and code size fully optimized or could they be made smaller (this area probably has not had as much work done because many compiler contributors do not use exceptions internally)? 

lauantai 21. tammikuuta 2017

A Python extension module using C, C++, FORTRAN and Rust

One of the advantages of using a a general build system is that combining code written in different languages is easy. To demonstrate this I wrote a simple Python module called Polysnake. It compiles and links four different compiled languages into one shared extension module. The languages are C, C++, FORTRAN and Rust.

The build system setup consists of five declarations in Meson. It should be doable in four, but Rust is a bit special. Ignoring the boilerplate the core business logic looks like this:

rustlib = static_library('func', 'func.rs')

py3_mod.extension_module('polysnake',
  'polysnake.c',
  'func.cpp',
  'ffunc.f90',
  link_with : rustlib,
  dependencies : py3_dep)

The code is available on Github.

Compiling and running.

Compiling is done using the default Meson commands:

meson build
ninja -C build

Once built the main script can be run with this command:

PYTHONPATH=build ./main.py

The script just calls into the module and prints the string it returns. The output looks like this:

Combining many languages is simple.

This line is created in C.
This line is created in FORTRAN.
This line is created in C++.
This line is created in Rust.

Why not COBOL?

lauantai 14. tammikuuta 2017

Using a smart card for decryption from the command line

In Finland the national ID card contains a fully featured smart card, much like in Belgium and several other countries. It can be used for all sorts of cool things, like SSH logins. Unfortunately using this card for HW crypto operations is not easy and there is not a lot of easily digestible documentation available. Here's how you would do some simple operations from the command line.

The commands here should work on most other smart cards with very little changes. Note that these are just examples, they are not hardened for security at all. In production you'd use the libraries directly, keep all data in memory rather than putting it in temp files and so on.

First you plug in the device and card and check the status:

$ pkcs15-tool -k

Private RSA Key [todentamis- ja salausavain]
Object Flags   : [0x1], private
Usage          : [0x26], decrypt, sign, unwrap
Access Flags   : [0x1D], sensitive, alwaysSensitive, neverExtract, local
Access Rules   : execute:01;
ModLength      : 2048
Key ref        : 0 (0x0)
Native         : yes
Path           : XXXXXXXXXXXX
Auth ID        : 01
ID             : 45
MD:guid        : XXXXXXXXXXXX

Private RSA Key [allekirjoitusavain]
Object Flags   : [0x1], private
Usage          : [0x200], nonRepudiation
Access Flags   : [0x1D], sensitive, alwaysSensitive, neverExtract, local
Access Rules   : execute:02;
ModLength      : 2048
Key ref        : 0 (0x0)
Native         : yes
Path           : XXXXXXXXXXXX
Auth ID        : 02
ID             : 46
MD:guid        : XXXXXXXXXXXX

This card has two keys. We use the first one whose usage is "decrypt". Its ID number is 45. Now we need to extract the public key from the card:

$ pkcs15-tool --read-public-key 45 > mykey.pub

Next we generate some data and encrypt it with the public key. The important thing to note here is that you can only encrypt a small amount of data, on the order of a few dozen bytes.

$ echo secret message > original

Then we encrypt this message with OpenSSL using the extracted public key.

$ openssl rsautl -encrypt -inkey mykey.pub -pubin -in original -out encrypted

The file encrypted now contains the encrypted message. The only way to decrypt it is to transfer the data to the smart card for decryption, because the private key can only be accessed inside the card. This is achieved with the following command.

$ pkcs11-tool --decrypt -v --input-file encrypted --output-file decrypted --id 45 -l -m RSA-PKCS --pin 1234

After this the decrypted contents have been written to the file decrypted. The important bit here is -m RSA-PKCS, which specifies the exact form of RSA to use. There are several and if you use the wrong one the output will be random bytes without any errors or warnings.

What can this be used for?

Passwordless full disk encryption is one potential use. Combining a card reader with a Raspberry Pi and an electrical lock makes for a pretty spiffy DIY physical access control system.

sunnuntai 8. tammikuuta 2017

Measuring execution performance of C++ exceptions vs error codes

In an earlier blog post we found out that C++ exceptions produce smaller executables than corresponding code using error codes. The most common comment to it was that executable size is not that important, performance is what matters. Since we have the tooling, let's do perf measurements as well.

Test setup

We generate code that calls a few other functions that call other functions and so on until we reach the bottommost function. In that function we either raise an exception or error or return success. This gives us two parameters to vary: the depth of the call hierarchy and what percentage of calls produce an error.

The code itself is simple. The C++ version looks like this:

int func0() {
    int num = random() % 5;
    if(num == 0) {
        return func1();
    }
    if(num == 1) {
        return func2();
    }
    if(num == 2) {
        return func3();
    }
    if(num == 3) {
        return func4();
    }
    return func5();
}

We generate a random number and based on that call a function deeper in the hierarchy. This means that every time we call the test it goes from the top function to the final function in a random way. We use the C random number generator and seed it with a known value so both C and C++ have the same call chain path, even though they are different binaries. Note that this function calls on average a function whose number is its own number plus three. So if we produce 100 functions, the call stack is on average 100/3 = 33 entries deep when the error is generated.

The plain C version that uses error objects is almost identical:

int func0(struct Error **error) {
    int num = random() % 5;
    int res;
    if(num == 0) {
        res = func1(error);
    } else if(num == 1) {
        res = func2(error);
    } else if(num == 2) {
        res = func3(error);
    } else if(num == 3) {
        res = func4(error);
    } else {
        res = func5(error);
    }
    /* This is a no-op in this specific code but is here to simulate
     * real code that would check the error and based on that
     * do something. We must have a branch on the error condition,
     * because that is what real world code would have as well.
     */
    if(*error) {
        return -1;
    }

    return res;
}

The code for generating this code and running the measurements can be found in this Github repo.

Results

We do not care so much about how long the executables take to run, only which one of them runs faster. Thus we generate output results that look like this (Ubuntu 16/10, GCC 6.2, Intel i7-3770):

CCeCCCCCCC
CCCCCCccec
eEcCCCeECE
cceEEeECCe
EEEcEEEeCE
EEEEEeccEE
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE

In this diagram C means that plain C error codes are faster and E that exceptions are faster for that particular pair of parameters. For easier reading all cases where error codes win have been given a red background color. Each row represents a different number of functions. The top row has 50 functions (call depth of 16) and each row adds 50 so the bottommost row has 500 functions.

The columns represent different error rates. The leftmost column has 0% errors, the next one has 1% and so on. A capital letter means that the runtime difference was bigger than 10%. Each executable was executed five times and the fastest run was taken in each case. Full results look like the following:

GCC 6.2      Clang 3.8.1  Clang 4.0 trunk

CCeCCCCCCC   EecCcCcCcC   ECcECcEeCC
CCCCCCccec   CceEEeEcEC   EEeEECEcCC
eEcCCCeECE   EEEEEECEeE   EEEECECeCC
cceEEeECCe   EcEeEECEEE   EEEEceeEeC
EEEcEEEeCE   EEEeEEEEEe   EEEEEEEEEE
EEEEEeccEE   EEEEEEEEEE   EEEEEEEEEE
EEEEEEEEEE   EEEEEEEEEE   EEEEEEEEEE
EEEEEEEEEE   EEEEEEEEEE   EEEEEEEEEE
EEEEEEEEEE   EEEEEEEEEE   EEEEEEEEEE
EEEEEEEEEE   EEEEEEEEEE   EEEEEEEEEE


Immediately we see that once the stack depth grows above a certain size (here 200/3 = 66), exceptions are always faster. This is not very interesting, because call stacks are usually not this deep (enterprise Java notwithstanding). For lower depths there is a lot of noise, especially for GCC. However we can see that for small error rates exceptions are sometimes faster, especially for Clang. Because of the noise we redid the measurements several times. This changed the individual measurements but produced the same overall shape where small errors seem to be faster with exceptions but as the error rate grows error codes become faster.

On a Raspberry Pi the results look like this:

GCC 4.9.2

eeccCCCCCC
EeeeccccCC
Eeeeeecccc
Eeeeeeeccc
EEeeeeeeec
EEEEEeeeee
EEEEEEEEee
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE

Here the results are a lot clearer and consistent, probably due to the simpler architecture of the ARM processor. Exceptions are faster except for small call depths and large error rates. When using 50 functions error objects only become noticeable faster at 4% error rate.

Finally let's do the same measurements on an i5-6600K.

GCC 4.8.4    gcc 5.4.0   Clang 3.{4, 6, 8}

ECCCCCCCCCC  eeccCCCCCC  EeeccCCCCC
EecCCCCCCCC  EEEeeeeccc  EEEEEeeeee
EEecCCCCCCC  EEEEEEeeee  EEEEEEEEEE
EEEeecCCCCC  EEEEEEEEEe  EEEEEEEEEE
EEEEeecccCC  EEEEEEEEEE  EEEEEEEEEE
EEEEEeeeccc  EEEEEEEEEE  EEEEEEEEEE
EEEEEEeeeec  EEEEEEEEEE  EEEEEEEEEE
EEEEEEEEeee  EEEEEEEEEE  EEEEEEEEEE
EEEEEEEEEee  EEEEEEEEEE  EEEEEEEEEE
EEEEEEEEEEe  EEEEEEEEEE  EEEEEEEEEE

Here we see that the compiler used can make a massive difference. GCC 4.8.4 favors error codes but 5.4.0 is the opposite as are all versions of Clang (which produce identical results). On this platform exceptions seem to be even more performant than on the Pi.

The top left corner seems to be the most interesting part of this entire diagram, as it represents shallow call chains with few errors. This is similar to most real world code used in, for example, XML or JSON parsing. Let's do some more measurements focusing on this area.

i5               Raspberry Pi

GCC    Clang     GCC    Clang
5.4.0  3.8       4.9.2  3.5.0

cCCCC  eCCCC     eCCCC  eCCCC
ecCCC  EeCCC     ecCCC  ecCCC
ecCCC  EeccC     eccCC  eccCC
eccCC  EEecc     eecCC  eecCC
eeccC  EEEee     eeccC  eeccC


In these measurements the parameters go from 10 functions at the top row to 50 in the bottom row in increments of 10 and the colums go from 0% error on the left to 4% on the right in increments of 1. Here we see that exceptions keep their performance advantage when there are no errors, especially when using Clang, but error codes get better as soon as the error rate goes up even a little.

Conclusions

Whenever a discussion on C++ exceptions occurs, there is That Guy who comes in and says "C++ exceptions are slow, don't use them". At this point the discussion usually breaks down on fighting about whether exceptions are fast or not. These discussion are ultimately pointless because asking whether exceptions are "fast" is the wrong question.

The proper question to ask is "under which circumstances are exceptions faster". As we have seen here, the answer is surprisingly complex. It depends on many factors including which platform, compiler, code size and error rate is used. Blanket statements about performance of X as compared to Y are usually simplifying, misleading and just plain wrong. Such is the case here as well.

(Incidentally if someone can explain why i7 has those weird performance fluctuations, please post in the comments.)

Thanks to Juhani Simola for running the i5 measurements.

keskiviikko 4. tammikuuta 2017

Clarifications to xz compression article

My previous article on beating the compression performance of xz went a bit viral. Comments on various discussion forums seemed to bring up the same, slightly erroneous, points so here are a few clarifications.

Random access to files is a useless feature

For most use cases this is probably true. But that was not the main point of the experiment. The point was to enable parallel compression without making the final size much bigger. Even more important was to enable parallel decompression. Very few file formats support that without losing compression performance. As a classical example PKZIP supports parallel decompression but its archives are a lot bigger than packed tar files.

Tarballs are just fine and need no changes

Many people were of the opinion that tarballs are "good enough for source distribution". This is probably correct. Unfortunately when people hear the word "tar" they immediately think of source tarballs. That is only a small fraction of tar usage in the world. Many file formats have tar files (or equivalent) hidden inside them. Thus all operations on said files are serial and can use only one CPU at a time.

As an example, the deb file format consists of a bunch of metadata and an embedded tar file. RPM is the same, except that it has a cpio archive rather than tar. This means that installing updates is an inherently serial operation. Update packages can only be installed one at a time, and because they are in a tarlike format, only one CPU can be used for processing them. This is particularly slow on platforms such as the Raspberry Pi that have underpowered multicore CPUs.

When running workloads such as a CI builder, there are multiple steps that are serial when using tar-like file formats. As an example pbuilder has the following steps:
  • unpack base image <- serial
  • install updates <- serial
  • install dependencies <- serial
  • unpack source <- serial
  • compile <- parallel
  • compress final result <- serial
If you are building Libreoffice, Firefox or Chromium, the compilation step dominates. For smaller packages the bottleneck may be elsewhere. As an extreme example, the Debian package build of Meson installs circa 1.5 GB of dependency packages but the actual build consists of a Python distutils invocation that takes all of one second (running the test suite does take a while, though).

This is not suitable for production due to X

It was never meant to. It was merely an experiment to see if a better file format were possible.

Pack/unpack operations are bound by disk, not by CPU

This is true on some platforms but not all of them. LZMA compression is almost always CPU bound even on the fastest CPUs and so is decompression if you have a fast SSD. Disks are getting faster all the time, CPUs are not, so CPUs will become even more of a bottleneck in the future.

What we have is good enough, the benefits are not worth an overhaul and even if they were inertia would prevent uptake

Maybe. But experiments are fun to do regardless.