Sunday, April 21, 2024

C is dead, long live C (APIs)

In the 80s and 90s software development landscape was quite different from today (or so I have been told). Everything that needed performance was written in C and things that did not were written in Perl. Because computers of the time were really slow, almost everything was in C. If you needed performance and fast development, you could write a C extension to Perl.

As C was the only game in town, anyone could use pretty much any other library directly. The number of dependencies available was minuscule compared to today, but you could use all of them fairly easily. Then things changed, as they have a tendency to do. First Python took over Perl. Then more and more languages started eroding C's dominant position. This lead to a duplication of effort. For example if you were using Java and wanted to parse XML (which was the coolness of its day), you'd need an XML parser written in Java. Just dropping libxml in your Java source tree would not cut it (you could still use native code libs but most people chose not to).

The number of languages and ecosystems kept growing and nowadays we have dozens of them. But suppose you want to provide a library that does something useful and you'd like it to be usable by as many people as possible. This is especially relevant for providing closed source libraries but the same applies to open source libs as well. You especially do not want to rewrite and maintain multiple implementations of the code in different languages. So what do you do?

Let's start by going through a list of programming languages and seeing what sort of dependencies they can use natively (i.e. the toolchain or stdlib provides this support out of the box rather than requiring an addon, code generator, IDL tool or the like)

C: C
Perl: Perl and C
Python: Python and C
C++: C++ and C
Rust: Rust and C
Java: Java and C
Lua: Lua and C
D: D, subset of C++ and C
Swift: Swift, Objective C, C++ (eventually?) and C
PrettyMuchAnyNewLanguage: itself and C

The message is quite clear. The only thing in common is C, so that is what you have to use. The alternative is maintaining an implementation per language leaving languages you explicitly do not support out in the cold.

So even though C as a language is (most likely) going away, C APIs are not. In fact, designing C APIs is a skill that might even see a resurgence as the language ecosystem fractures even further. Note that providing a library with a C API does not mean having to implement it in C. All languages have ways of providing libraries whose external API is compatible with C. As an extreme example, Visual Studio's C runtime libraries are nowadays written in C++.

CapyPDF's design and things picked up along the way

One of the main design goals of CapyPDF was that it should provide a C API and be usable from any language. It should also (eventually) provide a stable API and ABI. This means that the ground truth of the library's functionality is the C header. This turns out to have design implications to the library's internals that might be difficult to add in after the fact.

Hide everything

Perhaps the most important declaration in widely usable C headers is this.

typedef struct _someObject SomeObject;

In C parlance this means "there is a struct type _someObject somewhere, create an alias to it called SomeObjectType". This means that the caller can create pointers to structs of type SomeObject but do nothing else with them. This leads to the common "opaque structs" C API way of doing things:

SomeObject *o = some_object_new();

some_object_do_something(o, "hello");
some_object_destroy(o);

This permits you to change the internal representation of the object while still maintaining stable public API and ABI. Avoid exposing the internals of structs whenever possible, because once made public they can never be changed.

Objects exposed via pointers must never move in memory

This one is fairly obvious when you think about it. Unfortunately it means that if you want to give users access to objects that are stored in an std::vector, you can't do it with pointers, which is the natural way of doing things in C. Pushing more entries in the vector will eventually cause the capacity to be exceeded so the storage will be reallocated and entries moved to the new backing store. This invalidates all pointers.

There are several solutions to this, but the simplest one is to access those objects via type safe indices instead. They are defined like this:

typedef struct { int32_t id; } SomeObjectId;

This struct behaves "like an integer" in that you can pass it around as an int but it does not implicitly convert to any other "integer" type.

Objects must be destructable in any order

It is easy to write into documentation that "objects of type X must be destroyed before any object Y that they use". Unfortunately garbage collected languages do not read your docs and thus provide no guarantee whatsoever on object destruction order. When used in this way any object must be destructable at any time regardless of the state of any other object.

This is the opposite of how modern languages want to work. For the case of CapyPDF especially page draw contexts were done in an RAII style where they would submit their changes upon destruction. For an internal API this is nice and usable but for a public C API it is not. The implicit action had to be replaced with an explicit function to add the page that takes both object pointers (the draw context and document) as arguments. This ensures that they both must exist and be valid at the point of call.

Use transactionality whenever possible

It would be nice if all objects were immutable but sadly that would mean that you can't actually do anything. A library must provide ways for end users to create, mutate and destroy objects. When possible try to do this with a builder object. That is, the user creates a "transactional change" that they want to do. They can call setters and such as much as they want, but they don't affect the "actual document". All of this new state is isolated in the builder object. Once the user is finished they submit the change to the main object which is then validated and either rejected or accepted as a whole. The builder object then becomes an empty shell that can be either reused or discarded.

CapyPDF is an append only library. Once something has been "committed" it can never be taken out again. This is also something to strive towards, because removing things is a lot harder than adding them.

Prefer copying to sharing

When the library is given some piece of data, it makes a private copy of it. Otherwise it would need to coordinate the life cycle of the shared piece of data with the caller. This is where bugs lie. Copying does cost some performance but makes a whole class of difficult bugs just go away. In the case of CapyPDF the performance hit turned out not to be an issue since most of the runtime is spent compressing the output with zlib.

Every function call can fail, even those that can't

Every function in the library returns an error code. Even those that have no way of failing, because circumstances can change in the future. Maybe some input that could be anything somehow needs to be validated now and you can't change the function definition as it would break API. Thus every function returns an error code (except the function that converts an error code into an error string). Sadly this means that all "return values" must be handled via out parameters.

ErrorCode some_object_new(SomeObject **out_ptr);

This is not great, but such is life.

Think of C APIs as "in-process RPC"

When designing the API of CapyPDF it was helpful to think of it like a call to a remote endpoint somewhere out there on the Internet. This makes you want to design functions that are as high level as possible and try to ignore all implementation details you can, almost as if the C API was a slightly cumbersome DSL.

Wednesday, April 17, 2024

CapyPDF 0.10.0 is out

Perhaps the most interesting feature is that this new version reduces the number of external dependencies by almost 15%. In other words the number of deps went from 7 to 6. This is due to Apple Clang finally shipping with std::format support so fmt::format could be removed. The actual change was pretty much a search & replace from fmt::format to std::format. Nice.

Other features include:

L*a*b* color support in paint operations
Reworked raster image APIs
Kerned text support
Support for all PDF/X versions, not just 3
Better outline support

But most importantly, there are now stickers:

Sadly you can't actually buy them anywhere, they can only be obtained by meeting me in person and asking for one.

Tuesday, April 2, 2024

Aesthetics matter

When I started working on Meson I had several goals: portability, performance, usability and so on. I particularly liked the last one of these, but to my surprise this interest was not shared by people at large, especially those who used Autotools. Eventually the discussion always degenerated with them saying some variant of this:

It does not matter that Autotools is implemented as a mixture of five different scripting languages mismashed together. It works, so why replace it with something that is, at best, a tiny bit better?

One person went so far as to ask me (in public in front of a crowd) why making builds faster is even a thing to waste effort on? Said person followed this by saying he began his every working day by starting a build and going to brew some coffee. When he came back to his computer everything was ready to start programming.

It annoyed me to no end that I did not have a good reply to these people at the time. Unfortunately a thing happened last week that changed this.

The XZ malicious code injection incident.

It would be easy to jump on a bandwagon and blame Autotools for the whole issue and demand it to be banned as an unfixable security vulnerability [1] and all that. But let's not do that. Instead let's look at the issue from a slightly wider perspective.

Take any project you are working on currently. It can either be a work project or an open source one. Now think about all the various components it has. Go through them one by one in your mind. Pause at each one. Ponder them. Does any one of them immediately conjure up the following reaction in your mind:

I'm not touching that shit!

If the answer is yes then congratulations, you have found the most likely attack vector against the project. Why? Because that part that is guaranteed to have the absolute worst code reviews for the simple reason that nobody wants to touch it with a ten foot pole [2]. It is the very definition of someone else's problem. In the case of Autotools the problem is even worse, because there are no tools to find bugs automatically. Static analysis? No [3]! Linters? No! Even something simple like compiler warnings? Lol no! The reason they don't exist is exactly the same as above: the whole problem space is so off-putting that even the people who could do something about it prefer to work on something more meaningful instead. Badness begets more badness and apathy. The fact that it does not halt and catch fire most of the time is seen as sufficient quality.

This is even more of a problem for open source projects. Commercial projects pay people a full living salary to deal with necessary non-glamorous work like this. Volunteer based open source projects can not. A major fraction of the motivation for contributing on an open source project is to work on something that is somehow "cool", "fun" or "interesting". Debugging issues caused by incorrect M4 substitutions somewhere in the guts of a ten layer deep sed/awk/grep/Make/xargs/subshell pipeline is not that.

The reports I have read do not state whether XZ's malicious payload was submitted PR or not, but let's do a thought experiment. Assume that you are the overworked maintainer of an open source project that gets a PR that changes a bunch of M4 files with a description "fixes issue X in Y". What would you do? If you are honest with yourself, you'd probably do the same thing I'd do: merge it in while thinking "I'm just glad someone else fixed this so I don't have to touch that shit [4]".

Thus we find that aesthetics and beauty in fact play a vital role in security, because those systems make people want to work on them. They attract eyeballs. There is never a risk of getting stuck maintaining some awful piece of garbage because you touched it last so it's your responsibility now [5]. Beauty truly is the mother of security, or, as the ancient romans used to say:

Pulchritudo mater securitatis! [6]

[1] Which you still might choose to do.

[2] For a more literal example, several "undefeatable" fortresses have been taken over by attackers entering via sewage pipes.

[3] And not only because all the languages in question are dynamic.

[4] Yes, I have done this with Meson. Several times. Every maintainer of even moderately popular open source project has done the same. Trying to deny it is not productive.

[5] This is especially common in corporations with the added limitation that you are not allowed to make any major changes to avoid breaking things. If you ever find yourself in this situation, find employment elsewhere sooner rather than later. Otherwise your career has reached a dead end you can't escape.

[6] At least according to Google translate, which is good enough for our modern post-truth world.

Wednesday, March 20, 2024

Color management and API design

API design is hard. This is not a smashingly new revelation, but let's look at a sample issue I have been working on for CapyPDF. The main problem we are trying to solve is creating "print quality" PDFs. That is, ones that can be used to print things like books, magazines, posters and other high quality materials. A core component of this is color management, specifically the handling of ICC profiles for raster images.

There are at least four slightly conflicting design goals.

Fine-grained control

An advanced user knows and understands the PDF spec and know exactly how they want it to come out. The library should provide for this and not do, for example, unexpected color conversions behind the user's back.

Easy to use for basic cases

OTOH if your needs are simple, such as just loading images from files on disk, converting them to the output colorspace (almost certainly CMYK) with minimal fuss.

Simplicity

The API should be simple and readable. Even more importantly it should be understandable in the sense that when the user calls certain functions, they should be able to "know" what is going to happen and the behaviour should be the same over multiple invocations.

Safety

The API should prevent you from doing invalid things, such as using an uncalibrated RGB image in a CMYK document.

A wild real world appears!

Thus far things seem simple, but they get awfully complex. PDF is used in many different ways and all of those have their own requirements. For high quality printing specifically there is a specification called PDF/X that many printing shops use. Some might not even accept material that is not in this format. One of the requirements of PDF/X is that all raster images must be color managed. It would seem that a simple approach would be to convert all images to the output color space on load. And this is where things break down.

For you see, PDF does not have a single color managed pipeline, logically it has two. Grayscale images are "different" from full color images. A PDF generator must never convert grayscale raster images (or colors in general, but we'll focus on images now) to "color" images. Not even if the end result were "mathematically equivalent". In high quality printing that is not enough. Suppose you have a pixel whose gray value is 10. Converting that to CMYK can lead to (at least) two different values, (10, 10, 10, 0) and (0, 0, 0, 10). You'd think that the latter would always happen, but in testing LittleCMS produced the former (it also has custom gray-preserving transforms, but I did not try those). Even though these values are mathematically equivalent they (may) produce different output when printed. The latter is pure gray while the former can look muddled and if there are any registration problems the constituent colors might be visible. The RIP can not know whether the "grayscale looking color" was intentional or not. Under some circumstances it might be exactly what the creator intended, thus it can't really be post processed away. The only correct way is to keep the image in the gray color space so the RIP has maximal information to do its thing.

But this causes its own problem, because most grayscale images are not color managed. What should you do with those? Requiring color profiles would not be a nice UI, because then most images would break. For 1-bit grayscale images a color profile would not even make any sense. Not to mention that the grayscale image might not be printed at all but it instead used as an image mask for graphics composition operations (basically it would be used as the alpha channel). In that case you definitely want to use raw pixel values to obtain linear mixing. Doing gamma correction on your transparency channel could lead to some funky effects.

Things get more complicated once you realize that there are 7 variations of PDF/X that permit and prohibit different things. I tried to work out the workflow by writing a full table on color modes and output spaces and what should happen with every combination. Half way through I got a headache and had to stop.

Current status

The original plan was to make things happen automatically and try to validate the semantics of the output document as much as possible. That got simplified a whole lot. Because the state space is just so massive it might turn out that eventually CapyPDF only provides you the tools to do color conversions yourself and then writes out the result without trying to do anything fancy to it. It would then be the responsibility of the user to validate all semantic requirements.

All of this is to say that if you are currently using CapyPDF, just be aware that in the next version all APIs dealing with raster images have changed completely.

Monday, March 4, 2024

CapyPDF 0.9.0 released

I have just released CapyPDF 0.9.0. It can be obtained either via Github or PyPI.

There is no major big feature for this release. The most notable is probably the ability to create structured (or "tagged") PDF files. The code supports using both the builtin tags as well as defining your own. Feel free to try it, just know that the API is guaranteed to change.

As a visual example, here is the full Python source code for one of the unit tests.

When run it creates a tagged PDF file. Adobe Acrobat reports that the document has the following logical structure.

As you can (hopefully) tell, structure and content are the same in both of them.

Friday, February 23, 2024

Creating tagged PDFs with CapyPDF now sort of possible

There are many open source PDF generators available. Unfortunately they all have some limitations when it comes to generating tagged PDFs

Cairo does not support tagged PDFs at all
LaTeX can create tagged PDFs, but obviously only out of LaTeX documents
Scribus does not support tagged PDF
LibreOffice does support tagged PDF generation, but its code is not available as a standalone library, it can only be used via LibreOffice
HexaPDF does not support tagged PDFs (though they seem to be on the roadmap), but it is implemented in Ruby so most projects can't use it
Firefox can print pages to PDF, but the result is not tagged, even though the PDF document structure model is almost exactly the same as in HTML

There does not seem to be a library that provides for all of this with a "plain C" API that can be used to easily generated tagged PDFs using almost any programming language.

There still isn't, but at least now CapyPDF can generate simple tagged PDF documents. A sample document can be downloaded via this link. Here is a picture showing the document structure in Acrobat Pro.

It should also work in things like screen readers and other accessibility tools, but I have not tested it.

None of this is exposed in the C API, because this has a fairly large API surface and I have not yet come up with a good way to represent it.

Sunday, January 21, 2024

CapyPDF 0.8.0 released

Version 0.8.0 of the CapyPDF library has been released. The main new feature is support for form XObjects and printer's mark annotations.

Printer's marks are things like color bars, crop marks and registration marks (also known as "bullseye marks") that high end printers need for quality control. Here is a very simple document.

An experienced print operator would use the black lines to set up paper trimming. Traditionally these marks were drawn in the page's graphics stream. This is problematic because nowadays printers prefer to use their own custom marks instead of ones created by the document author. PDF solves this issue by moving these graphics operations to separate draw contexts (specifically "form XObjects", which are not actually forms, though they are XObjects) that can then be "glued" on top of the page. These annotations are shown in PDF viewer applications but they are not printed. I have no experience with high end RIP software, but presumably the operator can choose to either print the document's annotations or replace them with their custom QA marks.

As usual, to find out what features CapyPDF has and how to use them, look up either the public C header or the Python code used in unit tests.