Pierre-Henry Fröhring's Website

An Elixir service is built. The path from the specification to the executable deployed on a VM is described. The service may be used as a template for other services to serve requests beyond hello world!.

The objective is to build and deploy a manageable hello world service using Elixir.

If a service satisfies the following properties, then: we say that the service is manageable.

understandable ;
flexible ;
efficient to debug ;
reliable ;
scalable ;
performant ;
secure ;
portable.

The source code is available here: GitHub.
The service is manageable.

By Understandable we mean being able to explain the computations using actor computations.
In order to simplify the presentation, we present the computations from a given point of view.
A point of view on an object is the computation graph obtained w.r.t. a given actor. For instance, the client point of view or the programmer point of view.
Depending on the point of view, the object we describe may give rise to different actor computations without contradicting themselves.

From the client point of view, the service takes the form of a URL. When the URL receives a Hello message, it should reply within 1 second with a World message or an Error message.

A Hello message is a POST HTTP request with a body equal to {"type": "Hello"}.
A World message is a POST reply with a body equal to {"type": "World"}.
An Error message is a POST reply with a body equal to {"type": "Error", "msg": reason} where reason is a string.

From the programmer point of view, many computations may occur:

The happy path. Nothing crashes, everything is fine. When the client sends a message to the service URL, DNS receives a message. Then, the VM receives a message. Then, the BEAM process on the VM receives the message. Then, the handler in the BEAM process receives the message and computes a reply. Finally, the client receives the reply.

The following cases are built by answering a question: What if such and such happens on the happy path?:

The handler crashes. Then, the supervisor receives a notification that the handler crashed for some reason. Then, the client receives an excuse note for the inconvenience while the logger receives a detailed crash report. Finally, the developers receive a crash report.
The BEAM crashes. Then, systemd receives a signal which triggers a restart of the BEAM process.
Clients cannot get enough of hello world! The alarm mechanism detects a surge of requests exceeding a threshold. Then, a notification is logged. Then, the developers are informed.
The VM crashes. and so on.

Using an actor computation graph, the following summary may be built:

For a sysadmin, the service is essentially a directory that gets transformed into a deployed web service. The computation graph may be:

Given that the execution is logged using the standard Elixir infrastructure and that additional production code crashes are logged as well, the programmer can compare crash reports and their understanding of the service which leads to a more efficient debugging process than if one of the above elements was missing. For instance, here is a crash report:

16:28:10.431 [error] #PID<0.388.0> running HelloWorld.Router (connection #PID<0.387.0>, stream id 1) terminated Server: localhost:8081 (http) Request: POST / ** (exit) an exception was raised: ** (CaseClauseError) no case clause matching: %{"type" => "Wrong"} (hello_world 0.1.0) lib/hello_world/router.ex:15: anonymous fn/2 in HelloWorld.Router.do_match/4 (hello_world 0.1.0) deps/plug/lib/plug/router.ex:246: anonymous fn/4 in HelloWorld.Router.dispatch/2 (telemetry 1.2.1) …/hello_world/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3 (hello_world 0.1.0) deps/plug/lib/plug/router.ex:242: HelloWorld.Router.dispatch/2 (hello_world 0.1.0) lib/hello_world/router.ex:1: HelloWorld.Router.plug_builder_call/2 (hello_world 0.1.0) deps/plug/lib/plug/error_handler.ex:80: HelloWorld.Router.call/2 (plug_cowboy 2.7.1) lib/plug/cowboy/handler.ex:11: Plug.Cowboy.Handler.init/2 (cowboy 2.12.0) …/hello_world/deps/cowboy/src/cowboy_handler.erl:37: :cowboy_handler.execute/2 (cowboy 2.12.0) …/hello_world/deps/cowboy/src/cowboy_stream_h.erl:306: :cowboy_stream_h.execute/3 (cowboy 2.12.0) …/hello_world/deps/cowboy/src/cowboy_stream_h.erl:295: :cowboy_stream_h.request_process/3 (stdlib 5.2) proc_lib.erl:241: :proc_lib.init_p_do_apply/3

Considering the computation graph of the service, we say that it is reliable because:

If handling of a request crashes, then: the service keeps running.
If the service cannot keep up with the number of requests, then: an alarm is sent to operators to request more resources.
If the service process crashes, then: systemd restarts the service process.

While not impossible, these properties make the service hard to crash.

We will consider the service performant if it can handle 10000 messages under 1 second on a mid level PC. To test the hypothesis, the following code may be run in a livebook:

# Parameters port = 8082 url = "http://localhost:#{port}/" number_of_requests = 10000 duration = 1 # second # Libraries Mix.install([{:req, "~> 0.5.6"}]) # Experiment defmodule Hello, do: defstruct [type: "hello"] defmodule World, do: defstruct [type: "world"] defmodule Error, do: defstruct [msg: "No error message."] Supervisor.start_link([{Task.Supervisor, name: TaskSupervisor}], strategy: :one_for_one) defmodule Experiment do def result(url, number_of_requests, duration) do hellos = List.duplicate(%Hello{}, number_of_requests) tasks = Enum.map(hellos, fn(hello) -> ask(url, hello) end) replies = Task.await_many(tasks, duration * 1000) worlds = Enum.filter(replies, fn(rep) -> match?(%World{}, rep) end) length(hellos) == length(worlds) end def ask(url, msg) do task = fn -> map = Map.from_struct(msg) with {:ok, reply} <- Req.post(url, json: map), %{"type" => "world"} <- reply.body do %World{} else _ -> %Error{} end end Task.Supervisor.async(TaskSupervisor, task) end end # Result conclusion = case Experiment.result(url, number_of_requests, duration) do true -> "The server can handle #{number_of_requests} requests per #{duration} second." false -> "The server cannot handle #{number_of_requests} per #{duration} second." end IO.puts(conclusion)

Assuming the code is correct and performant, the service can still fail under the sheer number of requests. In this case, adding more VMs becomes necessary to horizontally scale the service. The alarm mechanism logs a notification so that a system administrator may add more VMs.

Additional actors should be added for the VMs to join seamlessly the system using standard Erlang mechanisms — e.g. epmd.

We use the term flexible in the same way as described by Gerald Jay Sussman in the book: Software Design for Flexibility. An explanation is available on YouTube: Three Directions in Design. This property is illustrated to some degree by how the computational graph has been built by adding more and more nodes and edges to it.

Adding new protocols. Another way that flexibility is attained is by adding protocols to existing actors. In effect, each actor may run arbitrary computations which means that it can learn new protocols. Pushing the idea to its limit, an actor may be taught protocols on the fly, provided it was explained to it appropriately. In the meantime, developers may add protocols to actors before live-reloading actors giving a similar effect.

Adding new actors. For instance, we started by the happy path and then, for each new hypothesis — e.g. the handler crashes — we added communications and actors to deal with it — e.g. if the handler crashes, then: a supervisor is informed and restarts it. Everything else was preserved, and a new property to the system was added.

Securing the service has different meanings depending on the perspective adopted. From the client perspective, it means that sending a Hello message to the address of the service results in the reception of a World message in a timely manner or an error and nothing else. Given that the legitimate owner controls the server, this property may be implemented using certificates — i.e. HTTPS for the client.

Adopting other perspectives, more properties should be added. For instance, it may be appropriate to add more properties to the service in order to avoid a Supply Chain Attack.

Implementing a tight systemd service specification should improve security by constraining how the process and the underlying OS interact — e.g. by constraining where the process can read/write in the filesystem.

Portability is achieved by using the release mechanism of Elixir: Once a release is assembled, it can be packaged and deployed to a target, as long as the target runs on the same operating system (OS) distribution and version as the machine running the mix release command.

All references: