TL ; DR
- KIE Smart Router resilience -against network issues- needs to be tested in an automated way
- Testcontainers and Toxiproxy are perfect tools for easing these complex tests
- Build KIE Smart Router temporary image from sources by means of Dockerfile
- Parameterize tests and do not use sleep-before-check (polling is faster and more reliable)
- Adopt this new motto: “whatever you do not test against, will happen”
MOTIVATION: NETWORK WILL FAIL SOONER THAN LATER
Modern systems are mainly distributed and connectivity is a key part of their design and operation. However, as quality-enthusiasts, we realize that it is fairly complex to test connection issues in a deterministic and automated manner, because:
- involves multiple components with their interactions.
- networking is built over abstractions.
- connection failures are unpredictable by nature.
In the following article, the third of a series about complex automation testing with KIE Server (see I and II), we will focus on the KIE Smart Router to see how we can assure its reliability and robustness against one of the most important resilience killers: the network issues.
KIE Smart Router is a component that acts as a gateway:
- hides the topology of the different KIE servers to their clients, forwarding requests, and aggregating responses into a single one before coming back.
- is really useful in dynamic deployment environments where the client is agnostic about KIE Servers distribution.
- handles multiple connections and has implemented stability patterns (like circuit-breaker) to cope with connection error scenarios.
Let’s assume it: network issues are inevitable. Sooner or later will happen and waiting for a critical outage in production to find out how the system will perform is, without a shadow of a doubt, a recipe for pain. This is the main motivation of present work: anticipate the disaster by writing automated tests, simulating common kinds of network failures to prove the KIE Smart Router resilience.
CIRCUIT BREAKER PATTERN FOR NETWORK ISSUES
KIE Smart Router implements the well-known Circuit Breaker Pattern when there is a connection loss. The goal of this mechanism is to fail fast in order to prevent further consequences. Let’s see briefly how it works because later we will test it thoroughly.
The routing table is the brain of the Smart Router. Consequently, it contains the relationships among
- containers (identified by an alias as well as group-artifact-version) and
- server-ids (logical names in the network) with
- server locations needed for redirection.
There are two ways to populate this table:
- By manual operation
- Automatically, each KIE Server at startup will self-register into the Smart Router by passing its own id and location.
When a connection issue happens, the system immediately updates this routing table by removing the location of the failing KIE server. Open circuit! Smart router won’t forward any request to that point of failure to guard the system against cascading errors and slow responses. Apart from this, new requests are going to be balanced to another working server for that container (if provisioned).
Next, the KIE Smart Router spawns a different thread of execution for periodically pinging the failing KIE Server until it reaches a maximum number of configured retries. When one of them succeeds (server connection is back again), the location is annotated again in the routing table, ready for more routine operation. Closed-circuit!
NETWORK ISSUES TESTING SETUP
Once we have introduced our testing scenario, let’s see the setup for putting in place the resilience test cases.
Similarly to other examples brought up in this series, we will take advantage of containerized applications and Testcontainers library and utilities.
The following figure depicts the initial configuration. Firstly, we create a network containing several KIE servers (one of them can act as a controller) connected to the KIE Smart Router with a secure/non-secure connection. Each box represents a Linux container that exposes a port to the network:
- Secured KIE Servers: port 8443
- Non-secured KIE Servers: port 8080
- KIE Smart Router: port 9000
Moreover, each KIE Server deploys a different business application (kjar) in their respective KIE containers. Be aware of the different meaning of the word “container” here: the self-contained environments to run business applications. KIE containers are identified by a
All of these elements comprise the System-Under-Test (SUT). In front of it, there’s our test suite acting as a client application.
DO TESTING BY PROXY
Now, we want to provoke network issues into this setup in a controlled and deterministic manner from the client. The way we’ve chosen to do it is by means of “Toxiproxy” containers. Toxiproxy is an open-source library for simulating abnormal network conditions (called toxics).
These toxics cause connection failures emulating real network issues like connection loss, poor bandwidth, timeouts, connection reset by peer, high latency and jitter, sliced data into multiple smaller packets et others.
As you can see in the figure, the Toxiproxy container is a proxy that intercepts all the traffic between the KIE Smart Router and the KIE server (upstream and downstream). It exposes port 8666 to the network.
It can simulate java.net exceptions like these (which fail immediately, so great for not making the tests too long):
|java.net.SocketException: Unexpected end of file from server||timeout, limitdata|
|java.net.SocketException: Connection reset||resetPeer|
|javax.net.ssl.SSLHandshakeException: Remote host terminated the handshake||timeout, limitdata|
|javax.net.ssl.SSLException: Connection reset||resetPeer|
TOXIPROXIES IN THE MIDDLE OF THE WIRE
We can initialize Toxiproxy in the code as shown below. Along with the out-of-the-box Shopify image, we will provide a shared network, network alias, and the log consumer to print out its logs:
@Container public static ToxiproxyContainer toxiproxy = new ToxiproxyContainer(DockerImageName.parse("ghcr.io/shopify/toxiproxy:2.4.0") .asCompatibleSubstituteFor("shopify/toxiproxy")) .withNetwork(network) .withNetworkAliases(TOXIPROXY_NETWORK_ALIAS) .withLogConsumer(new Slf4jLogConsumer(logger).withPrefix("TOXIPROXY-1"));
Toxyproxy will proxy the target container by invoking:
proxy1 = toxiproxy.getProxy(kieServer1, KIE_HTTPS_PORT); proxy3 = toxiproxy3.getProxy(kieServer3, KIE_PORT);
At this point, you might be wondering "ok, that’s the way Toxiproxy reaches the KIE server, but how should I configure the Smart Router to get to the Toxiproxy?"
Indeed, that’s a very good question. In this case, with self-registering, it’s the KIE server that sends its location (
KIE_SERVER_LOCATION) to be populated into the routing table.
When creating the KIE Server, we pass this Environment variable:
KIE_SERVER_LOCATION_node1/2/3 are defined as system properties (in the
pom.xml or they could be overridden at launch time)
org.kie.samples.server.location.node1 = https://toxiproxy:8666 org.kie.samples.server.location.node2 = https://kie-server-node2:8443 org.kie.samples.server.location.node3 = http://toxiproxy3:8666
To sum up, those KIE servers behind the Toxiproxy will pass their proxy network addresses.
CONTAINERS ALL AROUND
So, these are the containers in place and their origin:
After that, we will create temporary images from them just for testing (a.k.a. images-on-the-fly) including business applications and the rest of the needed configuration.
Same for the KIE Smart Router image, but in this case, we do have to create it from scratch (no community binaries for it). Do not panic, KIE is an open-source initiative and we can generate all that we need by instructing a Dockerfile.
GENERATING THE KIE SMART ROUTER IMAGE
Dockerfile is like the instructions manual to build an image. It’s flexible enough to layer and skip repeated steps if nothing forces it to execute them again.
Starting from a JDK base (in this case, from JBoss which already contains
jboss user), it will download git and maven tools. Next, it will proceed with the cloning of the repository (its branch and URL are configurable, by default they are
droolsjbpm-integration repo) and the compilation of sources and their packaging.
Then, it will include some properties files (for configuring the KIE Smart Router, logging, and the certificate for TLS communication) and will execute this command to import the certificate into a trust Keystore (as it is a self-signed certificate, created ad-hoc for testing purposes):
keytool -importcert -noprompt -trustcacerts -alias toxiproxy-full-ks -file $ROUTER_HOME/kieks.crt -keystore /etc/pki/java/cacerts -storepass changeit
An aside about certificates and TLS communication:
In order to generate, in your localhost, a self-signed certificate valid for multiple hostnames with
keytool, you must include the DNS (network alias) as Subject Alternative Names (SAN) if you don’t want to get a “no name matching” exception:
keytool -genkeypair -alias toxiproxy-full-ks -keyalg RSA -keysize 2048 -validity 365 -keystore serverks.pkcs12 -storetype PKCS12 -dname "cn=Kie Server,o=jbpm,c=ES" -keypass secret -storepass secret -ext san=dns:full-node1,dns:toxiproxy,dns:localhost
Secondly, we will export it into a
.crt file. For example, naming it as kie.crt:
keytool -export -alias toxiproxy-full-ks -file kie.crt -keystore serverks.pkcs12 Enter keystore password: Certificate stored in file <kie.crt>
KIE Smart Router image will use this one.
In the KIE Server, for enabling secure connections, we must execute this
jboss-cli command as part of the initialization:
security enable-ssl-http-server --key-store-path=$JBOSS_HOME/standalone/configuration/serverks.pkcs12 --key-store-password=secret
From Dockerfile to image-on-the-fly
entrypoint of the container will be the standard
“java -jar …” command including the
$ROUTER_PROPS to enable the file configuration and its watcher.
withEnv("ROUTER_PROPS", "-Djava.util.logging.config.file=$ROUTER_HOME/logging.properties -Dorg.kie.server.router.config.watcher.enabled=true -Dorg.kie.server.router.config.file=$ROUTER_HOME/smart_router.properties");
From this Dockerfile, the Testcontainers utility “
ImageFromDockerfile" will build the image of the KIE Smart Router containing also the network configuration, the
LogConsumer (whose purpose is avoiding
sleep calls) and will wait for the expected message to consider the component "up and running":
withNetwork(network); withNetworkAliases(SMARTROUTER_ALIAS); withExposedPorts(SMARTROUTER_PORT); withLogConsumer(new Slf4jLogConsumer(logger).withPrefix("SMART-ROUTER")); waitingFor(Wait.forLogMessage(".*KieServerRouter started on.*", 1).withStartupTimeout(Duration.ofMinutes(2L)));
NETWORK ISSUES TEST INSIGHT
You can find the code and configuration for this example here. Let’s give some hints on how to parameterize and structure tests for easy scale.
Test cases are aimed to validate whether the component fulfills the circuit breaker pattern, making it stable and usable during hard network conditions. The routing table (
kie-server-router.json file) has to be consistently updated to open and close the circuit, and a polling mechanism is launched to check when the connections are recovered.
COMPLEX AS SYSTEM TESTS, STRAIGHTFORWARD AS UNIT TESTS
JUnit5 Parameterized tests with @
MethodSource will allow us to define the different toxics to apply in each proxy for exercising the same tests in different contexts.
This static method “
provideToxics” returns a Stream of Arguments that will be passed to each test. We can combine the toxics as we want in our testing matrix without interfering with the implementation of the tests.
Notice that we can even set up the properties of these toxics based on random values between a range (as the waiting time before a timeout):
ToxicSupplier<Toxic, IOException> timeout3 = () -> proxy3.toxics().timeout("timeout", DOWNSTREAM, getRandomTimeout(2000,5000));
When a toxic is applied (
ToxicSupplier is a functional interface that defines a get method like Suppliers) for a proxy, the abnormal behavior of the network begins over that path.
On the other hand, when invoking
removeAllToxics method, toxics are completely wiped out. As a result, the network comes back to a healthy condition.
This control flow of the impediments on the arrange-act-assert steps leverages the power of the tests. The test suite is completely self-contained, managing the resources easily, in a predictable way, like in a unit test. Here, the unit is our SUT comprising several components and connections.
Finally, tests don’t rely on sleep functions to wait for the expected behavior of the SUT but actively poll over the routing table or the logs to check if the desired state is already reached before a timeout. This approach is not only faster, but furthermore, it’s also less error-prone in CI environments.
CONCLUSION: AUTOMATE NETWORK ISSUES TESTING
Beyoncé Rule, followed by Googlers and other major players in the industry, states that “if you liked it, then you shoulda put a test on it”. A test here means an automated test. For some critical aspects, like how the system handles network failures, these tests may have some complexity, but with new containerized tools (like Testcontainers and Toxiproxy) the effort is really worth it.
Network issues won’t be completely prevented ever and modern architectures (KIE Smart Router is a good example) follow resilience and stability patterns to minimize their effects. But you will only be confident that the system exhibits the desired behavior when you write an automated test for it and this one becomes part of the CI to execute regressions.
Investing in automated testing for a great variety of network issues is, without a shadow of a doubt, a recipe for success.