Geeks With Blogs
Josh Reuben


Financial Apps feel the need for speed – this can come via parallelization, and via infrastructure - fast messaging and non-blocking distributed memory management. This blogpost gives an overview + examples of various technologies that can squeeze performance out of your trading apps and clock cycles out of your modeling apps.

Low Latency via Infrastructure


· ZeroMQ is a messaging library - ‘messaging middleware’ , ‘TCP on steroids’ , ‘new layer on the networking stack’. not a complete messaging system , but is a simple messaging library to be used programmatically. Gives the flexibility and performance of low level socket interface plus ease of implementation of high level. It is designed for simplicity.

· Performance - ZeroMQ is orders of magnitude faster than most AMQP messaging systems as it doesn’t have the overhead. It leverages efficient transports such as reliable Multicast and makes use of intelligent message batching, minimizing not only protocol overhead but also system calls. You can choose the message encoding format such as BSON or ProtoBuff.

o ZeroMQ sockets can connect to multiple end points and automatically load balance messages over them. It is brokerless and thus has no single point of failure.

· ZeroMQ provides 4 different transports:

  • o INPROC an In-Process communication
  • o IPC an Inter-Process communication
  • o MULTICAST multicast via PGM, possibly encapsulated in UDP
  • o TCP a network based transport

· ZeroMQ provides message routing devices that can bind to different ports and forward messages from them according to pre-caned logic. ZeroMQ provides three kinds of devices:

  • o QUEUE, a forwarder for the request/response messaging pattern
  • o FORWARDER, a forwarder for the publish/subscribe messaging pattern
  • o STREAMER, a forwarder for the pipelined messaging pattern

· 0MQ supports the following message patterns:

· Clrzmq provides a C# Binding for the ZeroMQ API

o Get it from Source - or from Nuget package -

o server

// ZMQ Context, server socket

using (ZmqContext context = ZmqContext.Create())

using (ZmqSocket server = context.CreateSocket(SocketType.REP))



    while (true)


        // Wait for next request from client

        var message = server.Receive(Encoding.Unicode);

        // Do Some 'work'


        // Send reply back to client

        server.Send("blah", Encoding.Unicode);





using (ZmqContext context = ZmqContext.Create())

using (ZmqSocket client = context.CreateSocket(SocketType.REQ))



    var request = "blah";

    for (int requestNum = 0; requestNum < 10; requestNum++)


        client.Send(request, Encoding.Unicode);

        var reply = client.Receive(Encoding.Unicode);




Redis ( is a high performance NoSQL solution that provides network accessible shared memory – it provides a key-value / data structure store with support for lists & sets, as well as a non-blocking event bus. The performance is principally down to Redis keeping the entire dataset in memory, and only periodically syncing to disk. Redis supports 5 data structures: strings, lists, hashes, sets, & sorted sets.

BookSleeve ( is a .NET client (available as a Nuget package) for Redis provides pipelined, asynchronous, multiplexed and thread-safe access to redis:

using (var conn = new RedisConnection("localhost"))


    conn.Set(0, "foo", "bar");

var value = conn.GetString(0, "foo");

string s = await value;

LMAX Disruptor

Disruptor ( is ring buffer architecture for efficiently sending messages between threads without relying on shared queues (which CPU memory-access architecture causes contention). It leverages and improves upon some features of SEDA (which was serial) and the Actor model. It can handle 6 million TPS on a single thread. A Business Logic Processor runs in-memory using event sourcing and is surrounded by Disruptors - concurrency components that implements a network of queues that operate without needing locks. It allow consumers to wait on the results of other consumers without an intermediate queue.

LMAX Disruptor functions as a structured, ordered memory barrier set – multiple producer write barriers and consumer read barriers. There is no concept of entry deletion, just append.

readers can read concurrently and independently, and can optionally have dependencies. LMAX Disruptor uses a large pre-allocated ring of entries. Entry objects are pre-allocated adjacently and never get garbage collected.

Pre-allocating entries means adjacent entries are (very likely) located in adjacent memory cells, and because readers read entries sequentially, this is important to utilize CPU caches. And lots of efforts to avoid lock, CAS, even memory barrier (e.g. use a non-volatile sequence variable if there's only one writer). Different annotating readers write to different fields / cache lines, to avoid write contention.

· The .NET consumer interface IBatchHandler<T> specifies an OnAvailable method and an OnEndOfbatch method (reminds me of RX) - The consumer runs on a separate thread receiving entries as they become available.

public interface IBatchHandler<T>

    void OnAvailable(T sequence, T value);

    void OnEndOfBatch();



Setup the RingBuffer and barriers.

ringBuffer = new RingBuffer<ValueEntry>(

    ()=>new MyDataFactory(),

    1024, // Size of the RingBuffer



handler = new MyHandler(count) as IBatchHandler<long>;

producerBarrier = ringBuffer.CreateProducerBarrier();

//run the consumer in a new Thread



Publish messages to the disruptor

long sequence = _producerBarrier.NextEntry(out data);

// … do something to the data …

// append data



tear down the RingBuffer and stop consumer) threads:


Low Latency via Parallelization

The following parallelization technologies were investigated to determine latency capabilities given an indicator and a data set:

  • · C++ AMP
  • · C++ PPL Concrt
  • · TPL

The indicator under investigation was: standard deviation with a sliding window – a common measure of variance in stochastic calculus , used in Ito’s Lemma from Quant Finance

The data was made up of 50 million iterations - 21 days of x ticks (seconds).

TPL (.NET parallel CPU)

The Task Parallel Library (TPL) is a.NET Framework 4 API that simplifies parallelism and concurrency in applications. The TPL scales the degree of concurrency dynamically to most efficiently use all the processors that are available. In addition, the TPL handles the partitioning of the work, the scheduling of threads on the ThreadPool, cancellation support, state management, and other low-level details. By using TPL, you can maximize the performance of your code while focusing on the work that your program is designed to accomplish.  The TPL is the preferred way to write multithreaded and parallel code in .NET.

ConCrt ( C++ parallel CPU)

The Concurrency Runtime programming framework for C++ abstracts the details of high performance parallelism. uses a cooperative task scheduler that implements a work-stealing algorithm to efficiently distribute work among computing resources. The Concurrency Runtime also provides synchronization primitives that use cooperative blocking to synchronize access to resources. By blocking cooperatively, the runtime can use the remaining quantum to perform another task as the first task waits for the resource. This mechanism promotes maximum usage of computing resources.

C++ AMP ( C++ parallel GPGPU)

C++ Accelerated Massive Parallelism (C++ AMP) accelerates execution of C++ code by executing it on data-parallel hardware of the graphics processing unit (GPU) found on a DirectX 11 graphics card. The C++ AMP programming model includes multidimensional arrays, indexing, memory transfer, tiling, and a mathematical function library and allows you to control how data is moved from the CPU to the GPU and back, so that you can improve performance. General-purpose computing on graphics processing units (GPGPU) allows you to perform computation in applications traditionally handled by the CPU, but across many more cores.

You can see my previous post about C++ AMP here:

Here is the AMP moving average code – it performed best (I wont show the TPL code because it was trivial and I wont show the ConCrt code – as I actually leveraged this API in preloading the array in the AMP code sample):

#include "stdafx.h"

#include <vector>

#include <random>

#include <iostream>

#include <amp.h>

#include <concrt.h>

#include <amp_math.h>


#include "timer.h"

#define SAMPLES 500000000 // for tiles to avoid invalid_compute_domain: 16777216

#define WindowSize 21 // for tiles to avoid invalid_compute_domain: 256

using namespace std;

using namespace concurrency;

using namespace concurrency::fast_math;

int _tmain(int argc, _TCHAR* argv[])


    auto &samples = *new std::array<float, SAMPLES>();

    auto &results = *new std::array<float, SAMPLES + WindowSize>();

// tr1 uniform distrib

    std::tr1::uniform_real<float> unif(1.2000F, 1.5555F);

mt19937 eng;

// concrt preload


    0, SAMPLES, 1,

    [&] (int i) {samples[i]=unif(eng);});


array_view<const float,1> av(SAMPLES, samples);

array_view<float,1> rv(SAMPLES + WindowSize, results);

Timer tAll;


parallel_for_each(rv.extent, [=] (index<1> idx) restrict (amp)


    float r = 0;

    for (int n = 0; n < WindowSize; ++n)


        r+= av[idx + n];


    auto avg = r / WindowSize;

    float sum = 0;

    for (int n = 0; n < WindowSize; n++)


        auto dif = av[idx + n] - avg;

        sum += dif * dif; // powf(dif,2);


    rv[idx + WindowSize] = rsqrtf(r) / WindowSize;


cout << results[1000] << endl;


std::cout << tAll.Elapsed() << " ms" << endl;

Benchmark Results

tested on DELL laptop i7 64 bit 8 Giga RAM with SSD drive.

C# TPL  (Standard deviation):
  • Data Size 50M:  Window size 21: 650 millisec
  • Data Size 200M: Window size 21: 2.56 sec
C++ ConCrt
  • Data Size 50M: Window size 21: 486 millisec
  • Data Size 200M:  Window size 21: 1.8 sec
  • Data Size 50M: Window size 21: 285 millisec
  • Data Size 200M:  Window size 21: 1.15 sec

(win 64 bit fork, not including serialization / deserialization time):

o 5M objects:

  • § Write: 40 sec
  • § Read: 9.1 sec
  • § Redis memory  1.7 G

o 10M objects:

  • § Write: 209 sec
  • § Read: 234  sec
  • § Redis memory  2.7 G


As can be seen from the benchmarks, C++ AMP outperformed C++ Concrt, and C++ Concrt outperformed C# TPL. However, It was revealed that C# TPL provided adequate performance for the given indicator and data set. Using C# TPL also avoids the development productivity overhead associated with C++. By architecting correctly, the processing engine C++ / C# can be swapped out to suite your needs.

Posted on Monday, March 26, 2012 9:05 PM Parallelism , Quant Programming , C++ | Back to top

Comments on this post: Low-Latency High-Performant Financial App Infrastructures

# re: Low-Latency High-Performant Financial App Infrastructures
Requesting Gravatar...
Maybe Streaminsight could be in this list?
BTW Great article. Bookmarked it. TNX!
Left by Leonid Ganelin on Apr 02, 2012 12:58 AM

# re: Low-Latency High-Performant Financial App Infrastructures
Requesting Gravatar...
Interesting article indeed !

Which version of TPL did you test with?

Would be interesting to see the results with the latest released version. V4.5.5rc or later?

Would be nice to have the source available on codeplex/github ;)
Left by James Pereira on Sep 10, 2012 8:26 PM

# re: Low-Latency High-Performant Financial App Infrastructures
Requesting Gravatar...
Great information! I might be able to use these ideas some time soon. - Morgan Exteriors
Left by Winston Monroe on Dec 29, 2016 1:44 PM

Your comment:
 (will show your gravatar)

Copyright © JoshReuben | Powered by: