Profiling glogg's search
Isolate what to profile
Profiling tools tend to spit loads of data, so to avoid being overwhelmed, it is best to try to isolate a particular section of code of interest.
It is possible to just run the GUI application and manually run the action to analyse but in this case, we can use the tests added in optimising glogg.
Running the tests using the
--gtest_filter option, we can choose to run
only a particular test, here is will be
Using callgrind to profile the search
So let’s use callgrind to produce a very detailed account of what happens when we run a search that matches every single line:
valgrind --tool=callgrind ./glogg_ptests --gtest_filter=*.allMatchingSearch
On my machine, this particular test took several hours, but I could of course have greatly reduced the number of lines to search (down from 5M) without consequence on the results since callgrind counts every single call rather than using statistics.
Once the test has finished running, callgrind offers various way to display the extensive data collected, but we will use an amazing GUI tool to allow us to explore it interactively: kcachegrind.
This KDE tool (there is also a non-KDE Qt version: qcachegrind) is one of the
most comprehensive and user friendly performance browser I have used. Let’s
load the results from
The left column is called the flat profile, it is simply a list of all functions sorted by the time that is spent in them (including their sub-calls for the “Incl.” column). If there is a lot of different code paths taken in your sample, it could be a bit difficult to read, with lots of unrelated processings competing for attention near the top. But here as our test did pretty much one thing (searching), it is quite clear:
- About three quarters of the time is spent in our “search” thread which sounds about right.
- The doSearch function and its children absorbs most of the time, which is also expected.
- There is a bunch of QList operations and loads of costly mallocs. This is indeed interesting.
We can then look at the right panel to get a better view of how each function interact with the others. kcachegrind offers a variety of pretty views but they present mostly the same information in different forms. Here we are using the graphical “Callee Map” and the “Call Graph”.
It shows most of doSearch is spent not actually searching but rather calling addAll one thousand times to store the results. We can already smell some good potential improvements here.
Here we are using another great feature of callgrind/kcachegrind: code
annotation. We can see our code, each line being annotated with the number of
time it is run and, most importantly, the time spent in it. We can see addAll
is indeed doing
+= operations on the QList, which results in LOTS of
malloc. We have our culprit!
When not to use QList
On closer inspection, the behaviour we have seen is quite obvious. QList is implemented as an array of pointers to objects, with some clever tricks to allow fast (-ish) insertion at the beginning and end. Marc Mutz has a nice write up on QList and the other Qt containers if you are interested in the gory details.
This implementation results in a lot of dynamic allocations when the object is small (here, a single 64 bits integer, the line number).
Of course now I feel pretty embarrassed to have chosen a QList when I first implemented this code, when a standard C++ vector (or Qt’s QVector) was the obvious answer. Vectors store their object themselves continuously in memory and, because of the way they allocate memory in advance (doubling their size when they run out of spare space) are very efficient at handling lots of additions at their end (O(log n) if you are familiar with the “big O notation”).
My explanation is I didn’t really think about efficiency on my first implementation (which is generally not a bad thing, more on “when (not) to optimise” later). And perhaps because Qt’s documentation clearly advises to use QList “For most purposes” (see QList). Oh well…
In that case the fix was just to change the QList into a C++ std:vector, a very simple change indeed.
Re-running our benchmark brings a smile to our face: allMatchingSearch is down from 712s to 117s, an improvement of 600%, not bad for a 15 minutes job!
And a useful lesson learnt: when performance matters, don’t trust your instinct, or even worse, the instinct of a library vendor that doesn’t know your application, measure!