Skip to main content

CPU and I/O performance diagnostics in Node.js

Understanding the performance profile of your Node.js applications can be a difficult process, particularly while dealing with an ever-changing landscape in an evolving ecosystem.
In this post, I am going to explore looking at Node.js application performance using tools that are compatible with other programming languages, to provide a cross-language profiling toolset, that can be used on Linux systems with a recent Kernel.
We will be using a bunch of tools that are part of the Linux Kernel but don’t let that scare you, they are pretty simple once you understand the basics.
The first tool of the bunch is found in the “linux-tools” package and is suitably named perf, it can also be referred to as perf_events because poor SEO of perfand that the supporting documentation from Brendan Gregg and Vince Weaveruses perf_events. In our caseperf_events samples the Linux Kernel to provide a statistically relevant set of data of the runtime of your application without capturing all data points, that would either be too much or even more importantly slow down your application/machine.

How to collect performance samples

With perf_events available to use, we will now start to capture some data whilst our application is under load.
I am going to be using next.js to build a sample server-side rendered react application, which we will be using to show what these performance profiling tools will uncover.
git clone git@github.com:tomgco/hello-next-performance.git
cd hello-next-performance
npm install
npm run build 
NODE_OPTIONS="--perf-basic-prof --perf-prof-unwinding-info" npm run start# Ready on http://localhost:3000
(Note this might not work for you yet, as I have a PR in progress on the Node.js project to allow those parameters. PR #25565)
Once this application is up and running, you should be able to access it on http://localhost:3000. To produce some load on this application we will use an application named artillery which is a fantastic tool writing user-journey based load tests.
artillery quick -r 200 -d 0 http://localhost:3000
So, we now have our application running, and a sustained amount of load (150 arrivals per second), so the next step on our performance journey is to use perf_events to get an understanding of what is happening to it.
sudo perf record -F 99 -g -p $(pgrep -f next-start) -- sleep 60
Let's deconstruct the perf command:
  • sudo perf record — Record the perf events, ran as sudo as we need to interface with the Linux Kernel.
  • -F 99 — Sample the application at this frequency, 99 instead of 100 to minimise lockstep sampling
  • -g — Enables call-graph sampling, we will use this to make it easier to parse and make pretty graphs!
  • -p $(pgrep -f next-start) — Find our node.js applications process id.
  • -- sleep 60 Run our record for 60 seconds
After our command has ran for 60 seconds, we should see some output in our terminal telling us that the perfcommand has captured data.
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.249 MB perf.data (742 samples) ]
perf_events should have now created a perf.data file, and the node.js application should’ve created some logs files in the format of isolate-0x*-v8.log as well as some files /tmp/perf-*.map . To get a human readable version of the perf.data file, we can now run another command which will output to a file of our choice.
sudo perf script -f --header > stacks.test.$(date --iso-8601)
Now we have our raw data, lets use some tools to inspect what we have!

FlameScope

The first tool which we are looking at is something developed by Netflix which “uses subsecond offset heat maps and flame graphs to analyze periodicactivity, variance, and perturbations.” I found out about this tool on Brendan Gregg’s blog post titled “FlameScope Pattern Recognition”.
Flamescope can be installed by following the instructions found on the Github repository: https://github.com/Netflix/flamescopeonce this is running, it will load any files found within the examples directory within the flamescope folder. We can view what we have been running by copying our stacks.test.xxxx files to the examples directory and investigate it at http://localhost:5000
Our example applications showing an interesting load pattern
What we can see in the image to our left, is that although the arrivals per second are constant, we still see some variability in load every 5 seconds.
What could cause this?
To further investigate, we can drill down into a flamegraph, which should show us what functions are taking up the most time. This is achieved by selecting a range within both the light coloured sections and the darker colour sections.
What becomes apparent is that in the darker sections we see that our applications is spending much longer in the “sha512_block_data_order_avx2” and the “node::crypto::PBKDF2Job::DoThreadPoolWork” sections. These functions are used by crypto.pbkdf2 to generate hashed passwords. https://github.com/tomgco/hello-next-performance/blob/master/pages/index.js#L20 Looking in deeper we can see that this application actually changes the strength of the password hash every 5 seconds (why, who knows! :D), explaining the pattern we see in the subsecond heatmap.
Finally, we can start to pin down some of our applications bottlenecks due to CPU utilisation. However, we all know that Node.js’ real strong point is the asynchronous I/O that it performs. This leads to the main performance impact on Node.js applications, functions that block the event loop.

Off-CPU analysis

Again I am going to refer back to another Brendan Gregg blog post “Linux eBPF Off-CPU Flame Graph”. Off-CPU can be achieved using the same tooling as earlier, but due to having to monitor the scheduler (which scales with load) it can very quickly become a huge overhead (think around a million events a second).
eBPF overcomes this limitation, for more details why, I recommend you check out the linked post.
So follow me as we dive into some off-CPU analysis.
For this example, we are going to take a different application https://github.com/tomgco/hello-next-performance/tree/io-bottle-neck, but follow mostly the same procedure.
  1. Build and run the application with the correct node options — NODE_OPTIONS="--perf-prof --perf-prof-unwinding-info --perf-basic-prof" npm run start
  2. Simultaneously run a load test — artillery quick -r 5 -d 0 http://127.0.0.1:3000/
The next step is now to install and use the bcctools from iovisor, once installed you should have an application named offcputime in /usr/share/bcc/tools (Or wherever your distribution installs them). This tool will allow us to capture the frames to dive deeper into the performance issues of the application. We are also going to be using a tool to generate a flamegraph from the data, which is another amazing tool from, you guessed it, Brendan Gregg named flamegraph.pl.
Whilst running artillery I also took note of the latencies we had from our application, which was:
    min: 24.7
    max: 72.4
    median: 38.6
    p95: 63.4
    p99: 72.4
Shall we see if we can reduce these values?
sudo python2 /usr/share/bcc/tools/offcputime -dfp$(pgrep -f next-start) 30 > out.offcpu
cat out.offcpu | grep -v __pthread_cond_wait > out.offcpu2
flamegraph.pl --color=io --title="Off-CPU flame" -countname=us < out.offcpu2 > out.svg
To make things easier for this blog post I have removed some lines from the outputted file, to make the SVG easier to read.
As you see in fig.1 the large towers contain the areas which contain the blocking I/O and the large fat lower towers are where a lot of time is spent.
A lot of the main time is caused by Garbage Collection and Memory Allocations, which is a good indication of the error at hand.
However, because this happens outside of the code routine of our application, it is very difficult to find out what is causing this.
So if we ignore this and drill into the large slim tower, hopefully, we will start to see lines of code in our application being referenced.
Here we can see the generated code by next.js and see a reference to our pages/index.js on line 144. We can also see that next is transpiling our application to a .next/ folder, by inspecting that we can see it is referencing our large array which we are populating for every request! Success!
With our new knowledge, I am going to optimise this code by caching the whole page, if the details haven’t changed! I followed the simple steps from this post https://medium.com/@az/i18n-next-js-app-with-server-side-rendering-and-user-language-aware-caching-part-1-ae1fce25a693 to include our own express application, with the implementation of an LRU cache.
After running artillery again, I begin to see a reduced response time for the whole application:
    min: 1.7
    max: 10
    median: 6.5
    p95: 8.3
    p99: 10
These figures are so much smaller than what we had before!
It now simplifies our off-CPU flamegraph, almost totally removing our GarbageCollection block, eliminating our code block which was blocking our application. Check out the source here! https://github.com/tomgco/hello-next-performance/tree/cached-server

As you have seen, using tools like the above in a load testing/production environment allows you to begin to understand the performance profile of your applications as well as gain a quick visual understanding of what your application is actually doing!
And the best thing with this tooling is that you don’t need to use it exclusively with Node.js, you can apply your knowledge here to other languages! Compiled or Jitted!

Comments

Popular posts from this blog

How to use Ngx-Charts in Angular ?

Charts helps us to visualize large amount of data in an easy to understand and interactive way. This helps businesses to grow more by taking important decisions from the data. For example, e-commerce can have charts or reports for product sales, with various categories like product type, year, etc. In angular, we have various charting libraries to create charts.  Ngx-charts  is one of them. Check out the list of  best angular chart libraries .  In this article, we will see data visualization with ngx-charts and how to use ngx-charts in angular application ? We will see, How to install ngx-charts in angular ? Create a vertical bar chart Create a pie chart, advanced pie chart and pie chart grid Introduction ngx-charts  is an open-source and declarative charting framework for angular2+. It is maintained by  Swimlane . It is using Angular to render and animate the SVG elements with all of its binding and speed goodness and uses d3 for the excellent math functio...

Understand Angular’s forRoot and forChild

  forRoot   /   forChild   is a pattern for singleton services that most of us know from routing. Routing is actually the main use case for it and as it is not commonly used outside of it, I wouldn’t be surprised if most Angular developers haven’t given it a second thought. However, as the official Angular documentation puts it: “Understanding how  forRoot()  works to make sure a service is a singleton will inform your development at a deeper level.” So let’s go. Providers & Injectors Angular comes with a dependency injection (DI) mechanism. When a component depends on a service, you don’t manually create an instance of the service. You  inject  the service and the dependency injection system takes care of providing an instance. import { Component, OnInit } from '@angular/core'; import { TestService } from 'src/app/services/test.service'; @Component({ selector: 'app-test', templateUrl: './test.component.html', styleUrls: ['./test.compon...

How to solve Puppeteer TimeoutError: Navigation timeout of 30000 ms exceeded

During the automation of multiple tasks on my job and personal projects, i decided to move on  Puppeteer  instead of the old school PhantomJS. One of the most usual problems with pages that contain a lot of content, because of the ads, images etc. is the load time, an exception is thrown (specifically the TimeoutError) after a page takes more than 30000ms (30 seconds) to load totally. To solve this problem, you will have 2 options, either to increase this timeout in the configuration or remove it at all. Personally, i prefer to remove the limit as i know that the pages that i work with will end up loading someday. In this article, i'll explain you briefly 2 ways to bypass this limitation. A. Globally on the tab The option that i prefer, as i browse multiple pages in the same tab, is to remove the timeout limit on the tab that i use to browse. For example, to remove the limit you should add: await page . setDefaultNavigationTimeout ( 0 ) ;  COPY SNIPPET The setDefaultNav...