Skip to main content

How to set up a Headless Chrome Node.js server in Docker

How to Set Up a Headless Chrome Node.js Server in Docker
Headless browsers have become very popular with the rise of automated UI tests in the application development process. There are also countless use cases for website crawlers and HTML-based content analysis.
For 99 percent of these cases, you don’t actually need a browser GUI because it is fully automated. Running a GUI is more expensive than spinning up a Linux-based server or scaling a simple Docker container across a microservices cluster such as Kubernetes.
But I digress. Put simply, it has become increasingly critical to have a Docker container-based headless browser to maximize flexibility and scalability. In this tutorial, we’ll demonstrate how to create a Dockerfile to set up a Headless Chrome browser in Node.js.

Headless Chrome with Node.js

Node.js is the main language interface used by the Google Chrome development team, and it has an almost native integrated library for communicating with Chrome called Puppeteer.js. This library uses WebSocket or a System Pipe-based protocol over a DevTools interface, which can do all kinds of things such as take screenshots, measure page load metrics, connection speeds, and downloaded content size, and more. You can test your UI on different device simulations and take screenshots with it. Most importantly, Puppeteer doesn’t require a running GUI; it can all be done in a headless mode.
const puppeteer = require('puppeteer');
const fs = require('fs');

Screenshot('https://google.com');

async function Screenshot(url) {
   const browser = await puppeteer.launch({
       headless: true,
       args: [
       "--no-sandbox",
       "--disable-gpu",
       ]
   });

    const page = await browser.newPage();
    await page.goto(url, {
      timeout: 0,
      waitUntil: 'networkidle0',
    });
    const screenData = await page.screenshot({encoding: 'binary', type: 'jpeg', quality: 30});
    fs.writeFileSync('screenshot.jpg', screenData);

    await page.close();
    await browser.close();
}
Shown above is the simple actionable code for taking a screenshot over Headless Chrome. Note that we are not specifying Google Chrome’s executable path because Puppeteer’s NPM module comes with a Headless Chrome version embedded inside. Chrome’s dev team did a great job of keeping the library usage very simple and minimizing the required setup. This also makes our job of embedding this code inside the Docker container much easier.

Google Chrome inside a Docker container

Running a browser inside a container seems simple based on the code above, but it’s important not to overlook security. By default, everything inside a container runs under the root user, and the browser executes JavaScript files locally.
Of course, Google Chrome is secure, and it doesn’t allow users to access local files from browser-based script, but there are still potential security risks. You can minimize many of these risks by creating a new user for the specific purpose of executing the browser itself. Google also has sandbox mode enabled by default, which restricts external scripts from accessing the local environment.
Below is the Dockerfile sample responsible for the Google Chrome setup. We will choose Alpine Linux as our base container because it has a minimal footprint as a Docker image.
FROM alpine:3.6

RUN apk update && apk add --no-cache nmap && \
    echo @edge http://nl.alpinelinux.org/alpine/edge/community >> /etc/apk/repositories && \
    echo @edge http://nl.alpinelinux.org/alpine/edge/main >> /etc/apk/repositories && \
    apk update && \
    apk add --no-cache \
      chromium \
      harfbuzz \
      "freetype>2.8" \
      ttf-freefont \
      nss

ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true

....
....
The run command handles the edge repository for getting Chromium for Linux and libraries required to run chrome for Alpine. The tricky part is to make sure we don’t download Chrome embedded inside Puppeteer. That would be a useless space for our container image, which is why we are keeping the PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true environment variable.
After running the Docker build, we get our Chromium executable: /usr/bin/chromium-browser. This should be our main Puppeteer Chrome executable path.
Now let’s jump to our JavaScript code and complete a Dockerfile.

Combining Node.js Server and Chromium container

Before we continue, let’s change a little bit of our code to fit as a microservice for taking screenshots of given websites. For that, we’ll use Express.js to spin a basic HTTP server.
// server.js
const express = require('express');
const puppeteer = require('puppeteer');

const app = express();

// /?url=https://google.com
app.get('/', (req, res) => {
    const {url} = req.query;
    if (!url || url.length === 0) {
        return res.json({error: 'url query parameter is required'});
    }

    const imageData = await Screenshot(url);

    res.set('Content-Type', 'image/jpeg');
    res.set('Content-Length', imageData.length);
    res.send(imageData);
});

app.listen(process.env.PORT || 3000);

async function Screenshot(url) {
   const browser = await puppeteer.launch({
       headless: true,
       executablePath: '/usr/bin/chromium-browser',
       args: [
       "--no-sandbox",
       "--disable-gpu",
       ]
   });

    const page = await browser.newPage();
    await page.goto(url, {
      timeout: 0,
      waitUntil: 'networkidle0',
    });
    const screenData = await page.screenshot({encoding: 'binary', type: 'jpeg', quality: 30});

    await page.close();
    await browser.close();

    // Binary data of an image
    return screenData;
}
This is the final step to complete a Dockerfile. After running docker build -t headless:node, we’ll have an image with Node.js service and a Headless Chrome browser for taking screenshots.
Taking screenshots is fun, but there are countless other use cases. Fortunately, the process described above applies to almost all of them. For the most part, only minor changes to the Node.js code would be required. The rest is pretty standard environmental setup.

Common problems with Headless Chrome

Google Chrome eats a lot of memory during execution, so it’s no surprise that Headless Chrome does the same on the server side. If you keep a browser open and reuse the same browser instance many times, your service will eventually crash.
The best solution is to follow the principle of one connection, one browser instance. While this is more expensive than managing multiple pages per browser, sticking to just one page and one browser will make your system more stable. Of course, this all depends on personal preference and your particular use case. Depending on your unique needs and goals, you may be able to find a middle ground.
Take, for example, the official website for performance monitoring tool Hexometer. The environment includes a remote browser service that contains hundreds of idle browser pools. These are designed to pick up new connections over WebSocket when there is a need for execution, but it strictly follows the principle of one page, one browser. This makes it a stable and efficient way to not only keep running browsers idle, but keep them alive.
Puppeteer connection over WebSocket is pretty stable, and you can do something similar by making a custom service like browserless.io (there is an open-source version as well).
...
...

const browser = await puppeteer.launch({
    browserWSEndpoint: `ws://repo.treescale.com:6799`,
});

...
...
This will connect to the headless Chrome DevTools socket using the same browser management protocol.

Conclusion

Having a browser running inside a container provides a lot of flexibility and scalability. It’s also a lot cheaper than traditional VM-based instances. Now we can simply use a container service such as AWS Fargate or Google Cloud Run to trigger container execution only when we need it and scale to thousands of instances within a seconds.

Comments

Popular posts from this blog

Understand Angular’s forRoot and forChild

  forRoot   /   forChild   is a pattern for singleton services that most of us know from routing. Routing is actually the main use case for it and as it is not commonly used outside of it, I wouldn’t be surprised if most Angular developers haven’t given it a second thought. However, as the official Angular documentation puts it: “Understanding how  forRoot()  works to make sure a service is a singleton will inform your development at a deeper level.” So let’s go. Providers & Injectors Angular comes with a dependency injection (DI) mechanism. When a component depends on a service, you don’t manually create an instance of the service. You  inject  the service and the dependency injection system takes care of providing an instance. import { Component, OnInit } from '@angular/core'; import { TestService } from 'src/app/services/test.service'; @Component({ selector: 'app-test', templateUrl: './test.component.html', styleUrls: ['./test.compon...

How to use Ngx-Charts in Angular ?

Charts helps us to visualize large amount of data in an easy to understand and interactive way. This helps businesses to grow more by taking important decisions from the data. For example, e-commerce can have charts or reports for product sales, with various categories like product type, year, etc. In angular, we have various charting libraries to create charts.  Ngx-charts  is one of them. Check out the list of  best angular chart libraries .  In this article, we will see data visualization with ngx-charts and how to use ngx-charts in angular application ? We will see, How to install ngx-charts in angular ? Create a vertical bar chart Create a pie chart, advanced pie chart and pie chart grid Introduction ngx-charts  is an open-source and declarative charting framework for angular2+. It is maintained by  Swimlane . It is using Angular to render and animate the SVG elements with all of its binding and speed goodness and uses d3 for the excellent math functio...

How to solve Puppeteer TimeoutError: Navigation timeout of 30000 ms exceeded

During the automation of multiple tasks on my job and personal projects, i decided to move on  Puppeteer  instead of the old school PhantomJS. One of the most usual problems with pages that contain a lot of content, because of the ads, images etc. is the load time, an exception is thrown (specifically the TimeoutError) after a page takes more than 30000ms (30 seconds) to load totally. To solve this problem, you will have 2 options, either to increase this timeout in the configuration or remove it at all. Personally, i prefer to remove the limit as i know that the pages that i work with will end up loading someday. In this article, i'll explain you briefly 2 ways to bypass this limitation. A. Globally on the tab The option that i prefer, as i browse multiple pages in the same tab, is to remove the timeout limit on the tab that i use to browse. For example, to remove the limit you should add: await page . setDefaultNavigationTimeout ( 0 ) ;  COPY SNIPPET The setDefaultNav...