Skip to main content

Create a Remote Jobs Automated Scraping Bot with Nodejs, Express and Pupeteer What is Web Scraping?

Web Scraping is a technique used to extract data out of websites into spreadsheets or databases on a server for data analytics or for creating bots for different purposes.

What we are going to Create?

We will create remote jobs scraping bot which can automatically run and fetch the scraps the site every day and can serve the data through an Express server so we can enter to the website and see the new scraped job offering.

The site we are going to scrap is remoteok.io.

Note: Make sure to get the permission first before scraping a website.

Install Dependencies

We will use Puppeteer which is a Headless Browser API which can provide us with a chromium browser that we can control in the background pretty much like another browser.

For making the scraping automated we have to run the script every day (or depending on what you need) and this can be using CronTab which is a time job scheduler utility on Linux, which also available on Nodejs from the cron package.

Eventually, we will display the scraped jobs through an Express server and I will be using express-generator to scaffold an express project with Pug template engine.

Inspecting the Target Site (Remoteok.io)

The first step before scraping any website is to inspect the site content in order to be able how to build your scripts, cause scraping is a technique that depends on knowing the structure the website how the DOM is structured and know which HTML Elements with their attributes you will need to access.

For Inspection process, you can use Chrome Dev Tools or Mozilla's Dev Tools.

On this tutorial, we scrap and get today's jobs only so, on remoteok.io we need to inspect today's jobs section to know how everything is put together.

Make sure to know the wrapping container of the element you want to scrap in order to be able to access the nested children and scrap them easily.

NOTE: Scraping scripts can be easily go outdated cause the target website content is constantly changing, it is a bit hard to keep it updated.

Create the Scraping Script

Make sure to take a look at the Puppeteer Docs to understand how it works.

Let's first launch a browser on puppeteer and navigate to remoteok.io page.

We will save all jobs in an array (You can create a database and store all jobs there).

let jobs = [];

module.exports.run = async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://remoteok.io");

  await browser.close();
};

As you can we are using async/await to handle async promises as sync code cause all methods on puppeteer are promise based.

We are also exporting the main function to the module as run so we could use it outside and call it from our server.

Now, we need to look for todays jobs body and get all the jobs (Title, Company and technologies).

Today's jobs are wrapped in tbody (table) container and each job is under tr.

for the title and company name, both have an attribute [itemprop=title] and [itemprop=hiringOrganization] so we could easily access those through the selectors.

async function loadLatestJobs(page) {
  //Clear previous jobs
  jobs = [];
  //Todays jobs container
  const todaysJobsBody = await page.$("tbody");
  //All rows of the container (row = job)
  const bodyRows = await todaysJobsBody.$$("tr");

  //Loop through all rows and extract data of the job
  const rowsMapping = bodyRows.map(async row => {
    //Get title element
    const jobTitleElement = await row.$("[itemprop=title]");
    if (jobTitleElement) {
      const titleValue = await getPropertyValue(jobTitleElement, "innerText");
      //Get company element
      const hiringOrganization = await row.$("[itemprop=hiringOrganization]");
      let organizitionName = "";
      if (hiringOrganization) {
        organizitionName = await getPropertyValue(
          hiringOrganization,
          "innerText"
        );
      }
    }
  });
  //Make sure to wait for all rows promises to complete before moving on
  //Otherwise we will get an error for closing the browser window before scraping the data
  await Promise.all(rowsMapping);
}

When using map on async function it returns an array of promises so we have to do Promise.all (wait for all promises to be resolved) then we can continue.

Now we need to get all the technologies regarding a specific job. Each technology is saved in a hyperlink (a) element and all tags are wrapped by .tags container.

async function loadLatestJobs(page) {
  //Clear previous jobs
  jobs = [];
  //Todays jobs container
  const todaysJobsBody = await page.$("tbody");
  //All rows of the container (row = job)
  const bodyRows = await todaysJobsBody.$$("tr");

  //Loop through all rows and extract data of the job
  const rowsMapping = bodyRows.map(async row => {
    //Get title element
    const jobTitleElement = await row.$("[itemprop=title]");
    if (jobTitleElement) {
      const titleValue = await getPropertyValue(jobTitleElement, "innerText");
      //Get company element
      const hiringOrganization = await row.$("[itemprop=hiringOrganization]");
      let organizitionName = "";
      if (hiringOrganization) {
        organizitionName = await getPropertyValue(
          hiringOrganization,
          "innerText"
        );
      }
      //Technologies elements (multiple tags for a single job)
      let technologies = [];
      const tags = await row.$$(".tag");
      technologies = await Promise.all(
        tags.map(async tag => {
          const tagContent = await tag.$("h3");
          return (
            await getPropertyValue(tagContent, "innerText")
          ).toLowerCase();
        })
      );
      //Remove all duplicates
      technologies = [...new Set(technologies)];
      //Add new Job
      addJob(titleValue, organizitionName, ...technologies);
    }
  });
  //Make sure to wait for all rows promises to complete before moving on
  //Otherwise we will get an error for closing the browser window before scraping the data
  await Promise.all(rowsMapping);
}

Also, make sure to add a helper function for adding a new job to the jobs array with title, company and technologies.

function addJob(title, company, ...technologies) {
  if (jobs) {
    const job = { title, company, technologies };
    jobs.push(job);
  }
}

Schedule script to run every day

Crontab allows you to easily schedule scripts to run on a specific time interval.

const { CronJob } = require("cron");

const remoteJobsScrapper = require("./remotejobs-scraper");

console.log("Scheduler Started");
const fetchRemoteJobsJob = new CronJob("* * * * *", async () => {
  console.log("Fetching new Remote Jobs...");
  await remoteJobsScrapper.run();
  console.log("Jobs: ", jobs);
});
//You need to explicity start the cronjob 
fetchRemoteJobsJob.start();

onTick callback is the main script function which gets called every time the scheduled job runs.

Cron Job must be explicitly started to give a little more control over the jobs.

Run Server and Display Jobs

The server is just like a scraping bot which runs the scheduler that allows the scraping bot to run on the specified interval.

So go in app.js and add new get route on the server on /jobs route.

const remoteJobsScraper = require("../remotejobs-scraper");

app.get("/jobs", (req, res, next) => {
  //Get all fetched jobs and pass them to the index template for rendering
  res.render("index", {
    jobs: remoteJobsScraper.getJobs()
  });
});

Also, make sure to import the scheduler module in order to start the cron job once the server starts running.

The crown job will be automatically disposed of once the server is shutdown.

//Start Scheduler
require("../scheduler");

For displaying the jobs we will use Pug template engine.

extends layout

block content
  h1 Here is the List of your Today's Remote Jobs
  ul  
    for job in jobs
      span Title: #{job.title} 
      span Company: #{job.company} 
      span technologies: 
      for tech in job.technologies
        span #{tech} 
      br 
      br

Now, if you run the server on localhost:3000 and go to /jobs you should see today's jobs scraped from remoteok.io.

Comments

Popular posts from this blog

4 Ways to Communicate Across Browser Tabs in Realtime

1. Local Storage Events You might have already used LocalStorage, which is accessible across Tabs within the same application origin. But do you know that it also supports events? You can use this feature to communicate across Browser Tabs, where other Tabs will receive the event once the storage is updated. For example, let’s say in one Tab, we execute the following JavaScript code. window.localStorage.setItem("loggedIn", "true"); The other Tabs which listen to the event will receive it, as shown below. window.addEventListener('storage', (event) => { if (event.storageArea != localStorage) return; if (event.key === 'loggedIn') { // Do something with event.newValue } }); 2. Broadcast Channel API The Broadcast Channel API allows communication between Tabs, Windows, Frames, Iframes, and  Web Workers . One Tab can create and post to a channel as follows. const channel = new BroadcastChannel('app-data'); channel.postMessage(data); And oth...

Certbot SSL configuration in ubuntu

  Introduction Let’s Encrypt is a Certificate Authority (CA) that provides an easy way to obtain and install free  TLS/SSL certificates , thereby enabling encrypted HTTPS on web servers. It simplifies the process by providing a software client, Certbot, that attempts to automate most (if not all) of the required steps. Currently, the entire process of obtaining and installing a certificate is fully automated on both Apache and Nginx. In this tutorial, you will use Certbot to obtain a free SSL certificate for Apache on Ubuntu 18.04 and set up your certificate to renew automatically. This tutorial will use a separate Apache virtual host file instead of the default configuration file.  We recommend  creating new Apache virtual host files for each domain because it helps to avoid common mistakes and maintains the default files as a fallback configuration. Prerequisites To follow this tutorial, you will need: One Ubuntu 18.04 server set up by following this  initial ...

Working with Node.js streams

  Introduction Streams are one of the major features that most Node.js applications rely on, especially when handling HTTP requests, reading/writing files, and making socket communications. Streams are very predictable since we can always expect data, error, and end events when using streams. This article will teach Node developers how to use streams to efficiently handle large amounts of data. This is a typical real-world challenge faced by Node developers when they have to deal with a large data source, and it may not be feasible to process this data all at once. This article will cover the following topics: Types of streams When to adopt Node.js streams Batching Composing streams in Node.js Transforming data with transform streams Piping streams Error handling Node.js streams Types of streams The following are four main types of streams in Node.js: Readable streams: The readable stream is responsible for reading data from a source file Writable streams: The writable stream is re...