Skip to main content

Create a Remote Jobs Automated Scraping Bot with Nodejs, Express and Pupeteer What is Web Scraping?

Web Scraping is a technique used to extract data out of websites into spreadsheets or databases on a server for data analytics or for creating bots for different purposes.

What we are going to Create?

We will create remote jobs scraping bot which can automatically run and fetch the scraps the site every day and can serve the data through an Express server so we can enter to the website and see the new scraped job offering.

The site we are going to scrap is remoteok.io.

Note: Make sure to get the permission first before scraping a website.

Install Dependencies

We will use Puppeteer which is a Headless Browser API which can provide us with a chromium browser that we can control in the background pretty much like another browser.

For making the scraping automated we have to run the script every day (or depending on what you need) and this can be using CronTab which is a time job scheduler utility on Linux, which also available on Nodejs from the cron package.

Eventually, we will display the scraped jobs through an Express server and I will be using express-generator to scaffold an express project with Pug template engine.

Inspecting the Target Site (Remoteok.io)

The first step before scraping any website is to inspect the site content in order to be able how to build your scripts, cause scraping is a technique that depends on knowing the structure the website how the DOM is structured and know which HTML Elements with their attributes you will need to access.

For Inspection process, you can use Chrome Dev Tools or Mozilla's Dev Tools.

On this tutorial, we scrap and get today's jobs only so, on remoteok.io we need to inspect today's jobs section to know how everything is put together.

Make sure to know the wrapping container of the element you want to scrap in order to be able to access the nested children and scrap them easily.

NOTE: Scraping scripts can be easily go outdated cause the target website content is constantly changing, it is a bit hard to keep it updated.

Create the Scraping Script

Make sure to take a look at the Puppeteer Docs to understand how it works.

Let's first launch a browser on puppeteer and navigate to remoteok.io page.

We will save all jobs in an array (You can create a database and store all jobs there).

let jobs = [];

module.exports.run = async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://remoteok.io");

  await browser.close();
};

As you can we are using async/await to handle async promises as sync code cause all methods on puppeteer are promise based.

We are also exporting the main function to the module as run so we could use it outside and call it from our server.

Now, we need to look for todays jobs body and get all the jobs (Title, Company and technologies).

Today's jobs are wrapped in tbody (table) container and each job is under tr.

for the title and company name, both have an attribute [itemprop=title] and [itemprop=hiringOrganization] so we could easily access those through the selectors.

async function loadLatestJobs(page) {
  //Clear previous jobs
  jobs = [];
  //Todays jobs container
  const todaysJobsBody = await page.$("tbody");
  //All rows of the container (row = job)
  const bodyRows = await todaysJobsBody.$$("tr");

  //Loop through all rows and extract data of the job
  const rowsMapping = bodyRows.map(async row => {
    //Get title element
    const jobTitleElement = await row.$("[itemprop=title]");
    if (jobTitleElement) {
      const titleValue = await getPropertyValue(jobTitleElement, "innerText");
      //Get company element
      const hiringOrganization = await row.$("[itemprop=hiringOrganization]");
      let organizitionName = "";
      if (hiringOrganization) {
        organizitionName = await getPropertyValue(
          hiringOrganization,
          "innerText"
        );
      }
    }
  });
  //Make sure to wait for all rows promises to complete before moving on
  //Otherwise we will get an error for closing the browser window before scraping the data
  await Promise.all(rowsMapping);
}

When using map on async function it returns an array of promises so we have to do Promise.all (wait for all promises to be resolved) then we can continue.

Now we need to get all the technologies regarding a specific job. Each technology is saved in a hyperlink (a) element and all tags are wrapped by .tags container.

async function loadLatestJobs(page) {
  //Clear previous jobs
  jobs = [];
  //Todays jobs container
  const todaysJobsBody = await page.$("tbody");
  //All rows of the container (row = job)
  const bodyRows = await todaysJobsBody.$$("tr");

  //Loop through all rows and extract data of the job
  const rowsMapping = bodyRows.map(async row => {
    //Get title element
    const jobTitleElement = await row.$("[itemprop=title]");
    if (jobTitleElement) {
      const titleValue = await getPropertyValue(jobTitleElement, "innerText");
      //Get company element
      const hiringOrganization = await row.$("[itemprop=hiringOrganization]");
      let organizitionName = "";
      if (hiringOrganization) {
        organizitionName = await getPropertyValue(
          hiringOrganization,
          "innerText"
        );
      }
      //Technologies elements (multiple tags for a single job)
      let technologies = [];
      const tags = await row.$$(".tag");
      technologies = await Promise.all(
        tags.map(async tag => {
          const tagContent = await tag.$("h3");
          return (
            await getPropertyValue(tagContent, "innerText")
          ).toLowerCase();
        })
      );
      //Remove all duplicates
      technologies = [...new Set(technologies)];
      //Add new Job
      addJob(titleValue, organizitionName, ...technologies);
    }
  });
  //Make sure to wait for all rows promises to complete before moving on
  //Otherwise we will get an error for closing the browser window before scraping the data
  await Promise.all(rowsMapping);
}

Also, make sure to add a helper function for adding a new job to the jobs array with title, company and technologies.

function addJob(title, company, ...technologies) {
  if (jobs) {
    const job = { title, company, technologies };
    jobs.push(job);
  }
}

Schedule script to run every day

Crontab allows you to easily schedule scripts to run on a specific time interval.

const { CronJob } = require("cron");

const remoteJobsScrapper = require("./remotejobs-scraper");

console.log("Scheduler Started");
const fetchRemoteJobsJob = new CronJob("* * * * *", async () => {
  console.log("Fetching new Remote Jobs...");
  await remoteJobsScrapper.run();
  console.log("Jobs: ", jobs);
});
//You need to explicity start the cronjob 
fetchRemoteJobsJob.start();

onTick callback is the main script function which gets called every time the scheduled job runs.

Cron Job must be explicitly started to give a little more control over the jobs.

Run Server and Display Jobs

The server is just like a scraping bot which runs the scheduler that allows the scraping bot to run on the specified interval.

So go in app.js and add new get route on the server on /jobs route.

const remoteJobsScraper = require("../remotejobs-scraper");

app.get("/jobs", (req, res, next) => {
  //Get all fetched jobs and pass them to the index template for rendering
  res.render("index", {
    jobs: remoteJobsScraper.getJobs()
  });
});

Also, make sure to import the scheduler module in order to start the cron job once the server starts running.

The crown job will be automatically disposed of once the server is shutdown.

//Start Scheduler
require("../scheduler");

For displaying the jobs we will use Pug template engine.

extends layout

block content
  h1 Here is the List of your Today's Remote Jobs
  ul  
    for job in jobs
      span Title: #{job.title} 
      span Company: #{job.company} 
      span technologies: 
      for tech in job.technologies
        span #{tech} 
      br 
      br

Now, if you run the server on localhost:3000 and go to /jobs you should see today's jobs scraped from remoteok.io.

Comments

Popular posts from this blog

How to use Ngx-Charts in Angular ?

Charts helps us to visualize large amount of data in an easy to understand and interactive way. This helps businesses to grow more by taking important decisions from the data. For example, e-commerce can have charts or reports for product sales, with various categories like product type, year, etc. In angular, we have various charting libraries to create charts.  Ngx-charts  is one of them. Check out the list of  best angular chart libraries .  In this article, we will see data visualization with ngx-charts and how to use ngx-charts in angular application ? We will see, How to install ngx-charts in angular ? Create a vertical bar chart Create a pie chart, advanced pie chart and pie chart grid Introduction ngx-charts  is an open-source and declarative charting framework for angular2+. It is maintained by  Swimlane . It is using Angular to render and animate the SVG elements with all of its binding and speed goodness and uses d3 for the excellent math functio...

Understand Angular’s forRoot and forChild

  forRoot   /   forChild   is a pattern for singleton services that most of us know from routing. Routing is actually the main use case for it and as it is not commonly used outside of it, I wouldn’t be surprised if most Angular developers haven’t given it a second thought. However, as the official Angular documentation puts it: “Understanding how  forRoot()  works to make sure a service is a singleton will inform your development at a deeper level.” So let’s go. Providers & Injectors Angular comes with a dependency injection (DI) mechanism. When a component depends on a service, you don’t manually create an instance of the service. You  inject  the service and the dependency injection system takes care of providing an instance. import { Component, OnInit } from '@angular/core'; import { TestService } from 'src/app/services/test.service'; @Component({ selector: 'app-test', templateUrl: './test.component.html', styleUrls: ['./test.compon...

How to solve Puppeteer TimeoutError: Navigation timeout of 30000 ms exceeded

During the automation of multiple tasks on my job and personal projects, i decided to move on  Puppeteer  instead of the old school PhantomJS. One of the most usual problems with pages that contain a lot of content, because of the ads, images etc. is the load time, an exception is thrown (specifically the TimeoutError) after a page takes more than 30000ms (30 seconds) to load totally. To solve this problem, you will have 2 options, either to increase this timeout in the configuration or remove it at all. Personally, i prefer to remove the limit as i know that the pages that i work with will end up loading someday. In this article, i'll explain you briefly 2 ways to bypass this limitation. A. Globally on the tab The option that i prefer, as i browse multiple pages in the same tab, is to remove the timeout limit on the tab that i use to browse. For example, to remove the limit you should add: await page . setDefaultNavigationTimeout ( 0 ) ;  COPY SNIPPET The setDefaultNav...