Skip to main content

Node.js to Read Really, Really Large Datasets & Files

This blog post has an interesting inspiration point. Last week, someone in one of my Slack channels, posted a coding challenge he’d received for a developer position with an insurance technology company.
It piqued my interest as the challenge involved reading through very large files of data from the Federal Elections Commission and displaying back specific data from those files. Since I’ve not worked much with raw data, and I’m always up for a new challenge, I decided to tackle this with Node.js and see if I could complete the challenge myself, for the fun of it.
Here’s the 4 questions asked, and a link to the data set that the program was to parse through.
  • Write a program that will print out the total number of lines in the file.
  • Notice that the 8th column contains a person’s name. Write a program that loads in this data and creates an array with all name strings. Print out the 432nd and 43243rd names.
  • Notice that the 5th column contains a form of date. Count how many donations occurred in each month and print out the results.
  • Notice that the 8th column contains a person’s name. Create an array with each first name. Identify the most common first name in the data and how many times it occurs.
When you unzip the folder, you should see one main .txt file that’s 2.55GB and a folder containing smaller pieces of that main file (which is what I used while testing my solutions before moving to the main file).
Not too terrible, right? Seems doable. So let’s talk about how I approached this.
Processing large files is nothing new to JavaScript, in fact, in the core functionality of Node.js, there are a number of standard solutions for reading and writing to and from files.
The most straightforward is fs.readFile()wherein, the whole file is read into memory and then acted upon once Node has read it, and the second option is fs.createReadStream(), which streams the data in (and out) similar to other languages like Python and Java.
Since my solution needed to involve such things as counting the total number of lines and parsing through each line to get donation names and dates, I chose to use the second method: fs.createReadStream(). Then, I could use the rl.on(‘line’,...) function to get the necessary data from each line of code as I streamed through the document.
It seemed easier to me, than having to split apart the whole file once it was read in and run through the lines that way.
Below is the code I came up with using Node.js’s fs.createReadStream()function. I’ll break it down below.
The very first things I had to do to set this up, were import the required functions from Node.js: fs (file system), readline, and stream. These imports allowed me to then create an instream and outstream and then the readLine.createInterface(), which would let me read through the stream line by line and print out data from it.
I also added some variables (and comments) to hold various bits of data: a lineCountnames array, donation array and object, and firstNames array and dupeNamesobject. You’ll see where these come into play a little later.
Inside of the rl.on('line',...) function, I was able to do all of my line-by-line data parsing. In here, I incremented the lineCount variable for each line it streamed through. I used the JavaScript split()method to parse out each name and added it to my names array. I further reduced each name down to just first names, while accounting for middle initials, multiple names, etc. along with the first name with the help of the JavaScript trim()includes() and split() methods. And I sliced the year and date out of date column, reformatted those to a more readable YYYY-MM format, and added them to the dateDonationCount array.
In the rl.on('close',...) function, I did all the transformations on the data I’d gathered into arrays and console.logged out all my data for the user to see.
The lineCount and names at the 432nd and 43,243rd index, required no further manipulation. Finding the most common name and the number of donations for each month was a little trickier.
For the most common first name, I first had to create an object of key value pairs for each name (the key) and the number of times it appeared (the value), then I transformed that into an array of arrays using the ES6 function Object.entries(). From there, it was a simple task to sort the names by their value and print the largest value.
Donations also required me to make a similar object of key value pairs, create a logDateElements() function where I could nicely using ES6’s string interpolation to display the keys and values for each donation month. And then create a new Map()transforming the dateDonations object into an array of arrays, and looping through each array calling the logDateElements()function on it. Whew! Not quite as simple as I first thought.
But it worked. At least with the smaller 400MB file I was using for testing…
After I’d done that with fs.createReadStream(), I went back and also implemented my solutions with fs.readFile(), to see the differences. Here’s the code for that, but I won’t go through all the details here — it’s pretty similar to the first snippet, just more synchronous looking (unless you use the fs.readFileSync() function, though, JavaScript will run this code just as asynchronously as all its other code, not to worry.

If you’d like to see my full repo with all my code, you can see it here.
With my working solution, I added the file path into readFileStream.js file for the 2.55GB monster file, and watched my Node server crash with a JavaScript heap out of memory error.
Fail. Whomp whomp…
As it turns out, although Node.js is streaming the file input and output, in between it is still attempting to hold the entire file contents in memory, which it can’t do with a file that size. Node can hold up to 1.5GB in memory at one time, but no more.
So neither of my current solutions was up for the full challenge.
I needed a new solution. A solution for even larger datasets running through Node.
I found my solution in the form of EventStream, a popular NPM module with over 2 million weekly downloads and a promise “to make creating and working with streams easy”.
With a little help from EventStream’s documentation, I was able to figure out how to, once again, read the code line by line and do what needed to be done, hopefully, in a more CPU friendly way to Node.
Here’s my code new code using the NPM module EventStream.
The biggest change was the pipe commands at the beginning of the file — all of that syntax is the way EventStream’s documentation recommends you break up the stream into chunks delimited by the \ncharacter at the end of each line of the .txtfile.
The only other thing I had to change was the names answer. I had to fudge that a little bit since if I tried to add all 13MM names into an array, I again, hit the out of memory issue. I got around it, by just collecting the 432nd and 43,243rd names and adding them to their own array. Not quite what was being asked, but hey, I had to get a little creative.
Ok, with the new solution implemented, I again, fired up Node.js with my 2.55GB file and my fingers crossed this would work. Check out the results.
Woo hoo!
Success!
In the end, Node.js’s pure file and big data handling functions fell a little short of what I needed, but with just one extra NPM package, EventStream, I was able to parse through a massive dataset without crashing the Node server.
Stay tuned for part two of this series where I compare my three different ways of reading data in Node.js with performance testing to see which one is truly superior to the others. The results are pretty eye opening — especially as the data gets larger…
Thanks for reading, I hope this gives you an idea of how to handle large amounts of data with Node.js. Claps and shares are very much appreciated!

Comments

Popular posts from this blog

4 Ways to Communicate Across Browser Tabs in Realtime

1. Local Storage Events You might have already used LocalStorage, which is accessible across Tabs within the same application origin. But do you know that it also supports events? You can use this feature to communicate across Browser Tabs, where other Tabs will receive the event once the storage is updated. For example, let’s say in one Tab, we execute the following JavaScript code. window.localStorage.setItem("loggedIn", "true"); The other Tabs which listen to the event will receive it, as shown below. window.addEventListener('storage', (event) => { if (event.storageArea != localStorage) return; if (event.key === 'loggedIn') { // Do something with event.newValue } }); 2. Broadcast Channel API The Broadcast Channel API allows communication between Tabs, Windows, Frames, Iframes, and  Web Workers . One Tab can create and post to a channel as follows. const channel = new BroadcastChannel('app-data'); channel.postMessage(data); And oth...

Certbot SSL configuration in ubuntu

  Introduction Let’s Encrypt is a Certificate Authority (CA) that provides an easy way to obtain and install free  TLS/SSL certificates , thereby enabling encrypted HTTPS on web servers. It simplifies the process by providing a software client, Certbot, that attempts to automate most (if not all) of the required steps. Currently, the entire process of obtaining and installing a certificate is fully automated on both Apache and Nginx. In this tutorial, you will use Certbot to obtain a free SSL certificate for Apache on Ubuntu 18.04 and set up your certificate to renew automatically. This tutorial will use a separate Apache virtual host file instead of the default configuration file.  We recommend  creating new Apache virtual host files for each domain because it helps to avoid common mistakes and maintains the default files as a fallback configuration. Prerequisites To follow this tutorial, you will need: One Ubuntu 18.04 server set up by following this  initial ...

Working with Node.js streams

  Introduction Streams are one of the major features that most Node.js applications rely on, especially when handling HTTP requests, reading/writing files, and making socket communications. Streams are very predictable since we can always expect data, error, and end events when using streams. This article will teach Node developers how to use streams to efficiently handle large amounts of data. This is a typical real-world challenge faced by Node developers when they have to deal with a large data source, and it may not be feasible to process this data all at once. This article will cover the following topics: Types of streams When to adopt Node.js streams Batching Composing streams in Node.js Transforming data with transform streams Piping streams Error handling Node.js streams Types of streams The following are four main types of streams in Node.js: Readable streams: The readable stream is responsible for reading data from a source file Writable streams: The writable stream is re...