Skip to main content

The Drastic Mistake Of Using Mongoose To Handle Your Big Data

Introduction

 is an incredibly popular and well-done library in the NPM universe. It is used extensively by many excellent programmers based upon its Model-Schema structure. Indeed, a cursory look in Google at many examples at creating any sort of stack with Data models that include MongoDB will show you that the authors mostly include Mongoose in their development structure. It is a respected, well-kept, and incredibly popular library. All the above is true, and the authors should be lauded for their excellent skills and in understanding the needs of the community.
The above is not a disclaimer nor a cynical statement. Indeed, I have delayed the publishing of this article for a long time due to the fact that I assume it may not be well accepted in the MongoDB community. However, the very essence of Mongoose, what it forces one to do with data and how it forces the programmer to think, is the point of this article. I am fully aware that what will be written here may be incredibly unpopular with the current NodeJS — NPM community. However, based upon recent events in my own career, it has come time to explain just why Mongoose should be avoided in almost any true Big Data scenario. Let me make this claim even harsher. It is also why I truly disdain the use of Typescript within NodeJS. This too, forces typing of variables before they enter the system.

The Idea Behind NoSQL Data Bases

Any simple understanding of what NoSQL is supposed to do, is that it is constructed in such a way to take small or massive amounts of Data and store them in a Collection or data store. Whatever you wish to call your Database, the fact remains that the data itself when being stored is ‘data’. It is not yet, nor should ever be within the data store be limited to the idea that each aspect of this data must fit in with a predefined structure. Of course, to many when one is storing things such as a name it seems to make complete sense to make sure that name is a ‘string’. Or when I am storing an amount, it makes complete sense to make sure that amount is stored as a ‘number’.
However, big data systems are not about only storing names, addresses, and choices. They are about picking up every piece of information that one possibly can about the object at hand and being able to store it for later retrieval in whatever method you wish to. Indeed, it is critical to note here, that the entire idea of just getting data has evolved. Data itself without any filtering for patterns or consistency is no longer the goal. The goal is to take data and apply in whatever engine or code base you are using, to filter it in such a way to make if viable, and possibly one of the most important arsenals in your company’s perceived wealth.
NoSQL evolved from traditional SQL databases in order to avoid the need of triggers, traditional relationships, and even primary key. It evolved because data had become so huge and non-defined in traditional formats that one requires a different method of thinking about data as a whole, and various connective pieces to that data.
NoSQL in short, is a form of non-linear thinking, injection and taking out data. The very point of NoSQL is that you are not supposed to ‘type’ information before it goes into the DB. Typing of the information comes when you pull it out for information. This is done by way of smart aggregation, algorithms, theoretical constructs à all based on the patterns and information you are seeking.

Linear Thinking

In defense of the Mongoose users, they have usually been trained in the SQL world where every single item has a specific structure, and a system is related (relational databases), based on primary keys, relations and triggers. This is typical linear thinking. It works great for small things such as name-address etc. databases, blog posts, book reviews etc. I am sure millions of examples can be given here. I myself, used MySQL & PHP exclusively for many, many years. It was and still is an excellent system. And indeed, most of the examples you will find in any search, even of complete systems for NodeJS & MongoDB will make use of such a scenario, and include Mongoose.
Yet, it forces one to think in a linear fashion. It forces linear programming. It goes from Step 1 to Step 2 to Step 3 etc. There is the blob of course or huge text fields in MySQL systems, but these require massive amounts of coding and preparation in order to pull out the exact desired information. The traditional MySQL systems are not built for the massive, real time needs of data these days.
The very nature of MySQL, fights against the use of massive amounts of ‘non-typed’ data.
Assuredly, today it gives the ability to allow for non-structured data, but the very essence of the tables, triggers and relationships is built upon a structure.
More so, MySQL structure can get incredibly complicated. Table A goes to Table B then triggers a relationship in Table C then picks up information from Table D which must then sort through Table E to retrieve information. This chain can be literally 20–30 tables. It is not only messy, it becomes almost impossible for anyone who is not constantly on top of the structure (DBA) to keep it correct.
Yet even in such cases the main reason is that they are simply not built for the non-structured data being picked up these days.

Mongoose Structure

So, it makes sense, that one would gravitate to Mongoose in order to apply some sort of structure to the data that is being collected. More so, since Mongoose works incredibly well as a middleware with Express, Passport etc. and it supplies the needed connectivity to MongoDB in the normal fashion, it is by nature something one should not pass over without clear thought of what the entire system should do. Within the Schema-Model system, one can pass almost any type of function in order to achieve true CRUD. It also allows for callbacks, promises, and ES6. All the above is fantastic.
However, there is one caveat. By nature of the way Mongoose works, you must define (type) the nature of the information in typical JSON format. In practical situations, especially with huge amounts of data, any experienced data analyst will tell you that such a model is not only incredibly bad for the data itself but almost impossible to achieve.

Typing the Data Before it ends up in your DB

The essence of applying a type to data before it ends up in your collection, implies that you know exactly what type of data is coming in and what type of data that is. Be it any type of string, number or whatever state it is in. Again, this can apply to simple solutions of small systems where you are concentrating on specific data which you are 100% sure you are going to get. But what happens within real time instances or receiving data in JSON packets where sometimes the information just is not there or suddenly an extra key-value appears? Or what happens when you are dealing with medical or insurance information where huge amounts of differentiating data must be picked up in different types?
Of course, you can argue that one can go back and change the Schema-Model for that collection to include the changes. This keeps everything nicely wrapped up in the model, and certainly easy to read. Yet it does not reflect real-world scenarios, and a Schema-Model of a truly big data system would be extremely complex, once again requiring not only knowledge of MongoDB and all its possibilities but probably a DBA just to handle the Mongoose setup.

Extending middleware yet again for the wrong purpose

Any NodeJS programmer will tell you, if asked about middleware, that Express will usually be at the top of the stack. Routing, JWT, Passport, Helmet and many others will then continue the stack. It is incredibly easy to get lost in such a stack, no matter how well you design. I have seen package.json files that boggle the mind with the number of NPM modules used. Maybe all these are needed, and honestly, I am truly an NPM believer. However, the stack is the stack. Middleware not introduced correctly, or introduced in the wrong order, can severely slow down or simply make a system non-functional. Add all this to forcing types on data, you are dealing with systems which could probably react much faster with much less latency and most importantly less rejection of non-formatted data.

Conclusion

I wish to reiterate that the creators of Mongoose did an incredible job. They also allowed SQL programmers to join in the NoSQL universe without forcing great changes to their method of thought about data. Additionally, for small systems Mongoose may be the way to go.
However, if you are truly dealing with complex big data scenarios, or real-time packets coming from in from all over where values, types and information can change from packet to packet, Mongoose should be left out of your equation. You must master MongoDB in all its nuances, including MapReduce, aggregation etc. CRUD is no longer just CRUD. In big data systems it goes way beyond the typical SQL CRUD type of operation. It requires a great deal of understanding of the true nature of big data way beyond the old name-address-phone; login and accepting some type of written data. When you develop beyond that, my advice is put Mongoose aside and dive into creating a functional library with MongoDB native constructs. Avoid the middleware of Mongoose no matter how well you know it. In the end is such systems Mongoose will reject information that is not typed, and you will lose it. That can be disastrous in big data scenarios.
The choice is yours, and obviously this is an opinion piece. If you are working on true Big Data system, my simple humble opinion, is to leave Mongoose out of the equation. Do not “type” (read: pre-define) your information. Do not set limits on what will come into your system. Take all the data you can possibly get, and then with correct algorithms and manipulations of the native MongoDB commands, you will be able to achieve the goals of not only collecting data but finding the patterns and connections is any possible way. This allows for flexibility and non-linear thought process. This is what NoSQL was created for.

Comments

Popular posts from this blog

4 Ways to Communicate Across Browser Tabs in Realtime

1. Local Storage Events You might have already used LocalStorage, which is accessible across Tabs within the same application origin. But do you know that it also supports events? You can use this feature to communicate across Browser Tabs, where other Tabs will receive the event once the storage is updated. For example, let’s say in one Tab, we execute the following JavaScript code. window.localStorage.setItem("loggedIn", "true"); The other Tabs which listen to the event will receive it, as shown below. window.addEventListener('storage', (event) => { if (event.storageArea != localStorage) return; if (event.key === 'loggedIn') { // Do something with event.newValue } }); 2. Broadcast Channel API The Broadcast Channel API allows communication between Tabs, Windows, Frames, Iframes, and  Web Workers . One Tab can create and post to a channel as follows. const channel = new BroadcastChannel('app-data'); channel.postMessage(data); And oth...

Certbot SSL configuration in ubuntu

  Introduction Let’s Encrypt is a Certificate Authority (CA) that provides an easy way to obtain and install free  TLS/SSL certificates , thereby enabling encrypted HTTPS on web servers. It simplifies the process by providing a software client, Certbot, that attempts to automate most (if not all) of the required steps. Currently, the entire process of obtaining and installing a certificate is fully automated on both Apache and Nginx. In this tutorial, you will use Certbot to obtain a free SSL certificate for Apache on Ubuntu 18.04 and set up your certificate to renew automatically. This tutorial will use a separate Apache virtual host file instead of the default configuration file.  We recommend  creating new Apache virtual host files for each domain because it helps to avoid common mistakes and maintains the default files as a fallback configuration. Prerequisites To follow this tutorial, you will need: One Ubuntu 18.04 server set up by following this  initial ...

Working with Node.js streams

  Introduction Streams are one of the major features that most Node.js applications rely on, especially when handling HTTP requests, reading/writing files, and making socket communications. Streams are very predictable since we can always expect data, error, and end events when using streams. This article will teach Node developers how to use streams to efficiently handle large amounts of data. This is a typical real-world challenge faced by Node developers when they have to deal with a large data source, and it may not be feasible to process this data all at once. This article will cover the following topics: Types of streams When to adopt Node.js streams Batching Composing streams in Node.js Transforming data with transform streams Piping streams Error handling Node.js streams Types of streams The following are four main types of streams in Node.js: Readable streams: The readable stream is responsible for reading data from a source file Writable streams: The writable stream is re...