The Drastic Mistake Of Using Mongoose To Handle Your Big Data

Introduction

Mongoose is an incredibly popular and well-done library in the NPM universe. It is used extensively by many excellent programmers based upon its Model-Schema structure. Indeed, a cursory look in Google at many examples at creating any sort of stack with Data models that include MongoDB will show you that the authors mostly include Mongoose in their development structure. It is a respected, well-kept, and incredibly popular library. All the above is true, and the authors should be lauded for their excellent skills and in understanding the needs of the community.

The above is not a disclaimer nor a cynical statement. Indeed, I have delayed the publishing of this article for a long time due to the fact that I assume it may not be well accepted in the MongoDB community. However, the very essence of Mongoose, what it forces one to do with data and how it forces the programmer to think, is the point of this article. I am fully aware that what will be written here may be incredibly unpopular with the current NodeJS — NPM community. However, based upon recent events in my own career, it has come time to explain just why Mongoose should be avoided in almost any true Big Data scenario. Let me make this claim even harsher. It is also why I truly disdain the use of Typescript within NodeJS. This too, forces typing of variables before they enter the system.

The Idea Behind NoSQL Data Bases

Any simple understanding of what NoSQL is supposed to do, is that it is constructed in such a way to take small or massive amounts of Data and store them in a Collection or data store. Whatever you wish to call your Database, the fact remains that the data itself when being stored is ‘data’. It is not yet, nor should ever be within the data store be limited to the idea that each aspect of this data must fit in with a predefined structure. Of course, to many when one is storing things such as a name it seems to make complete sense to make sure that name is a ‘string’. Or when I am storing an amount, it makes complete sense to make sure that amount is stored as a ‘number’.

However, big data systems are not about only storing names, addresses, and choices. They are about picking up every piece of information that one possibly can about the object at hand and being able to store it for later retrieval in whatever method you wish to. Indeed, it is critical to note here, that the entire idea of just getting data has evolved. Data itself without any filtering for patterns or consistency is no longer the goal. The goal is to take data and apply in whatever engine or code base you are using, to filter it in such a way to make if viable, and possibly one of the most important arsenals in your company’s perceived wealth.

NoSQL evolved from traditional SQL databases in order to avoid the need of triggers, traditional relationships, and even primary key. It evolved because data had become so huge and non-defined in traditional formats that one requires a different method of thinking about data as a whole, and various connective pieces to that data.

NoSQL in short, is a form of non-linear thinking, injection and taking out data. The very point of NoSQL is that you are not supposed to ‘type’ information before it goes into the DB. Typing of the information comes when you pull it out for information. This is done by way of smart aggregation, algorithms, theoretical constructs à all based on the patterns and information you are seeking.

Linear Thinking

In defense of the Mongoose users, they have usually been trained in the SQL world where every single item has a specific structure, and a system is related (relational databases), based on primary keys, relations and triggers. This is typical linear thinking. It works great for small things such as name-address etc. databases, blog posts, book reviews etc. I am sure millions of examples can be given here. I myself, used MySQL & PHP exclusively for many, many years. It was and still is an excellent system. And indeed, most of the examples you will find in any search, even of complete systems for NodeJS & MongoDB will make use of such a scenario, and include Mongoose.

Yet, it forces one to think in a linear fashion. It forces linear programming. It goes from Step 1 to Step 2 to Step 3 etc. There is the blob of course or huge text fields in MySQL systems, but these require massive amounts of coding and preparation in order to pull out the exact desired information. The traditional MySQL systems are not built for the massive, real time needs of data these days.

The very nature of MySQL, fights against the use of massive amounts of ‘non-typed’ data.

Assuredly, today it gives the ability to allow for non-structured data, but the very essence of the tables, triggers and relationships is built upon a structure.

More so, MySQL structure can get incredibly complicated. Table A goes to Table B then triggers a relationship in Table C then picks up information from Table D which must then sort through Table E to retrieve information. This chain can be literally 20–30 tables. It is not only messy, it becomes almost impossible for anyone who is not constantly on top of the structure (DBA) to keep it correct.

Yet even in such cases the main reason is that they are simply not built for the non-structured data being picked up these days.

Mongoose Structure

So, it makes sense, that one would gravitate to Mongoose in order to apply some sort of structure to the data that is being collected. More so, since Mongoose works incredibly well as a middleware with Express, Passport etc. and it supplies the needed connectivity to MongoDB in the normal fashion, it is by nature something one should not pass over without clear thought of what the entire system should do. Within the Schema-Model system, one can pass almost any type of function in order to achieve true CRUD. It also allows for callbacks, promises, and ES6. All the above is fantastic.

However, there is one caveat. By nature of the way Mongoose works, you must define (type) the nature of the information in typical JSON format. In practical situations, especially with huge amounts of data, any experienced data analyst will tell you that such a model is not only incredibly bad for the data itself but almost impossible to achieve.

Typing the Data Before it ends up in your DB

The essence of applying a type to data before it ends up in your collection, implies that you know exactly what type of data is coming in and what type of data that is. Be it any type of string, number or whatever state it is in. Again, this can apply to simple solutions of small systems where you are concentrating on specific data which you are 100% sure you are going to get. But what happens within real time instances or receiving data in JSON packets where sometimes the information just is not there or suddenly an extra key-value appears? Or what happens when you are dealing with medical or insurance information where huge amounts of differentiating data must be picked up in different types?

Of course, you can argue that one can go back and change the Schema-Model for that collection to include the changes. This keeps everything nicely wrapped up in the model, and certainly easy to read. Yet it does not reflect real-world scenarios, and a Schema-Model of a truly big data system would be extremely complex, once again requiring not only knowledge of MongoDB and all its possibilities but probably a DBA just to handle the Mongoose setup.

Extending middleware yet again for the wrong purpose

Any NodeJS programmer will tell you, if asked about middleware, that Express will usually be at the top of the stack. Routing, JWT, Passport, Helmet and many others will then continue the stack. It is incredibly easy to get lost in such a stack, no matter how well you design. I have seen package.json files that boggle the mind with the number of NPM modules used. Maybe all these are needed, and honestly, I am truly an NPM believer. However, the stack is the stack. Middleware not introduced correctly, or introduced in the wrong order, can severely slow down or simply make a system non-functional. Add all this to forcing types on data, you are dealing with systems which could probably react much faster with much less latency and most importantly less rejection of non-formatted data.

Conclusion

I wish to reiterate that the creators of Mongoose did an incredible job. They also allowed SQL programmers to join in the NoSQL universe without forcing great changes to their method of thought about data. Additionally, for small systems Mongoose may be the way to go.

However, if you are truly dealing with complex big data scenarios, or real-time packets coming from in from all over where values, types and information can change from packet to packet, Mongoose should be left out of your equation. You must master MongoDB in all its nuances, including MapReduce, aggregation etc. CRUD is no longer just CRUD. In big data systems it goes way beyond the typical SQL CRUD type of operation. It requires a great deal of understanding of the true nature of big data way beyond the old name-address-phone; login and accepting some type of written data. When you develop beyond that, my advice is put Mongoose aside and dive into creating a functional library with MongoDB native constructs. Avoid the middleware of Mongoose no matter how well you know it. In the end is such systems Mongoose will reject information that is not typed, and you will lose it. That can be disastrous in big data scenarios.

The choice is yours, and obviously this is an opinion piece. If you are working on true Big Data system, my simple humble opinion, is to leave Mongoose out of the equation. Do not “type” (read: pre-define) your information. Do not set limits on what will come into your system. Take all the data you can possibly get, and then with correct algorithms and manipulations of the native MongoDB commands, you will be able to achieve the goals of not only collecting data but finding the patterns and connections is any possible way. This allows for flexibility and non-linear thought process. This is what NoSQL was created for.

NEW TECH UPDATES

Search This Blog

The Drastic Mistake Of Using Mongoose To Handle Your Big Data

Introduction

The Idea Behind NoSQL Data Bases

Linear Thinking

Mongoose Structure

Typing the Data Before it ends up in your DB

Extending middleware yet again for the wrong purpose

Conclusion

Comments

Post a Comment

Popular posts from this blog

4 Ways to Communicate Across Browser Tabs in Realtime

Certbot SSL configuration in ubuntu

Working with Node.js streams