Skip to main content

NODE.JS: EXTRACT TEXT FROM IMAGE USING TESSERACT



In this article, we will see how to extract text from images using Tesseract.

So let's start with this use-case,

Suppose you have 300 screenshot images in your mobile which has an email attribute that you need for some reason like growing your network or for email marketing.
To get an email from all these images manually into CSV or excel will take a lot of time.
So now we will check how to automate this thing.


First, you need to install Tesseract OCR(An optical character recognition engine) pre-built binary package for a particular OS.
I have tested it for Windows 10.
For Windows 10, you can install it from here.
For other OS you make check this link.
So once you install Tesseract from windows setup, you also need to set path variableprobably,
'C:\Program Files\Tesseract-OCR' to access it from any location.

Then you need to install textract library from npm.

To read the path of these 300 images we can select all images and can rename it to some name.
For example, we have renamed it to 'image' then there will image(1) to image(300) images,
So that we can read the image path dynamically using the loop index.

NodeJs Code:

var textract = require('textract');
var jsonexport = require('jsonexport');
const fs = require('fs');
var emailList = [];//To store all email that we have extracted.
for (let i = 1; i <= 300; i++) {
    var name = 'image(' + i + ').jpg';//Image type is jpg.
    textract.fromFileWithPath(name, function (error, text) {
        console.log(text)//extracted text
        //By some split logic we can get email from particular image depending upon image.
        var email = text.split("Email")[1];
        emailList.push({ Email: email });
        if (emailArray.length == 300) {
            jsonexport(emailList, function (err, csv) {
                if (err) return console.log(err);
                fs.writeFile('EmailList.csv', csv, function (err) {
                    if (err) throw err;
                    console.log('Congrats! Email List created for 300 emails');
                });
            });

        }
    })

}

The code is self-explanatory.
We have used jsonexport library to convert the email list to CSV format and then we have used fs.writeFile to export it to CSV file.

I hope you like this article and if any doubts please let me know in the comment section.

Comments

Popular posts from this blog

4 Ways to Communicate Across Browser Tabs in Realtime

1. Local Storage Events You might have already used LocalStorage, which is accessible across Tabs within the same application origin. But do you know that it also supports events? You can use this feature to communicate across Browser Tabs, where other Tabs will receive the event once the storage is updated. For example, let’s say in one Tab, we execute the following JavaScript code. window.localStorage.setItem("loggedIn", "true"); The other Tabs which listen to the event will receive it, as shown below. window.addEventListener('storage', (event) => { if (event.storageArea != localStorage) return; if (event.key === 'loggedIn') { // Do something with event.newValue } }); 2. Broadcast Channel API The Broadcast Channel API allows communication between Tabs, Windows, Frames, Iframes, and  Web Workers . One Tab can create and post to a channel as follows. const channel = new BroadcastChannel('app-data'); channel.postMessage(data); And oth...

Certbot SSL configuration in ubuntu

  Introduction Let’s Encrypt is a Certificate Authority (CA) that provides an easy way to obtain and install free  TLS/SSL certificates , thereby enabling encrypted HTTPS on web servers. It simplifies the process by providing a software client, Certbot, that attempts to automate most (if not all) of the required steps. Currently, the entire process of obtaining and installing a certificate is fully automated on both Apache and Nginx. In this tutorial, you will use Certbot to obtain a free SSL certificate for Apache on Ubuntu 18.04 and set up your certificate to renew automatically. This tutorial will use a separate Apache virtual host file instead of the default configuration file.  We recommend  creating new Apache virtual host files for each domain because it helps to avoid common mistakes and maintains the default files as a fallback configuration. Prerequisites To follow this tutorial, you will need: One Ubuntu 18.04 server set up by following this  initial ...

Working with Node.js streams

  Introduction Streams are one of the major features that most Node.js applications rely on, especially when handling HTTP requests, reading/writing files, and making socket communications. Streams are very predictable since we can always expect data, error, and end events when using streams. This article will teach Node developers how to use streams to efficiently handle large amounts of data. This is a typical real-world challenge faced by Node developers when they have to deal with a large data source, and it may not be feasible to process this data all at once. This article will cover the following topics: Types of streams When to adopt Node.js streams Batching Composing streams in Node.js Transforming data with transform streams Piping streams Error handling Node.js streams Types of streams The following are four main types of streams in Node.js: Readable streams: The readable stream is responsible for reading data from a source file Writable streams: The writable stream is re...