recent years computer vision has drastically evolved and has been widely available to programmers with open libraries such as OpenCV. OpenCV offers image manipulation capabilities in order to extract information from images, like classification of objects, face characteristics and body poses, in a way that computer sees what we see in real time -thus computer vision. At the same time, programming languages have become way more powerful tools and easier of use. Nowadays, for example, Python programming language does literally everything, requires small amounts of coding and runs in any environment, from a huge google cloud server to a “potato” in your basement. Python owes its huge success to its very strong community that produces a wide variety of libraries with many features and functionalities. However, a drawback is that Python is not easy to be run on a browser and it is usually run server side communicating with the browser with web services. Hence, that could produce a high payload and huge delays, especially when one has to deal with video stream. Therefore, the goal of this project is to provide computer vision capabilities on the browser side with the use of existing JS libraries and manipulate an HTML5 video feed to extract information and implement css functionality (auto-scrolling).
State of the art
There are few known projects published for eye tracking, which we will not be looking further into. Instead, I would like to advice our readers to take a look on this articlewhere a few known gaze recognition projects are being compared. Two JavaScript libraries were used for the needs of this project. Face-api.js by Vincent Mühler was used to detect the face and extract some useful information on facial expression and furthermore recognise the position of our eyes, nose, jawline etc. OpenCV library, which is known for its strong aspects on python projects, luckily has been also reshaped as a JavaScript library (opencv.js), with a less friendlier and well-documented environment. Nevertheless, it provides adequate functions to process the frames of the eye. Last but not least, the project of Antoine Lamé GazeTracking help us dig in the world of iris recognition, so props to him.
How it works
An analysis of the library is presented here, in order to better understand how we are going to deduct the iris from our frame and realise how will the system help us recognise the winking motion. So, the wink-scroll library is implemented in the following order:
FACE DETECTION -> LANDMARKS DETECTION -> EYE ISOLATION -> IMAGE PROCESSING -> CALIBRATION -> CENTROID OF IRIS
As a first step, a video element is created in JavaScript via DOM, where we feed the web cam stream. In addition, a canvas element is placed on top of the video feed in absolute position, on which the face-api estimations are being drawn for debugging purposes. Next, two more canvases are appended on the bottom of the video, where we are going to draw the eye and the image processed frames. Every DOM element has its CSS options altered via JavaScript as follows:
video.id = “NAME”;
video.width = WIDTH
video.height = HEIGHT
video.style.position = “absolute”;
video.style.top = DISTANCE_T + “px”;
video.style.left = DISTANCE_L + “px”;
Now that the “playground” is all set we can start feeding the video frame to face-api. We use 3 characteristics to collect information from the API:
- Detections
- Expressions
- Landmarks
We use the first two characteristics (detections, expression) for graphic representation drawn on the first canvas. Also, for testing reasons we’ve added a function that changes the background color of the html body element whenever our mood is changed.
The landmarks characteristic, is replacing the dlib toolkit, which is normally used by python to determine the 68 landmarks of a face (shown in picture above) and has been previously trained by AI, creating a pretty solid model for us to play around. This component give us access to the coordinates [x, y] of the 68 landmarks of the face that reflect the actual position of the characteristic on the camera feed. Furthermore, it ranges those characteristics on groups based on the portions of the face (e.g. getLeftEye, getJawline, etc.).
For wink-scroll library we are going to use the leftEye component which is actually the right eye, since camera feed is mirrored. GetLeftEye returns an array of x and y coordinates of the landmarks of the right eye on the camera feed. After that, we can crop the image on the portion of the left eye based on the previously extracted coordinates as follows:
ctx.drawImage(
video, start X, start Y // Frame and starting points
disX, disY, // Area to crop
0, 0, // Place the result at 0, 0 in the canvas
canvas2.width, canvas2.height) // Canvas dimensions
where ctx is the context of canvas2, retrieved by getContext function.
Since, it will be complex to give priority on who will the script listen to when winking, the library assumes that only one person is presented on the camera. Therefore, we add an if condition to rule out multiple detections by checking if the length of the detections array is equal to 1. Now that we have the cropped eye we can draw it on canvas2 and we can proceed with image processing with openCV.
Initially what we need to accomplish, is to remove some noise from the low quality web camera that is also cropped at the area of the right eye which gives us bad results. For that reason we use a bilateral filter:
let src = cv.imread(‘canvas2’); // read the cropped frame as an image
let dst = new cv.Mat(); // the destination is an empty Mat array
cv.cvtColor(src, src, cv.COLOR_RGBA2RGB, 0); // setting the colorspace to RGB
cv.bilateralFilter(src, dst, 10, 15, 15); // applying bilateral filter
The parameters of the filter (10, 15, 15) are obtained from the project of Antoine Lamé and based on the results, they are well accepted for our project. The src is the eye frame fetched from canvas2 and the dst is the output of the applied filter, which we are then going to use as a source for the next filter.
Moreover, we are using the erode function which also removes some noise, as some thickness of the edges is reduced.
cv.erode(dst, src, M, anchor, 3, cv.BORDER_CONSTANT, DEFAULT_BORDER_VALUE);
The next thing we need to do is apply a binary threshold to the image after moving from RGB to GRAY colorspace. But in order to do that we need a threshold value that is calibrated based on the frame that we have. There are different parameters that affect the image like the quality of the web camera and the brightness of a room. The calibration process is held as follows:
The threshold value needs to be assumed as a number, which when applied to the filter will make the iris of the eye distinguishable. That value can be obtained by calibrating the ratio of black pixels to total pixels to be around 0.48, which is the expected value of an average eye. We recursively change the threshold value by trying to reach the optimal iris/image ratio and we apply the filter as follows:
cv.threshold(src, dst, threshold, 255, cv.THRESH_BINARY);
Finally, we create a layer of contours of the shapes of the frame, by which we can extract the centroid that represents then position/or absence of the iris (eye shut). After some testing it is observed that the centroid values on y-axis appear to move downwards when the eye is shut. But in order to distinguish between blinking and winking (eye shut for a longer period) we apply a moving average (with the use of a buffer) on the centroid values. As a result that deducts some noise.
We can then easily detect if the eye is shut -by adding an if condition- and then simply run a function that scrolls down all the portions of text that are classified as “winkScroll”. In the last picture we show the debugging environment/playground we created, which consists of; the camera element, two canvases for the eye, a debugging text and the scrollable text on the right.
Final thoughts
Is this project something that has been created for the first time? No. However, we are using technologies that are commonly known in a very promising programming stack (HTML5, JavaScript), and we are willing to motivate and inspire other scientists to use it and implement it. Furthermore, our library provides the web browser with the right tools to enhance the experience of browsing for people with disabilities. Computer vision is a promising and evolving technological aspect, especially with the rise of Artificial Intelligence and Deep Neural Networks. Imagine all the endless possibilities in the fields of Security, Retail, Automotive, Healthcare, Agriculture, Banking, Industry, Education. Moreover, people with special needs will have a chance in the future to improve their quality of life. All that matters now is to keep introducing computer vision functionalities to our programming environments and let the scientists and programmers do their wonders, taking the computer logic and understanding of the surroundings to the next level.
Comments
Post a Comment