Sitting Posture Identifier using AI
A beginner thought process while completing my final year project and using YOLO for this task
This post is more of a thought process while embarking on this journey rather than a line-by-line code tutorial. I will try to cover as much as I can for people who are learning AI (focused on image processing/computer vision) or trying to start a project just like when I started this project.
Here is the link to try it out Sitting Posture Identifier Application.
Magic portal to quick jump around this article:
- A little intro
- Use cases that I can imagine of (Why, What)
- What approaches can we take
- Data collection
- Approaches in machine learning (CNN, RCNN, FCRNN…)
- Results
- Considerations
- What’s next
A little intro
My Final Year Project (FYP) for my degree started just when the COVID-19 pandemic hit. While online classes started and people started working from home, I realized my sitting posture declined over time as I continue sitting in front of my PC. We know what bad postures leads to over the years. This made me wonder, what if there was someone who could constantly monitor my posture and let me know once it starts declining (you get the idea). At the same time, I was keen to explore and contribute to the healthcare sector.
Use cases that I can imagine of (Why, What)
Before thinking too hard about actually building the system, a few things usually would come to my mind. What values would this system bring? Does it benefit not only me?
Those questions were answered pretty easily for me. Having an internship with an airline made me realize that some flights are pretty lengthy! Given you are not seated in a business class, seats could be rather small and I tend to slouch after a while. Imagine if there was a small light bulb that blinks and reminds you that your holiday could be ruined due to backaches if you continued slouching or bending your neck forward reading the magazines.
And not only for flights could this system be implemented on! A reminder system in kindergartens so good posture is cultivated in the early ages or perhaps a “pre-diagnosis” of you in the waiting lounge of a chiropractor.
What approaches can we take
Now that we identified our problem and potential use case, I think it is safe to proceed with building the system. Of course, it is essential to do some literature review and see what the experts in this subject have explored for this problem.
At the time of writing my FYP, most of the systems use an array of sensors placed on a chair (pressure) or even on the body (gyroscope) (more towards IoT). This data is then fed into a system and a rule base matching is applied. However, this method is not scalable, is complex in setup, and not cheap to implement. So if we need to build this system, it has to overcome those 3 drawbacks from other systems.
The approach that immediately came into my mind was: using a webcam and a browser, we could record the user and infer information from that image. What we can safely say is most laptops have a webcam, and webcams are not that expensive for desktop users. This would overcome the cost and complexity of setting up the system anywhere at any time. Any users including newbies could just open the link on their browser, place their webcam to face in the correct angle and the system would be set up.
Since we are inferring information from an image, there are a few methods that could be used, however for this project, we would be exploring the use of machine learning and computer vision to solve this particular posture problem.
Data collection
Now with that draft approach of using machine learning established, data collection is a necessity for training the machine to learn. However, before blindly going around digging for data, the question would be: what data are we looking for?
From literature reviews done previously, we know that certain factors contribute to bad postures which include but are not limited to neck bending, slouching, and crossed legs. Based on this information, the data we are looking for would be a full-body image of a person (sorry shy people) sitting.
Typically most of the tutorials out there would point you to Kaggle or some Google Image Repository for the data source (Don’t get me wrong, those are great places to get what you need). Unfortunately, I could not find a single data source resembling a human sitting on a chair, therefore manual work needs to be done. Not going into details on this, but my phone camera worked pretty hard during those few weeks.
Google Form was created and sent to some of my friends with the guideline above. Of course, some credits go to gaming chairs advertisements on the internet and illustrations of good/bad postures. I ended up with around 90 images, at which point the pandemic hit and I could not go out for more. 90 images is a small number, however certain methods such as image augmentation (resize, crop, rotate, etc) can help increase the dataset size and variability. Luckily for me, my dataset was not biased towards any label, this means the labels are distributed equally across the dataset (eg: 40 images with good posture and 50 with bad posture).
Approaches in machine learning
This section would be the interesting one for most beginners! So many acronyms, what are they even? CNN? RCNN? SSD?
To answer those acronyms, I think Rohith explained it well here. But I will still briefly talk about it as we go along. OpenCV was used for loading, processing, and drawing bounding boxes on the images.
CNN detection approach
Being new and clueless, I started with the most common one that you hear about, Convolutional Neural Network (CNN). I opened a Tensorflow Image Classification tutorial here, copy-pasted and replaced the dataset with my sitting posture.
I separated my dataset for this approach by categorizing the images with “good” and “bad” posture. Image enhancement (blur, cropping to ROI) to remove background noise was applied to the images. Ran it, randomly changed the epoch, add and removed layers based on some recommendations from Google search. However, I then realize, CNN results return as to how I trained them, either a “good” or “bad”. This is not what we want! We want a system to highlight to us, what exactly is the problem with the posture. So to be more accurate, the problem we are solving is not a classification problem (good or bad).
Multi-label image classification
As the title states, multi-label. The first thought would be, cool, now from an image, we have attributes/labels such as the neck, legs, back with values 0 (good posture) and 1 (bad posture). So I then created a CSV file, with the first column being the image file name, and the remaining being the attributes and their respective values. The same thing, copied some code from a Google search on multi-label classification, ran it, and got the results.
This time, the results came pretty jumbled and fairly inaccurate. The issue is, we may have a label called the neck, however, we must always remember the machine has no idea what is a neck and may tag that label with something else in the image that could persist across the entire dataset (perhaps the type of chair, or even the shape of the person). We have to be more specific in this case.
RCNN, FRCNN, SSD then YOLO
As we mature more into understanding how it works now, Recursive-CNN (RCNN), Fast-RCNN (F-RCNN), or Single Shot Detectors (SSD) will appear sooner or later. It became clear that object detection might be an approach for this problem. I have created my simplified version of manual RCNN for testing and will share it in the next few post!
But some might ask, those tutorials usually detect objects (apple, human, face, legs), how can we get it to know if it is the legs are placed in a good posture? The most obvious answer could be: detect the legs, crop that part and send it to a CNN to classify if it is placed in a good posture. However, in my opinion, it is computationally heavy and not suitable for a real-time system, and furthermore, it is a lot of work to train a CNN for each attribute.
Here comes what I did out of being lazy. Instead of labeling the parts with neck, back, legs and buttocks. I labeled them as:
neck_good
neck_bad
backbone_good
backbone_bad
buttocks_good
buttocks_bad
legs_crossed
legs_straight
You get the idea. Images with people having their necks upright would be labeled as neck_good and vice versa.
Labelimg made by tzutalin was used for labeling the images. You just need to draw a rectangle around the region and then assign the respective label.
Next is training a model. While they are a few tutorials on building RCNN, I came across YOLO (v3 at the time of doing my FYP). Rather than building something from scratch, I could utilize what was already there and apply my use case to it. This method is known as transfer learning. Retraining the YOLO model took a long time on my PC as they have multiple complex layers. Luckily, Google Colab is free and I do recommend people to check it out! (I will share my colab project once I have sorted it out).
After setting it up, I let it run and automatically saved snapshots every 1000 iterations to Google Drive.
Results
YOLO performed excellently. I quickly built a front-end web interface using VueJS for posting images to a Python web server. That Python web server will then pass the image to YOLO and return the results in a JSON format back to the front end.
Here is a snapshot of the result:
YOLO results are highly accurate, and bounding boxes are colored red for bad and green for good. In cases where there are no humans in the image, a result returning no human would be returned.
To simulate what was intended, a “timeline” feature was created to capture an image every x interval, then a simple graph would be plotted to indicate posture degradation over the interval.
Considerations
User privacy is the highest priority here. Remember that we are processing images of the user from their webcam. The concept of not storing any data on the system is applied here. Images that are sent for processing are processed and immediately returned to the user.
No information is written on the local disk. Building the prediction system in a Docker container and deploying it on Google Cloud Run allows the instance to be destroyed once it is not being used too. Feel free to view the source code above.
What’s next
Even after the goal of the project is achieved (and my FYP is completed), there is a lot that can be worked on. Bringing the model for offline usage, a better modeling of what a good posture is, and many more.
However, I hope that this idea can be scaled to be used in production, especially for the healthcare sector and I would love to be involved in it! So please reach out to me if there are any intentions to work on this project or anything similar!