"When you try to write a program that does what a human can do, you will appreciate the greatness of nature"
Microsoft's XBox Kinect is a wonderful device and an optional add-on to the XBox gaming console. The device allows you to play games without using any hand-held controllers. For example, if you were playing Fruit Ninja, you can slice fruits just by making a slicing action with your hand - Woop! The fruit on the screen gets sliced. There are many beautiful games that you can play with the device.
But from a technical standpoint, it is interesting how the Kinect actually determines your position. The Kinect has 3 sensors - One is a camera, one is an Infra Red sensor and one more I think is used to get wireframe images. In fact, there are special chips within the device that allows for motion and gesture recognition, 3D rendering etc. Read more on Wikipedia here.
Now let's have a quick look at the problem statement - It's a tool being developed for an online apparel store. A common problem that people face is - I am not sure what size I have to buy. So imagine if there was an online tool that tries to determine your size using your webcam?
Now let's have a look at the technical challenges. If you are standing in front of a webcam, the image that is obtained has you with some background that includes a window probably, table, chair, door etc. The first challenge is to separate the person from the background.
The next challenge is to get perspective. If a relatively wider person stands farther from the camera, his width will appear to be the same as that of a thinner person who's closer. As far as the image is concerned, both of them will appear to have the same width. This is because an image is 2-dimensional, and the depth is lost.
Of course, if you were doing this with Kinect, neither of this is particularly difficult. Primarily because the device can easily judge how far from the camera you are using IR beams. It can beam, get reflection and determine distance. And identifying where you are is also not an issue because they have specialised chips that do that work. In fact, it can even determine where your head, hands, legs, hips etc. are.
Then why not use a Kinect? For obvious reasons - A Kinect is not a device that everybody has, a webcam is something that almost everyone has because it comes with laptops. Besides, I don't think many would be driven to purchase a Kinect just to try out some app - it costs about Rs. 9000.
Ok, we have established that we have to use a webcam and we have a couple of challenges that a Kinect, an expensive device with special hardware can resolve.
Great! Now we have been able to extract the person. However, the problem of perspective still remains. A fat person at a distance and a thin person who is closer appear to be of the same width in a 2D image. The Kinect, as discussed previously, uses IR to solve this problem. A webcam can't do that. So instead, the idea was to ask the subject to hold an object of standard size and dimensions, something that everyone will have. A Rs. 100 note seemed ideal.
If the person is holding the note in his hand, we can determine the size of the note in the image. The size of the note in real life is already known. We get a ratio of sizes. We know the size of the person in the image. The ratios are going to be the same, thus we know the person's size in real life.
Great! Now we have resolved the second issue. However, there is a challenge. You need to be able to measure the size of the note in the image. For that you should know where the note is. A human can do that. A program cannot easily do that - it probably needs a note detection program. I tried to train one using OpenCV but unfortunately couldn't get it up.
So the next best solution was to draw a rectangle on the screen. What you have to do is hold the note against your body and align yourself to the rectangle. By doing this, the distance between the subject and the camera is always the same, no matter who is holding it, fat or thin. The ratio is also fixed.
The note has to be held against the body - this is the only way I know that the ratio of the note in the image to real life is the same as that of your image and you in real life. If the note is held a foot away from you, it defeats the entire purpose.
The next objective is to measure your chest. A human being knows where the chest begins and ends, and where the arms lie. The program is dumb. So I ask you to spread your arms apart. This way, your arms will not appear in the measurement of the chest. But I have to determine where your chest is. What I do is, I first determine where your head is using a standard face detection algorithm. I then go a little to the right, and then start going down till I get your arms - you had spread them wide, remember? After that, I continue moving further till I reach the end of your arms. I have almost reached the measurement region. The point of intersection between the horizontal line drawn from this point and the vertical line drawn from the centre of your face marks the centre of your chest. We have now reached the point of measurement.
I now measure this width of your chest and get some value. If I use the ratio, will I get the right measurement? No, I won't. The human body is not a cuboid. Though we see a flat image, it is a curved body in reality. So, in the next step I ask you to turn to the right by 90 degrees and capture your side profile and measure that width as well.
The cross section of the human body when looked from the top, around the chest area is very close to an ellipse. What I have obtained here is the major axis and the minor axis of the ellipse.
The objective was to find the circumference. There is a formula for that. We can use the ratio to get the real width and depth and get the approximate circumference of the person. The formula is below
Are we done? There are other challenges too. Like, if you were wearing a loose shirt, the extra cloth stretches and makes it appear like you are wider than you really are.
When you are done aligning, how will the program know? If you try to click a button, you will have to move closer to the computer which means the position is lost. I am using a motion detection library there. So you just have to wave your hand at the top left to indicate that you are done.
And what if you were standing right in front of a wall? Then a shadow might get cast on the wall. The shadow was not there when the background was captured but is present when the person's photo is captured, making it seem to the program that the shadow is part of the person. So this also makes it feel like the person is broader than he actually is...
And what if you are wearing a white shirt? I told you that I was making the background white. Since your shirt is also white, I will consider that as part of the background, ergo I won't know where your body is.
These are a couple of problems that have to be solved, but for now, the only way to avoid it is to avoid the situation. Wear a darker tighter shirt, don't stand close to a wall.
I found that most people cannot understand the challenges faced in this problem. When they try the program out, they expect things to be way simpler and even the slightest discomfort made them say
"WTF is this shit?" and I find it hard to explain to them the various challenges.
Like I said before, if you use a specialised hardware like the Kinect, you don't have to do the dance of the sugar plum fairy. This project is purely an academic one and shows how to measure it with image processing and a simple webcam.
If you have a different or better way to solve this or these problems, I would definitely love to hear about it. I hope you liked what you read. Do leave me a comment with your opinions and suggestions.
A lot of the things that I have attempted to do programmatically are extremely trivial for a human being to achieve.
Comments from Facebook
"necessity" is the mother of invention. Now let us know about the mother :P
Hey this is a very interesting problem you are trying to solve. :)
You can check out the SIFT algorithm to find your hundred rupee note in the frame, its pretty good at such a job. You won't even need the whole damn note, any two distinct points on the note will do because you can get your ratio even that much by appropriate mapping.
I feel you could use the same algorithm to find other features (eyes, chest, arms, etc.) by comparing the input image against a template of a human body..
Let me know your thoughts on this approach and if you get anywhere this way. I worked on the SIFT algo a while back and it interests me what it can achieve. :)
Hey Chiranth Ashok - I vaguely remember trying that with OpenCV - for note detection. I couldn't get it up and running. Maybe I was doing something wrong. Another problem was that the program had to be done with Flash. So translating something that works in Python to something in Flash was another challenge
Adarsh Basavalingappa - Mother of what man? :P I posted it because I was tired of listening to "WTF is this shit" :P Thought it was better to show why I was doing what I was doing
Flash? I did that using cpp.. czech out this blog
I was able to get it running in a few hours I guess, you can do better :)
:) One of the first limitation is that it has to be done with actionscript because it should run within a browser... But I'll definitely have a look at it. Seems interesting