site search  
 
     
 
 








[+QUICK INFO+]  
TEAM MEMBERS :   Kenny Teng
Jeremy Ng
Shirlene Lim
ASSIGNMENT :   Write a program that is able to read in a video stream of someone's hand making numbers in sign language. The program will be able to detect the position, orientation and shape of the hands to efficiently translate the signs being made. There are 3 different algorithms that have been investigated to see which one is the best method.
SPECIFICATIONS :   The system will be written using Matlab. Initially, we wanted to translate words in sign languages, but due to time constraints, we had to restrict it to numbers.
RESULT:   Got an A in the class
   
If you thought my project was helpful to you,

[+OVERVIEW+]

The video sequence being read as input had to be processed to locate the position and orientation of the hand before any translation process could be done. Some assumptions have been made to restrict the complexity of the entire system, since we had limited time for research, design and implementation. One assumption that has been made was the fact that the camera taking the input sequence will be mounted at a fixed location and orientation. In that way, the background of the feed will be fixed and relatively constant. Also, the position of the hand within the field of vision will be kept within a specific range. In this way, we ensured that there will always be a hand in the input sequence, and that the size of the hand will be relatively the same. These are to limit computations and error checking complexities on the system, since the main goal of the latter is for hand gesture translation.

With that said, let me introduce the pre-processing part of the system. Before the beginning of the experiment, a video sequence of the empty background is recorded to obtain a mean image of the background. This backgroud image is later going to be used in background substraction of every single frame for foreground extraction, i.e. the hand. Therefore, binary images of the hands will be obtained after proper thresholding, to eliminate noises.
From there, simple heuristics can be applied on the binary image to locate the positions and orientations of the hand. Location is performed by simply scanning the image 3-ways to locate the extremities of the hand (as shown on the left). Orientation is done, by scanning 3 lines at the base of the hand to obtain approximate vector directions of the edges of the hand. From those vectors, a middle orientation vector is computed (as shown on the right).
[+BOUNDARY CONTOUR+]
Boundary contour (as we call it) is the process of determining the Euclidean distance of any point on the edge of an image to the center of mass (COM). We used this method to differentiate fingers and nonfingers as fingers have a distinctive length which will enable us to easily determine whether a certain point is a finger, thumb or neither.
The cut-off points obtained from the orientation determination process are used as start point and end point of an edge detection heuristic. We cycle the points along the edge of the binary image, while saving them in that sequence, and at the same time computing the Euclidean distance between that point and the COM. The peaks (maximum) are the furthest point from the COM, that is, they represent the positions of the tip of the fingers.
  From the graphs, we determine the height of the maximum peak. The minimum difference between the maximum and the closest minimum is computed. If that value is above 20% of the highest finger, it is encoded as ‘1’ whilst peaks below that threshold are classified and encoded as ‘0’. For example, [1 1 1 1 1] will be categorized as five fingertips, thus representing the number ‘5’.
[+EQUIDISTANCE SKELETON+]
Skeletonization is the process for reducing foreground regions in a binary image to a skeletal remnant that largely preserves the extent and connectivity of the original region while throwing away most of the original foreground pixels. To see how this works, imagine that the foreground regions in the input binary image are made of some uniform slow-burning material. Light fires simultaneously at all points along the boundary of this region and watch the fire move into the interior. At points where the fire traveling from two different boundaries meets itself, the fire will extinguish itself and the points at which this happens form the so called `quench line'. This line is the skeleton. Under this definition it is clear that thinning also produces a sort of skeleton.
We classify each pixel as we move along the skeleton using identifiers such as endpoints”, “branch” and “normal points”. “Endpoints” is classified as the pixel at which there are no other valid neighboring pixels. “Branches” are classified as the pixel at which it has more than one valid neighboring pixel. This pixel is then marked so that after reaching an endpoint of one of the branching pathways, our window tracking would return to the branch point and move along the next valid pathway from that particular pixel.
Using the maximum distance as reference, distances that are ¼ of the maximum distance stored will be categorized as a thumb, thus encoded as a ‘1’. Distances that are more than ½ of the maximum distance is categorized as a finger, thus encoded as a ‘2’. Other invalid distances are then classified as a ‘0’. Based on the encoding, we determine the number the hand represents. For example, [2 2 2 2 1] will be classified as detecting the number ‘5’.
[+THINNING+]
Thinning is a morphological operation that is used to remove selected foreground pixels from binary images. It can be used for several applications, but is particularly useful for skeletonization. It is commonly used to tidy up the output of edge detectors by reducing all lines to single pixel thickness. Thinning is normally only applied to binary images, and produces another binary image as output. The behavior of thinning is determined by the structuring elements used for the specific points being “thinned”.
To implement thinning, first, translate the origin of the structuring element (middle) to each possible pixel position in the image. If foreground and background pixels in the structuring element exactly match foreground and background pixels in the image, the image pixel underneath the origin of the structuring element is set to background. Otherwise it is left unchanged.
For the thinning implementation, the classification stage is almost similar to that of the equidistance skeleton. The same process of tracing through the skeleton is also needed. However the algorithm is less complex since we do not need to take care of split ends as perceived from the skeleton image. Basically, we calculate the length of each branch, getting rid of insignificant branches which have length shorter than a given threshold. Based on the longest branch, we calculate how many short branches and how many long branches. Long branches would represent stretched fingers while short branches represent folded fingers. I would then code them in terms of binary values as 1 for stretched fingers and 0 for folded fingers. Based on the binary values, we would be able to classify what number does the hand represents.

Among the three algorithms, boundary contour was the most reliable and stable. The equidistance skeleton performed worst among the three algorithms. We attribute the performance level of the equidistance skeleton to the fact that the skeletons had a lot of “noise”. There was a lot of branching which were confused as a key branch point representing branching of fingers. Thus, results weren’t as stable and reliable. Thinning performed reasonably well due to the fact that it had more “test” cases in the program. It was more carefully tested against more possible cases and classified accordingly. It also did not have the problem of “noise” in the form of “branching” as that of the equidistance skeleton method.

 

 
Paper
You can download our paper in pdf format.
BOUNDARY CONTOUR



:: Site Map :: Contact :: Projects
©2007 Kenny Teng. All rights reserved