|
| [+QUICK
INFO+] |
|
| TEAM
MEMBERS : |
|
Kenny
Teng
Jeremy
Ng
Shirlene
Lim |
| ASSIGNMENT
: |
|
Write
a program that is able to read in a video stream
of someone's hand making numbers in sign language.
The program will be able to detect the position,
orientation and shape of the hands to efficiently
translate the signs being made. There are 3 different
algorithms that have been investigated to see
which one is the best method. |
| SPECIFICATIONS
: |
|
The
system will be written using Matlab. Initially,
we wanted to translate words in sign languages,
but due to time constraints, we had to restrict
it to numbers. |
| RESULT: |
|
Got
an A in the class |
| |
|
|
| [+OVERVIEW+] |
The video sequence being read
as input had to be processed to locate the position
and orientation of the hand before any translation
process could be done. Some assumptions have
been made to restrict the complexity of the
entire system, since we had limited time for
research, design and implementation. One assumption
that has been made was the fact that the camera
taking the input sequence will be mounted at
a fixed location and orientation. In that way,
the background of the feed will be fixed and
relatively constant. Also, the position of the
hand within the field of vision will be kept
within a specific range. In this way, we ensured
that there will always be a hand in the input
sequence, and that the size of the hand will
be relatively the same. These are to limit computations
and error checking complexities on the system,
since the main goal of the latter is for hand
gesture translation. |
| With
that said, let me introduce the pre-processing
part of the system. Before the beginning
of the experiment, a video sequence of the
empty background is recorded to obtain a
mean image of the background. This backgroud
image is later going to be used in background
substraction of every single frame for foreground
extraction, i.e. the hand. Therefore, binary
images of the hands will be obtained after
proper thresholding, to eliminate noises. |
 |
|
 |
From
there, simple heuristics can be applied
on the binary image to locate the positions
and orientations of the hand. Location is
performed by simply scanning the image 3-ways
to locate the extremities of the hand (as
shown on the left). Orientation is done,
by scanning 3 lines at the base of the hand
to obtain approximate vector directions
of the edges of the hand. From those vectors,
a middle orientation vector is computed
(as shown on the right). |
 |
|
| [+BOUNDARY
CONTOUR+] |
| Boundary
contour (as we call it) is the process of
determining the Euclidean distance of any
point on the edge of an image to the center
of mass (COM). We used this method to differentiate
fingers and nonfingers as fingers have a
distinctive length which will enable us
to easily determine whether a certain point
is a finger, thumb or neither. |
| The
cut-off points obtained from the orientation
determination process are used as start
point and end point of an edge detection
heuristic. We cycle the points along the
edge of the binary image, while saving them
in that sequence, and at the same time computing
the Euclidean distance between that point
and the COM. The peaks (maximum) are the
furthest point from the COM, that is, they
represent the positions of the tip of the
fingers. |
 |
|
|
From
the graphs, we determine the height of
the maximum peak. The minimum difference
between the maximum and the closest minimum
is computed. If that value is above 20%
of the highest finger, it is encoded as ‘1’ whilst
peaks below that threshold are classified
and encoded as ‘0’. For example,
[1 1 1 1 1] will be categorized as five
fingertips, thus representing the number ‘5’. |
|
| [+EQUIDISTANCE
SKELETON+] |
| Skeletonization is the process
for reducing foreground regions in a binary image
to a skeletal remnant that largely preserves the
extent and connectivity of the original region
while throwing away most of the original foreground
pixels. To see how this works, imagine that the
foreground regions in the input binary image are
made of some uniform slow-burning material. Light
fires simultaneously at all points along the boundary
of this region and watch the fire move into the
interior. At points where the fire traveling from
two different boundaries meets itself, the fire
will extinguish itself and the points at which
this happens form the so called `quench line'.
This line is the skeleton. Under this definition
it is clear that thinning also produces a sort
of skeleton. |
 |
| We
classify each pixel as we move along the
skeleton using identifiers such as endpoints”,
“branch” and “normal points”.
“Endpoints” is classified as
the pixel at which there are no other valid
neighboring pixels. “Branches”
are classified as the pixel at which it
has more than one valid neighboring pixel.
This pixel is then marked so that after
reaching an endpoint of one of the branching
pathways, our window tracking would return
to the branch point and move along the next
valid pathway from that particular pixel. |
 |
 |
Using
the maximum distance as reference, distances
that are ¼ of the maximum distance
stored will be categorized as a thumb, thus
encoded as a ‘1’. Distances
that are more than ½ of the maximum
distance is categorized as a finger, thus
encoded as a ‘2’. Other invalid
distances are then classified as a ‘0’.
Based on the encoding, we determine the
number the hand represents. For example,
[2 2 2 2 1] will be classified as detecting
the number ‘5’. |
|
| [+THINNING+]
|
| Thinning is a morphological
operation that is used to remove selected foreground
pixels from binary images. It can be used for
several applications, but is particularly useful
for skeletonization. It is commonly used to tidy
up the output of edge detectors by reducing all
lines to single pixel thickness. Thinning is normally
only applied to binary images, and produces another
binary image as output. The behavior of thinning
is determined by the structuring elements used
for the specific points being “thinned”. |
| To
implement thinning, first, translate the
origin of the structuring element (middle)
to each possible pixel position in the image.
If foreground and background pixels in the
structuring element exactly match foreground
and background pixels in the image, the
image pixel underneath the origin of the
structuring element is set to background.
Otherwise it is left unchanged. |
 |
|
  |
For
the thinning implementation, the classification
stage is almost similar to that of the equidistance
skeleton. The same process of tracing through
the skeleton is also needed. However the
algorithm is less complex since we do not
need to take care of split ends as perceived
from the skeleton image. Basically, we calculate
the length of each branch, getting rid of
insignificant branches which have length
shorter than a given threshold. Based on
the longest branch, we calculate how many
short branches and how many long branches.
Long branches would represent stretched
fingers while short branches represent folded
fingers. I would then code them in terms
of binary values as 1 for stretched fingers
and 0 for folded fingers. Based on the binary
values, we would be able to classify what
number does the hand represents. |
|
Among the three algorithms, boundary contour
was the most reliable and stable. The equidistance
skeleton performed worst among the three algorithms.
We attribute the performance level of the equidistance
skeleton to the fact that the skeletons had
a lot of “noise”. There was a lot
of branching which were confused as a key branch
point representing branching of fingers. Thus,
results weren’t as stable and reliable.
Thinning performed reasonably well due to the
fact that it had more “test” cases
in the program. It was more carefully tested
against more possible cases and classified accordingly.
It also did not have the problem of “noise”
in the form of “branching” as that
of the equidistance skeleton method.
|
|
Paper You can download our paper in pdf
format.
|
| BOUNDARY
CONTOUR |

|
|
|