Sunday, March 19, 2006

Hash Tables

Couple of days back, I got into a discussion about hash tables with one of my friends. Initially we were confused about what good it is over an array. Before explaining the confusion, I'll describe an example. (You know I strongly believe what my sixth standard teacher told me once: "Eg makes it easy.")

Suppose we want to store a table containing roll numbers and names of students:
(1, Pundarikakshya Purakayastha)
(2, Prasenjit Sarvadhikari)
(... well I think it is better to shorten the names ...)
(3, Arun)
(4, Varun)
(5, Darth Vader)

Okay we can continue with this. Well we can store this data in an array containing five elements and store the ith entry in a[i-1]. That would be good enough. We were confused what is the point of defining something like a hash table. Well we realised the point is we don't construct hash tables for such data. To give a real example, hash tables are used to handle data like path names of Linux executables indexed by the command names, like (just as an example)
(ls, /bin/ls)
(rm, /bin/rm)
(shutdown, /sbin/shutdown)
(vi, /usr/bin/vi)
(acroread, /usr/local/bin/acroread)
... and so on

Suppose there are 1355 commands and if we put the whole data in a 2 x 1355 array, it might be a time consuming job to find the path for a command you want. First you have to find the position of the command in the array and then check the corresponding path. If the command lies towards the end of the array, it will take some time to find it out.

So the way a hash table solves the problem is that it uses a hash function to convert the command name into a number and then store the path in a one dimensional array at the corresponding place. Like if, for example, for a suitable hash function, "ls" converts to 6, the computer sets a[5] = "/bin/ls". And so on. So when the computer has to retrieve the path for ls, it calculates the value of the hash function at "ls" and gets 6 as it's value and then gets the path at a[5] to be "/bin/ls". That works much faster than searching an array for "ls".

Now, as usual, life is not all that fair and it's very hard to construct hash functions which are "good" in the sense that they do not give the same number for two different commands and at the same time the numbers it produces are not too huge compared to the total number of commands. Well if one is interested how one solves those problems, one can read the article on Hash Tables in Wikipedia or some advanced book on algorithms.

22 comments:

वैभव Vaibhav said...

Does that mean that if a hash function returns 1 for some key and 2000 for the next - the remaining slots are 'waisted' memory chunks?

Vivek said...

I don't know if memory chunks have "waists" but some memory is wasted. Usual algorithm followed is that you start with a small array and a corresponding hash function which takes values from 1 to the size of the array. Once the array gets almost full, create a bigger array with a new hash function and "transfer" the data in the smaller array to the bigger one.

Anyway the Wiki article gives a quite detailed description of these issues. So I felt a bit lazy to write more! :)

kate said...

*dissolves in the geekiness*

Vivek said...

@ Kate

Come on, it isn't that geeky! :D. After all it can't be. My Geek Test score is only 31.36095%

Nikhil Joshi said...

is this hash the same as those objects created by the hash key codes in network security?

Vivek said...

Nikhil

I know that the hash functions are used in cryptography. And those are similar to the ones I described but may have the extra property that it has to be one way and collision free. By one way, I mean it will be easy to calculate the value of the hash function for a given data but given the value of the hash function, it should be difficult to guess the data. By collision free, I mean the hash value should take different values for different data points.

Nikhil Joshi said...

Vivek

Actually, this is the point I never could understand. One way is ok...but how do you garauntee the colision free status? Isn't it equivalent to say that the hash function is one one and on-to (bijective)?...whereas, by hash code a BIG messege (even in MB's) is encoded in a very small one (generally a few hundred KB's)....I don't think it is possible to have a bijective function from a big space (MB) to a small space (KB), since both are discrete and finite spaces!

Please, clear my doubts or correct the funda...

Vivek said...

Yeah! You are right, there will be collisions as the hash function takes only finitely many values whereas the input for the hash function may have infinitely many possibilities. Well, the main points ensured for security are:

(Taking the hash function to be h)

1. Given a data m, it should be difficult to find another data point n such that
h(m) = h(n).

2. It should be difficult to find m and n with the same hash value. So for this the range of values the hash function takes should be large enough.

But the hash functions used for hash tables need not have these properties. There are other easier ways of handling collisions in Hash Tables.

Nikhil Joshi said...

thanks boss...can you suggest me some good reference...I would like to clear up/learn these fundae a bit more seriously.

send a mail, if possible!

Anonymous said...

thanks vivek-it was really very useful and comprehensive-
now atleast i know what it means
but i still wonder if it really helps-and doesnt it take extra memory space

Anonymous said...

anonymous was me
sumedha

Amit Kulkarni said...

Like vivek explained you seem to be confused on hashes used in 'network security' and what are commonly used as Hash Tables to store key-value (or alternatively name-value pairs).

Collisions in name-value pairs are acceptable, they just lead to degradation in performance. Collisions in security hashing functions are completely unacceptable.

There are many hashes in network security

1) for storing passwords and checking passwords, like the infamous NT Lanman hash

2) for checksumming to check if file you downloaded is the same as that on server etc.. This is also related with cryptography.. MD5 is now officially broken, a cracker can do it in a few days with enough machines. Other algorithms are now recommended.

When a collision occurs in cryptography, security conscious people stop using that algorithm. Yes, a collision is always inevitable for whatever hashing function there is. And yes, brute force combined with clever algorithms will always raise the barrier for checksumming/password storing algorithms. This is an entire branch of Mathematics and Computer Science. That is why pure random numbers are necessary, to inject some uncertainty and reduce the probability of collisions (and thus cracking).

And yes, it is entirely possible (but not always) to reduce Megs of data to a few kb. It depends on many factors.

If you don't take offense, IMHO no book is good enough to clear your fundamentals. You have to sweat to really understand some things.

Vivek said...

Thank you, Amit. That was a really nice explanation! :)

Amit Kulkarni said...

vivek,

Nice intent for your blog! If you know some people and don't mind asking them, maybe we all can group blog on Mathematics with different applications in mind :)

Right now I don't know a thing about Math as I consistently managed to cram and get through it all. But I have chosen a Math topic from which to make my future living. I love Math when it is explained and I understand its applications.

I am trying to setup a geography blog too but no takers yet.

Sincerely,
amit

Vivek said...

Amit,

That is definitely a good idea! But I think the job of formally writing down maths is already done well by Wikipedia. So as far as formal stull is concerned we can write there as well. My aim for this blog was to write what I like in an informal fashion so that everyone can read. Of course, I cannot write whole of Maths in that fashion.

Do you have some topics in mind? I'll ask around if people have some ideas on it. Then we can definitely form a Math blog!

I saw your blog on Forest. It is quite interesting! :)

Vivek

Amit Kulkarni said...

Hi vivek,

Nothing formal, just informal. I can't stand formal stuff anyway :)

I am trying to understand certain things in Math and apply it to Geography. Pure Random numbers fascinates me, because they can be used in a variety of ways in Geography i.e stratified random sampling. Most of the modern Math techniques have been under-utilized e.g. Combinatorial Analytics.

What I want to do is explain in simple terms what a Math term means, the significance, where it is likely to be used i.e its applications etc.

I am not sure if you know about it, but Computational Fluid Dynamics CFD , which is basically Mechanical Engg. is used to predict Blood flow within blood vessels and better understand it!!! These kinds of things which are apparent to a Mechanical Engg could be useful to a variety of fields. And I want to be able to facilitate the tapping of such domain specific knowledge and share it out.

amit

kate said...

I scored only 20% on the geek test.
validation! :)

Dr Shanta Laishram said...

I am a total geek who is not yet a major! My score was 16.96252 %

samudrika said...

For the record, I got 27.6%. Total geek! and proud of it.

@vivek
with the first three posts so 'successful' I wonder what the rest of your blog will be like!

Vivek said...

Now for the long awaited replies. I got a bit busy in the last few days and also was trying to do a bit of reseach on already written blogs to see how much work has already been done and what has to be done.

Well now I'll reply each of you one by one.

To Amit Kulkarni:

I've been looking at blogs for the last few days and I have found a couple of bloggers who are writing about maths (thanks to Samudrika for one of them) and found a lot of Mathematicians writing. Well as of forming a group, I'll start searching soon after my impending work gets over. I've been a moderator of an Yahoo group and sometimes it becomes an hassle to keep it going! :D.

Still I like your idea and definitely want to follow it up. I just need some time.

Other than that I had been thinking on writing about my subject in a way which won't scare away non-mathematicians. (And of course, to write in such a way that people won't ask, "So what?" at the end of it! :) ) I don't seem to be getting a good grip of it and that also says why I am delaying my next post.

To Kate:

Welcome to the world of geeks! And wish the geekiness helps you get some fundamental breakthroughs soon! :)

To Shanta:

Too bad! I thought you were a better geek than that! ;)

To Samudrika:

Thanks! :).

Well it's a bit scary if the starting is too good. Thanks to all my readers and I hope to keep my readers interested!

Anonymous said...

only 21.something % boo hoo hoo

Vivek said...

At P.P.

Don't cry yaar, being a total geek is good enough, I think! ;). About the other blog let's see! Just now I'm crushed under work!

At Voice Within

Don't worry! We'll help you get your geekiness! :)