The Accidental Search Engine.
Well, OK. It’s not an accident. I decided to experiment around with writing a search engine. I’ve run into a few problems and still have quite parts I need to implement, but all in all, it has worked a lot better than I had planned.
There are the usual problems when you start looking at something big like this. Horsepower of the machines you are using for this, or in this case, machine. Memory is needed because databases are memory intensive. Drive space is needed because you are storing so much data to make this happen.
I did try to make it scalable. It is set up so that I can start adding more machines for doing different tasks as well as load balancing coming in from users doing searches. A real test of this, though, won’t happen unless I add more machines. That won’t happen unless I magically start making a lot of money off the Google ads running on the side.
There is a lot of work going forward. I’ve got some ideas about ranking web page relevance that I’m going to play around with. I need to work on the crawler and how it scans and stores links, also, this one part isn’t quite working how I want it to when I run multiple copies of it.
Another problem I have, which is mostly related to resources, is that I keep running into the big players on the internet. When I run into a site like Microsoft or Ebay, the crawler winds up spending all of it’s time indexing the thousands of links these guys have to their own sites. This wouldn’t be a problem with unlimited resources, but I want a more diverse collection of pages. And coming up with an exclude list is not a guarantee the results will be different. Imagine trying to come up with a list of all of the large sites on the internet?
So there it is. Search on a very tight budget.