More search stuff
Ok, time for some boring stuff.
I’ve made some major improvements on the back end of my search engine. Response times so far are down to under a second for most queries. There are still some edge cases that still hang things up. I was able to make a query that took 15 minutes. But I think the most common cases are handled. Boolean searches are now possible. Scoring is a bit more accurate I hope.
I wiped the database clean and started repopulating it based on a different seed address (my own) so those that I have links to and those I have mentioned here will most likely be in there.
Right now about 5000 pages are indexed. 221,000 urls are left to be scanned. I read somewhere you should get about 8 links per page scanned, I am getting closer to 50 per page. I think that’s because I’ve scanned blogs and have managed to start indexing the BBC news site (I have a couple BBC links in my blog).
I can now run multiple crawlers across multiple machines across the internet if I so desire. I’ve limited them to 1000 sites per night and I am only running two, so I don’t piss off my gracious host (When your t1 is free, you try to be nice so you can keep it free).
Some things I plan on doing when I get the time…
1. Clean up the html
2. Create a logo
3. Write the typical “About”, “Legal”, and “Contact” pages
4. Create a page that lets people submit addresses to be indexed
5. Figure out a way to “update” the index for pages already scanned and how often to scan them
6. Write a better alogorithm for getting the abstract from the article.
7. Use the document title in the link instead of the link itself
8. Compress similar pages together (like subpages of a blog)
9. Write relevance ranking algorithm to compliment in text scoring.
10. Do more html parsing for relevance.
11. Figure out how to span multiple machines with the DB itself as it grows
12. Check out load balancing for both apache and the db. I’m sure my poor little web server wouldn’t handle a heavy load.
Maybe I should give my search site it’s own blog.
I think I might need to study up on information retrieval as well.