More Search Engine Stuff
I’ve spent some time this weekend improving my search engine. It’s amazing what lack of money with plenty of boredom can cause. There’s not a lot that will be immediately visible, but it’s there. A lot of work went into the crawler and the formatting of the search results. I also wrote an “about” page.
The crawler had a problem in that it never released URLs from the queue, so whenever the crawler started again, it would start with double the number of links it was supposed to scan. It has code to prevent rescanning web pages, but that takes time. When you are talking about 1000 URLs this isn’t a lot of time, but add 1000 more every time it runs, then it starts to cause problems. It now clears the queue for the specific instance of the crawler before it determines what it needs to scan. This also takes care of a crawler failing and being restarted. In all of this, I accidentally messed up the search data, so I had to reset the database.
The search results now have an abstract based on where the key words are located. Previously I just took the first 200 characters of the page, now it takes the first 200 characters starting where the key word or phrase was found. Failing that, it falls back on the beginning of the document.
The about page now touches a bit on why I wrote it, where I got my information, and a little about me.
There’s a long way to go.
- Records need to have a “time until next rescan”.
- A few more support web pages need to be written like help, legal, and contact information. The usual type of stuff.
- It needs to be able to read sitemaps
- People need to be able to submit URLs and sitemaps and I need to figure out how to put those into the queue and prioritize them.
- I have a couple more ideas on optimizing scoring of the search results, but I want to be sure they don’t slow down the queries.
- the html layout really should be controlled by CSS. It should not be hard coded tables and positioning. That way, I can format it easily by adjusting the css instead of editing the page. It would also allow me to format all of the web pages at once instead of needing to edit them all.
- and lots more