Site banner image

Welcome to the CozyNet Blog!

How to be a cool cat by rolling your own internal search engine

Click for wallpaper size!

The past few days I spent some time setting up an internal search for the CozyNet blog. It’s now present on the main blog page, as well as a search engine option on the Homepage too.

I had to look around for awhile until I was able to find something suitable, because I’m sure as heck not going to reinvent this wheel here. I wouldn’t even know where to begin, so I figured there was something out there simple enough that I could plug in. Turns out, it was surprisingly complicated to find anything up to snuff; and by up to snuff I mean it has to be completely server side, preferably PHP and MySQL based, and absolutely no javashid dependency!

I knew that it could be done because I’ve seen them all my life; mostly on old school sites which is what I had in mind. I came up with Sphider, Sphinx, and a few others. Sphider doesn’t appear to be maintained and a little complicated to setup on newer environments. Sphinx looks way too complicated for what it should be, and the others were all half baked bloated crap that would never fit here.

So I looked around a little more and found Xapian. It took some considerable reading in the documentation, but it actually worked surprisingly well. The funny thing about Xapian though is that it’s a CGI program. Not exactly PHP, but it works! It’s a blast from the past to say the least.

I did make some extensive edits to the main query search page so that it blends well with the overall site, as well as fixed up some bugs. If you take a look at the source it’ll look like a mess, but you should of seen it before because it was in even worse shape then. There were missing tags, quotes, unnecessary spaces, etc. It’s also intermixed with other things that are unseen, but cause large empty gaps in the page source.

So long it works and looks nice, then I’m fine with it really. It also works quite well in text based browsers, which I bet you don’t see that very often!

I did notice though that it pulls the <title> element as the “caption” title for each result, which I’ve unfortunately set to simply “CozyNet Blog.” So every search result comes back with “CozyNet Blog” and a short summary, which isn’t exactly very intuitive. Going forward, I’ll begin implementing the title within the <title> element for each blog and video post so that they show up properly in the search. I might go thru and update the older ones, but then again I might not.

Setup

If you want to setup Xapian on your site, it’s really not that too difficult. It’s all command based, so you’ll want to create a shell script or a daily cron job to update its database of indexed pages. However, forethought and proper organizational hierarchy of your web content is needed or else you could index crap you don’t want showing up in search queries, like template files for example. You can filter some things, but proper organization goes a long way.

Here’s the command that I use to index the blogs:

omindex --db /var/lib/xapian-omega/data/default --mime-type html:text/html --mime-type-match=template.html:ignore --depth-limit=1 --url /blogs /var/www/www.cozynet.org/html/blogs && echo "Blogs done…"

  • The --db option and following file path sets the target database to be updated. You can makeup whatever database name you’d like.
  • The --mime-type option sets the types of files you want indexed. I have it set to only html so that images and audio files won’t junk up search results.
  • The --mime-type-match option filters out by filename, which I’m filtering “template.html” with the “:ignore” bit on the end.
  • The --depth-limit=1 only permits omindex one directory deep.
  • And the --url option requires the directory to scan (starting from the webroot) and the full file path.

Each time you run this command, it will completely replace the previous contents of the database. If you would like to add additional contents to the database without writing over it, then just include a -p after the omindex command to preserve. The preserve option is only beneficial if you want to index additional directories into the same database or update the pre-existing database contents with new contents, which includes a “modified” timestamp feature in search query results.

In my shell script, I have the first omindex command rebuild the database by writing new content to it, then the next omindex command includes the preserve flag so that additional directories can also be included in the same DB. You’ll want to try and keep everything in the same DB, or else you would have to create individual search forms dedicated to each DB on your site. This might not be a bad idea if you want to separate blogs, videos, and images; however Xapian doesn’t have an image or video viewer, so just keep that in mind.

To create the search box itself on your site, you can include the following in the HTML document. You'll also want to make sure that CGI execution is enabled on your web server too, by the way.

<form method="GET" action="/cgi-bin/omega/omega" role="search">
  <input name="P" placeholder="Search for..." size="50" autocomplete="off" accesskey="s" type="search" spellcheck="false"/>
  <span>
    <button type="submit">Search</button>
  </span>
</form>

Thanks for reading my blog!


Back to top!

Comments:

Back to top!