Saturday, January 31, 2009

Searching indexes on a maemo device: a customized solution

In a previous post I had a video showing one of the libferris inverted file metadata index plugins on maemo. Unfortunately that implementation was designed for a desktop machine, meaning for many many more files, slow disk head seeks, blazingly fast CPU(s).

While these videos are not as snazzy as some others, I make them so folks can get an idea of the real performance of the software. Saying its "acceptably fast" or takes less than a second is all well and good, but seeing the snail or rabbit move gives an immediate gut feeling if its "fast enough".

The new maemo custom designed metadata architecture uses the mmap() design. I always have my reservations about mmap() because it encourages data structure and algo choices that are directed at an in-memory world. That is all great if you can cache the entire memory mapped file in main memory, but often leads to kittens crying when there isn't enough RAM for that. For maemo with 128mb total RAM this is also an issue, but with the much faster disk seeks of flash storage the relatively poor data clustering of mmap()ed designs is a bit less important for performance. But of course, data clustering and physical layout is always interesting when playing the indexing game.


libferris maemo audio search by regex on URL from Ben Martin on Vimeo.

Shown in the video above is a search using the new "custom mobile device" libferris metadata index. As I continue to type a regular expression to match against the URL, libferris gives some search results on each character entered. Note that this is against an index with over 10,000 URLs in it. This index format will be available in libferris 1.2.7+. The index format sports a probabilistic regular expression prefilter for URL matching. This means that the results of the prefilter might include some false positives, but anything the filter drops out is definitely not a match. You have to run the regex itself on all results of the prefilter to drop out the false positives. The prefilter can drop the number of times a regular expression is evaluated down to less than 100 in 10,000 URLs given only a 3-4 character regex. This helps out oodles when you want to run raw regex searches on big (for a portable device) indexes and you want your results asap.

An unfortunate current side effect of the mmap() design is that the index is architecture dependant. Mainly because some secondary offset-pointer information is 64 or 32 bit width dependant, so its a matter of tracking those peskies down and clamping them to a fixed width. To get around that for now, I indexed the 10,000 odd audio files over NFS from an n810.

Some information for those looking to play, if you are using boost to memory map, you may well discover that atomic_cas32() calls __sync_val_compare_and_swap() which is not actually available on the maemo device (for me). This is with a modified scratchbox using gcc version 4.2.1.
I'm not sure if there is an accepted cas et al library for maemo/armel. I leave the hacks for __sync_val_compare_and_swap() to the reader and their accepted level of concurrency control risk.

No comments: