Wednesday, October 14, 2009

Libferris, Soprano, Extended Attributes... the Ménage à trois

If you want to store metadata in a filesystem, there are Extended Attributes (EA). The kernel interface allows you to store key=value metadata in these EAs for each file on your filesystem. The catch is that kernel EA are limited in size, sometimes performance is poor, and some systems do not support EA either natively or in the kernels that Linux distributions ship. The classic example of the latter is NFS, which can have EA support patched in, but many distros do not do that.

In libferris the EA interface is virtualized along with the filesystem. So if you are using XFS (or ext3/4 with the right options) then libferris will let you read and write EA to the kernel filesystem. For other filesystems, libferris stores the EA behind the scenes in RDF for you. The difference is no seen by applications, its all just EA... magically every filesystem supports read/write EA.

Version 1.4.0 and above of libferris use Soprano and optionally Nepomuk for RDF support. To take RDF for a spin from the filesystem lets use FerrisFUSE and the normal console tools...

$ mkdir -p /tmp/RDFTESTING/backing /tmp/RDFTESTING/fs
$ date >| /tmp/RDFTESTING/backing/df1.txt
$ ferrisfs -u /tmp/RDFTESTING/backing /tmp/RDFTESTING/fs

As you can see, backing is where the filesystem is and fs is where you can access backing through libferris. You could just as easily use a HTTP server or emacs as your backing filesystem, anything that libferris can see is up for grabs.

The below uses the attr command to set and get and Extended Attribute. Assuming that the /tmp kernel filesystem does not allow EA to be set by users. If in doubt, use an NFS directory and you'll almost certainly not be able to attr -s directly on the backing filesystem.

$ cd /tmp/RDFTESTING/fs
$ cat df1.txt
Wed Oct 14 22:33:04 EST 2009
$ attr -s foo -V bar df1.txt
Attribute "foo" set to a 3 byte value for df1.txt:
bar
$ attr -g foo df1.txt
Attribute "foo" had a 3 byte value for df1.txt:
bar

So, you might ask where does all this metadata go and come from. And what does the RDF schema look like... The best solution would be to use SPARQL to query the data, but the default store is still a redland one with libferris 1.4.0 and its sparql is very, very slow. 1+ minutes for a simple query on this data vs <2 seconds using the sesame2 soprano backend. So the fastest way to explore the redland RDF store is to export the whole RDF store and grep it for now. Hopefully virtuoso and/or my own soprano backend will save the day in the future :/

$ cd ~/.ferris/rdfdb
$ time sopranocmd --backend redland \
--settings name=myrdf export t

I have changed the URIs to use prefixes in the grep output... as you would expect, there is data attached to the df1.txt URL.

$ grep RDFTE t ferris:uuid ferris:93f22bd8-b8be-11de-8e06-001bfc4f043c .

That UUID node has an mtime and a out-of-band-ea bnode.

$ grep 93f22bd8-b8be-11de-8e06-001bfc4f043c t
ferris:93f22bd8-b8be-11de-8e06-001bfc4f043c ferris:mtime "1255523953"^^ .
ferris:93f22bd8-b8be-11de-8e06-001bfc4f043c ferris:out-of-band-ea _:r1255523601r5448r1 .

And the bnode has the EA foo=bar set on it.

$ grep r1255523601r5448r1 t
_:r1255523601r5448r1 ferris:user.foo "bar"^^ .

As you see, the UUID node has a mtime assoicated with it, this way libferris can tell if you have updated any RDF values for a file, and it becomes like an additional ctime check available to libferris and used for example when indexing files.

The gain of having the UUID node use a bnode is that many files can share the same RDF metadata. This is useful if you can access the same file from multiple paths or if files get moved on file servers and you want to relink the old RDF metadata to the file with the new path. I mention file servers here because libferris will track the metadata for you if you use ferrismv/ferriscp, but if somebody moves a file on the file server you've got to have a way to tell libferris about that change.

Fileserver movements are a common enough thing that libferris can automatically relink RDF nodes for you. There are also smushing tools available to help with the task. But that's a story for another post.

No comments: