Thursday, June 14, 2012

Libferris and SQLite -- a powerful combination

This article as published back in 2009. This blog entry brings all the pieces together again.

The libferris virtual filesystem allows you to mount things that are not normally thought of as filesystems. Examples include XML files, Berkeley db4 ISAM files, relational databases, system log files, as well as applications like Firefox, Evolution, XWindow, emacs and Amarok.

Version 1.3.0 of libferris has enhanced support for SQLite databases. This includes both mounting an SQLite database as a filesystem and exposing any libferris virtual filesystem to SQLite as a virtual table. You can have as many libferris backed virtual tables as you like in SQLite, so you can join information from multiple sources (XML, syslog, db4 files) using SQL queries.

In this article series I will explain how to mount SQLite as a filesystem so you can manipulate it with the familiar command line tools (ls, cat, redirection), how to go the other way and expose libferris filesystems like mounted XML to SQLite for manipulation. I'll finish up with a discussion of how libferris lets you mount semi structured plain text files like log files as a filesystem and how you can then expose those to SQLite for analysis.

libferris tries to perform what are traditionally seen as filesystem mounting operations for you automatically. For example, if you list an XML file using ferrisls without using the -d option to list the entry itself rather than its contents, then libferris will try to mount that XML file and show it to you as a filesystem. This allows clients to access and store information into XML and db4 files without needing to know any of the details. A URL could be for a normal directory on an ext3 filesystem or for part of an XML file, it doesn't matter.

Mounting SQLite databases is now done automatically too. Consider the simple SQLite database shown below:

$ touch   /tmp/test.sqlite
$ sqlite3 /tmp/test.sqlite
SQLite version 3.5.9
Enter ".help" for instructions
sqlite> create table foo ( id int primary key, msg varchar );
sqlite> insert into foo values (4,'hello there');
sqlite> insert into foo values (5,'foo bar');

Without doing anything special, you can get at the data in this SQLite database through any libferris client. If you prefer you can also use FerrisFUSE to expose it through the Linux kernel to any program. For those unfamiliar, FUSE allows a filesystem that appears through the Linux kernel to be served by an application running as a normal application process. This means FerrisFUSE runs like a daemon and can use libferris to store information while making it available to the Linux kernel like any other traditional (ext3) filesystem. All of a sudden you can run vi or emacs to edit a tuple in an SQLite database. The libferris distribution contains clones of the traditional ls, cat, cp, etc. tools which use libferris to perform their work. This way you don't have to setup any FUSE mounts or the like if you just want to ls, cat or redirect information into a libferris filesystem.

When you mount a table or SQL query result from a relational database with libferris, each column from a tuple is exposed using the standard filesystem Extended Attribute (EA) interface. The EA interface allows key=value pairs to be associated with a file. You can think of EA simply as metadata for files and directories, the familiar lstat(2) information of size, mtime, owner and group of a file are exposed as EA by libferris. In fact, you can get and set the byte contents of a file through the content EA of a file. Making everything available through the EA interface allows you to filter and sort a directory based on lstat metadata as well as user supplied metadata -- it's all just EA.

Shown below, I first list the table from the SQLite database and then view a tuple with the fcat command. When you read a tuple as a file, libferris exposes it as an XML document. In places that do not have a direct mapping, libferris normally tries to do the most sane thing it can. For a tuple, you have to expose all the fields in some delimited format because libferris can't know what fields are the most important. So using XML makes the most sense because it offers explicit key=value serialization and is easily digestible by other applications. You can also use the -a option to fcat to select a single single column (EA) which avoids the XML.

$ ferrisls -lh  /tmp/test.sqlite/foo
                                 40                   4 
                                 36                   5 
$ fcat /tmp/test.sqlite/foo/5
<context  id="5"  msg="foo bar"  /> 
$ fcat -a msg  /tmp/test.sqlite/foo/4
hello there

As you can see above, the long listing format from ferrisls only shows the size and name of each file. All of the information shown by ls is exposed as EA by libferris. But there is no modification time, or file ownership privileges stored for a tuple in the SQLite database, so no value is available. This makes the ferrisls long listing seem fairly spartan.

You can also ask a libferris directory what it thinks is the most interesting EA (the recommended-ea) for the user. The ferrisls -0 option uses this feature to let you dig into strange filesystems. Shown below, you can see the msg and id column from the database are shown when using -0 to see the recommended-ea. Because the columns that are shown by recommended-ea can be arbitrary EA, you might like to use the --show-headings option to have ferrisls tell you the name of each EA.

$ ferrisls -0  /tmp/test.sqlite/foo
4 hello there 4 id 
5 foo bar 5 id 

$ ferrisls -0 --show-headings  /tmp/test.sqlite/foo
id msg name primary-key 
4 hello there 4 id 
5 foo bar 5 id 

ferrisls also lets you see its output in XML or RDF/XML using the --xml or --rdf options respectively. Apart from being able to easily parse the output, the XML output lets you see the names of the EA for each value.

$ ferrisls --xml  /tmp/test.sqlite/foo
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<ferrisls>

  <ferrisls id="" msg="" name="foo" primary-key="id" url="file:///tmp/test.sqlite/foo">
    <context id="4" msg="hello there" name="4" primary-key="id"/>
    <context id="5" msg="foo bar" name="5" primary-key="id"/>
  </ferrisls>

</ferrisls>

While such read access is a huge step forward for unlocking the data in an SQLite file, you might like to update data as well. Because the redirection commands in bash only know about kernel filesystems, there is the ferris-redirect command which performs a similar role to the shell >, >> and >| family of primitives. The ferris-redirect command reads from standard input and writes that data to a nominated URL. The -T option to ferris-redirect truncates the file before writing to it, and the --ea command is used to select an EA to write to instead of writing to the contents of the file.

$ echo -n "welcome to libferris" \
  | ferris-redirect -T --ea msg /tmp/test.sqlite/foo/5
$ ferrisls --xml  /tmp/test.sqlite/foo
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<ferrisls>

  <ferrisls id="" msg="" name="foo" primary-key="id" url="file:///tmp/test.sqlite/foo">
    <context id="4" msg="hello there" name="4" primary-key="id"/>
    <context id="5" msg="welcome to libferris" name="5" primary-key="id"/>
  </ferrisls>

</ferrisls>

I added support for mounting PostgreSQL database to libferris years ago. There are still some features unique to the PostgreSQL support in libferris such as calling database functions through a virtual filesystem. You can also mount other databases that QtSQL can access using libferris.

Going the other way

Lets now turn our attention to doing the opposite: exposing libferris as a relation to SQLite. You might have some data in an XML file that you need to join with data stored in Berkeley db4 or an existing SQLite database. Your first thought might be to whip up a Perl script to bring the data together. With libferris you don't have to write any of this glue code.

As SQLite includes support for virtual tables, you can expose any libferris virtual filesystem as a table to SQLite. Because XML is just another virtual filesystem to libferris you can easily expose it to SQLite and join the data with a normal SQL query.

To use libferris from SQLite, you'll need an sqlite client binary that can load shared libraries. Allowing binary libraries to be loaded into your SQLite session is a bit of a security issue, the code could potentially do anything. As such, the packaged client for Fedora does not allow loading shared libraries. It is very easy to download the sqlite source and recompile it with shared library support though, and you only have to install the single sqlite3 binary somewhere in order to get at libferris virtual tables.

Firstly you'll need to load the libferris support using the .load directive at an sqlite prompt. Data is exposed into SQLite as a virtual table using the create virtual table statement. The format for this statement varies depending on the code implementing the virtual table. For libferris, the parameters are: The URL to expose, a string containing key=value pairs to effect this exposure, and the EA you want exposed, expressed in the normal create table format. Note that you can only specify the name and datatype for each EA (column) you want, you can't put constraints on them like foreign or primary keys. To avoid confusion, I'll call the URL supplied as the first parameter to create virtual table the "base URL".

Issuing the below two statements at an sqlite prompt will bring libferris support into the current session and create a virtual table which shows the /tmp/test directory with various lstat(2) information as well as the MD5 checksum and file contents as columns in the new SQLite table. Note that there is an implicit first column in the created virtual table which is always the file URL (primary key) so you don't have to specify that manually.

$ sqlite3

.load libferrissqlitevtable.so
create virtual table fs using libferris(
        'file:///tmp/test',
        '',
        name text,
        size int,
        md5 text, 
        path text,
        mtime int,
        atime int,
        ctime int,
        is-dir int,
        mtime-display text,
        content text );

There is a short list of common EA descriptions that are available in libferris. If you are unsure what an EA looks like, use the --show-ea option to ferrisls to select which EA you want to see for a URL. --show-ea takes a comma separated list of which EA you want to see. The special ea-names EA will list the names of all EA that are available for a URL as a comma separated string. The second command below generates a newline separated version in a temporary file.

$ ferrisls --show-ea=name,foo URL
$ fls --show-ea=ea-names URL | tr ',' '\n' >|/tmp/o
$ vi /tmp/o

Turning our attention back to the create virtual table SQLite function again. There are currently two parameters which can be specified in the second parameter as comma separated key=value pairs. If you specify recursive=true then the table exposes the entire filesystem recursively from the base URL, otherwise only the directory of the base URL is exposed. This lets you choose if you want the directory or entire filesystem tree from the base URL to be exposed. If you specify add-rowid-to-base-filesystem=true then stable 64bit row-ids are added as needed to files in the exposed table. You need stable row-ids to allow the SQL update .. where and insert into statements to work as expected.

There are a few things to keep in mind when exposing a filesystem as a relational table. For example, EA that are shown in ferrisls and native libferris clients with a '-' in them are translated to use '_' instead because SQLite doesn't like the dash. Likewise, ':' is translated into '_colon_'. So in the above create virtual table, the is-dir and mtime-display EA will be available in the SQL columns is_dir and mtime_display respectively.

To get at the above virtual fs table, you can use the normal SQL select command:

select is_dir,name,url from fs 
order by url desc;
...
0|df2|file:///tmp/test/df2
0|df1|file:///tmp/test/df1
...
Because some calculated EA like the MD5 checksum are exposed as columns you can grab those from a query too. The libferris virtual table implementation includes code to allow you to pick files in a where clause using their path or URL and libferris will optimize the lookup in a manner similar to (but not quite as efficiently) as if you had created a relational database index on those columns. Note that if you use the URL in the where clause, you need to specify the file:// part too, and the URL must exactly match what you are looking for. If you are not getting the result you expect when searching for yourURL, use ferrisls --show-ea=url yourURL and make sure the result exactly equals yourURL.

select md5,size,mtime,mtime_display,name,url from fs 
where  path = '/tmp/test/df1';

Exposing a normal filesystem directory to SQLite is not very exciting, but keep in mind that anything you can see through libferris you can see through an SQLite virtual table backed by libferris. So if you have an XML file:

$ cat customers.xml 
<customers>
 <customer id="3"   givenname="Foo"   familyname="Bar"/>
 <customer id="15"  givenname="Bobby" familyname="McGee"/>
 <customer id="131" givenname="Ziggy" familyname="Stardust" />
</customers>

You can expose it to SQLite and query based on any attributes in the XML file as shown below.

$ cat test-xml.sql 
sqlite> .load libferrissqlitevtable.so
sqlite>  create virtual table fs using libferris(
        'customers.xml/customers',
        '',
        name text,
        size int,
        id int,
        givenname text,
        familyname text
        content text 
        );

sqlite> select * from fs 
        order by givenname desc;

file:///ferris/exposures/sqlitevtab/customers.xml/customers/131|131|0|131|Ziggy|Stardust
file:///ferris/exposures/sqlitevtab/customers.xml/customers/3|3|0|3|Foo|Bar
file:///ferris/exposures/sqlitevtab/customers.xml/customers/15|15|0|15|Bobby|McGee

You can also create a filesystem index with libferris and expose desktop search results as a virtual sqlite table. For example, the following searches an image collection for a particular bridge using the EA Query (eaq://) libferris virtual filesystem. For more information on creating and using the desktop search functionality in libferris see the Linux Journal article.

$ cat query.sql
.load libferrissqlitevtable.so
create virtual table fs using libferris(
        'eaq://(url=~Akashi-Kaikyo-Bridge)',
        '',
        name text,
        size int,
        md5 text, 
        path text,
        mtime int, atime int, ctime int,
        is-dir int,
        mtime-display text,
        width int, height int, content text );

select is_dir,width,height,url from fs 
order by url desc;
...
0|1840|1232|file:///digital_pictures/2004-japan/Akashi-Kaikyo-Bridge-main-section2.png
0|1840|1232|file:///digital_pictures/2004-japan/Akashi-Kaikyo-Bridge-main-section.png
...

Updating XML through SQL commands

Imagine that you have the customers.xml file shown below. You have exposed it to SQLite and used it to join with information from tables in the database to resolve queries. All is well, but you would like to add a new customer to the XML file. While you might consider opening up a text or XML editor on the XML file itself, you would really like to just keep using SQL.

$ cat customers.xml
<customers>
 <customer id="3"   givenname="Foo"   familyname="Bar"/>
 <customer id="15"  givenname="Bobby" familyname="McGee"/>
 <customer id="131" givenname="Ziggy" familyname="Stardust" />
</customers>

Currently you have to specify everything including the URL of the new file you wish to create in the SQL INSERT statement as shown below. Note that the ID does not need to be part of the URL, you only need to tell libferris you want to create a new customer element. A planned addition here is to allow libferris to understand the autoincrement keyword during the creation of the virtual SQLite table. This would let you be able to supply a NULL for the ID column and have one automatically generated for you, even though this is backed by an XML file.

.load libferrissqlitevtable.so
create virtual table fs using libferris(
        'file:///tmp/customers.xml/customers',
        'add-rowid-to-base-filesystem=true',
        id int,
        givenname text,
        familyname text
        );

insert into fs values ( 
        'file:///tmp/customers.xml/customers/customer',
        300,
        'this is the new given name',
        'new sirname here' );

At this stage your customers.xml file will include a new element for the entry you INSERTed above:

$ cat customers.xml
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<customers ferris-sqlite-next-rowid="1">
 <customer familyname="Bar" givenname="Foo" id="3"/>
...
  <customer familyname="new sirname here" ferris-sqlite-rowid="1" 
      givenname="this is the new given name" id="300"/>
</customers>

The little differences that surfaced for the SQL INSERT statement completely disappear when using SQL UPDATE as shown below:

update fs set familyname = 'Anderson' where id = 300;
As you can see above, setting add-rowid-to-base-filesystem=true in the create virtual table statement causes libferris to save ferris-sqlite-rowid data for files in the virtual filesystem. If these attributes can be saved into the file itself, either directly as in XML, or to filesystem Extended Attributes then they are stored there. If neither of these are possible then libferris stores the ferris-sqlite-rowid data in a personal RDF store.

Plain text log files and SQLite

Those familiar with awk and Perl don't really think too much about splitting a log file into its various columns. But a log file stored as a plain text file on a filesystem is hard to directly query from a relational database. One might wish to know the number of connections from each IP address and the total number of bytes send to each of those IP addresses. For such a query the SQL is fairly quick to conjure up but the data is not in an agreeable format for SQL.

Version 1.3.0 of libferris also includes support for mounting text files with a regular structure as a virtual filesystem. In its most basic form this extends to splitting up each line of a file using a nominated separator character and making a virtual file for each line with the individual fields exposed as EA. For example, using the comma as the delimiter you could mount a comma separated file as a libferris filesystem. Once its a libferris filesystem you can then turn around and use SQLite to query that filesystem.

In an awk script, the fields that are generated by splitting up a line (record) of input are available through the $1, $2, etc. variables. By default the fields are exposed using similar names in libferris but you can also specify a comma separated list of identifiers for the fields. These identifiers give the name of an extended attribute (EA) which will have the field value. Perhaps an example would best clarify things.

The test2.csv file shown below contains three fields separated by commas. The first ferris-redirect command sets the ferris-recordfile-field-names EA for the file which instructs libferris the names you would like to give to the columns. The second ferris-redirect tells libferris what separates the fields in the file. This is like the FS variable from awk and can be a regular expression.

Finally the ferris-type is set to recordfile to tell libferris that it should try to mount the file by parsing it if the user tries to list it like a directory. Because recordfiles are just plain text files, libferris has no quick way of knowing if it can or should try to parse the file without you explicitly setting the ferris-type to recordfile.

Notice in the ferrisls command that the fields have the names which you specified using the ferris-recordfile-field-names attribute.

$ cat test2.csv
cola,colb,colc
seca,secb,secc
third1,third2,third3,third4
$ echo -n 'colA,secondColumn,lastColumn' \
   | ferris-redirect -T --ea=ferris-recordfile-field-names test2.csv 
$ echo -n ',' \
   | ferris-redirect -T --ea=ferris-recordfile-fs test2.csv 
$ echo -n recordfile \
   | ferris-redirect -T --ea=ferris-type test2.csv 
$ fls --xml test2.csv 
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<ferrisls>

  <ferrisls name="/tmp/testrecordfile/test2.csv" url="file:///tmp/testrecordfile/test2.csv">
    <context colA="cola" lastColumn="colc" name="0" secondColumn="colb"/>
    <context colA="seca" lastColumn="secc" name="1" secondColumn="secb"/>
    <context colA="third1" lastColumn="third3" name="2" secondColumn="third2"/>
  </ferrisls>

</ferrisls>

Since its annoying and tedious to set these field separators for each file, you can set the ferris-recordfile-metadata-url EA to an absolute or relative name where the metadata about field names and separators is kept. In the case shown below, ferris will look in ~/.ferris/recordfile-types.db/apache_access_log for the info about how to split up the file.

$ echo -n apache_access_log \
   | ferris-redirect -T -x --ea=ferris-recordfile-metadata-url /var/log/httpd/access_log
$ echo -n recordfile \
   | ferris-redirect -T --ea=ferris-type /var/log/httpd/access_log

The ferris-recordfile-metadata-url is simply a level of indirection. Instead of libferris looking at the ferris-recordfile-fs EA on the /var/log/httpd/access_log file it will instead look for the ferris-recordfile-fs EA on ~/.ferris/recordfile-types.db/apache_access_log. This way you can setup the information for field names and the regular expressions used to split up various file types in one place and simply refer files of that format to the correct recordfile-types.db URL.

Trying to parse apache access logs using a fixed regular expression for the field separator is not as simple as with comma separated files. This is because the separator changes as you move across a line in the access log. To enable parsing of less regular files libferris supports the ferris-recordfile-fsplit EA which defines a single regular expression which includes regular expression (captures) to explicitly pick out the fields from the input. The fsplit regular expression (regex) is normally much longer than a field separator (FS) regex but you can easily change what the regex searches for after each matched field.

Shown below is a partial implementation of a fsplit capturing regex for parsing apache access log files. The recordfile-types.db comes with libferris as part of the /etc/skel files for ~/.ferris. Note that you can easily explore and update the data stored in a Berkeley db file because libferris can mount those too.

$ fls --xml ~/.ferris/recordfile-types.db 
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<ferrisls>
  <ferrisls name="/ferris/dot-ferris/recordfile-types.db" url="file:///ferris/dot-ferris/recordfile-types.db">

    <context ferris-recordfile-field-names="ip,f2,f3,date,req,response,sz,f8" 
       ferris-recordfile-fsplit="^([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+) ([^ ]+) ([^ ]+) \[(.+)\] &quot;(.*)&quot; ([0-9]+) ([0-9]+|\-).*" 
       name="apache_access_log"/>
      ...
  </ferrisls>
</ferrisls>

Once you have set the ferris-type to recordfile and set the metadata URL as shown above, then you can dig into the log file as a filesystem as shown below. Note that if you wanted to import the log somewhere else you can use the --xml or --rdf options to ferrisls to expose this filesystem as XML or RDF/XML respectively.

$ fls -0  access_log
0 - - 14/Mar/2009:14:37:06 +1000 585 192.168.100.200 GET / HTTP/1.1 200 
1 - - 14/Mar/2009:14:37:06 +1000 287 192.168.100.200 GET /favicon.ico HTTP/1.1 404 

But rather than exporting the data as XML, why not just use the SQLite virtual table interface to query it directly as shown below. First I make a new virtual table from the log file and then perform the most basic query on it. The second query shows how many requests each IP address made and how many bytes were sent to each IP address.

sqlite> .load libferrissqlitevtable.so
sqlite> create virtual table logfile using libferris(
        'file:///tmp/access_log',
        '',
        ip text,
        req test,
        response text,
        date text,
        sz int
        );

sqlite> select * from logfile order by ip desc;
file:///tmp/access_log/0|192.168.100.200|GET / HTTP/1.1|200|14/Mar/2009:14:37:06 +1000|585
file:///tmp/access_log/1|192.168.100.200|GET /favicon.ico HTTP/1.1|404|14/Mar/2009:14:37:06 +1000|287
...

sqlite> select count(*),max(ip),sum(sz) from logfile group by ip;
15|192.168.100.200|1798021
3|192.168.110.200|42722
1|192.168.130.200|16927
2|192.168.133.200|271977

Most log files contain a timestamp containing the date and time that a request or event occurred. In the above query, we declare the date column as part of the virtual table but don't use it in the query. There is one major glitch -- having the date field as plain text makes performing date comparisons or grouping by date almost impossible. You are at the mercy of the string to date functions of your SQL environment.

To get around this, support was added to libferris to have it parse arbitrary date and time EA and generate a new EA-epoch attribute exposing the UNIX epoch value (seconds since 1970). The epoch time is the lowest common denominator in that it requires no parsing for sqlite to interpret it properly and you can perform date and interval comparisons directly on the epoch.

The below command tells libferris to try to parse the date EA as a time value using the given strftime(3) time format string. The format fully matches the apache access_log time format and so libferris will generate a new date-epoch EA containing the seconds since 1970 for the date EA. Note that recordfile-types.db that ships with libferris has all of these modifications already in place for you but knowing how things are setup lets you parse files in other formats too.

As you can see, I now use the date-epoch column when creating the virtual table because that value is much more useful to SQLite. Selecting just the date_epoch column shows you the number of seconds as an integer. The date_epoch can be made more digestible with the datetime() SQLite function. If you have more than one EA that you want to parse into an epoch time, separate then with the pipe character, for example, a=%b/%Y|b=%H:%M:%S

echo -n 'date=%d/%b/%Y:%H:%M:%S %z' \
  | ferris-redirect --ea ferris-recordfile-field-timeparsers \
  recordfile-types.db/apache_access_log


sqlite> .load libferrissqlitevtable.so
sqlite> create virtual table logfile using libferris(
        'file:///tmp/access_log',
        '',
        ip text,
        req test,
        response text,
        date-epoch int,
        sz int
        );

sqlite> select url,ip,date_epoch,datetime(date_epoch,'unixepoch') from logfile order by ip desc;
file:///tmp/access_log/19|192.168.133.200|1237005517|2009-03-14 04:38:37
file:///tmp/access_log/20|192.168.133.200|1237005517|2009-03-14 04:38:37
...
This leads you to move interesting queries such as the one shown below which gives a day by day breakdown of bandwidth usage for the last 60 days. The first part of the where clause converts the current time into a UNIX epoch value so that the time from the log file can be subtracted and we can exclude events that are older than 60 days.

select strftime('%Y-%m-%d',datetime(date_epoch,'unixepoch')) as Date, sum(sz) as Bytes 
from logfile 
where strftime('%s',datetime('now')) - date_epoch <= 60*60*24*60 
group by Date;

Wrap up

If you are more familiar with XML technologies, libferris can also expose itself as a virtual document to XQuery. Currently libferris can expose itself to FUSE, SQLite, XQuery and as a virtual Xerces-C DOM. A core ideal of libferris is liberation, you should be able to see your data as a virtual filesystem, be it from an XML or db4 file, from a relational database, or from an application like Firefox or emacs. Once information is available to libferris you should be able to get at it from whatever tool best suits your current task, a filesystem, a table in a relational database, or a document in an XQuery.

1 comment:

Matthew Rubenstein said...

Howdy. Do you have any leads for using libferris and SQLite (or other SQL engine) to manage souce code (preferably Java)? For example: progressive builds at scopes finer than dirty files, dynamically mapping the dependencies starting with the dirty scope; relecting only private methods taking a specific type as an argument; renaming all instances of any variable sharing the substring to be renamed; counting depth of call graphs; storing old versions of code blocks; that kind of stuff. Any way to query scopes in source files that's already coded or even just researched as an example integration around libferris.