Unoffical empeg BBS

Quick Links: Empeg FAQ | RioCar.Org | Hijack | BigDisk Builder | jEmplode | emphatic
Repairs: Repairs

Topic Options
#319263 - 13/02/2009 15:01 SQL interface to flat file data
tonyc
carpal tunnel

Registered: 27/06/1999
Posts: 7058
Loc: Pittsburgh, PA
One of my projects at work involves collecting a large number of small, fixed-length data records. Billions of records per day, less than 50 bytes each. We currently store the data in hourly binary files arranged in a relatively simple directory hierarchy to provide for cheap and fast indexing of the type of data, date/time collected, etc.

This works very well for the command line tools we provide to analyze the data, but some users want a richer query interface. Several of our users have actually gone through the trouble to import our data into an RDBMS to provide for more advanced queries, but this is obviously not an ideal solution for more than a small portion of the data that's collected. Adding in some type of SQL interface to our data has been suggested, and I'm wondering if anyone has been down this road before and could offer some advice.

My boss is currently talking with the folks who sell c-tree, which purports to do some of what we want, but we'd prefer something that's free and open source, since our tools are released that way. It's also important that we change our current on-disk format as little as possible to add the SQL layer in. The goal is to support SQL queries of our data as efficiently as possible without using an RDBMS as our storage engine.

Ideas, anyone?
_________________________
- Tony C
my empeg stuff

Top
#319265 - 13/02/2009 15:50 Re: SQL interface to flat file data [Re: tonyc]
matthew_k
pooh-bah

Registered: 12/02/2002
Posts: 2298
Loc: Berkeley, California
Have you considered if you can work with SQLite? It's the defacto standard in "SQL on files", but it doesn't sound like it's a drop in solution.

Matthew

Top
#319266 - 13/02/2009 15:58 Re: SQL interface to flat file data [Re: tonyc]
wfaulk
carpal tunnel

Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
Can you be more specific about the data? Are the records index/value mappings? Are the indexes always the same? What kind of indexing are you currently using?
_________________________
Bitt Faulk

Top
#319267 - 13/02/2009 16:12 Re: SQL interface to flat file data [Re: wfaulk]
tonyc
carpal tunnel

Registered: 27/06/1999
Posts: 7058
Loc: Pittsburgh, PA
Originally Posted By: matthew_k
Have you considered if you can work with SQLite? It's the defacto standard in "SQL on files", but it doesn't sound like it's a drop in solution.


I could be wrong, but I think our requirement that it be able to partition data across multiple files on disk rules out SQlite, which doesn't have any partitioning support that I'm aware of.

Quote:
Can you be more specific about the data? Are the records index/value mappings? Are the indexes always the same? What kind of indexing are you currently using?


It's network flow data, much like Cisco Netflow. So, a 5-tuple of (src ip, src port, dest ip, dest port, protocol) plus start time, end time, and a tiny bit of extra data like TCP flag mask, next hop IP, etc. Each file contains an hour's worth of data from one sensor.

There's no indexing to speak of except for the directory and file structure, which is (generally)

/class/type/year/month/day/sensor-hour

"class" is usually some segment of the traffic based on where the sensors are located (at the network core, on the border, etc.) and "type" is usually used to indicate the direction of the traffic (in/out of the network.)

So, it's trivially easy to get a particular sensor's data, or all data for a particular hour/day, but getting all flows from a particular IP over a wide time range is a brute force chug through the files.
_________________________
- Tony C
my empeg stuff

Top
#319269 - 13/02/2009 17:25 Re: SQL interface to flat file data [Re: tonyc]
wfaulk
carpal tunnel

Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
Originally Posted By: tonyc
It's also important that we change our current on-disk format as little as possible to add the SQL layer in.

Can you expand on this point a little, too? Do you mean you need it to stay in the same binary format it's in now, or you just want to keep the same indexing structure, or there can be no new files, or what?
_________________________
Bitt Faulk

Top
#319270 - 13/02/2009 17:33 Re: SQL interface to flat file data [Re: wfaulk]
tonyc
carpal tunnel

Registered: 27/06/1999
Posts: 7058
Loc: Pittsburgh, PA
Ideally, we'd like to keep the exact same format we have and just write some glue code to hand records to the SQL layer. I can see a case for the creation of files that sit alongside each hourly data file to provide indexing, but we'd like to avoid fundamental changes to the existing data structure if possible. This is to keep compatibility with other tools that read/write our data, and to avoid having to convert many years of data that our customers have collected.
_________________________
- Tony C
my empeg stuff

Top
#319271 - 13/02/2009 17:37 Re: SQL interface to flat file data [Re: tonyc]
peter
carpal tunnel

Registered: 13/07/2000
Posts: 4180
Loc: Cambridge, England
Originally Posted By: tonyc
So, it's trivially easy to get a particular sensor's data, or all data for a particular hour/day, but getting all flows from a particular IP over a wide time range is a brute force chug through the files.

It'll still be a huge chug no matter how much SQL middleware you put between you and it. IMO the only way to get a sane outcome here is to write scripts to import the data into a real database. Fortunately this duplication isn't a problem, as the data never changes once it's generated (right?).

Peter

Top
#319272 - 13/02/2009 17:50 Re: SQL interface to flat file data [Re: peter]
wfaulk
carpal tunnel

Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
Well, there are some potential intermediate steps, such as implementing some sort of indexing, but you're talking about reimplementing portions of a database engine. But I think you're basically right.

Tony, have you looked at MySQL's CSV engine? It doesn't actually provide any indexing, though, so it'll still be a big churn, albeit a somewhat automated one.
_________________________
Bitt Faulk

Top
#319273 - 13/02/2009 18:52 Re: SQL interface to flat file data [Re: wfaulk]
andy
carpal tunnel

Registered: 10/06/1999
Posts: 5916
Loc: Wivenhoe, Essex, UK
The IMAP server Dovecot proves that this short of on-the-side indexing can work well. You can keep you email in dumb old mbox files (on big file for each folder) and dovecot creates indexes on the fly to provide lightning fast access to the fields it indexes.
_________________________
Remind me to change my signature to something more interesting someday

Top
#319298 - 14/02/2009 16:17 Re: SQL interface to flat file data [Re: matthew_k]
Mojo
Unregistered


I second sqlite.

Why would you need to partition the data across multiple database files? If you really need to, just make a separate database for each class/type/year combination, and make use of the attach database statement.

Top
#319379 - 16/02/2009 17:54 Re: SQL interface to flat file data [Re: ]
tonyc
carpal tunnel

Registered: 27/06/1999
Posts: 7058
Loc: Pittsburgh, PA
Quote:
Well, there are some potential intermediate steps, such as implementing some sort of indexing, but you're talking about reimplementing portions of a database engine. But I think you're basically right.

There are libraries that will allow us to do b-tree indexing with very little fuss, and we've experimented with them. The problem in our case is that with such small records, and so many of them, indexing becomes a bit crazy, because the indexes end up being nearly as large as the data they're indexing.

Really, though, this is a bit of a tangent, since our customers don't have a problem with the speed of the current system. What they want is added flexibility on the query side.

Quote:
Tony, have you looked at MySQL's CSV engine? It doesn't actually provide any indexing, though, so it'll still be a big churn, albeit a somewhat automated one.


I don't think that's the right way to go. Our on-disk representation is as compact as possible, whereas CSV is very wasteful. Also, there would be a ton of conversion back and forth between textual representation of numbers, timestamps, IP addresses, etc. and the native C representations we use in our tools.

The goal here is to keep the existing record data structure as intact as possible. Adding stuff to each file or each record is acceptable, changing it to some file format that's managed by the database probably is not.

Quote:
Why would you need to partition the data across multiple database files?


Because that is how our collection system currently works, and I'm trying to provide parity with what we have without rewriting all of our tools. As I mentioned above, using this hierarchical directory structure and having files broken down with this granularity provides a hierarchical composite index of (type, class, year/month/day, sensor) at virtually no cost, and probably 80% of user queries take advantage of this pattern. This way, we optimize for the most common case by using the filesystem's very efficient data structures.

I had given some thought to SQlite, but my experience with it in the past was that it didn't scale to the level we need. Maybe I'll play around with a scaled-down set of our data to see if it's gotten better since then. I was hoping for something that wouldn't force us to adopt someone else's storage engine, but if we do have to drink the RDBMS kool-aid, SQLite is probably the least onerous to work with on disk.
_________________________
- Tony C
my empeg stuff

Top
#319388 - 16/02/2009 23:29 Re: SQL interface to flat file data [Re: tonyc]
Mojo
Unregistered


You could use an sqlite database in conjunction with your current layout. The data would just be duplicated in the two formats but you'd get the best of both worlds.

If disk space is important, you might consider using sqlite's feature of creating a database in memory. That way your software would have to chug through the data only once before performing any complex queries.

Good luck

Top