[Home] Self-organizing Media Database

Private Cloud Media Server

This idea is a plugin framework for organizing media, meta-data, and revisions such as the Music, Movies, Documents, etc on your computer.

This media database is part of a package which runs on a Gumstix Overo Fire, which is constrained. Using event-oriented / lazy-evaluation architecture is important.

Examples of data handled:

Documentation

Documentation should be in a machine-readable format (preferably JSON), which can be used to create a web page.

Handlers

Each handler is a separate application. It is preferred that the handlers be written in NodeJS, but Python is the best choice in many cases due to the availability of existing libraries that very nearly meet the requirements as-is.

Input:

Process:

Output:

The output format should represent all meta-data where reasonable, but at least as much as the common consumer applications (iTunes, Picasa, Adobe Reader, etc)

Example: IPTC defines a huge number of tags, but applications such as Picasa, Gthumb, and Flickr show the most relevant ones.

In cases where one format supports a feature that another doesn't, the output should handle the higher fidelity cases.

Example: One tag format allows only one Artist, another allows an array. An array should be used for the json output, even if only one item is present.

The most import feature now is simply to read the various meta-data. The ability to add / remove / edit meta-data is a plus.

file

audio

Python's mutagen provides most of the necessary functionality. AtomicParsley is also very helpful.

Output should include bitrate if possible.

video

images

documents

NOTE: This is for future reference, not in the current plan

example

From Commandline:

jpeg-handler ./path/to/fun-pic.jpeg --output-format json
{
  date_taken: 987343... // Unix timestamp
  md5sum: ABE853.... // hexdigest
  sha256sum: FEC74D... // hexdigest
}

As a library

pic = handleJpeg(read(filename), { defer_checksum: true })
puts pic.stream.checksum
puts pic.meta.title
puts pic.toJSON

The handler should be able to checksum the JPEG stream (not including the embedded JPEG thumbnail which may exist).

Media Database Framework & Server

Python / SQLAlchemy PostgreSQL is probably the best fit for this part of the project.

For each handler there should be a python class which integrates with SQLAlchemy to populate a database. PostgreSQL has a non-blocking adapter for NodeJS, which other parts of the project is written in.

Relationships:

meta-data has sub meta-data. Example. A song has an album and an artist

Optimizations:

Application

The server should probably be implemented in Twisted as a simple HTTP server.

Scratchpad and Psuedo-Code

filescan:

Step through each file and operate on it

class filescan:
  moved_files = []
  handlers = []

  function filescan(path):
    for filename, filepath in fs.search(path):
      for handler in handlers:
        handler(filename, filepath)

mediadb: (To be built in python as a unix socket server)

// http://docs.python.org/library/socket.html#socket-example
// http://github.com/Kami/python-twisted-binary-file-transfer-demo/blob/master/server.py
class media:
  plugins = []

  function add(filepath):
    for plugin in plugins:
      plugin.process(filepath)

custom_frontend:

An application build on the framework

import filescan, media

function safely_move_file(file, new_file):
  if not fs.exists(new_file):
    fs.mv(file, new_file)
  else
    if same_checksum_on_file(file, new_file):
      fs.mv(file, new_file)
    else  
      safely_move_file(file, new_file + '_copy')

function force_utf8(path):
  if not utf8_safe(filename):
    new_file = utf8_replace_invalid_chars(filename, '_')
  return new_file

function main:
  filescan.add_handler(force_utf8)    
  mediadb.move_files = 'move' // 'copy', 'hardlink', 'symlink'
  filescan.add_handler(mediadb)
  filescan.filescan('~/')
  filescan.filescan('~/')
  query = ... 
  // see http://www.sqlalchemy.org/docs/05/ormtutorial.html#querying
  result = mediadb.session.query(query)
  result.toJSON();

Database

The database primarily stores 3 types of information:

Strings should always be either unicode (UTF-8).

Search Optimization:

Searchable strings should also be stored reverse in lowercase, sometimes with punctuation removed.

File meta data

File meta-data is used to detect duplicates and to track revisions.

Example Schema:

CREATE TABLE IF NOT EXISTS Files (
    `id` INTEGER PRIMARY KEY,
    `viewable_id` INTEGER,
    `UUID` CHAR,
    `size` INTEGER,
    `md5sum` CHAR,
    `path_searchable` VARCHAR,
    `filename_searchable` VARCHAR,
    `extension` VARCHAR,
    `ctime` INTEGER,
    `mtime` INTEGER,
    `atime` INTEGER,
    `record_last_updated` DATE,
    UNIQUE (`path`,`filename`)
)
CREATE_TABLE IF NOT EXISTS Files_viewable (
    `path` TEXT,
    `filename` TEXT,
    `sha256sum` CHAR,
    `uid` INTEGER,
    `gid` INTEGER,
)
CREATE INDEX IF NOT EXISTS pathIndex ON Files(`path_searchable`)
CREATE INDEX IF NOT EXISTS filenameIndex ON Files(`filename_searchable`)
CREATE INDEX IF NOT EXISTS md5Index ON Files(`md5sum`)

Search Optimization:

items not likely to be used for search or sort are stored in a separate table and lazy-loaded when needed filename_searchable - limited to 255 chars, reverse, lowercase path_searchable - limited to 255 chars, reverse, lowercase, '/' removed

blog comments powered by Disqus Updated at 2010-09-13