Indexing files with Scala and Elasticsearch

I was doing load testing in Elasticsearch., I’ve created a simple code in Scala to fetch files recursively and index them to Elasticsearch.

The code uses Java Mime Magic Library as a helper to get file description.

So let’s get started installing Elasticsearch

Installing and start Elasticsearch

curl -O http://cloud.github.com/downloads/elasticsearch/elasticsearch/elasticsearch-0.18.7.tar.gz
tar zxf elasticsearch-0.18.7.tar.gz
cd elasticsearch-0.18.7
bin/elasticsearch -f

Remove -f out if you don’t want start it in foreground.

Scala code

Our Scala code have 3 functions, one to list all files

def fetchFiles(path:String)(op:File => Unit){
  for (file <- new File(path).listFiles if !file.isHidden){
    op(file)
    if (file.isDirectory){
      fetchFiles(file.getAbsolutePath)(op)
    }
  }
}

A function to create the JSON.

def document(file:File) = {

  val json = jsonBuilder.startObject
  .field("name",file.getName)
  .field("parent",file.getParentFile.getAbsolutePath)
  .field("path",file.getAbsolutePath)
  .field("last_modified",new Date(file.lastModified))
  .field("size",file.length)
  .field("is_directory", file.isDirectory)

  if (file.isFile) {
    try{
      val m = Magic.getMagicMatch(file, true)
      json.field("description",m.getDescription)
      .field("extension",m.getExtension)
      .field("mimetype",m.getMimeType)
    }catch {
      case _ => json.field("description","unknown")
        .field("extension",file.getName.split("\\.").last.toLowerCase)
        .field("mimetype","application/octet-stream")
    }
  }
  json.endObject
}

Only files will be passed to Magic detection, there’s a treatment in case detector gets issue parsing the file. It’ll generate the final format to be indexed.

  {
      "name": "pragmatic-guide-to-git_p1_0.pdf",
      "parent": "/Users/shairon/Reference",
      "path": "/Users/shairon/Reference/pragmatic-guide-to-git_p1_0.pdf",
      "last_modified": "2010-11-26T18:55:43.000Z",
      "size": 1358963,
      "is_directory": false,
      "description": "PDF document",
      "extension": "pdf",
      "mimetype": "application/pdf"
  }

And finally the main

def main(args: Array[String]) = {
    val dir = new File(args(0))
    if (!dir.exists || dir.isFile || dir.isHidden) {
      printf("Directory not found %s\n",dir)
      System.exit(1)
    }

    val client = new TransportClient()
    client addTransportAddress(
      new InetSocketTransportAddress("0.0.0.0",9300)
    )
    fetchFiles( dir.getAbsolutePath){
      file => {
        printf("Indexing %s\n",file)
      client.prepareIndex("files", "file", DigestUtils.md5Hex(file.getAbsolutePath))
        .setSource(document(file))
        .execute.actionGet
      }
    }
    client.close
}

As you may notice, we’re running Elasticsearch and the program in the same machine(0.0.0.0), if you want to run Elasticsearch in other machine, change the ip/hostname at

client addTransportAddress(new InetSocketTransportAddress("ip/host-name-here",9300))

The index name and type is in the line

client.prepareIndex("files", "file", DigestUtils.md5Hex(file.getAbsolutePath))

it’s equivalent of a curl call

curl -XPUT 'http://0.0.0.0:9200/files/file/4cdb168a80e2adc397f44353b3223494' -d '...'

The only difference is port 9300, it’s used by Java Transport Client and 9200 is used straightforward by others clients.

Indexing

Indexing files is also simple. All we have to do is get this code put together so clone it https://github.com/shairontoledo/elasticsearch-filesystem-indexer

git clone git://github.com/shairontoledo/elasticsearch-filesystem-indexer.git
cd elasticsearch-filesystem-indexer

Install dependencies and compile it by maven

mvn install

Running

mvn exec:java -Dexec.mainClass=net.hashcode.fsindexer.Main -Dexec.args=/Users/me/directory/path

Set exec.args to a directory that you want to index.

Searching

After to index some files, you can search by

curl -XGET 'http://0.0.0.0:9200/files/file/_search?q=pdf&pretty=true'

You should see a response similar to

"took": 86,
"timed_out": false,
"_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
},
"hits": {
    "total": 767,
    "max_score": 0.34046465,
    "hits": [
        {
            "_index": "files",
            "_type": "file",
            "_id": "a277bda1f97f8ffa6885347b1c76b8d3",
            "_score": 0.34046465,
            "_source": {
                "name": "agile-web-development-with-rails_p1_0.pdf",
                "parent": "/Users/shairon/Reference",
                "path": "/Users/shairon/Reference/agile-web-development-with-rails_p1_0.pdf",
                "last_modified": "2011-11-03T09:59:09.000Z",
                "size": 6700177,
                "is_directory": false,
                "description": "PDF document",
                "extension": "pdf",
                "mimetype": "application/pdf"
            }
        },

      ...

Now you have data, you can improve your queries and get started with Elasticsearch in your Scala application.

4 comments

  1. Trackback: otis
  2. Trackback: alan
  3. Trackback: oscar
  4. Trackback: max

Post a comment

You may use the following HTML:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>