Category: ElasticSearch

Indexing files with Scala and Elasticsearch

I was doing load testing in Elasticsearch., I’ve created a simple code in Scala to fetch files recursively and index them to Elasticsearch.

The code uses Java Mime Magic Library as a helper to get file description.

So let’s get started installing Elasticsearch

Installing and start Elasticsearch

curl -O http://cloud.github.com/downloads/elasticsearch/elasticsearch/elasticsearch-0.18.7.tar.gz
tar zxf elasticsearch-0.18.7.tar.gz
cd elasticsearch-0.18.7
bin/elasticsearch -f

Remove -f out if you don’t want start it in foreground.

Scala code

Our Scala code have 3 functions, one to list all files

def fetchFiles(path:String)(op:File => Unit){
  for (file <- new File(path).listFiles if !file.isHidden){
    op(file)
    if (file.isDirectory){
      fetchFiles(file.getAbsolutePath)(op)
    }
  }
}

A function to create the JSON.

def document(file:File) = {

  val json = jsonBuilder.startObject
  .field("name",file.getName)
  .field("parent",file.getParentFile.getAbsolutePath)
  .field("path",file.getAbsolutePath)
  .field("last_modified",new Date(file.lastModified))
  .field("size",file.length)
  .field("is_directory", file.isDirectory)

  if (file.isFile) {
    try{
      val m = Magic.getMagicMatch(file, true)
      json.field("description",m.getDescription)
      .field("extension",m.getExtension)
      .field("mimetype",m.getMimeType)
    }catch {
      case _ => json.field("description","unknown")
        .field("extension",file.getName.split("\\.").last.toLowerCase)
        .field("mimetype","application/octet-stream")
    }
  }
  json.endObject
}

Only files will be passed to Magic detection, there’s a treatment in case detector gets issue parsing the file. It’ll generate the final format to be indexed.

  {
      "name": "pragmatic-guide-to-git_p1_0.pdf",
      "parent": "/Users/shairon/Reference",
      "path": "/Users/shairon/Reference/pragmatic-guide-to-git_p1_0.pdf",
      "last_modified": "2010-11-26T18:55:43.000Z",
      "size": 1358963,
      "is_directory": false,
      "description": "PDF document",
      "extension": "pdf",
      "mimetype": "application/pdf"
  }

And finally the main

def main(args: Array[String]) = {
    val dir = new File(args(0))
    if (!dir.exists || dir.isFile || dir.isHidden) {
      printf("Directory not found %s\n",dir)
      System.exit(1)
    }

    val client = new TransportClient()
    client addTransportAddress(
      new InetSocketTransportAddress("0.0.0.0",9300)
    )
    fetchFiles( dir.getAbsolutePath){
      file => {
        printf("Indexing %s\n",file)
      client.prepareIndex("files", "file", DigestUtils.md5Hex(file.getAbsolutePath))
        .setSource(document(file))
        .execute.actionGet
      }
    }
    client.close
}

As you may notice, we’re running Elasticsearch and the program in the same machine(0.0.0.0), if you want to run Elasticsearch in other machine, change the ip/hostname at

client addTransportAddress(new InetSocketTransportAddress("ip/host-name-here",9300))

The index name and type is in the line

client.prepareIndex("files", "file", DigestUtils.md5Hex(file.getAbsolutePath))

it’s equivalent of a curl call

curl -XPUT 'http://0.0.0.0:9200/files/file/4cdb168a80e2adc397f44353b3223494' -d '...'

The only difference is port 9300, it’s used by Java Transport Client and 9200 is used straightforward by others clients.

Indexing

Indexing files is also simple. All we have to do is get this code put together so clone it https://github.com/shairontoledo/elasticsearch-filesystem-indexer

git clone git://github.com/shairontoledo/elasticsearch-filesystem-indexer.git
cd elasticsearch-filesystem-indexer

Install dependencies and compile it by maven

mvn install

Running

mvn exec:java -Dexec.mainClass=net.hashcode.fsindexer.Main -Dexec.args=/Users/me/directory/path

Set exec.args to a directory that you want to index.

Searching

After to index some files, you can search by

curl -XGET 'http://0.0.0.0:9200/files/file/_search?q=pdf&pretty=true'

You should see a response similar to

"took": 86,
"timed_out": false,
"_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
},
"hits": {
    "total": 767,
    "max_score": 0.34046465,
    "hits": [
        {
            "_index": "files",
            "_type": "file",
            "_id": "a277bda1f97f8ffa6885347b1c76b8d3",
            "_score": 0.34046465,
            "_source": {
                "name": "agile-web-development-with-rails_p1_0.pdf",
                "parent": "/Users/shairon/Reference",
                "path": "/Users/shairon/Reference/agile-web-development-with-rails_p1_0.pdf",
                "last_modified": "2011-11-03T09:59:09.000Z",
                "size": 6700177,
                "is_directory": false,
                "description": "PDF document",
                "extension": "pdf",
                "mimetype": "application/pdf"
            }
        },

      ...

Now you have data, you can improve your queries and get started with Elasticsearch in your Scala application.

ElasticSearch & Tika – the mapper-attachment plugin

ElasticSearch supports text extraction at indexing time, it’s called mapper-attachment plugin, basically it gets a base 64 encoded field, decoded it and so invokes Apache Tika to get its content. I had some troubles to get it running by current documentation then I’ve created a simple test project to get more sense about the mapping. I’m going to describe here.

Mapping

ElasticSearch uses JSON to indexing and searching, it also uses JSON in its configuration. In the ElasticSearch java client, you have a bunch of builders they facilitate a lot JSON creation in a DSL. In few words a mapping an index layout configuration, to the test I created the following structure:

String idxName = "test";
String idxType = "attachment";
XContentBuilder map = jsonBuilder().startObject()
        .startObject(idxType)
          .startObject("properties")
            .startObject("file")
              .field("type", "attachment")
              .startObject("fields")
                .startObject("title")
                  .field("store", "yes")
                .endObject()
                .startObject("file")
                  .field("term_vector","with_positions_offsets")
                  .field("store","yes")
                .endObject()
              .endObject()
            .endObject()
          .endObject()
     .endObject();

It just hold the mapping in the object map, you can to create the index and mapping it in the same time.

CreateIndexResponse resp = client.admin().indices().prepareCreate(idxName).setSettings(
            ImmutableSettings.settingsBuilder()
            .put("number_of_shards", 1)
            .put("index.numberOfReplicas", 1))
            .addMapping("attachment", map).execute().actionGet();

assertThat(resp.acknowledged(), equalTo(true));

As you may see above, CreateIndexResponse provides a method acknowledged to verify the answer that you got.

Indexing

The next step is to index a PDF document encoded in base64. I’m going to index a document with id 80.

String pdfPath = ClassLoader.getSystemResource("fn6742.pdf").getPath();
String data64 = org.elasticsearch.common.Base64.encodeFromFile(pdfPath);
XContentBuilder source = jsonBuilder().startObject()
        .field("file", data64).endObject();

IndexResponse idxResp = client.prepareIndex().setIndex(idxName).setType(idxType).setId("80")
        .setSource(source).setRefresh(true).execute().actionGet();

Searching

Now, you can perform search on it

QueryBuilder query = QueryBuilders.queryString("amplifier");

SearchRequestBuilder searchBuilder = client.prepareSearch().setQuery(query)
        .addField("title")
        .addHighlightedField("file");
SearchResponse search = searchBuilder.execute().actionGet();

Highlights

In the search result set you can get the hits and the highlights.

    assertThat(search.hits().totalHits(), equalTo(1L));
    assertThat(search.hits().hits().length, equalTo(1));
    assertThat(search.hits().getAt(0).highlightFields().get("file"), notNullValue());
    assertThat(search.hits().getAt(0).highlightFields().get("file").toString(), containsString("<em>Amplifier</em>"));

It is just a parsed search result for:

{
  "took" : 136,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.005872132,
    "hits" : [ {
      "_index" : "test",
      "_type" : "attachment",
      "_id" : "80",
      "_score" : 0.005872132,
      "fields" : {
        "file.title" : "ISL99201"
      },
      "highlight" : {
        "file" : [ "\nMono "<em>Amplifier</em>"\nThe ISL99201 is a fully integrated high efficiency class-D \nmono "<em>amplifier</em>". It is designed" ]
      }
    } ]
  }
}

Very simple to get searching done in your system, files such as PDF and MS documents without previous steps to extract documents.