Indexing files with Scala and Elasticsearch
I was doing load testing in Elasticsearch., I’ve created a simple code in Scala to fetch files recursively and index them to Elasticsearch.
The code uses Java Mime Magic Library as a helper to get file description.
So let’s get started installing Elasticsearch
Installing and start Elasticsearch
curl -O http://cloud.github.com/downloads/elasticsearch/elasticsearch/elasticsearch-0.18.7.tar.gz
tar zxf elasticsearch-0.18.7.tar.gz
cd elasticsearch-0.18.7
bin/elasticsearch -f
Remove -f out if you don’t want start it in foreground.
Scala code
Our Scala code have 3 functions, one to list all files
def fetchFiles(path:String)(op:File => Unit){
for (file <- new File(path).listFiles if !file.isHidden){
op(file)
if (file.isDirectory){
fetchFiles(file.getAbsolutePath)(op)
}
}
}
A function to create the JSON.
def document(file:File) = {
val json = jsonBuilder.startObject
.field("name",file.getName)
.field("parent",file.getParentFile.getAbsolutePath)
.field("path",file.getAbsolutePath)
.field("last_modified",new Date(file.lastModified))
.field("size",file.length)
.field("is_directory", file.isDirectory)
if (file.isFile) {
try{
val m = Magic.getMagicMatch(file, true)
json.field("description",m.getDescription)
.field("extension",m.getExtension)
.field("mimetype",m.getMimeType)
}catch {
case _ => json.field("description","unknown")
.field("extension",file.getName.split("\\.").last.toLowerCase)
.field("mimetype","application/octet-stream")
}
}
json.endObject
}
Only files will be passed to Magic detection, there’s a treatment in case detector gets issue parsing the file. It’ll generate the final format to be indexed.
{
"name": "pragmatic-guide-to-git_p1_0.pdf",
"parent": "/Users/shairon/Reference",
"path": "/Users/shairon/Reference/pragmatic-guide-to-git_p1_0.pdf",
"last_modified": "2010-11-26T18:55:43.000Z",
"size": 1358963,
"is_directory": false,
"description": "PDF document",
"extension": "pdf",
"mimetype": "application/pdf"
}
And finally the main
def main(args: Array[String]) = {
val dir = new File(args(0))
if (!dir.exists || dir.isFile || dir.isHidden) {
printf("Directory not found %s\n",dir)
System.exit(1)
}
val client = new TransportClient()
client addTransportAddress(
new InetSocketTransportAddress("0.0.0.0",9300)
)
fetchFiles( dir.getAbsolutePath){
file => {
printf("Indexing %s\n",file)
client.prepareIndex("files", "file", DigestUtils.md5Hex(file.getAbsolutePath))
.setSource(document(file))
.execute.actionGet
}
}
client.close
}
As you may notice, we’re running Elasticsearch and the program in the same machine(0.0.0.0), if you want to run Elasticsearch in other machine, change the ip/hostname at
client addTransportAddress(new InetSocketTransportAddress("ip/host-name-here",9300))
The index name and type is in the line
client.prepareIndex("files", "file", DigestUtils.md5Hex(file.getAbsolutePath))
it’s equivalent of a curl call
curl -XPUT 'http://0.0.0.0:9200/files/file/4cdb168a80e2adc397f44353b3223494' -d '...'
The only difference is port 9300, it’s used by Java Transport Client and 9200 is used straightforward by others clients.
Indexing
Indexing files is also simple. All we have to do is get this code put together so clone it https://github.com/shairontoledo/elasticsearch-filesystem-indexer
git clone git://github.com/shairontoledo/elasticsearch-filesystem-indexer.git
cd elasticsearch-filesystem-indexer
Install dependencies and compile it by maven
mvn install
Running
mvn exec:java -Dexec.mainClass=net.hashcode.fsindexer.Main -Dexec.args=/Users/me/directory/path
Set exec.args to a directory that you want to index.
Searching
After to index some files, you can search by
curl -XGET 'http://0.0.0.0:9200/files/file/_search?q=pdf&pretty=true'
You should see a response similar to
"took": 86,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 767,
"max_score": 0.34046465,
"hits": [
{
"_index": "files",
"_type": "file",
"_id": "a277bda1f97f8ffa6885347b1c76b8d3",
"_score": 0.34046465,
"_source": {
"name": "agile-web-development-with-rails_p1_0.pdf",
"parent": "/Users/shairon/Reference",
"path": "/Users/shairon/Reference/agile-web-development-with-rails_p1_0.pdf",
"last_modified": "2011-11-03T09:59:09.000Z",
"size": 6700177,
"is_directory": false,
"description": "PDF document",
"extension": "pdf",
"mimetype": "application/pdf"
}
},
...
Now you have data, you can improve your queries and get started with Elasticsearch in your Scala application.