Getting Started with Minion – A tutorial

Note: This is a Work in Progress

Blurb:

Minion is a  full-text search engine that provides for ranked boolean querying, as well as proximity and relational querying. It provides a simple API for indexing and retrieval as well as for document similarity and document classification operations.

It’s rather new at the time of writing this post and as such there aren’t many “getting started” aka “Minion for Dummies” tutorials out there . I’m going to attempt to address this in a small way.

1. Download the distribution from https://minion.dev.java.net/ (either the tarball or from svn) and compile using “ant” or “ant jar”.

2. Upon building, you should see  a minion.jar file at <MINION_DIR>/dist and also <MINION_DIR>/javalib/LabsUtil.jar .

3. Fire up your favourite IDE and create a Java Project. Add the two jars above to the project and you should be ready to go. Copy and paste the code as needed.

1. Simple example indexing and querying upon a few text strings.


package com.il.minion.learn;

import java.util.List;
import com.sun.labs.minion.Result;
import com.sun.labs.minion.ResultSet;
import com.sun.labs.minion.SearchEngine;
import com.sun.labs.minion.SearchEngineException;
import com.sun.labs.minion.SearchEngineFactory;
import com.sun.labs.minion.SimpleIndexer;
import com.sun.labs.minion.FieldInfo;
import com.sun.labs.minion.FieldInfo.Attribute;

import java.util.EnumSet;

public class MinionVer1 {

	/**
	 * @param args
	 * @throws SearchEngineException
	 */
	public static void main(String[] args) throws SearchEngineException {

		String indexDir = "c:/tmp/minion";
		SearchEngine searchEngine = SearchEngineFactory
				.getSearchEngine(indexDir);

		// search engine config
		searchEngine.defineField(new FieldInfo("testbody", EnumSet.of(
				Attribute.INDEXED, Attribute.TOKENIZED, Attribute.VECTORED,
				Attribute.SAVED), FieldInfo.Type.STRING));

		searchEngine.defineField(new FieldInfo("id", EnumSet.of(
				Attribute.INDEXED, Attribute.TOKENIZED, Attribute.VECTORED,
				Attribute.SAVED), FieldInfo.Type.FLOAT));

		SimpleIndexer simpleIndexer = searchEngine.getSimpleIndexer();
		createSampleTestData(simpleIndexer);

		String query = "Linux";
		ResultSet rs = searchEngine.search(query);
		List results = rs.getResults(0, 10); // could alternatively use
		// getAllResults

		for (Result r : results) {
			System.out.println("result score = " + r.getScore() + " "
					+ r.getSingleFieldValue("testbody") + " *** " + r.getKey() + " "
					+ r.getSingleFieldValue("id") + " "
					+ r.getField("testbody").size());
		}
		searchEngine.close();
	}

	private static void createSampleTestData(SimpleIndexer si) {
		// strings borrowed from wikipedia entries
		si.startDocument("doc1");
		si
				.addField(
						"testbody",
						"Linux is a modular Unix-like operating system. It derives much of its basic design from principles "
								+ "established in Unix during the 1970s and 1980s. Linux uses a monolithic kernel, the Linux kernel, which handles "
								+ "process control, networking, and peripheral and file system access. Device drivers are integrated directly with the kernel.");
		si.addField("id", 1.0f);
		si.endDocument();

		si.startDocument("doc2");
		si
				.addField(
						"testbody",
						"The name Linux comes from the Linux kernel, originally written in 1991 by Linus Torvalds. "
								+ "The system's utilities and libraries usually come from the GNU operating system, announced in "
								+ "1983 by Richard Stallman. The GNU contribution is the basis for the alternative name GNU/Linux.");
		si.addField("id", 2.0f);
		si.endDocument();

		si.startDocument("doc3");
		si
				.addField(
						"testbody",
						"Penguins (order Sphenisciformes, family Spheniscidae) are a group of aquatic, flightless birds living "
								+ "almost exclusively in the Southern Hemisphere. The number of penguin species is debated. Depending on which authority "
								+ "is followed, penguin biodiversity varies between 17 and 20 living species, all in the subfamily Spheniscinae. ");
		si.addField("id", 3.0f);
		si.endDocument();

		si.startDocument("doc4");
		si
				.addField(
						"testbody",
						"The Linux landscape is constantly changing and has a strong community of both developers and users. But where is Linux the most "
								+ "popular, and where are the different Linux distributions the most popular? To try to answer these questions, we have looked at "
								+ "data from Google with the highly useful Insights for Search, which gave us a number of interesting and often surprising results.");
		si.addField("id", 4.0f);
		si.endDocument();
		si.finish();
	}

}


Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s