Finding subtopics from heterogenous sources for result diversification

In our paper Combining Implicit and Explicit Topic Representations for Result Diversification (SIGIR’12), we presented an approach that combines subtopics that are extracted from multiple heterogenous sources and represented in different formats.  We use this method to mine subtopics of a query and applied it to the search result diversification task. Here is the abstract:

Result diversification aims to deal with ambiguous or multi-faceted queries by providing documents that cover as many subtopics of a query as possible. Various approaches to subtopic modeling have been proposed. Subtopics have been extracted internally, e.g., from documents retrieved in response to the query, and externally, e.g., from Web resources such as query logs. Internally modeled subtopics are often implicitly represented, e.g., as latent topic models, while externally modeled subtopics are often explicitly represented, e.g., as reformulated queries.

In this paper, we propose a framework that: i)~combines both implicitly and explicitly represented subtopics; and ii)~allows flexible combination of multiple external resources in a transparent and unified manner. Specifically, we use a random walk based approach to estimate the similarities of the subtopics mined from a number of heterogeneous resources, i.e., click logs, anchor text, and web n-grams. We then combine these with the internal (implicit) subtopics by constructing regularized topic models, where we use the similarities among the external subtopics to regularize the latent topics extracted from the top-ranked documents. Empirical results show that regularization with explicit subtopics extracted from a good resource leads to improved diversification results. These indicate that better (implicit) topic models are formed due to the regularization with (explicit) external resources. In our experiments, click logs and anchor text are shown to be more effective resources compared to web n-grams. Combining resources does not always lead to better results, but achieves a robust performance. This robustness is important for two reasons: it cannot be predicted which resources will be most effective for a given query, and it is not yet known how to reliably determine the optimal model parameters for building implicit topic models.

Advertisements

Explaining query modifications: An alternative interpretation of term addition and removal

When seeking information with a search engine, under which circumstances do you modify your queries in order to retrieve better results, e.g., by adding or removing terms? In this paper, we investigate the motivation behind query modifications.  Here is the abstract:

In the course of a search session, searchers often modify their queries several times. In most previous work analyzing search logs, the addition of terms to a query is identified with query specification and the removal of terms with query generalization. By analyzing the result sets that motivated searchers to make modifications, we show that this interpretation is not always correct. In fact, our experiments indicate that in the majority of cases the modifications have the opposite functions. Terms are often removed to get rid of irrelevant results matching only part of the query and thus to make the result set more specific. Similarly, terms are often added to retrieve more diverse results. We propose an alternative interpretation of term additions and removals and show that it explains the deviant modification behavior that was observed.

Using HBase with Jython

I started to play with HBase recently and decided to keep a log of what worked for me.
HBase version: hbase-0.90.2
Jython version: Jython 2.5.2

Steps:
1. Install Jython, HBase
2. Setting up the classpath for hbase. The commands described on the HBase wiki worked fine for me with some minor changes.

-bash-4.1$ cd hbase-0.90.2

start hbase:

-bash-4.1$ ./bin/start-hbase.sh

checkout classpath:

-bash-4.1$ ps auwx|grep java|grep org.apache.hadoop.hbase.master.HMaster|sed -r "s/.+?classpath //" | sed -r "s/ .+?//"

copy the classpath and

-bash-4.1$ export CLASSPATH=$coppied_class_path

3. Start jyphon. Since it’s quite a long line, I just made an alias for it:

-bash-4.1$ alias pyhbase='HBASE_OPTS="-Dpython.path=$JYTHON_HOME" HBASE_CLASSPATH=$JYTHON_HOME/jython.jar $HBASE_HOME/bin/hbase org.python.util.jython'

where the $JYTHON_HOME, $HBASE_HOME, as well as the $CLASSPATH and the alias can be stored in the ~/.profile file (for bash).

The jython shell appears:

-bash-4.1$ pyhbase
Jython 2.5.2 (Release_2_5_2:7206, Mar 2 2011, 23:12:06)
[Java HotSpot(TM) 64-Bit Server VM (Sun Microsystems Inc.)] on java1.6.0_20
Type "help", "copyright", "credits" or "license" for more information.
>>>

To run a jython script:

-bash-4.1$ pyhbase jython_script

Useful links:
http://hbase.apache.org/book/quickstart.html (quick start with HBase)
http://hbase.apache.org/book/notsoquick.html (more complicated setting up for HBase)
http://wiki.apache.org/hadoop/Hbase/Jython (Using Jython to interact with HBase)