Ana səhifə

Text Processing Instructions


Yüklə 20.79 Kb.
tarix24.06.2016
ölçüsü20.79 Kb.
Text Processing Instructions

The following workflow was implemented by Paul M. King for the text processing needs of TRECVID and CIVR from 2008 to 2010. This procedure creates a textual retrieval engine for annotations associated with video shots. More information about the procedure can be found in the document “Overview of Text Processing in Verge”. A more detailed background concerning the how the Lemur search engine works can be found in “Overview of the Lemur Model for Verge”. All documentation is located in the docs/ directory.


Programs1


Help pages and general documentation for usage for each program can be found by invoking the following on the command line: perldoc [program_name]

Overview


  • format.pl - converts a plain text annotation file into an XML format recognized by the Lemur Toolkit indexer IndriBuildIndex

  • Search.pm - processes query strings (i.e., query expansion using Semantics.pm) and searches an index file created by the Lemur Toolkit using IndriRunQuery.

  • Parse.pm - supports Search.pm

  • Semantics.pm - supports Search.pm

  • quicklist.pl - generates an alphabetized list of unique terms from a keyword file

Dependencies


  • Perl for Windows (it's natively written for Unix):

    • Strawberry Perl (recommended) - An open source distribution with access to all modules located in the Perl repository called CPAN2.

    • ActivePerl – A commercially supported Perl distribution. Unfortunately, they do not have all the modules we need in their distribution.

  • File::Basename (Perl module)

  • base 'Exporter' (Perl module)

  • WordNet::QueryData (Perl module)

  • WordNet::stem (Perl module)

  • WordNet::Similarity::lesk (Perl module)

  • WordNet (Note: there are not binaries available for Windows)

Configuration

Your Perl Library


You must ensure that the path to the Perl code library is correct in each of the three files: Search.pm, Parse.pm and Semantics.pm. If the path is wrong, you'll get an error code when searching that says:

Can't locate [something] in @INC

The @INC array tells Perl where to find all the modules we are importing. If you've installed these modules in unorthodox locations, you will need to modify the “use lib” line at the top each program so that it tells the Perl interpreter the correct path to where you've installed the Perl scripts. IMPORTANT: Be sure to do this for each program! Following in an example:

use lib 'D:\Services\Apache\Apache2\htdocs\trec2009\tmodule';

NOTE: Don't forget the semicolon at the end of your lines!


Your WordNet Path


You must also ensure that the path to your local WordNet installation is correctly set at the beginning of the Semantics.pm program. See the Semantics.pm section under Further Information about this.

Indexing


We use Indri, which is a new search engine by the Lemur project. Support can be found here:

  • http://www.lemurproject.org/indri/

  • http://ciir.cs.umass.edu/~strohman/indri/

The general procedure for indexing an annotation file follows:

  1. Install the Lemur search engine3

  2. Create the directories:

    1. bin/ (Perl programs)

    2. etc/ (configuration files)

    3. corpus/ (annotation files for IndriBuildIndex, IndriDaemon)

  3. Format the annotation file

    1. move the annotation file to the bin/ directory and rename with .anno suffix

    2. invoke (from the bin/ directory): ./format.pl

    3. open the new [annotations].xml file in vi and determine whether ^M line feeds are at the end of each line. If so, remove them with the following vi command: : set ff=unix (NOTE: Failure to do this will result in errors when building the index.)

    4. move the output .xml file to the corpus/ directory

  4. Build the Index

    1. edit the configuration file (index.para)4

    2. invoke5: IndriBuildIndex index.para

    3. stopwords can be included in the parameter file (see the included file index-stopwords.para in the etc/ directory) or a stopword list can be referenced from the parameter file (as in index.para)

Searching


We use IndriRunQuery to conduct searches.

  1. Start the Lemur daemon6

    1. edit the configuration file (daemon.para)

    2. invoke: IndriDaemon daemon.para

    3. NOTE: There is a parameter set in the daemon.para file called “port” that may need to be adjusted if you have trouble starting the search daemon. We used 22 in 2009.

  2. Queries can be run from the command line

    1. edit the configuration file (query.para)7

    2. invoke: IndriRunQuery query.para

  3. But we run queries from a Perl script called Search.pm

    1. NOTE: the configuration file is created within this script. (See explanation below.)

    2. general invocation syntax: ./Search.pm -q 'query terms' [options]

    3. simple search: ./Search.pm -q 'query terms'

    4. without query expansion: ./Search.pm -q 'query terms' -e

    5. specify number of concepts to list in the navigation file: ./Search.pm -q 'query terms' -c 10

    6. specify memory usage (can increase search speed): ./Search.pm -q 'query terms' -m 5120m

    7. specify number of results: ./Search.pm -q 'query terms' -h 25

    8. combo query with a side of fries: ./Search.pm -q 'query terms' -c 10 -m 5120m -h 25

  4. Full documentation can be found by invoking the following on the command line from within the directory where Search.pm resides: perldoc Search.pm (alternatively: perldoc ./Search.pm)

Further Information

Search Parameters


All parameters that are ordinarily set in the configuration file for IndriRunQuery (query.para) or passed to IndriRunQuery as options can likewise be passed to Search.pm as options. They can also be hard-coded into the program by changing the values of the %params hash, located in the subroutine “params”. You will find important options here that may need to be changed, such as the server address. This is also where you can permanently code how large you want result sets to be.

The parameter for server is now set to localhost, but there is a line for the server used last year. Simply uncomment this line and comment out the localhost line if you use the same server.

Indri Query Language


The Search.pm program manages the construction of the rather complex search syntax required by the IndriRunQuery program. For example:

#weight( 1.0 european 1.0 union 0.5 #wsyn( 1.000 #od1(trades union) 1.000 #od1(labor union) 1.000 #od1(trade union) ))

Yikes! Looks scary, huh? It's not as hard as it looks. Documentation for this syntax can be found here:


  • http://ciir.cs.umass.edu/~metzler/indriquerylang.html

However, as mentioned above, the Perl search program facilitates the construction of this query string and offers a much simpler interface. However, this comes at the cost of not being as expressive as the underlying Indri query language. Whereas the Perl program makes it quite easy to specify synonyms, broader and narrower terms, it does not allow you to easily change the retrieval weights for these terms. However, this can easily be changed in the program within the formQuery subroutine.

Indri Retrieval Model


The retrieval model used by the Indri search engine can be found here:

http://ciir.cs.umass.edu/~metzler/indriretmodel.html

Parse.pm


This module is called by the Search.pm program and handles the preprocessing of query terms before they are passed to the Indri search engine. The intention is simply to harmonize the methods used by format.pl (in the preparation of the annotation file before the index was created) with the preparation of the query string so that matches are optimized.

Semantics.pm


This module supports the Search.pm program by handling the creation of the navigation file, which specifies concepts (equivalent, broader, narrower and related terms). The navigation file is written to the bin/ directory and can be used to assist the end user in query refinement tasks. It also handles query expansion tasks, using synonyms to expand the original query string to a pre-defined limited number of terms.

WordNet is used to mine term relationships. The path to your local WordNet installation can be set at the beginning of the program on the following line:

$wn = WordNet::QueryData->new("c:/Program Files/WordNet/2.1/dict");

In addition, this module is capable of deriving similarity scores using the Lesk measure, which can be used for sorting and filtering concepts (i.e., putting the most relevant related terms first in the navigation file, removing strange narrower terms, etc.). This functionality is found in the simScore subroutine. Although it has been shut off due to performance problems, it can easily be implemented by uncommenting the relevant lines in the subroutine.



1A .pl suffix indicates a Perl script, .pm is a Perl module, which can be imported and used by other programs.

2http://www.cpan.org/

3http://www.lemurproject.org/

4http://www.lemurproject.org/lemur/indexing.php

5http://www.lemurproject.org/lemur/indexing.php#IndriBuildIndex

6http://www.lemurproject.org/docs/index.php/The_Indri_Daemon

7http://www.lemurproject.org/doxygen/lemur/html/IndriRunQuery.html



Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©atelim.com 2016
rəhbərliyinə müraciət