本站首页    管理页面    写新日志    退出


«August 2025»
12
3456789
10111213141516
17181920212223
24252627282930
31


公告
 本博客在此声明所有文章均为转摘,只做资料收集使用。

我的分类(专题)

日志更新

最新评论

留言板

链接

Blog信息
blog名称:
日志总数:1304
评论数量:2242
留言数量:5
访问次数:7594194
建立时间:2006年5月29日




[Apache(jakarta)]解决lucene搜索Word文档的检索问题
软件技术,  电脑与网络

lhwork 发表于 2006/6/12 14:53:22

lunece是个姓氏,Lucene is Doug’s wife’s middle name; it’s also her maternal grandmother’s first name.看了车东老大的blog,针对MSWord文档的解析器,因为Word文档和基于ASCII的RTF文档不同,需要使用COM对象机制解析。其实apache的POI完全可以做到解析MSWord文档。我修改了别人的一个例子,算是抛砖引玉,大家不要那转头打我。Lucene并没有规定数据源的格式,而只提供了一个通用的结构(Document对象)来接受索引的输入,但好像只能是文本数据。package org.tatan.framework;import java.io.PrintStream;import java.io.PrintWriter;public class DocumentHandlerException extends Exception {  private Throwable cause;  /**   * Default constructor.   */  public DocumentHandlerException() {    super();  }  /**   * Constructs with message.   */  public DocumentHandlerException(String message) {    super(message);  }  /**   * Constructs with chained exception.   */  public DocumentHandlerException(Throwable cause) {    super(cause.toString());    this.cause = cause;  }  /**   * Constructs with message and exception.   */  public DocumentHandlerException(String message, Throwable cause) {    super(message, cause);  }  /**   * Retrieves nested exception.   */  public Throwable getException() {    return cause;  }  public void printStackTrace() {    printStackTrace(System.err);  }  public void printStackTrace(PrintStream ps) {    synchronized (ps) {      super.printStackTrace(ps);      if (cause != null) {        ps.println("--- Nested Exception ---");        cause.printStackTrace(ps);      }    }  }  public void printStackTrace(PrintWriter pw) {    synchronized (pw) {      super.printStackTrace(pw);      if (cause != null) {        pw.println("--- Nested Exception ---");        cause.printStackTrace(pw);      }    }  }}解析MSWORD的类package org.tatan.framework;import org.apache.poi.hdf.extractor.WordDocument;import java.io.InputStream;import java.io.StringWriter;import java.io.PrintWriter;public class POIWordDocHandler  {  public String getDocument(InputStream is)    throws DocumentHandlerException {    String bodyText = null;    try {      WordDocument wd = new WordDocument(is);      StringWriter docTextWriter = new StringWriter();      wd.writeAllText(new PrintWriter(docTextWriter));      docTextWriter.close();      bodyText = docTextWriter.toString();    }    catch (Exception e) {      throw new DocumentHandlerException(        "Cannot extract text from a Word document", e);    }    if ((bodyText != null) && (bodyText.trim().length() > 0)) {           return bodyText;    }    return null;  } }建立索引的类package org.tatan.framework;import org.apache.lucene.index.IndexWriter;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;import java.io.File;import java.io.FileInputStream;import java.io.IOException;import java.util.Date;public class Indexer {  public static void main(String[] args) throws Exception {        File indexDir = new File("d:/testdoc/index");    File dataDir = new File("d:/testdoc/msword");    long start = new Date().getTime();    int numIndexed = index(indexDir, dataDir);    long end = new Date().getTime();    System.out.println("Indexing " + numIndexed + " files took "      + (end - start) + " milliseconds");  }  public static int index(File indexDir, File dataDir)    throws Exception {    if (!dataDir.exists() || !dataDir.isDirectory()) {      throw new IOException(dataDir        + " does not exist or is not a directory");    }     IndexWriter writer = new IndexWriter(indexDir,      new CJKAnalyzer(), true)    writer.setUseCompoundFile(false);    indexDirectory(writer, dataDir);    int numIndexed = writer.docCount();    writer.optimize();    writer.close();    return numIndexed;  }  private static void indexDirectory(IndexWriter writer, File dir)    throws Exception {    File[] files = dir.listFiles();    for (int i = 0; i < files.length; i++) {      File f = files[i];      if (f.isDirectory()) {        indexDirectory(writer, f);  // recurse      } else if (f.getName().endsWith(".doc")) {        indexFile(writer, f);      }    }  }  private static void indexFile(IndexWriter writer, File f)    throws Exception {    if (f.isHidden() || !f.exists() || !f.canRead()) {      return;    }    System.out.println("Indexing " + f.getCanonicalPath());    Document doc = new Document();    POIWordDocHandler handler = new POIWordDocHandler();       doc.add(Field.UnStored("body", handler.getDocument(new FileInputStream(f))));    doc.add(Field.Keyword("filename", f.getCanonicalPath()));    writer.addDocument(doc);  }}要注意的问题:使用Field对象UnStored函数,只全文索引,不存储。检索的类package org.tatan.framework;import org.apache.lucene.document.Document;import org.apache.lucene.queryParser.QueryParser;import org.apache.lucene.search.Hits;import org.apache.lucene.search.IndexSearcher;import org.apache.lucene.search.Query;import org.apache.lucene.store.Directory;import org.apache.lucene.store.FSDirectory;import org.apache.lucene.analysis.Token;import org.apache.lucene.analysis.cjk.CJKAnalyzer;public class Searcher {     public static void main(String[] args) throws Exception {                 Directory fsDir = FSDirectory.getDirectory("D:\\testdoc\\index", false);            IndexSearcher is = new IndexSearcher(fsDir);                        Token[] tokens = AnalyzerUtils.tokensFromAnalysis(new CJKAnalyzer(), "一人一情");            for (int i = 0; i < tokens.length; i++) {           Query query = QueryParser.parse(tokens[i].termText(), "body", new CJKAnalyzer());                    Hits hits = is.search(query);                        for (int j = 0; j < hits.length(); j++) {                Document doc = hits.doc(j);                System.out.println(doc.get("filename"));              }                                   }     }}要注意的问题:不要使用TermQuery检索不出中文,目前还有中文切词功能。


阅读全文(5746) | 回复(1) | 编辑 | 精华
 


回复:解决lucene搜索Word文档的检索问题
软件技术,  电脑与网络

AA(游客)发表评论于2010/3/12 14:39:42

THANK YOU , COME ON!!


个人主页 | 引用回复 | 主人回复 | 返回 | 编辑 | 删除
 


» 1 »

发表评论:
昵称:
密码:
主页:
标题:
验证码:  (不区分大小写,请仔细填写,输错需重写评论内容!)



站点首页 | 联系我们 | 博客注册 | 博客登陆

Sponsored By W3CHINA
W3CHINA Blog 0.8 Processed in 0.324 second(s), page refreshed 144759610 times.
《全国人大常委会关于维护互联网安全的决定》  《计算机信息网络国际联网安全保护管理办法》
苏ICP备05006046号