--成功解决PDF大文件分页提取文字难的问题

　　 本站首页 管理页面写新日志退出

«	November 2025					»
日	一	二	三	四	五	六
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

最近的评论

回复:学做游戏
回复:安装Nginx（负载均衡器）
回复:介绍SME和研究心得
回复:目前想看/在看的电影和TV
回复:有点担心这次的examinatio
回复:观后感《非诚勿扰》
回复:观后感《非诚勿扰》
回复:生财有道
回复:总算是忙完了啦，要想想有哪些事情要
回复:Summary of this m

连接

[Progam]成功解决PDF大文件分页提取文字难的问题

原创空间, 软件技术, 职业生涯

davidxiem 发表于 2007/9/13 15:25:25

CASE：IMME发过来许多的PDF的文件，是他们从数据库中用工具生成的答题报告，每个PDF文件由400－900页不等，每页是一个学生的答题步骤和时间，所以必须从其中提取出文件和图形进行识别并生成我们使用的报告。问题：找到了比较简单的控件ASPOSE.PDF.KIT，可以进行提取操作，可是写程序还是很顺利的，也提供了按页提取的方法，我打算就这样每页调用函数进行处理就完了。可惜的是运行的时候这种方法显示出很糟糕的性能问题，内存使用量疯长。解决：最终处理每个文件的时候不再分页处理，而是提取其中所有的文字先，然后再对中间产物的文本文件按照标识符进行处理，以识别每一页。部分代码 private void Convert() { //delete all exist record if (checkBox1.Checked) new Extract().ExecuteCommand("delete from dataanalysis"); string dir = ".\\pdf\\"; string[] pdfnames= new string[5]; pdfnames[0] ="Baiyan 07-1.pdf"; pdfnames[1] ="Huijing 07-1.pdf"; pdfnames[2] ="Nanzhuang 07-1.pdf"; pdfnames[3] ="Shiyan 07-1.pdf"; pdfnames[4] ="Tongji 07-1.pdf"; int[] pdfpagecount = new int[5]; pdfpagecount[0] =419; pdfpagecount[1] =729; pdfpagecount[2] =837; pdfpagecount[3] =355; pdfpagecount[4] =565; // Aspose.Pdf.Kit.License lic = new Aspose.Pdf.Kit.License(); // lic.SetLicense(".\\Aspose.Pdf.Kit.lic"); for (int i=0;i<5;i++) { string textfilename = string.Format( ".\\txt\\temp{0}.txt", i); // PdfExtractor ext = new PdfExtractor(); // ext.BindPdf(dir + pdfnames[i]); // ext.StartPage =1; // ext.EndPage = pdfpagecount[i]; // ext.ExtractText(); // ext.GetText(textfilename); new Extract().Begin(textfilename, pdfpagecount[0], prgbarProblem); if (prgbarFiles.Value == prgbarFiles.Maximum) { prgbarFiles.Value = prgbarFiles.Minimum; } prgbarFiles.PerformStep(); } }

阅读全文(3079) | 回复(0) | 编辑 | 精华

发表评论：

昵称：
密码：
主页：
标题：

验证码： (不区分大小写,请仔细填写,输错需重写评论内容！)

公告

　Anybody can contact me through Email:

or through instant message messageing

MSN

davidxiem@hotmail.com

Thanks Nexodyne for email icon generation.

专题

首页(174)
Progam(7)
english learning(41)
个人日志(72)
Forever QuakeIII(4)
Software Process(2)
lgp(13)
Movie and TV(12)
DataBase(0)
ILC(5)

留言

签写新留言

hihi

统计

blog名称:
日志总数:174
评论数量:98
留言数量:-1
访问次数:538750
建立时间:2007年7月20日

站点首页 | 联系我们 | 博客注册 | 博客登陆

Sponsored By W3CHINA
W3CHINA Blog 0.8 Processed in 0.533 second(s), page refreshed 144809526 times.
《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》
苏ICP备05006046号