找回密码
 新注册用户
搜索
查看: 2885|回复: 0

How to read SGML files using Java

[复制链接]
发表于 2006-1-16 23:51:43 | 显示全部楼层 |阅读模式
I've got a text categorisation test collection called Reuters-21578 for my Information Retrieval project. It is distributed in 22 files. Each of the first 21 files (reut2-000.sgm through reut2-020.sgm) contains 1000 documents, while the last (reut2-021.sgm) contains 578 documents. The files are in SGML format. Each of the 22 files begins with a document type declaration line:
<!DOCTYPE lewis SYSTEM "lewis.dtd"> The DTD file lewis.dtd is included in the distribution. Following the document type declaration line are individual Reuters articles marked up with SGML tags.

My questions is how to write a java program to read those 21578 documents or transform them into 21578 seperated text files.
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 新注册用户

本版积分规则

论坛官方淘宝店开业啦~
欢迎大家多多支持基金会~

Archiver|手机版|小黑屋|中国分布式计算总站 ( 沪ICP备05042587号 )

GMT+8, 2024-10-1 13:21

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表