download0 view1,409
twitter facebook

공공누리This item is licensed Korea Open Government License

dc.contributor.author
전홍우
dc.date.accessioned
2018-11-02T04:56:05Z
dc.date.available
2018-11-02T04:56:05Z
dc.date.issued
2010-12
dc.identifier.other
K6
dc.identifier.uri
https://repository.kisti.re.kr/handle/10580/11008
dc.identifier.uri
http://www.ndsl.kr/ndsl/search/detail/report/reportSearchResultDetail.do?cn=TRKO201100007962
dc.description
funder : 교육과학기술부
dc.description
agency : 교육과학기술부
dc.description.abstract
○ 전체 논문(Full paper)을 이용한 다양한 의미 정보 추출의 요구 증대.
- 대부분의 전체 논문은 PDF 양식으로 공개되어 있음.
- 기존 자연어처리 연구는 대부분 초록(Abstract)만을 이용한 한정적인 연구임.
- 전체 논문 이용의 시도가 있으나 공개 되어 있는 논문수의 제한으로 자체적인 말뭉치 구축 작업 시행.
- 이 작업은 많은 시간과 노동을 필요로 하는 자연어처리 연구의 병목구간임.
○ 위의 요구를 충족시키기 위해 PDF의 XML로의 변환기 개발.
- PDF 문서 내부를 분석하여 텍스트를 추출 후 문장 및 단락, 섹션의 재구성하여 XML 문서화
- 논문별, 출판사별로 PDF 스타일을 분석하여 해당 출판사에서 제공하는 PDF 문서들의 XML 문서화
dc.description.abstract
Most public data have published using PDF (Portable Document Format), because it is not dependant upon devices, operating systems. However, PDF processing is a bottleneck because analysis of semantic information from PDF is a difficult task. Thus, there are a lot of needs to convert PDF to other structural format such as XML format. Such conversion makes it possible to analyze texts in PDF, and extract information from full papers. In other words, the output of the converter can be processed by various Natural Language Processing (NLP) techniques, and it can analyze and extract information more exquisitely. In addition, since corpora can be constructed with a cheap manner, a data sparseness problem that is one of bottlenecks in the probabilistic-based approach might be overcome.
There are several open and commercial convertors from PDF to XML. MOBIPOCKET’s PDF2XML can analyze positions of all objects in PDF, font and line interval information [1]. Besides MOBIPOCKET’s PDF2XML, Matt’s pdf2xml[2], PDFtoRTF ofAdobe[3], and PDFBox of Apache Software Foundation are trying to convert PDF into XML. However, most previous work are focusing on only constructing the same view of PDF. Thus previous approach are notsufficient to reconstruct splitted sentences and words by lines, pages, tables.
The proposed approach aim to analyze texts in PDF and construct XML considering words and sentences reconstruction.In addition, each publishers and articles have their own styles, so spaces in a sheet for necessary information are somewhat different. The analysis of styles of each articles have been included in the proposed approach.
dc.publisher
한국과학기술정보연구원
dc.publisher
Korea Institute of Science and Technology Information
dc.title
PDF 파일을 XML 파일로 변환하는 도구 개발
dc.title.alternative
Conversion of PDF to XML-based Structural Information
dc.contributor.alternativeName
Jeon, Hong-U
dc.identifier.localId
TRKO201100007962
dc.identifier.url
http://www.ndsl.kr/ndsl/commons/util/ndslOriginalView.do?dbt=TRKO&cn=TRKO201100007962
dc.subject.keyword
문장 재구성
dc.subject.keyword
단어 재구성
dc.subject.keyword
문장 인식
dc.subject.keyword
띄어 쓰기
dc.subject.keyword
PDF
dc.subject.keyword
XML
dc.subject.keyword
Sentence reconstruction
dc.subject.keyword
Word reconstruction
dc.subject.keyword
Sentence detection
dc.subject.keyword
word segmentation
dc.type.local
최종보고서
dc.identifier.koi
KISTI2.1015/RPT.TRKO201100007962
Appears in Collections:
7. KISTI 연구성과 > 연구보고서 > 2010
Files in This Item:
There are no files associated with this item.

Browse