PDF 파일을 XML 파일로 변환하는 도구 개발
Conversion of PDF to XML-based Structural Information
Jeon, Hong-U
Korea Institute of Science and Technology Information
○ 전체 논문(Full paper)을 이용한 다양한 의미 정보 추출의 요구 증대.
- 대부분의 전체 논문은 PDF 양식으로 공개되어 있음.
- 기존 자연어처리 연구는 대부분 초록(Abstract)만을 이용한 한정적인 연구임.
- 전체 논문 이용의 시도가 있으나 공개 되어 있는 논문수의 제한으로 자체적인 말뭉치 구축 작업 시행.
- 이 작업은 많은 시간과 노동을 필요로 하는 자연어처리 연구의 병목구간임.
○ 위의 요구를 충족시키기 위해 PDF의 XML로의 변환기 개발.
- PDF 문서 내부를 분석하여 텍스트를 추출 후 문장 및 단락, 섹션의 재구성하여 XML 문서화
- 논문별, 출판사별로 PDF 스타일을 분석하여 해당 출판사에서 제공하는 PDF 문서들의 XML 문서화

Most public data have published using PDF (Portable Document Format), because it is not dependant upon devices, operating systems. However, PDF processing is a bottleneck because analysis of semantic information from PDF is a difficult task. Thus, there are a lot of needs to convert PDF to other structural format such as XML format. Such conversion makes it possible to analyze texts in PDF, and extract information from full papers. In other words, the output of the converter can be processed by various Natural Language Processing (NLP) techniques, and it can analyze and extract information more exquisitely. In addition, since corpora can be constructed with a cheap manner, a data sparseness problem that is one of bottlenecks in the probabilistic-based approach might be overcome.
There are several open and commercial convertors from PDF to XML. MOBIPOCKET’s PDF2XML can analyze positions of all objects in PDF, font and line interval information [1]. Besides MOBIPOCKET’s PDF2XML, Matt’s pdf2xml[2], PDFtoRTF ofAdobe[3], and PDFBox of Apache Software Foundation are trying to convert PDF into XML. However, most previous work are focusing on only constructing the same view of PDF. Thus previous approach are notsufficient to reconstruct splitted sentences and words by lines, pages, tables.
The proposed approach aim to analyze texts in PDF and construct XML considering words and sentences reconstruction.In addition, each publishers and articles have their own styles, so spaces in a sheet for necessary information are somewhat different. The analysis of styles of each articles have been included in the proposed approach.
문장 재구성; 단어 재구성; 문장 인식; 띄어 쓰기; PDF; XML; Sentence reconstruction; Word reconstruction; Sentence detection; word segmentation
