KISTI Institutional Repository: PDF 파일을 XML 파일로 변환하는 도구 개발

Open Access KISTI

KISTI repository

BROWSE

KISTI Institutional Repository7. KISTI 연구성과 연구보고서 2010

download0 view1,669

This item is licensed Korea Open Government License

Title: PDF 파일을 XML 파일로 변환하는 도구 개발

Alternative Title: Conversion of PDF to XML-based Structural Information

Author(s): 전홍우

Alternative Author(s): Jeon, Hong-U

Publisher: 한국과학기술정보연구원
Korea Institute of Science and Technology Information

Publication Year: 2010-12

Description: funder : 교육과학기술부
agency : 교육과학기술부

Abstract: ○ 전체 논문(Full paper)을 이용한 다양한 의미 정보 추출의 요구 증대.
- 대부분의 전체 논문은 PDF 양식으로 공개되어 있음.
- 기존 자연어처리 연구는 대부분 초록(Abstract)만을 이용한 한정적인 연구임.
- 전체 논문 이용의 시도가 있으나 공개 되어 있는 논문수의 제한으로 자체적인 말뭉치 구축 작업 시행.
- 이 작업은 많은 시간과 노동을 필요로 하는 자연어처리 연구의 병목구간임.
○ 위의 요구를 충족시키기 위해 PDF의 XML로의 변환기 개발.
- PDF 문서 내부를 분석하여 텍스트를 추출 후 문장 및 단락, 섹션의 재구성하여 XML 문서화
- 논문별, 출판사별로 PDF 스타일을 분석하여 해당 출판사에서 제공하는 PDF 문서들의 XML 문서화

Most public data have published using PDF (Portable Document Format), because it is not dependant upon devices, operating systems. However, PDF processing is a bottleneck because analysis of semantic information from PDF is a difficult task. Thus, there are a lot of needs to convert PDF to other structural format such as XML format. Such conversion makes it possible to analyze texts in PDF, and extract information from full papers. In other words, the output of the converter can be processed by various Natural Language Processing (NLP) techniques, and it can analyze and extract information more exquisitely. In addition, since corpora can be constructed with a cheap manner, a data sparseness problem that is one of bottlenecks in the probabilistic-based approach might be overcome.
There are several open and commercial convertors from PDF to XML. MOBIPOCKET’s PDF2XML can analyze positions of all objects in PDF, font and line interval information [1]. Besides MOBIPOCKET’s PDF2XML, Matt’s pdf2xml[2], PDFtoRTF ofAdobe[3], and PDFBox of Apache Software Foundation are trying to convert PDF into XML. However, most previous work are focusing on only constructing the same view of PDF. Thus previous approach are notsufficient to reconstruct splitted sentences and words by lines, pages, tables.
The proposed approach aim to analyze texts in PDF and construct XML considering words and sentences reconstruction.In addition, each publishers and articles have their own styles, so spaces in a sheet for necessary information are somewhat different. The analysis of styles of each articles have been included in the proposed approach.

Keyword: 문장 재구성; 단어 재구성; 문장 인식; 띄어 쓰기; PDF; XML; Sentence reconstruction; Word reconstruction; Sentence detection; word segmentation

Files in This Item:: There are no files associated with this item.

Appears in Collections:: 7. KISTI 연구성과 > 연구보고서 > 2010

URI: https://repository.kisti.re.kr/handle/10580/11008
http://www.ndsl.kr/ndsl/search/detail/report/reportSearchResultDetail.do?cn=TRKO201100007962

Fulltext: http://www.ndsl.kr/ndsl/commons/util/ndslOriginalView.do?dbt=TRKO&cn=TRKO201100007962

Export: RIS (EndNote); XLS (Excel); XML

Show full item record

KISTI 국가과학기술데이터본부 디지털큐레이션센터 데이터표준화팀
우)34141 대전광역시 유성구 대학로 245 한국과학기술정보연구원
Tel 042) 869-1004,1234 FAX 042) 869-1091

KISTI Institutional Repository는 국립중앙도서관 OAK 보급사업으로 구축되었습니다.

개인정보처리방침

저작권 정책

BROWSE

Browse