KISTI Institutional Repository: 한국어문서 자동색인을 위한 전처리기 개발

Open Access KISTI

KISTI repository

BROWSE

KISTI Institutional Repository7. KISTI 연구성과 연구보고서 1997

download0 view709

This item is licensed Korea Open Government License

Title: 한국어문서 자동색인을 위한 전처리기 개발

Publisher: 한국과학기술정보연구원
Korea Institute of Science and Technology Information

Publication Year: 1997-12

Abstract: 1. 제목
한국어문서 자동색인을 위한 전처리 개발
2. 연구의 목적 및 중요성
1) 연구의 목적
수사표현의 분석 및 정규화 모델 및 프로그램개발
직위표현의 분석 모델 및 프로그램 개발
지명표현의 인식 모델 및 프로그램 개발
2) 연구의 중요성
태깅에서 오분석된 표현과 실패한 표현(미등록어)들을 재처리함으로써, 자동색인의 성능을 높일 수 있다. 자동색인의 성능은 검색시스템의 성능과 직접적으로 연관이 되어 있음으로 그 중요성은 매우 높다.
기존의 연구는 태깅과 색인시스템의 모델링에 중점이 되어왔으나, 비교적 어렵지 않게 해결될 수 있는 전처리의 연구가 오히려 적은 관심속에서 미미한 상태여 왔다. 그러나 전처리기가 시스템의 성능에 미치는 영향은 엔진부분의 합리화가 가져오는 것에 뒤지지 않으며, 이러한 관점에서 학술적으로도 그 중요성을 가진다.
3. 연구의 내용 및 범위
수사표현 연구 및 모델 (모델링)
수사표현 처리 프로그램개발 (시제품)
단위사전개발 (시제품)
직위표현 연구 및 모델 (모델링)
직위표현 처리 프로그램개발 (시제품)
직위사전개발 (시제품)
지명표현 연구 및 모델 (모델링)
지명포현인식기 개발 (시제품)
4. 연구결과 및 시스템 성능
수사처리모듈
단위사전: 104단어 (트라이사전으로 이용)
수단어사전: 50단어
전치어, 후치어: 각각 5개 2개
실험정확률: 8478개의 문자열에 대하여, 약 98%의 정확률
직위표현처리모듈
직위사전: 258단어 (역트라이사전으로 구성)
성씨사전: 50단어 (프로그램내에 저장)
10,000여개의 (이름+직위)표현들에 대한 실험: 96%
지명표현인식기모듈
5099개의 국내지명자료
전치바이그램: 3278개
중간바이그램: 172개
후치바이그램: 719개
지명에 대한 인식률: 99.6% (오류: 0.4%)
비지명에 대한 오인식률: 14% (인식률: 86%)
5. 활용에 대한 건의
자동색인기의 전처리로 사용
기타 언어현상의 정규화를 필요로하는 시스템에 사용

Summary of Research Results
1. Title
Preprocessing for the automatic indexing of Korean documents
2. Research Goals
Develop an algorithm for analyzing numeric expressions
Develop a program for the analysis of numeric expressions
Develop an algorithm for analyzing title expressions
Develop a program for the analysis of title expressions
Develop a method to recognize location nouns.
3. Importance of the Study
The preprocessing enables more nouns to be analyzed and expressions with highly variable patterns to be normalized. Automatic indexing can be significantly improved by means of reducing missing nouns and employing uniform representation of variable expressions. The quality of information system can be critically affected by the accuracy of indexing, and our study will contribute to the practical improvement of the information retrieval systems.
4. Research Results
1) numerical expressions analysis
measurement nouns: 104 words are constructed.
numeric words : 50 words are identified.
pre-unit, post-unit words: 5 and 2 words are defined respectively.
experiements
98% in accuracy for 8478 input words
2) title expressions analysis
title nouns: 258 words are constructed.
last names: 50 words are identified.
experiments
96% in accuracy for 10,000 title expressions.
3) location nouns identification
from 5099 Korean domestic location names,
prefix bigrams : 3278
infix bigrams : 172
postfix bigrams: 719
experiments
correct recognition for locations : 99.6% (error: 0.4%)
incorrect recognition for non locations: 14% (correct: 86%)

Files in This Item:: There are no files associated with this item.

Appears in Collections:: 7. KISTI 연구성과 > 연구보고서 > 1997

URI: https://repository.kisti.re.kr/handle/10580/10497
http://www.ndsl.kr/ndsl/search/detail/report/reportSearchResultDetail.do?cn=TRKO200500060189

Export: RIS (EndNote); XLS (Excel); XML

Show full item record

KISTI 국가과학기술데이터본부 디지털큐레이션센터 데이터표준화팀
우)34141 대전광역시 유성구 대학로 245 한국과학기술정보연구원
Tel 042) 869-1004,1234 FAX 042) 869-1091

KISTI Institutional Repository는 국립중앙도서관 OAK 보급사업으로 구축되었습니다.

개인정보처리방침

저작권 정책

BROWSE

Browse