KISTI Institutional Repository: 유전자 발굴 알고리즘 개발

Open Access KISTI

KISTI repository

BROWSE

KISTI Institutional Repository7. KISTI 연구성과 연구보고서 2002

download0 view1,084

This item is licensed Korea Open Government License

dc.contributor.author: 유재천

dc.contributor.author: 정동수

dc.contributor.author: 박춘구

dc.date.accessioned: 2018-11-02T04:54:37Z

dc.date.available: 2018-11-02T04:54:37Z

dc.date.issued: 2002-11

dc.identifier.other: D2

dc.identifier.uri: https://repository.kisti.re.kr/handle/10580/10409

dc.identifier.uri: http://www.ndsl.kr/ndsl/search/detail/report/reportSearchResultDetail.do?cn=TRKO200500060094

dc.description: funder : 국무조정실

dc.description.abstract: I. 제목
유전자 발굴알고리즘 개발
II. 연구개발의 목적 및 중요성
인간 게놈이 완료된지 벌써 일년이 지났다. 최근 인간을 포함한 많은 생명체의 DNA 서열이 밝혀 졌고 정보의 양은 기하 급수적으로 증가하는 추세에 있다. 현재 800종이 지놈 분석이 이루어 졌거나 연구가 진행되고 있다고 한다. 이제 과학자들은 이 엄청난 데이터를 분석하는 어떤 툴 없이는 연구 자체가 불가능하다.
새로운 유전자가 기존에 이미 밝혀진 유전자인지, 기존의 유전자와는 어느 정도 유사성이 있는지, 진짜 새로운 것인지, 단백질로 번역(translation)했을 때 어떤 구조를 가지는지, 어떠한 기능을 갖는지 등 생물학적으로 어떤 의미가 있는지 알기위해서는, 생물학자들이 기존에 구축해 놓은 단백질의 아미노산 서열(Amino Acid sequence)나 DNA sequence를 한데 모아놓은 Database를 이용하여, 유사성 (Similarity) 혹은 동질성(Homology) 검색(search)을 하여 유사성이 있는 염기서열 (sequence) 정보를 얻음으로서 의문을 해결할 수 있는 여러 간접적인 정보(clue)를 얻을 수 있다.
이 과정에서 가장 광범위하게 사용되는 알고리즘은 dynamic programming(1981), BLAST(1990)와 FASTA(1988)이다. Dynamic programming은 질의(query)유전자와 유사한 염기서열(nucleotide sequence)들을 데이터베이스(database) 내에서 최고값(optimal cost)을 가지고 sequence를 찾아준다. 이 과정에서 계산 시간이 많이 걸리는 단점이 있고 엄청난 컴퓨터 파워를 요구하는데, 이를 해결하기
위하여 빠른 시간 내에 최적에 가까운(near optimal)비용을 가지는 sequence를 찾아주는 algorithm이 개발되었는데, 이것이 BLAST이다. BLAST는 모든 가능한 배열을 조사하지 않는 heuristic algorithm을 이용한다. BLAST는 지금까지 나온 툴중에 가장 널리 사용되는 sequence aligner이나, 아직 개선의 여지가 많다.
따라서 본 연구에서는 BLAST만큼 빠르면서 Accuracy도 높을 뿐 만 아니라,
database을 압축할 수 있는 3D- Gene이라 불리는 새로운 알고리즘을 연구코자 한다.

III. 연구개발의 내용 및 범위
1차년도 연구로서, 3D- Gene의 기본 기능만을 포함한 기본적인 core 알고리즘을 개발한다.
IV. 연구개발결과
본 알고리즘의 성능을 테스트하기 위해 , 100의 data set을 사용하여 computer simulation하였다. simulation 결과 다음과 같은 결과를 얻었다.
(i) Accuracy:
압축된 database와 3D- Gene algorithm을 사용하여, candidate를 한개로 가정하였을 때 substitution rate 20%, deletion rate 5%에서 Query sequence의 ID를 약 80% 이상 정확히 찾아내었다.( 실제에서는 보통 candidate를 10개 이상 지정함)
(ii) Compression:
70%이상, database 위한 메모리 용량을 줄였다.
(iii) Speed:
BLAST의 검색 속도는 여러 가지 검색 option에 따라 변하므로, 현재 상태에서는 정량적으로 비교하기 어려우나, 3D- Gene Algorithm은 기존 BLAST알고리즘과 비교해, 초기 Conserved Zone에 대한 검색 과정이 필요 없어 더 빠를 것으로 추정된다. 이것은 3D- Gene 알고리즘의 경우, High Conserved Start_Zone에 대한 정보가 미리 DB에 저장되어 있기 때문이다.
V. 응용분야
본 기본 알고리즘은 유전자 발굴, 유전자 기능 파악, 질병 진단 등에 광범위한 분야에 응용될 수 있을 것이다.
(i) Comparative sequence analysis
(ii) Specific Gene Hunting
(iii) Evolutionary Relationship

dc.description.abstract: I. Title
The Development of Gene Mining Algorithm
II. Objective of the study and its importance
The human genome is already one year old, and recently, DNA molecules of many organisms, including the human species, have been sequenced and the amount of sequence information has been on a rapid increase. The whole- genomes sequences for more than 800 organisms are either complete or being determined. Confronted with a wealth of genomic sequences, scientists can
not analyze biological information and interpret genetic messages without using efficient sequence analysis tools. When a new DNA molecules is sequenced, the next step that biologist will be eager to take is to find in databases the DNA sequences that are similar to the newly obtained sequence. This routine is very important because sequence homology implies the evolution and gen function clues. There are many useful tools such as dynamic programming, FASTA and
BLAST for sequence alignment. The computational cost of the dynamic programming is proportional to the length product of the two sequences. So, the dynamic programming is not suitable for searching large- scale sequence database due to its high computational cost. Instead of adopting the global alignment, To overcome this problem, BLAST find in the database the sequence segments that are similar enough to a segment in the query sequence according to a local similarity score. Although BLAST becomes the most widely- used search engine
for sequence databases, it is clear that further improvement to sequence alignment are much needed.
Our goal in this research is to develope another new sequence alignment algorithm which is called ""3D- Gene"" .

The advantage of the proposed algorithm is that it can well align with query sequences using a highly compressed database with an algorithm as fast as the well- known BLAST.

III. Content and scope of the study
As the first year""s study, develop a core algorithm including 3D- Gene""s basis function

IV. Result of the study
In order to examine the effectiveness of proposed algorithm for the sequence alignment, computer simulation was performed using 100 test data set The simulation results presented in this document show that:
(i) Accuracy: query sequences could be identified with a practical performance(accuracy) using a compressed database and a signal alignment.
(ii) Compression: the amount of storage for database could be reduced as much as about 70%
(iii) Speed : Quantitative comparison between algorithms is difficult because of it""s various search option in present situation. However, since the 3D- Gene has already had its Highly Conserved Start_Zone data in the database, the algorithm is estimated to be faster than BLAST.

V. Application
This algorithm can be applied in wide field such as disease diagnosis, functional genomics, gene mining:
(i) Comparative sequence analysis
(ii) Specific Gene Hunting
(iii) Evolutionary Relationship