Devgraphy

월간 DEVGRAPHY

Introduction to SPARK 2019.02.08 #개발/프레임워크 & 라이브러리
전통적인 분산 컴퓨팅과 그 발전 2019.02.08 #개발/프레임워크 & 라이브러리
From Anconda to Tensorflow installation in Ubuntu 2018.11.22 #개발/프레임워크 & 라이브러리
IO 멀티플렉싱 2018.11.13 PM/CS & IT실무
Installation of Spark in Ubuntu 2018.11.11 카테고리 없음
Stream Processing101 2018.10.19 카테고리 없음
Half-Close - shutdown() 2018.10.05 PM/CS & IT실무
네트워크 바이트 순서와 인터넷 주소 변환 2018.10.04 PM/CS & IT실무
소켓에 할당되는 IP주소와 PORT번호 2018.10.04 PM/CS & IT실무
Previewing 2018.03.01 #개발/데이터베이스

Introduction to SPARK

fruiedem 2019. 2. 8. 16:43

2019. 2. 8. 16:43

SPARK 의 특징적인 부분

1. 데이터 처리 및 계산의 단순성

2. 계산 속도

3. 대규모 데이터셋의 확장성 및 처리량

4. 다양한 데이터 유형

5.다양한 cluster manager를 사용한 cluster computing 및 배포의 용이성

6. 다양한 빅데이터 storage와 소스를 통한 작업 기능 및 지원

7. 널리 사용되는 최신 프로그래밍 언어로 작성된 다양한 API

SPARK란?

고급 데이터 분석을 위한 정교한 API를 가진 빠른 In-memory 빅데이터 처리와 general-purpose cluster computing을 위한 framework이다.

- 배치 작업에만 적합한 하둡 기반의 MapReduce와 달리, 스파크는 실시간 데이터와 배치(정적) 데이터 모두에 고급 분석을 적용하기에 적합한 일반 실행 엔진이라고 할 수 있다.

'#개발 > 프레임워크 & 라이브러리' 카테고리의 다른 글

DAG: For Optimizing Execution Plan (0)	2019.02.08
RDD: 새로운 컴퓨팅 패러다임 (0)	2019.02.08
전통적인 분산 컴퓨팅과 그 발전 (0)	2019.02.08
From Anconda to Tensorflow installation in Ubuntu (0)	2018.11.22

전통적인 분산 컴퓨팅과 그 발전

fruiedem 2019. 2. 8. 16:30

2019. 2. 8. 16:30

전통적인 분산 컴퓨팅

- 전통적인 데이터 처리 패러다임은 일반적으로 client-server 모델이며, 데이터를 코드로 이동시키기 위해 사용했다.

- DB Server(간단히 server)는 주로 데이터 연산을 수행하고 그 결과를 client-server (client)프로그램으로 반환한다.

- 다양한 연산과 client device 또한 기하급수적으로 증가

-결과적으로, 백그라운드에서 server의 컴퓨팅 엔드포인트의 배열이 복잡해진다.

-또한 늘어난 연산을 저장하고 처리할 수 있도록 application(client) server와 DB server를 늘려야 한다.

-결과적으로, 노드 간 데이터 전파와 이 네트워크 전반에 걸쳐 데이터의 앞뒤로의 전송 역시 크게 늘어나게 된다. 따라서, 네트워크 자체가 성능 병목 현상(bottleneck)을 겪게 된다.

이라하여, 이러한 종류의 컴퓨팅 패러다임에서 (확장성과 처리량 모두에서의) 성능은 의심의 여지없이 문제가 된다.

코드에서 데이터로의 이동

- 데이터를 코드로 이동시키는 대신 코드 또는 애플리케이션을 데이터로 이동시켜서 데이터를 조작하고 처리하고 (데이터가 저장되는) 홈에서 관련 컴퓨팅을 할 수 있도록 새로운 컴퓨팅 패러다임이 절실히 필요하다.

- 데이터로 코드를 이동시킴으로써 컴퓨팅 구조가 크게 바뀌었다는 점이 눈에 띈다.

- 네트워크를 통한 데이터 전송량이 크게 줄었다.

- 이유는 작은 소프트웨어 코드만을 컴퓨팅 노드로 전송해 반환 결과로 원본 데이터의 작은 subset을 돌려받게 됐기 때문이다.

- 이것을 가능하게 한 개념은 RDD, dataset, dataframe 을 비롯해 스파크의 유익한 기능 덕분이다.

'#개발 > 프레임워크 & 라이브러리' 카테고리의 다른 글

DAG: For Optimizing Execution Plan (0)	2019.02.08
RDD: 새로운 컴퓨팅 패러다임 (0)	2019.02.08
Introduction to SPARK (0)	2019.02.08
From Anconda to Tensorflow installation in Ubuntu (0)	2018.11.22

From Anconda to Tensorflow installation in Ubuntu

fruiedem 2018. 11. 22. 12:56

2018. 11. 22. 12:56

1. Anconda 다운, 압축풀기

2. "conda command not found" error -> Path설정

- anaconda3 의 config 파일에 다음을 추가하거나 터미널에 입력한다.

- current directory가 ~인 상태에서

export PATH="~/ananconda3/bin:$PATH

- conda --version 으로 확인

3. python(pycharm) 다운, 압축풀기

4. tensorflow 설치

- python terminal에 다음과 같이 입력

pip install --ignore-installed --upgrade tensorflow

'#개발 > 프레임워크 & 라이브러리' 카테고리의 다른 글

DAG: For Optimizing Execution Plan (0)	2019.02.08
RDD: 새로운 컴퓨팅 패러다임 (0)	2019.02.08
Introduction to SPARK (0)	2019.02.08
전통적인 분산 컴퓨팅과 그 발전 (0)	2019.02.08

IO 멀티플렉싱

fruiedem 2018. 11. 13. 23:10

2018. 11. 13. 23:10

멀티플렉싱 서버의 구현에 있어서 가장 대표적인 방법 : select()

select함수를 사용하면 한곳에 여러 개의 file descriptor를 모아놓고 동시에 이들을 관찰할 수 있다.

*파일 디스크립터의 관찰은 소켓의 관찰로 해석할 수 있다.

파일 디스크립터에 대한 관찰을 하는 데 사용되는 것이 fd_set형 변수이다. 이는 0과 1로 표현되는, 비트단위로 이뤄진 배열이다.

이러한 자료형으로 각 file descriptor에 대해 세 가지에 대해서 관찰을 한다.

- read buffer 관련

- write buffer 관련

- exception 관련

이러한 관찰범위를 select 함수에 fd_set 변수를 선언하여 변수의 주소값을 넘겨준다.

관심없는 관찰에 대해선 NULL을 넘겨주면된다.

select의 parameter를 살펴보면 다음과 같다.

readfds 는 read buffer 즉, '수신된 데이터의 존재여부'에 관심 있는 file descriptor 정보를 모두 등록해서 그 변수의 주소 값을 전달한다.

writefds 는 write buffer 즉, '블로킹 없는 데이터 전송의 가능여부'에 관심 있는 file descriptor 정보를 모두 등록해서 그 변수의 주소 값을 전달한다.

exceptfds는 exception 즉, ' 예외상황의 발생여부'에 관심있는 파일 디스크립터 정보를 모두 등록해서 그 변수의 주소 값을 전달한다.

※readfds, writefds, exceptfds는 모두 서로 다른 fd_set 형 변수이다.

관찰 대상이되는 파일 디스크립터의 수를 나타내는 nfds에 대해서 알아보자.

여기서 주의해야할 것은 관찰 대상이 되는 file descriptor의 정확한 개수를 넘기는 게 아니라

'가장 큰 file descriptor의 값 + 1' 을 select의 가장 첫번째 인수로 넘겨준다는 것이다.

이 의미는 fd_set형 배열을 nfds만큼 순회하면서 값이 1인 file descriptor를 캐치한다는 것이다.

Q. 관찰을 원하지 않는 file descriptor의 변화를 감지할 수 있지 않은가?

우리는 fd_set형 변수에서 관심있는 file descriptor를 1로 set한다. 그리고 이 fd_set 변수는 초기화하는 데 사용되고

이 변수를 또다른 fd_set형 변수를 선언하여 복사해준다.

그래서 매번 select 하기 전에 원래의 fd_set(1)으로 초기화한다. 그 다음, select의 fd_set 자리에는 복사한 fd_set(2)형 변수의 주소값을 넘겨준다.

select는 fd_set의 1로 셋팅된 것으로 관찰 대상인 file descriptor를 판별하는데 매번 초기화를 안 해주면 들어갈 때마다 값이 바뀌기 때문에 관찰대상이 계속 바뀌어 의미가 없어진다.

Select함수는 변화를 감지하기 위한 목적이 있다. 그래서 세 가지 관찰범위에 대해서 변화가 일어나지 않으면 반환을 하지 않는다. 이러한 infinite blocking 상태에 빠지지 않게 하기 위해서 설정한 timeout 시간을 초과하면 0을 반환하도록 한다.

실패하면 -1을 반환하고 성공 시, 변화가 일어난 file descriptor의 수를 반환한다.

'PM > CS & IT실무' 카테고리의 다른 글

Half-Close - shutdown() (0)	2018.10.05
네트워크 바이트 순서와 인터넷 주소 변환 (0)	2018.10.04
소켓에 할당되는 IP주소와 PORT번호 (0)	2018.10.04

Installation of Spark in Ubuntu

fruiedem 2018. 11. 11. 19:10

2018. 11. 11. 19:10

사전 준비: JDK 설치 및 JAVA_HOME 환경 변수 설정

sudo apt-get update

sudo apt-get -y install openjdk-8-jdk

echo $JAVA_HOME

아무것도 안 뜨거나, echo $JAVA_HOME 이 그대로 뜨면

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

1. Apache Spark Project의 공식 내려 받기 페이지에 접속

https://goo.gl/DLvWLR

2. Package type 선택

이미 하둡을 설치했다면 해답 하둡 버전에 맞는 유형을 선택할 수 있다.

3. Spark 다운로드

cd $HOME/Downloads

tar -xvf spark*

rm spark*tgz -------내려받은 스파크 tgz 아카이브를 삭제

4. 압축을 해제한 스파크 디렉터리를 Downloads에서 bin/sparks 로 옮긴다.

cd $HOME ------ 홈 디렉터리로 이동

mkdir -p bin/sparks ------ bin/sparks 디렉터리 생성

mv Downloads/spark-* bin/sparks ------압축 해제한 스파크 디렉터리를 Downloads에서 bin/sparks로 옮긴다.

5. 심볼릭 링크 생성

cd %HOME/bin

ln -s sparks/spark-2.4.0-bin-hadoop2.7 spark ------ 심볼릭 링크 생성

심볼릭 링크는 리눅스에서 여러 버전의 프로그램을 관리하는 데 매우 유용하다.
~/bin 폴더에서 tree -L 2 명령을 실행해 보자.
spark 폴더가 실제로는 sparks 디렉터리의 다른 폴더를 가리키는 심볼릭 링크인 것을 볼 수 있다.

이처럼 심볼릭 링크를 사용하는 이유는 스파크 루트 폴더를 항상 동일한 방식을 참조할 수 있기 때문이다.
다른 버전의 스파크를 사용하려면 단순히 심볼릭 링크가 다른 버전의 설치 디렉터리를 가리키도록 변경하면 된다.

이제부터는 이 $HOME/bin/spark 디렉터리를 spark root로 칭한다.

Stream Processing101

fruiedem 2018. 10. 19. 01:13

2018. 10. 19. 01:13

database에서 consistency는 "constraints, trigger, cascades 를 포함하여 모든 정의된 룰에 따라 database에서 쓰여진 데이터가 valid 하는가"를 의미한다.

그리고 correctness는 "application programmer가 원하는 모든 방식으로 올바르게 transaction이 이루어지는가"를 의미한다.

따라서 stream processing 에서 consistency는 오히려 데이터베이스에서의 correctness와 의미가 일맥상통한다.

"streaming"의 다른 일반적인 용도에 관해서는, 정기적으로 듣는 몇 가지가 있다. 더욱 정확하고, 설명적인 용어들로 각각을 소개하겠다.

1.Unbounded data

unbounded data sets과 bounded data sets의 주요 차이점은 finiteness이다.

streaming과 batch는 data sets에 적용할 때 문제가 있다. 이 두 용어는 그 data sets을 처리하는 특정 execution engine의 사용을 의미한다.

2. unbounded data processing

unbounded data에 적용되는 데이터 처리를 하는 ongoing mode를 의미한다.

이런 유형의 데이터 처리를 streaming이라는 용어를 사용하는 것을 개인적으로 좋아하지만 , 이 문맥에서 streaming 용어를 사용하는 것은 streaming execution engine을 사용한다는 것을 의미합니다. 이것은 대단히 오해를 불러 일으킬 수 있다. 왜냐하면 batch system이 처음 고안되었을 때부터 batch engine들을 반복하여 실행하여 unbounded data를 처리하는 데 사용해왔기 때문이다. 반대로 잘 설계된 streaming system은 bounded data에 대해서 batch workloads를 더 잘 다룬다.

정리하면, bounded data sets과 unbounded data sets은 finiteness 로 구분하고 streaming system, batch system 모두 두 가지 타입의 data sets을 처리하는 데 사용할 수 있다.
streaming data(infinite) ≠ unbounded datasets
batch data(finite) ≠ bounded datasets

3. Low-latency, approximate, and/or speculative results

이런 유형의 결과는 대부분 streaming engines 과 관련이 있다. batch system 은 전통적으로 low-latency 나 speculative results(추측적 결과)를 염두에 두고 설계되지 않았다. 그리고 물론, batch system이 완전히 approximate results 를 만들어 낼 수도 있다, 그렇게 하도록 지시를 받았다면. 따라서, 위에서 언급한 용어들과 마찬가지로, 이러한 결과들이 "역사적으로 어떻게 나타났는지"보다 그 결과들이 무엇인지 설명하는 편이 훨씬 낫다.

여기부터, "streaming" 이라는 용어를 사용할 때마다 당신은 안전하게 내가 unbounded data sets 처리를 위한 하나의 execution engine을 의미한다고 생각해도 좋다. 다른 용어들을 의미할 때는 unbounded data, unbounded data processing 또는 low-latency / approximate / speculative results 라고 명시적으로 말하겠다.

On the greatly exaggerated limitations of streaming

streaming system이 할 수 있는 것과 할 수 없는 것에 대해서 조금만 다뤄본다.

그 중에서도 할 수 있는 것에 초점을 맞춘다. 이 포스트에서 내가 전달하고 싶은 가장 큰 것 중 하나는 "잘 설계된 streaming system은 얼마나 우수한가"이다.

streaming system은 오랫동안 low-latency, inaccurate/speculative results를 제공하는 다소 틈새 시장으로 밀려났으며, 이는 종종 올바른 결과를 제공하는 더 우수한 batch system과 연계되었다(i.e. Lambda Architecture).

Lambda Architecture 의 basic idea는 streaming system을 batch system 과 함께 실행 시키는 것이다. 그 두 system은 본질적으로 같은 계산을 수행한다. streaming system은 low-latency, inaccurate results를 제공하고, 잠시 후에 batch system이 따라와서 올바른 결과값을 제공한다.

streaming system이 부정확한 결과를 제공하는 이유는 approximation algorithm을 사용하거나 그 streaming system 자체가 correctness를 제공하지 않거나 둘 중 하나이다.
correctness에 대해서는 잠시 후에 다뤄보자.

Twitter의 Nathan Marz(Creator of Storm)에 의해 제안된 Lambda Architecture는 결국 꽤 성공적이었다. 그 당시로서는 실제로 환상적인 생각이었기 때문이다.

그러나 Streaming engines 은 correctness 부분에서 취약했고, batch engines는 예상한 것처럼 내부적으로 다루기힘들었다. 유감스럽게도, Lambda system을 유지하는 것은 굉장히 번거로운 일이다. 왜냐하면 두 개의 독립적인 버전의 pipeline을 구축하고, 제공하고, 유지해야하기 때문이다. 또한 끝에서 두 pipeline으로 부터 결과를 병합(merge)해야 한다.

strongly-consistent streaming engine에서수년간 일해온 사람으로서, Lambda Architecture의 전체적인 원리가 좋지만은 않다는 것을 발견했다. 나는 Jay Kreps의 Questioning the Lambda Architecture 포스트를 굉장히 좋아한다. dual-mode 실행의 필요성에 대해 가장 잘 나타내는 진술 중 하나이다.

솔직히 말하면, 잘 설계된 streaming system은 batch functionality의 superset(상위집합)을 제공한다고 말하고 싶다.

Half-Close - shutdown()

fruiedem 2018. 10. 5. 14:25

2018. 10. 5. 14:25

일방적인 연결 종료-close() 의 문제점

close 호출 후

- read buffer는 close와 동시에 모두 없어짐

- write buffer는 안의 내용을 모두 내보낸 다음에 파일의 마지막을 의미하는 'EOF'를 전송 후 스트림이 종료된다.

※ EOF가 스트림 종료가 아닌 정확히는 파일의 마지막을 의미한다. 데이터를 전송하고, close하게 되면 언젠가 출력스트림의 내용이 모두 빠져나가고 파일의 마지막을 의미하는 'EOF'가 보내지게 된다.

BUF_SIZE만큼 (지속적으로) 전송하고 마지막 write은 BUF_SIZE을 못 채울 수도 있지만 마지막 BUF_SIZE에 대한 read의 block은 해제된다.
또한 EOF가 온 경우에도 read는 해제된다. 따라서 이 둘을 구분하기 위해 read의 return 값이 0일때와 0이 아닐 때를 구분하여 EOF(이때 read의 return값은 0, error일 경우 -1 반환)가 올 때, 모든 write(전송)을 마쳤음을 인식하도록 한다.

- close를 호출한 측에서 buffer를 닫게되고, EOF를 통해 상대에게 출력 스트림을 닫았음을 알리게 된다.

- close는 소켓의 완전 소멸을 의미한다.

- close를 통해 read / write buffer를 모두 닫게 되면 외부로부터의 데이터는 차단되게 된다. <-- 문제점

메시지를 지속적으로 받거나 두 개의 패킷으로 나눠서 오는 경우는 다른 문제 --> read를 while로 구현

cf) 일반 file descriptor 와 socket file descriptor 는 구분하지 않는다.

half-close

- 이러한 일방적인 연결 종료의 문제점으로 인해 하나의 스트림을 유지시켜놓을 필요성이 있다.

- 종료를 원하다는 것은, 더이상 전송할 데이터가 존재하지 않는 상황

- 다만 송신측에서 상대방도 종료를 원하는 지 확인되지 않은 상황이므로, 입력 스트림은 종료시키지 않을 필요가 있다.

- 때문에 일반적으로 Half-close라 하면, 입력 스트림을 유지한 채, 출력 스트림만 종료하는 것을 의미한다.

- 이 때 shutdown()을 사용한다.

- 데이터를 모두 보내고, 출력스트림(shutdown())을 종료할 경우, 파일 출력 마지막에 EOF를 전송하여 출력스트림을 닫았음을 의도적으로 알릴 수 있다.

※EOF가 반드시 스트림 종료를 의미하는 것이 아니다. 송신측이 close()나 shutdown()을 호출하지 않으면 어느 스트림도 닫히지 않는다.

- 수신측에선 어차피 상대가 출력스트림을 닫았으므로 출력스트림과 입력스트림 모두 닫기위해 close를 호출한다.

- 송신측에서 EOF를 받게되면 나머지 입력스트림에 대한 half-close를 해준다. * close 또는 shutdown(sd,SHUT_RDWR)해도 무방

shutdown(sock fd, int howto)

성공 시 0, 실패 시 -1반환

close는 아예 연결을 끊는 것

close 호출되면 상대방에게 EOF가 날라감.

EOF가 데이터 전송이 끝났다는 의미.

shutdown에서도 똑같이 EOF를 날려서 한쪽 스트림을 닫을 수 있다.

EOF를 보내고 잘 받았는지 확인하기 위해 상대는 Thank you 메시지를 보내준다.

이번 실습은 서버와 클라이언트 다른 디렉토리 상황에서 실시해보기 175 176

성공하면 receive.dat가 클라이언트 디렉토리에 생성됨

-thank you 확인

-close 로 바꾸고 에러확인

'PM > CS & IT실무' 카테고리의 다른 글

IO 멀티플렉싱 (0)	2018.11.13
네트워크 바이트 순서와 인터넷 주소 변환 (0)	2018.10.04
소켓에 할당되는 IP주소와 PORT번호 (0)	2018.10.04

네트워크 바이트 순서와 인터넷 주소 변환

fruiedem 2018. 10. 4. 14:51

2018. 10. 4. 14:51

바이트 순서와 네트워크 바이트 순서

네트워크 바이트 순서와 인터넷 순서가 어떤 관계가 있는지

시스템에 따라 Big Endian / Little Endian 사용 방식이 다르다.

Big Endian - 상위 바이트의 값을 작은 번지수에 저장

ex. 0x12345678 은 가장 낮은 번지부터 0x12 0x34 0x56 0x78 순서로 담긴다. (상위 바이트는 12이다.)

Little Endian - 상위 바이트의 값을 큰 번지수에 저장

ex. 0x12345678은 0x78 0x56 0x34 0x12 순서로 담긴다.

데이터 송수신 과정에서 데이터 저장방식의 차이 문제가 발생할 수 있음 따라서 네트워크 데이터 송수신을 할 땐 약속이 필요함

보내는 쪽, 받는 쪽 모두 BigEndian / LittleEndian이든 상관없이 보내는 바이트의 순서는 낮은 주소부터 전송(빅엔디안)한다는 것을 알려야한다.

네트워크 바이트 순서는 BigEndian 기준! (낮은 주소의 데이터 번지부터 전송!) 낮은 주소를 significant byte로 취급하라고 알려줌 )

바이트 순서의 변환

htons - host의 메모리 속 데이터 순서가 bigendian이든 littleendian이든 상관없이 network byte(bigendian) 타입으로 바꿔라

이식성을 위해서 변경되지 않는다 하더라도 변경을 취해준다!

0x1234에서 1 하나는 4bit이고 12가 1byte를 나타낸다.

문자열 정보를 네트워크 바이트 순서의 정수로 변환

inet_addr(const char* string)

( ).( ).( ).( ) 형태의 문자열을 4바이트의 정수 배열(in_addr_t: 32비트 정수형)로 바꿔주는 함수이다.

성공 시 Big Endian으로 변환된 32비트 정수값

실패 시 INADDR_NONE 반환

또한 표현 가능 범위를 체크하여 비트 수를 초과하는 값의 변환을 시도하면 INADDR_NONE반환

inet_aton(const char* string, struct in_addr* addr);

기능상으로 inet_addr 함수와 동일하다. 다만 in_addr형 구조체 변수에 변환의 결과가 저장된다는 점에서 차이를 보인다.

(struct sockaddr_in 안에 struct in_addr형 sin_addr 변수/ struct in_addr 안에는 in_addr_t 타입 s_addr변수가 있음)

성공시 1(true), 실패 시 0(false) 반환

주로 inet_addr를 많이 사용. inet_aton을 지원하지 않는 경우가 많기 때문

inet_ntoa

32bit ip 주소를 ASCII형(. . . .)으로 변환

Note. 항상 네트워크 바이트 순서로 저장을 함 -> htonl 이용

inet_ntoa 에 (객체).sin_addr 를 인자로 넘김

인터넷 주소의 초기화

memset 을 하는이유 : sockaddr_in 구조체 변수의 마지막 8byte 변수는 사용하지 않기 때문에 모두 0으로 채울 필요가 있다.

(sockaddr_in 구조체 변수).sin_addr.s_addr = inet_addr(문자열 형태의 IP주소) - 여기서 문자열 형태의 IP주소는 "( ).( ).( ).( )" 형태

atoi로 문자열을 정수형으로 변환

Q. 왜 s_addr에는 htons로 변경하지 않나 ?

inet_addr에서 자동적으로 big endian 형식의 32bit 정수형을 반환함

이렇게 초기화한 값을 bind에 할당하게 됨

서버에서 주소정보를 설정하는 이유! (초기화한 값을 bind의 인자로 넘긴 경우)

초기화한 IP, PORT로 들어오는 데이터는 내게로 다 보내라!

클라이언트에서 주소정보를 설정하는 이유(초기화한 값을 connect의 인자로 넘긴 경우)

초기화한 IP, PORT로 연결을 해라

INADDR_ANY

현재 실행중인 컴퓨터의 IP를 소켓에 부여할 때 사용되는 것이 INADDR_ANY이다.

이는 서버 프로그램의 구현에 주로 사용된다.

이때 htonl(INADDR_ANY)형태로 대입한다.

4Byte unsigned int로 들어가 있으므로 데이터 순서만 정렬해서 넣어줌

이미 정해진 것을 사용하겠다는 의미

그러나 컴퓨터에 network interface card가 두개가 있을 땐 직접 지정해줘야 한다.

이때 INADDR_ANY를 쓰면 어느 것이 들어갈 지 모른다.

'PM > CS & IT실무' 카테고리의 다른 글

IO 멀티플렉싱 (0)	2018.11.13
Half-Close - shutdown() (0)	2018.10.05
소켓에 할당되는 IP주소와 PORT번호 (0)	2018.10.04

소켓에 할당되는 IP주소와 PORT번호

fruiedem 2018. 10. 4. 09:14

2018. 10. 4. 09:14

소켓을 생성한 다음에 서버의 경우 bind 함수를 사용해서 소켓에 주소를 부여할 수 있고, 클라이언트의 경우 connect함수 내부에서 자동으로 주소를 할당이 된다.

소켓의 주소라는 것이 어떻게 생겼는지 소켓에 그 주소를 어떻게 바인딩하는지가 목표

IPv4 인터넷 주소의 체계

처음 8비트는 네트워크를 구분하고 나머지 24비트는 네트워크 안의 호스트들을 구분하는 데 사용된다.

일반적으로 클래스C를 사용.

3바이트로 인하대학교 네트워크를 발견하고, 네트워크 내부에서 사용하는 호스트들은 나머지 1바이트로 구분

클래스 A의 첫 번째 비트는 항상 0으로 시작
클래스 B의 첫 두 비트는 항상 10으로 시작
클래스 C의 첫 세 비트는 항상 110으로 시작
클래스 C에서 마지막 1바이트인 IP로 PC 시스템 하나를 구분하고 Port번호로 하나의 시스템에서 동작하고 있는 수많은 어플리케이션들을 구분한다. 그러나 어플리케이션과 port 하나를 구분하지 않는다. 소켓하나당 port가 하나를 구분하는 것

IP +Port 번호로 소켓하나를 구분하는 것

PORT번호는 16bit로, 따라서 그 값은 표현 0~(

-1)

0~1023은 잘 알려진 port(Well-known PORT)라 해서 이미 용도가 결정되어 있다.

IP와 PORT번호를 어떤 자료구조로 할당하는가?

IPv4 기반의 주소표현을 위한 구조체로 sockaddr_in이 있다. 이 변수안에 IP주소와 PORT번호 정보를 담는다.

struct sockaddr_in {
    sa_family_t        sin_family;
    uint16_t        sin_port;
    struct in_addr    sin_addr;
    char            sin_zero[8];
};

이 안에는

- 주소체계를 나타내는 sin_family,

- PORT번호를 나타내는 sin_port(2byte),

- 32비트 IP주소를 나타내는 sin_addr(4byte),

- sin_zero[8](8byte) - 0으로 채움 (사용하지 않음)

까지4가지 변수가 있다.

여기서 sin_addr는 struct in_addr라는 구조체 타입인데, 이 변수 안에는 in_addr_t 라는 자료형을 갖는 s_addr 변수가 있다.

여기에 32bit IPv4 인터넷 주소 정보를 담는다.

이 자료형은 uint32_t로 정의되어 있는 데 왜 구분되어 사용되는 것일까?

이름을 보고 의미를 파악하게 하기위해, 해당 속성마다 이름을 구분하여 시스템 확장성을 위함이다. (즉, 직관성, 확장성)

sin_family는 PF(Protocol Family)인지 AF(Address Family)를 구분 - IPv4, IPv6 구분

sockaddr_in 구조체 변수는 bind 함수의 인자로 전달되는데 매개변수 형이 sockaddr으로 형변환을 해야만 한다.

sockaddr이 먼저 생겨나거 그 세부사항을 나눠서 표현하기 위해 sockaddr_in 구조체 타입을 생성한 것이라 생각하면 편한다.

sockaddr 타입 안에는 두 개의 변수가 있다. sin_family 변수는 sockaddr_in과 같다. 다른 하나는 sa_data[14]인데 sockaddr_in에서 sin_port, sin_addr, sin_zero[0] 총 14 byte의 정보가 sa_data에 포함된다. 따라서 IPv4에 맞춰서 sa_data의 14byte를 구분해놓은 것이 sockaddr_in 구조체 타입 변수이다.

따라서 우리는 sockaddr_in으로 정의했기 때문에 bind 로 넘길 때는 sockaddr로 강제 형변환을 하여 넘긴다.

'PM > CS & IT실무' 카테고리의 다른 글

IO 멀티플렉싱 (0)	2018.11.13
Half-Close - shutdown() (0)	2018.10.05
네트워크 바이트 순서와 인터넷 주소 변환 (0)	2018.10.04

Previewing

fruiedem 2018. 3. 1. 00:17

2018. 3. 1. 00:17

Why do I need database?
• The primary goal of a DBMS is to provide an environment that is both convenient
and efficient for people to use in retrieving and storing information.

What is database?

• A database-management system (DBMS) consists of a collection of interrelated
data and a collection of programs to access that data. The data describe
one particular enterprise.

History of database / Purpose of database

Keeping organizational information in a file-processing system has a number of major disadvantages:

• Data redundancy and inconsistency.

• Difficulty in accessing data.

• Data isolation.

• Integrity problems.

• Atomicity problems.

• Concurrent-access anomalies.

• Security problems.

These difficulties, among others, prompted the development of database systems.
In what follows, we shall see the concepts and algorithms that enable
database systems to solve the problems with file-processing systems.

Database는 programming을 이용해 데이터의 create, read, update, delete를 자동화함으로써 데이터의 관리 측면에서도 큰 효율을 얻었다.

참고: http://avant.org/project/history-of-databases/

교재: Database System Concept 저자: Silberschatz, Korth, Sudarshan 출판사: McGraw-Hill

Review of this book:

Intended for a first course in databases at the junior or senior undergraduate, or first-year graduate, level, this book covers concepts and algorithms based on those used in commercial or experimental database systems. It aims to present these concepts and algorithms in a general setting that is not tied to one particular database system.

Other good textbooks on Database Systmes

"Database Management Systems, Third Edition”, by Ramakrishnan and Gehrke

“Database Systems: The Complete Book”, by Garcia-Molina, Ullman, and Widom

Goals of database course

Our primary focus is the design of the database with considering aspects of software design(UML)

Features of the Book

In most of this book, we use a university organization as a running example of a typical
data-processing application.

Each chapter has a list of review terms, in addition to a summary, which can help
readers review key topics covered in the chapter.

The exercises are divided into two sets: practice exercises and exercises. The
solutions for the practice exercises are publicly available on the Web site of the
book. Students are encouraged to solve the practice exercises on their own, and
later use the solutions on the Web site to check their own solutions. Solutions
to the other exercises are available only to instructors (see “Instructor’s Note,”
below, for information on how to get the solutions).

Many chapters have a tools section at the end of the chapter that provides
information on software tools related to the topic of the chapter; some of these
tools can be used for laboratory exercises. SQL DDL and sample data for the
university database and other relations used in the exercises are available on the
Web site of the book, and can be used for laboratory exercises.

Steps in designing database

The database design process can be divided into six steps. The ER model is most relevant to the first three steps.

Our primary focus is the design of the database, however,
and we will not discuss other aspects of software design(UML) in any detail.

There are many approaches to end-to-end software system design, covering all
the steps from identifying the business requirements to the final specifications
for a complete application, including workflow, user interfaces, and many aspects
of software systems that go well beyond databases and the data stored in
them. In this section, we briefly discuss an approach that is becoming popular,
called the unified modeling language (UML) approach.

UML, like the ER model, has the attractive feature that its constructs can be
drawn as diagrams. It encompasses a broader spectrum of the software design
process than the ER model:

(Omitted)

The database design process can be divided into six steps.

The ER model is most relevant to the first three steps:

1. Requirements Analysis

The very first step in designing a database
application is to understand what data is to be stored in the database,
what applications must be built on top of it, and what operations are
most frequent and subject to performance requirements. In other words,
we must find out what the users want from the database. This is usually
an informal process that involves discussions with user groups, a study
of the current operating environment and how it is expected to change,
analysis of any available documentation on existing applications that are
expected to be replaced or complemented by the database, and so On.

Several methodologies have been proposed for organizing and presenting
the information gathered in this step, and some automated tools have been
developed to support this process.

Database Design Tools: Design tools are available from RDBMS vendors
as well as third-party vendors. For example! see the following link for
details on design and analysis tools from Sybase:
http://www.sybase.com/products/application_tools
The following provides details on Oracle's tools:
http://www.oracle.com/tools

2. Conceptual Design

The information gathered in the requirements
analysis step is used to develop a high-level description of the data
to be stored in the database, along with the constraints known to hold over
this data. This step is often carried out using the ER model and is discussed
in the rest of this chapter. The ER model is one of several high-level,
or semantic, data models used in database design. The goal is to create
a simple description of the data that closely matches how users and developers
think of the data (and the people and processes to be represented in
the data). This facilitates discussion among all the people involved in the
design process, even those who have no technical background. At the same
time, the initial design must be sufficiently precise to enable a straightforward
translation into a data model supported by a commercial database
system (which, in practice, means the relational model).

3. Logical Database Design

We must choose a DBMS to implement
our databctse design, and convert the conceptual database design into a
database schema in the data model of the chosen DBMS. We will consider
only relational DBMSs, and therefore, the task in the logical design step
is to convert an ER schema into a relational database schema. We discuss
this step in detail in Chapter 3; the result is a conceptual schema,
sometimes called the logical schema, in the relational data model.

4. Schema Refinement

The fourth step ill database design is to analyze
the collection of relations in our relational database schema to identify potential
problems, and to refine it. In contrast to the requirements analysis
and conceptual design steps, which are essentially subjective, schema refinement
can be guided by some elegant and powerful theory. \Ve discuss
the theory of normalizing relations-restructuring them to ensure some
desirable properties-in Chapter 19.

5. Physical Database Design

In this step, we consider typical expected
workloads that our database must support and further refine the database
design to ensure that it meets desired performance criteria. This step may
simply involve building indexes on some tables and clustering some tables,
or it may involve a substantial redesign of parts of the database schema
obtained from the earlier design steps. We discuss physical design and
database tuning in Chapter 20.

6. Application and Securiy Design

Any software project that involves
a DBMS must consider aspects of the application that go beyond the
database itself. Design methodologies like UML (Section 2.7) try to address
the complete software design and development cycle. Briefly, we must
identify the entities (e.g., users, user groups, departments) and processes
involved in the application. We must describe the role of each entity in every
process that is reflected in some application task, as part of a complete
workflow for that task. For each role, we must identify the parts of the
database that must be accessible and the parts of the database that must
not be accessible, and we must take steps to ensure that these access rules
are enforced. A DBMS provides several mechanisms to assist in this step,
and we discuss this in Chapter 21.

There are many kinds of diagrams in UML. Use case diagrams describe the
actions performed by the system in response to user requests, and the people
involved in these actions. These diagrams specify the external functionality
that the system is expected to support.

In the implementation phase, we must code each task in an application language
(e.g., Java), using the DBMS to access data. We discuss application development in Chapters 6 and 7.

In general, our division of the design process into steps should be seen as a
classification of the kinds of steps involved in design. Realistically, although
we might begin with the six step process outlined here, a complete database
design will probably require a subsequent tuning phase in which all six kinds
of design steps are interleaved and repeated until the design is satisfactory.

(Database Management Systems - 3rd Edition, 2.Introduction to database design, p26)

Contents in Context of Six steps above

PART 1.

2. INTRODUCTION TO DATABASE DESIGN

ER model used to create an initial design

Relationship between UML and ER model

3. THE RELATIONAL MODEL

How is data represented in the relational model(most used model in DBMS, but it's not only a model in DBMS)?

How can we create, modify, and query tables using SQL?

How do we obtain a relational database design from an ER diagram?

What is the foundation for relational query languages like SQL?

What is the difference between procedural and declarative languages?

4. RELATIONAL ALGEBRA AND CALCULUS

What is relational algebra, and why is it important?

What subset of mathematical logic is used in relational calculus, and how is it used to write queries?

5. SQL: QUERIES, CONSTRAINTS, TRIGGERS

How are queries expressed in SQL? How is the meaning of a query specified in the SQL standard?
How does SQL build on and extend relational algebra and calculus?

Part 2.

6. DATABASE APPLICATION DEVELOPMENT

How do application programs connect to a DBMS?

How can applications manipulate data retrieved from a DBMS?
How can applications modify data in a DBMS?

What is JDBC, SQLJ and how is it used?

What are stored procedures?

7. INTERNET APPLICATIONS

How do we name resources on the Internet?

How do Web browsers and webservers communicate?

How do we present documents on the Internet? How do we differentiate between formatting and content?
What is a three-tier application architecture? How do we write threetiered applications?
Why do we have application servers?

PART 3.

8. OVERVIEW OF STORAGE AND INDEXING

How does a DBMS store and access persistent data?

How does a DBMS organize files of data records on disk to minimize I/O costs?

What is the relationship between a file of data records and any indexes on this file of records?

9. STORING DATA: DISKS AND FILES

What are the different kinds of memory in a computer system?

What are the physical characteristics of disks and tapes, and how do they affect the design of database systems?

How does a DBMS keep track of space on disks? How does a DBMS access and modify data on disks?

What is the significance of pages as a unit of storage and transfer?

How does a DBMS create and maintain files of records?

How are records arranged on pages, and how are pages organized within a file?

10. TREE-STRUCTURED INDEXING

11.HASH-BASED INDEXING

PART 4.

12. OVERVIEW OF QUERY EVALUATION

What descriptive information does a DBMS store in its catalog?

What alternatives are considered for retrieving rows from a table?

Why does a DBMS implement several algorithms for each algebra operation?

What factors affect the relative performance of different
algorithms?

What are query evaluation plans and how are they represented?

Why is it important to find a good evaluation plan for a query? How is this done in a relational DBMS?

13. EXTERNAL SORTING

Why is sorting important in a DBMS?

Why is sorting data on disk different from sorting in-memory data?

How does external merge-sort work?
How do techniques like blockecl I/O and overlapped I/O affect the design of external sorting algorithms?
When can we use a B+ tree to retrieve records in sorted order?

14. EVALUATING RELATIONAL OPERATORS

What are the alternative algorithms for selection?

Which alternatives are best under different conditions? How are complex selection conditions handled?
How can we eliminate duplicates in projection? How do sorting and hashing approaches -compare?
What are the alternative join evaluation algorithms? Which alternatives are best under different conditions?
How are the set operations (union, inter;section, set-difference, crossproduct) implemented?
How are aggregate operations and grouping handled?
How does the size of the buffer pool and the buffer replacement policy affect algorithms for evaluating relational operators?

15. A TYPICAL RELATIONAL QUERY OPTIMIZER

How are SQL queries translated into relational algebra?

As a consequence, what class of relation algebra queries does a query optimizer concentrate on?
What information is stored in the system catalog of a DBMS and how is it used in query optimization?
How does an optimizer estimate the cost of a query evaluation plan?
How does an optimizer generate alternative plans for a query?

What is the space of plans considered? What is the role of relational algebra equivalences in generating plans?
How are nested SQL queries optimized?

16. OVERVIEW OF TRANSACTION MANAGEMENT

What four properties of transactions does a DBMS guarantee?
Why does a DBMS interleave transactions?
What is the correctness criterion for interleaved execution?
What kinds of anomalies can interleaving transactions cause?
How does a DBMS use locks to ensure correct interleavings?
What is the impact of locking on performance?
What SQL commands allow programmers to select transaction characteristics and reduce locking overhead?
How does a DBMS guarantee transaction atomicity and recoveryfrom system crashes?

17. CONCURRENCY CONTROL

How does Strict 2PL ensure serializability and recoverability?
How are locks implemented in a DBMS?
What are lock conversions and why are they important?
How does a DBMSresolve deadlocks?
How do current systerns deal with the phantom problerrl?
Why are specialized locking techniques used on tree indexes?
How does multiple-granularity locking work?
What is Optimistic concurrency control?
What is Timestarrlp-Ba..')ed concurrency control?
What is Multiversion concurrency control?

18. CRASH RECOVERY

What steps are taken in the ARIES method to recover fronl a DBMS crash?
How is the log rnaintained during nonnal operation?
How is the log used to recover frorn a crash?
What infonnation in addition to the log is used during recovery?
What is a checkpoint and why is it used?
What happens if repeated crashes occur during recovery?
How is media failure handled?
How does the recovery algorithnl interact with concurrency control?

PART 5.

19. SCHEMA REFINEMENT AND NORMAL FORMS

What problems are caused by redundantly storing information?
What are functional dependencies?
What are nornlal forms and what is their purpose?
What are the benefits of BCNF and 8NF?
What are the considerations in decolllposing relations into appropriate normal forms?
Where does normalization fit in the process of database design?
Are luore general dependencies useful in database design?

20. PHYSICAL DATABASE DESIGN AND TUNING

What is physical database design?
What is a query workload?
How do we choose indexes? What tools are available?
What is co-clustering and how is it used?
What are the choices in tuning a database?
How do we tune queries and view?
What is the impact of concurrency on perforrnance?
How can we reduce lock contention and hotspots?
That are popular database benchnlarks and how are they used?

21. SECURIY AND AUTHORIZATION

What are the rnain security considerations in designing a database application?

What IIlechanisms does a DBNIS provide to control a user's access to data?
What is discretionary access control and how is it supported in SQL?
What are the weaknesses of discretionary access control?

How are these addressed in lnandatory access control?
What are covert channels and how do they cornpromise lnandatory access control?
What must the DBA do to ensure security?
What is the added security threat when a database is accessed remotely?
What is the role of encryption in ensuring secure access?

How is it used for certifying servers and creating digital sig11atures?

PART 6.

22. PARALLEL AND DISTRIBUTED DATABASES

23. OBJECT-DATABASE SYSTEMS

24. DEDUCTIVE DATABASES

25. DATA WAREHOUSING AND DECISION SUPPORT

26. DATA MINING

27. INFORMATION RETRIEVAL AND XML DATA

28. SPATIAL DATA MANAGEMENT

29. FURTHER READING

30. THE MINIBASE SOFTWARE

10 Things you absolutiely need to do

There are 10 essentials:

1. Understand the database’s purpose.

Despite the popularity of Scrum, rapid application development (RAD), and numerous other rapid-development techniques, you still need to know why you’re creating the database. The first essential is that you discover and understand the database’s purpose before doing anything else. Saying that the database’s purpose is “to store records” isn’t good enough. You need to understand the business reason for this database’s existence before it goes into development, much less production. The business reason needs to map not only to the technical specifications but also

to the business purpose and the business processes that this database will be supporting. For example, will it be a retail point of sale (POS) database, an HR database, or part of a customer relationship management (CRM) package that will help you track customers and manage the sales cycle? You need to know this and many more details about why you’re creating the database before you start the design process.

2. Get the right tool.

You can’t create a viable database without software tools, no more than you could build a house without construction tools. So, you’re going to need data modeling software. Data modeling software, also called a CASE tool, is the data modeler’s and data designer’s best friend. A good CASE tool can help you quickly create different types of data models—such as functional decompositions, system and context diagrams, business process and data flow diagrams, and entity relationship models—each of which will help you communicate to your colleagues the visual components of the proposed system you’re building. Some CASE tools have features that enable sharing, coordination, merging, and version control for design teams. Last, but certainly not least, these tools effectively document what you’re doing and, by implication, why you’re doing it.

The cost of entry into the CASE tool market is not insignificant. However, the initial investment will be paid back in terms of shortened time-to-market for database projects and increased knowledge of corporate data and processes. You can read about six different CASE tools in "Comparative Review: Sizing Up Data Modeling Software". The comparison is based on capability and price, with some indication of how long it might take you to learn the package.

There is one thing to remember, though: No modeling tool will ever replace a person who understands how the business works. The CASE tool will only assist in creating visual representations of the business processes and structure.

3. Gather the requirements for the database.

Once you understand the overarching reason why you’re doing the database project and you’ve selected a tool that will help you visualize the as-is and to-be environments, you need to do a deep dive into requirements gathering. You need to understand the requirements well enough to be able to create visual representations of data, processes, and systems.

When gathering the requirements, don’t limit yourself to disk drives, virtual machines (VMs), and other technical requirements. Although you need to know the technical requirements, they aren’t pivotal to good database design. The database that you’re creating has to support business operations, so you need to interview company staff members and managers to understand how the business runs. Only then can you get a handle on the requirements that you’ll need to meet in order to create a viable database.

When you’re gathering requirements, you’re going to encounter conflicting business needs—count on it. You’ll have to wear your diplomat hat to get the parties involved to agree on some sort of compromise so that the project can move forward.

4. Be conscientious when modeling the data.

My favorite part of database design is modeling the data—that is, creating structures that will hold distinctly different data sets (entities or tables) and representing the relationships between pairs of these data sets. When you’re modeling the data, you’ll have a chance to reaffirm or correct what you found when gathering the requirements.

There’s not enough space here to do a deep dive into the details of how to model data, so I’ll point out the highlights and some constructs that you shouldn’t avoid simply because they seem a little complicated.

Point 1. Based on the answer to why you’re building the database, you need to use either transactional modeling techniques (for an OLTP database) or dimensional modeling techniques (for a relational data warehouse). In the SQL Server Magazine archive, you can find a wealth of information about how to do both types of data modeling. The Learning Path box lists some of those articles.

Point 2. You need to perform data modeling at multiple levels. You should create the following three models:

Conceptual data model (CDM). The CDM contains the semantics of the information domain that you’re modeling. (Taken from the Greek word semantiká, semantics refers to the study of meaning.) The CDM defines the scope of the database. Entity classes—the distinct data sets I mentioned earlier—represent things of interest to the organization, and the relationships between the entity pairs assert business rules and regulations. The CDM expresses both the as-is and to-be states and is often created during the Discovery stage.

Logical data model (LDM). The LDM describes the structure of the information domain. It includes criteria such as attribute set details, data types, key status, and individual attribute nullability.

Physical data model (PDM).The PDM defines how the data will be physically stored. You need to map the LDM to a specific database management system (DBMS) platform.

Point 3. This point is relevant for transactional data models only: You need to normalize the data and model to third normal form (3NF). Failure to model to 3NF in the early stages of design restricts your understanding of how data items interact with each other and most likely will inhibit a full understanding of the business requirements that the database must support. Once you move to later levels of design (e.g., the PDM), you can denormalize back to second normal form (2NF) if you must. Denormalizing a transactional data model isn’t recommended, however. It results in data duplication, which causes insert, update, and deletion anomalies and data synchronization problems. But if you must denormalize (for production efficiencies, perhaps), you’ll at least know exactly where the data duplication will occur and where to expect data synchronization problems. For more information about 2NF, 3NF, and data normalization, see "SQL By Design: Why You Need Database Normalization".

Point 4. Every transactional database contains four integrities: primary key integrity, referential integrity, domain integrity, and business rules integrity. Whether these integrities are maintained will determine if your data integrity will last beyond the first data load into the database. It will also determine whether the database can support the business rules and requirements that you gathered. For more information about the four integrities, see the Learning Path box.

Point 5. Never use a printed report as a template for a table. This is a common mistake that database design novices often make. Take, for example, an invoice. An invoice contains three distinct data sets:

Sales data. This data set includes information such as the date, time, and location of the sale, and the total invoice amount (with or without discounts, taxes, and shipping and handling).

Items sold. This data set includes information such as quantity, per-item cost, total cost of items, and references to items’ descriptions.

Product in inventory. This data set includes a complete description of each product and other information that’s necessary to maintaining an inventory.

Yet, in production databases, I’ve seen tables named Invoice that mix components of all three data sets. This is detrimental to data integrity and confusing for the office staff. A printed report is a business requirement and one of the outcomes of your database design. You can create a view, indexed or otherwise, to model a business report, but a business report shouldn’t be used as a template for a database table.

Point 6. Don’t be afraid to include supertype and subtype entities in your design in the CDM and onward. The subtypes represent classifications or categories of the supertype, such as employees who are staff members and employees who are authors. They’re both employees, and they’re both entitled to the employee benefits package. Yet some staff members are authors and some aren’t; authors have additional relationships and attributes that the staff members don’t have. Entities are represented as subtypes when it takes more than a single word or phrase to categorize the entity.

If a category has a life of its own, with separate attributes that describe how the category looks and behaves and separate relationships with other entities, then it’s time to invoke the supertype/subtype structure. Failure to do so will inhibit a complete understanding of the data and the business rules that drive data collection.

When it comes to implementing the supertype/subtype structure, the way you’ve modeled the business requirements and the relationships will determine whether you create one or many tables in the PDM. In this example, if you create an Employee table and an Author table, you could be duplicating data for the individual who is both a staff member and an author. This duplication would lead to nonkey data redundancy within the database, which inevitably causes insert, update, and deletion anomalies, unsynchronized data, and loss of data integrity. For more information about how to properly use supertypes and subtypes, see "Implementing Supertypes and subtypes."

Point 7. You must look out for multivalued attributes (MVAs), both expressed and implied. In its latest rendering, Microsoft Access has introduced the MVA structure. In doing so, it has veered away from the principles of relational data modeling and has truly become a nonrelational database. I recently reviewed a real estate system for a business that wanted to upgrade from Access to SQL Server. The database schema was riddled with MVAs. For instance, the building entity had an array of attributes named Picture1, Picture2, Deed of Record, Title, and so on—all within the same table. The owner wasn’t too surprised when he was told that SQL Server wouldn’t support the MVAs because his programmers had been having problems trying to access the attributes that were stored as MVA structures.

An MVA contains a collection of values for an attribute, as shown by the list of disks (e.g., Disk1Capacity, Disk2Capacity) in the SERVER table in Figure 1. This figure shows the schema for a server room inventory database. The solution to resolving this MVA is to flatten the structure so that each disk has its own row in a newly created table named DISK_DEVICES, as Figure 2 shows. The only constraint on the number of disks that the server can have is the number of bays in the server.

MVAs make retrieving summary data extremely difficult. They also cause problems when inserting, updating, and deleting data in a database. You can read more about MVAs in "Multivalued Attributes."

5. Enforce relationships.

The whole idea of a relational database is to identify and enforce relationships between entity pairs so that data integrity, as defined in the business rules and regulations and represented in the data models, is maintained. If you have the rule "each order must be associated with a current customer," you don’t want someone to write a program that lets a person place an order but not get a customer number. No application-level code can circumvent the rule to maintain data integrity.

The three types of relationships—one-to-one (1:1), one-to-many (1:M), and many-to-many (M:N)—are easily transitioned from the data model to the implemented database. For information about how to do this, see "Logical Modeling."

6. Use the appropriate data types.

There are a myriad of reasons why you want to use the proper data type for an attribute. Reduced disk storage space, data accuracy, the ability to join or union two tables together, attribute synchronization with the application layer, and internationalization capability are just a few of the reasons why you want to make sure that you’re defining each attribute’s data type correctly.

Technically, using the correct data type for an attribute speaks to domain integrity, which is one of the four database integrities I mentioned previously. When enforced, domain integrity ensures that an attribute contains only the set of valid values that are relevant and meaningful to that attribute. Many data modeling software packages let you create data domains and use them throughout the model. For example, you can create a Zip+4 domain that’s defined as char(5)+'-'+char(4). As a result of using the Zip+4 domain, every zip code instance in the database is defined with the same data type, instead of some instances defined as char(), some as varchar(), and some as int. Adding a NOT NULL characteristic to the Zip+4 domain further enforces the domain integrity rule that states each occurrence of a zip code in the database must contain a value.

Implementing domain integrity in SQL Server is a manual effort if you don’t have a data modeling package that lets you create and enforce data domains. The tools at your disposal include

Defining appropriate data types and lengths, nullability, defaults, and check constraints when creating tables

Using user-defined types, user-defined functions, and stored procedures that evaluate data when it’s being entered or used

Using Declarative Referential Integrity (DRI) or triggers to enforce relationships between pairs of tables

Using lookup tables to restrict data values when they’re inputted or edited

For additional ideas on how to enforce domain integrity, see "SQL By Design: The Four Integrities."

7. Include indexes when modeling.

When you’re building the PDM, you should include indexes. This is a bit of a chicken-and-egg situation because until you know how the data is going to be used, you can’t predict with certainty which columns will need indexing. But you’ve already gathered the requirements, so you should be able to make an intelligent guess as to which data should be indexed and how to index it. SQL Server automatically indexes (and clusters, by default) primary key constraints, but it doesn’t automatically index foreign key constraints. Therefore, the first set of indexes you want to create are those for the foreign keys in child tables. The next set would be for those columns on which you’re going to sort. In other words, you want to create indexes for columns that would be in ORDER BY clauses. From the gathered requirements, you should be able to determine which columns will be used for filtering (i.e., in the WHERE clause), so based on your understanding of how many rows are in each table, you can make an intelligent decision regarding which columns to index. For additional information about indexing, see “Indexing Dos and Don'ts”.

8. Standardize the naming convention.

If you don’t already have one, adopt a naming convention for all database objects, including tables, columns, indexes, keys, named constraints and relationships, and code. Not having a naming convention can lead to errors in data retrieval and can even inhibit productivity. A database schema is rarely documented and evolves over time; an established naming convention for the objects helps modulate schema confusion. Here are a few rules for creating useful database object names:

Use names that are meaningful to the entire organization and clearly identify the database object.

Don’t get too wordy. Use the minimum number of characters necessary to identify the database object.

Don’t use names that implicitly or explicitly refer to more than one topic (for tables) or characteristic (for columns).

Use the singular form of a name wherever possible, especially for entities and tables. This helps to correctly distinguish the entity-to-entity relationship (i.e., 1:1, 1:M, or M:N).

Don’t use spaces in database object names (e.g., Employee ID). Other database management systems might not support the use of spaces in object names, thereby limiting transportability and cross-database integration.

When you choose a naming convention, remember that no one convention is perfect, but almost any naming convention is better than none at all. For additional information about naming conventions, see “Seven Deadly Sins”.

9. Store the code that touches the data in SQL Server.

Repeat after me, “I will not embed dynamic SQL code in my applications.” Doing so puts a database at risk for inconsistent and possibly inaccurate application of business rules and regulations. It also puts a database at risk for a SQL injection attack. In case you haven’t heard the term SQL injection before, it’s a security vulnerability that occurs whenever one programming or scripting language (in this case, T-SQL) is embedded inside another. A hacker can literally hijack the embedded T-SQL code, modify it on the fly, and gain access to the data in the database. The technique to do this is widely known and publicly available. For additional information about SQL injection, see the web-exclusive article "SQL Injection Attacks on the Rise."

The best way to avoid inconsistent rule application and SQL injection is to store code that touches a database’s data as a stored procedure or CLR object. When you need to add or modify data, you then call the appropriate stored procedure; the operation will execute the same way every time it’s called. For an introduction to the CLR, see the web-exclusive article "What's the CLR and Should I care?"

10.Document the work.

Documenting the work you do is probably you're least favorite activity. But if you're following through with the essentials that I've discussed, you're already documenting. Now, that wasn't so bad, was it?

Reap the Benefits

If you did a good job discovering the database’s purpose and its requirements, and if you were conscientious while you worked through the myriad steps of the modeling process, then your database design will likely stand up to everything that the organization can throw at it. The database will support all the business requirements and comply with all the integrity, security, and privacy rules and regulations that govern your industry. It’ll perform well and be easy to use and maintain. And as the organization evolves into different lines of business, you’ll be able to easily extend the database’s functionality and easily scale it up and out to store and serve up many times more data than it was originally designed for

(Copied from: http://www.itprotoday.com/software-development/database-design-essentials)

'#개발 > 데이터베이스' 카테고리의 다른 글

데이터 무결성 (0)	2021.07.14
데이터베이스의 설계 단계 (0)	2021.07.14
스키마 정의 및 종류 (0)	2021.07.14
병렬 데이터베이스의 분할 기법 (0)	2021.07.13
django - mariadb on AWS EC2 연동 (0)	2021.06.06

PREV 이전 1 ···5 6 7 8 9 NEXT 다음

Devgraphy

월간 DEVGRAPHY

Introduction to SPARK

'#개발 > 프레임워크 & 라이브러리' 카테고리의 다른 글

전통적인 분산 컴퓨팅과 그 발전

'#개발 > 프레임워크 & 라이브러리' 카테고리의 다른 글

From Anconda to Tensorflow installation in Ubuntu

'#개발 > 프레임워크 & 라이브러리' 카테고리의 다른 글

IO 멀티플렉싱

'PM > CS & IT실무' 카테고리의 다른 글

Installation of Spark in Ubuntu

Stream Processing101

Half-Close - shutdown()

'PM > CS & IT실무' 카테고리의 다른 글

네트워크 바이트 순서와 인터넷 주소 변환

'PM > CS & IT실무' 카테고리의 다른 글

소켓에 할당되는 IP주소와 PORT번호

'PM > CS & IT실무' 카테고리의 다른 글

Previewing

'#개발 > 데이터베이스' 카테고리의 다른 글

+ Recent posts

티스토리툴바