[php] 문자셋(character set) 변환
로빈아빠
본문
문자셋(character set) 변환
1. character set 정보는 어디서??
character set 의 배정은 IANA (Internet Assigned Numbers Authority)에서 합니다.
최신의 character-set 정보는 다음 주소에서 찾아보시구요.
http://www.iana.org/assignments/character-sets
2. character set 변환은 왜?
다국어 지원을 하는 경우에는 이 character set을 고려해야 하는 경우가 있습니다.
중국 시스템에서 한국 시스템으로 어떤 문자를 보내고자 할때,
보통 중국은 big5 인코딩 방식을 사용하고, 한국은 ksc_5601을 사용합니다.
big5로 인코딩되어 있는 문자(한자가 되겠죠?)를 ksc_5601에 맞춰서 출력하며
이상한 값이 출력되겠죠?
3. 변환 방법은?
각 인코딩 방식마다 저마다의 인코딩 룰이 있습니다. utf-8 에서 unicode(ucs-2)로
변환하는 경우, 두 인코딩 방식을 비교하여 변화하는 코드를 작성해야겠죠.
일일이 인코딩 방식을 봐가면서 하기에는 다소 무리가 있는듯 합니다.
그래서, 편하게 사용할 수 있는 방법이 unix 시스템의 경우,
iconv 라는게 있습니다.
4. iconv 사용법은?
- 명령어
사용법 : iconv -f 원시코드 -t 목적코드 [파일...]
console에서 직접 iconv 명령어를 사용하여 변환할 수 있습니다.
- iconv library 이용
프로그램을 작성하는 경우, 라이브러리를 사용하시면 됩니다.
libiconv.a
* 컴파일시 -liconv 주어야 합니다.
* iconv_open, iconv, iconv_close 등의 함수가 있습니다.
자세한 사항은 iconv man page를 참조하세요
5. iconv를 통해 변화할 수 있는 character set 정보는 어디에?
시스템마다 조금씩 다릅니다.
- compaq /usr/lib/nls/loc/iconv 화일 이름에서 _ 를 중심으로 왼쪽은 from, 오른쪽은 target - sun /usr/lib/iconv 화일 이름에서 % 를 중심으로 왼쪽은 from, 오른쪽은 target
* iconv man page
iconv(3) iconv(3) NAME iconv - Converts a string of characters from one codeset to another codeset LIBRARY The iconv library (libiconv) SYNOPSIS The following syntax is for pre-XSH5.0-compliant interfaces on Tru64 UNIX V4 and V5 systems: #include <iconv.h> size_t iconv( iconv_t cd, const char **inbuf, size_t *inbytesleft, char **outbuf, size_t *outbytesleft); The following syntax is for pre-V4 Tru64 UNIX systems and XSH5.0-compliant interfaces on V5 systems: #include <iconv.h> size_t iconv( iconv_t cd, char **inbuf, size_t *inbytesleft, char **outbuf, size_t *outbytesleft); STANDARDS Interfaces documented on this reference page conform to industry standards as follows: iconv(): XSH4.0, XSH4.2, XSH5.0 Refer to the standards(5) reference page for more information about indus- try standards and associated tags. PARAMETERS cd Specifies the conversion descriptor that points to the correct codeset converter inbuf Points to a variable that points to the beginning of a buffer that con- tains the characters to be converted inbytesleft Points to an integer that contains the number of bytes in inbuf still to be converted outbuf Points to a variable that points to the buffer that contains the characters that have been converted outbytesleft Points to an integer that contains the number of free bytes in outbuf DESCRIPTION The iconv() function converts a string of characters in inbuf into a dif- ferent codeset and returns the results in outbuf. The required converter is identified by cd, which must be a valid descriptor returned by a previous successful call to the iconv_open() function. On calling, the inbytesleft parameter indicates the number of bytes in inbuf to be converted and outbytesleft indicates the number of available bytes in outbuf. For codesets that include shift-state sequences, a call to iconv() in which inbuf is or points to a null pointer places the cd conversion descriptor into its initial shift state. In this case, as long as outbuf is not (or does not point to) a null pointer, the call places in outbuf the byte sequence that changes the output buffer to its initial shift state. If the output buffer is not large enough to hold the entire reset sequence, the call fails and sets errno to [E2BIG]. Any subsequent calls in which inbuf is not (or does not point to) a null pointer cause the conversion to take place from the current state of the conversion descriptor. See the RES- TRICTIONS section for information about support of shift-state encoding. If a sequence of input bytes does not form a character that is valid in the input codeset, conversion stops after the previous successfully converted character. If the input buffer ends with an incomplete character or shift sequence, conversion stops after the last byte sequence that was success- fully converted to a character. If the output is not large enough to hold the entire sequence of converted characters, conversion stops just prior to the input byte sequence that would cause the output buffer to overflow. On return from the call: + The inbuf value is updated to point to the byte following the last byte used successfully in conversion + The inbytesleft value is decremented to reflect the number of input bytes still not converted + The outbuf value is updated to point to the byte following the last output byte of successfully converted data + The outbytesleft value is decremented to reflect the number of bytes still available in the output buffer. + For codesets that include shift-state encoding, the conversion descriptor is updated to reflect the shift state in effect at the end of the last successfully converted byte sequence. It is possible for input data to include a character that is valid in the input codeset but for which an identical character does not exist in the output codeset. The output character for such cases is defined by the con- verter that iconv() applies when converting from one particular codeset to another. In other words, the output character in this case can vary from one codeset converter to another. RESTRICTIONS Currently, the operating system does not include locales whose codesets use shift-state encoding. Some sections of this reference page refer to iconv() behavior with respect to conversion of shift sequences. This information is included only for your convenience in developing portable applications that run on multiple platforms, some of which may supply locales whose codesets do use shift-state encoding. RETURN VALUES The iconv() function updates the variables pointed to by the call arguments to reflect the extent of the conversion and returns the number of non- identical conversions performed. If the function is successful and converts the entire input string, the value pointed to by inbytesleft will be 0 (zero). If an error occurs, the function returns (size_t)-1 and sets errno to indi- cate the condition. ERRORS If any of the following conditions occur, the iconv() function sets errno to the corresponding value: [E2BIG] The outbuf buffer is too small to contain all the converted charac- ters. The character that causes the overflow is not converted and inbytesleft indicates the bytes left to be converted, including the character that caused the overflow. The inbuf parameter points to the first byte of the characters left to convert. [EBADF] The cd parameter does not specify a valid converter descriptor. [EILSEQ] An input character does not belong to the input codeset. No conver- sion is attempted on the invalid character and inbytesleft indicates the bytes left to be converted, including the first byte of the invalid character. The inbuf parameter points to the first byte of the invalid character sequence. The values of outbuf and outbytesleft are updated according to the The values of outbuf and outbytesleft are updated according to the number of characters that were previously converted. [EINVAL] The last character or shift sequence in the inbuf parameter was not complete. The inbytesleft parameter indicates the number of input bytes still not converted. RELATED INFORMATION Functions: iconv_close(3), iconv_open(3) Commands: iconv(1), genxlt(1) Others: iconv_intro(5), standards(5)
* PHP 에서의 변환
기본적으로 서버에서 iconv 함수를 지원한다면 @iconv함수를 이용하시면 됩니다.(iconv함수는 PHP 기본 함수가 아님니다)
하지만 지원하지 않는다면 위에서 소개한 콘솔을 이용하시면됩니다.
@iconv함수를 이용한 방법
콘솔 iconv를 이용한 방법
* UTF-8 인지 확인하는 함수
ps. 위함수는 KEBIL이 작성하신것입니다.
* 자신의 호스팅 계정이 함수방법, 콘솔방법을 지원하는지 알아보기위한 소스
1. character set 정보는 어디서??
character set 의 배정은 IANA (Internet Assigned Numbers Authority)에서 합니다.
최신의 character-set 정보는 다음 주소에서 찾아보시구요.
http://www.iana.org/assignments/character-sets
2. character set 변환은 왜?
다국어 지원을 하는 경우에는 이 character set을 고려해야 하는 경우가 있습니다.
중국 시스템에서 한국 시스템으로 어떤 문자를 보내고자 할때,
보통 중국은 big5 인코딩 방식을 사용하고, 한국은 ksc_5601을 사용합니다.
big5로 인코딩되어 있는 문자(한자가 되겠죠?)를 ksc_5601에 맞춰서 출력하며
이상한 값이 출력되겠죠?
3. 변환 방법은?
각 인코딩 방식마다 저마다의 인코딩 룰이 있습니다. utf-8 에서 unicode(ucs-2)로
변환하는 경우, 두 인코딩 방식을 비교하여 변화하는 코드를 작성해야겠죠.
일일이 인코딩 방식을 봐가면서 하기에는 다소 무리가 있는듯 합니다.
그래서, 편하게 사용할 수 있는 방법이 unix 시스템의 경우,
iconv 라는게 있습니다.
4. iconv 사용법은?
- 명령어
사용법 : iconv -f 원시코드 -t 목적코드 [파일...]
console에서 직접 iconv 명령어를 사용하여 변환할 수 있습니다.
- iconv library 이용
프로그램을 작성하는 경우, 라이브러리를 사용하시면 됩니다.
libiconv.a
* 컴파일시 -liconv 주어야 합니다.
* iconv_open, iconv, iconv_close 등의 함수가 있습니다.
자세한 사항은 iconv man page를 참조하세요
5. iconv를 통해 변화할 수 있는 character set 정보는 어디에?
시스템마다 조금씩 다릅니다.
- compaq /usr/lib/nls/loc/iconv 화일 이름에서 _ 를 중심으로 왼쪽은 from, 오른쪽은 target - sun /usr/lib/iconv 화일 이름에서 % 를 중심으로 왼쪽은 from, 오른쪽은 target
* iconv man page
iconv(3) iconv(3) NAME iconv - Converts a string of characters from one codeset to another codeset LIBRARY The iconv library (libiconv) SYNOPSIS The following syntax is for pre-XSH5.0-compliant interfaces on Tru64 UNIX V4 and V5 systems: #include <iconv.h> size_t iconv( iconv_t cd, const char **inbuf, size_t *inbytesleft, char **outbuf, size_t *outbytesleft); The following syntax is for pre-V4 Tru64 UNIX systems and XSH5.0-compliant interfaces on V5 systems: #include <iconv.h> size_t iconv( iconv_t cd, char **inbuf, size_t *inbytesleft, char **outbuf, size_t *outbytesleft); STANDARDS Interfaces documented on this reference page conform to industry standards as follows: iconv(): XSH4.0, XSH4.2, XSH5.0 Refer to the standards(5) reference page for more information about indus- try standards and associated tags. PARAMETERS cd Specifies the conversion descriptor that points to the correct codeset converter inbuf Points to a variable that points to the beginning of a buffer that con- tains the characters to be converted inbytesleft Points to an integer that contains the number of bytes in inbuf still to be converted outbuf Points to a variable that points to the buffer that contains the characters that have been converted outbytesleft Points to an integer that contains the number of free bytes in outbuf DESCRIPTION The iconv() function converts a string of characters in inbuf into a dif- ferent codeset and returns the results in outbuf. The required converter is identified by cd, which must be a valid descriptor returned by a previous successful call to the iconv_open() function. On calling, the inbytesleft parameter indicates the number of bytes in inbuf to be converted and outbytesleft indicates the number of available bytes in outbuf. For codesets that include shift-state sequences, a call to iconv() in which inbuf is or points to a null pointer places the cd conversion descriptor into its initial shift state. In this case, as long as outbuf is not (or does not point to) a null pointer, the call places in outbuf the byte sequence that changes the output buffer to its initial shift state. If the output buffer is not large enough to hold the entire reset sequence, the call fails and sets errno to [E2BIG]. Any subsequent calls in which inbuf is not (or does not point to) a null pointer cause the conversion to take place from the current state of the conversion descriptor. See the RES- TRICTIONS section for information about support of shift-state encoding. If a sequence of input bytes does not form a character that is valid in the input codeset, conversion stops after the previous successfully converted character. If the input buffer ends with an incomplete character or shift sequence, conversion stops after the last byte sequence that was success- fully converted to a character. If the output is not large enough to hold the entire sequence of converted characters, conversion stops just prior to the input byte sequence that would cause the output buffer to overflow. On return from the call: + The inbuf value is updated to point to the byte following the last byte used successfully in conversion + The inbytesleft value is decremented to reflect the number of input bytes still not converted + The outbuf value is updated to point to the byte following the last output byte of successfully converted data + The outbytesleft value is decremented to reflect the number of bytes still available in the output buffer. + For codesets that include shift-state encoding, the conversion descriptor is updated to reflect the shift state in effect at the end of the last successfully converted byte sequence. It is possible for input data to include a character that is valid in the input codeset but for which an identical character does not exist in the output codeset. The output character for such cases is defined by the con- verter that iconv() applies when converting from one particular codeset to another. In other words, the output character in this case can vary from one codeset converter to another. RESTRICTIONS Currently, the operating system does not include locales whose codesets use shift-state encoding. Some sections of this reference page refer to iconv() behavior with respect to conversion of shift sequences. This information is included only for your convenience in developing portable applications that run on multiple platforms, some of which may supply locales whose codesets do use shift-state encoding. RETURN VALUES The iconv() function updates the variables pointed to by the call arguments to reflect the extent of the conversion and returns the number of non- identical conversions performed. If the function is successful and converts the entire input string, the value pointed to by inbytesleft will be 0 (zero). If an error occurs, the function returns (size_t)-1 and sets errno to indi- cate the condition. ERRORS If any of the following conditions occur, the iconv() function sets errno to the corresponding value: [E2BIG] The outbuf buffer is too small to contain all the converted charac- ters. The character that causes the overflow is not converted and inbytesleft indicates the bytes left to be converted, including the character that caused the overflow. The inbuf parameter points to the first byte of the characters left to convert. [EBADF] The cd parameter does not specify a valid converter descriptor. [EILSEQ] An input character does not belong to the input codeset. No conver- sion is attempted on the invalid character and inbytesleft indicates the bytes left to be converted, including the first byte of the invalid character. The inbuf parameter points to the first byte of the invalid character sequence. The values of outbuf and outbytesleft are updated according to the The values of outbuf and outbytesleft are updated according to the number of characters that were previously converted. [EINVAL] The last character or shift sequence in the inbuf parameter was not complete. The inbytesleft parameter indicates the number of input bytes still not converted. RELATED INFORMATION Functions: iconv_close(3), iconv_open(3) Commands: iconv(1), genxlt(1) Others: iconv_intro(5), standards(5)
* PHP 에서의 변환
기본적으로 서버에서 iconv 함수를 지원한다면 @iconv함수를 이용하시면 됩니다.(iconv함수는 PHP 기본 함수가 아님니다)
하지만 지원하지 않는다면 위에서 소개한 콘솔을 이용하시면됩니다.
@iconv함수를 이용한 방법
euc-kr -> UTF-8 $str = @iconv("UHC","UTF-8",$str); UTF-8 -> euc-kr $str = @iconv("UTF-8","UHC",$str); big5 -> UTF-8 $str = @iconv("big5","UTF-8",$str);
콘솔 iconv를 이용한 방법
function euckr2utf8($str) { $str = ereg_replace("\n","\\n",$str); $str = exec('echo \''.$str.'\' |iconv -c -f uhc -t utf-8'); return ereg_replace("\\\\n","\n",$str); } function utf82euckr($str) { $str = ereg_replace("\n","\\n",$str); $str = exec('echo \''.$str.'\' |iconv -c -f utf-8 -t uhc'); return ereg_replace("\\\\n","\n",$str); } euc-kr -> UTF-8 $str = euckr2utf8($str); UTF-8 -> euc-kr $str = utf82euckr($str)
* UTF-8 인지 확인하는 함수
function isutf8($str) { $i=0; $len = strlen($str); for ($i=0;$i<$len;$i++) { $sbit = ord(substr($str,$i,1)); if ($sbit < 128) { } else if($sbit > 191 && $sbit < 224) { $i++; } else if($sbit > 223 && $sbit < 240) { $i+=2; } else if($sbit > 239 && $sbit < 248) { $i+=3; } else { return 0; } } return 1; }
ps. 위함수는 KEBIL이 작성하신것입니다.
* 자신의 호스팅 계정이 함수방법, 콘솔방법을 지원하는지 알아보기위한 소스
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>변환 방법 확인</title> </head> <body> 1. iconv() 함수 존재여부 확인<br /> Result: <? $test2 = function_exists(iconv); if($test2 == 1) echo "있음"; else echo "없음"; ?><br /><br /> 2. Consol iconv test<br /> Result: <? $test = exec('echo "112" |iconv -f euc-kr -t utf-8'); if($test=="112") echo "사용가용"; else echo "사용불가"; ?><br /><br /> <? if($test == 112 && $test2 ==1) { echo"방법1과 방법2를 사용할 수 있습니다.<br />방법1을 사용하는 것이 가장 쉽습니다.<br />단, 변환불가능한 문자(UTF-8에는 존재하지만, euc-kr에는 존재하지 않는 문자, 기호 등)가 있을 경우 변환 결과가 나오지 않을 수 있습니다. 그럴 경우에는 방법 2를 사용해 주세요."; } else if($test != 112 && $test2 ==0) { echo"두 방법 모두 사용할 수 없습니다. 방법 3을 사용하세요."; } else if($test != 112 && $test2 ==1) { echo"방법1을 사용할 수 있습니다."; } else if($test == 112 && $test2 ==0) { echo "방법2를 사용할 수 있습니다."; } ?> </body> </html>
관련링크
댓글목록
등록된 댓글이 없습니다.