【C言語】C言語におけるキャストの働きについて調べてみた

キャストについてよくわかっていなかったので調べた。

僕がよくわかっていなかったのは、次の3点である。
・「符号あり・符号なしの間におけるキャスト」
・「サイズの異なるキャスト」
・「キャストによってどんな機械語が生成されるのか」

また、ここで対象にするのは整数型だけである。浮動小数点数は対象にしていない。
コンパイラによって細かな違いがあるかもしれないが、大体の動きを把握するような気持ちで読んでほしい。
ちなみにOSはUbuntu ディストリビューション、コンパイラドライバはGCCを使用している。

【符号あり・符号なしの間におけるキャスト】

結果を言ってしまうと、符号あり・符号なしの間のキャストではただ単にバイトがコピーされるみたいだ。
(なお、この場合は同じサイズのキャストのみを考えることにします。)

例えば、次のようなコードを実行すると

#include<stdio.h>

// 符号ビットが0のときは違いがわかりにくいので、符号ビットを1にした場合を調べる

int main(void){
    char c = 0x80;
    unsigned char uc = c;
    short s = 0x8000;
    unsigned short us = s;
    int i = 0x80000000;
    unsigned int ui = i;

    // まず、符号ありから符号なしにキャストした場合を調べる
    printf("uc = %x, us = %x, ui = %x\n", uc, us, ui);

    // 次は、符号なしを符号ありにキャストした場合を調べる
    c = uc, s = us, i = ui;
    printf("c = %x, s = %x, i = %x\n", c, s, i);

    return 0;
}

実行結果は次のようになる。

uc = 80, us = 8000, ui = 80000000
c = ffffff80, s = ffff8000, i = 80000000

出力の一行目から、uc、us、uiの値はc, s, iの値をそのまま受け取っていることがわかる。
つまり、符号あり→符号なしはビット列がそのままコピーされるらしい。

しかし、出力の二行目からは符号なし→符号ありのキャストは値が変更されているような印象を受ける。
実は、printfの%xで符号ありの変数を表示するとき、%xによって変数の値は4byteの値に切り上げられるのだが、この際に符号あり変数は足りない分の上位ビットを符号ビットで埋めるのだ。
なので、例えば「1000 0000」について考えると、これを4byteに切り上げる際には「1111 1111 1111 1111 1111 1111 1000 0000」に切り上げてしまう。
よって、こんな出力になっているのだが、実際のcの値は「1000 0000」となっている。
つまり、符号なり→符号ありのキャストもただのビット列のコピーなのだ。

一応、アセンブラも見てみよう。
これは「objdump -S 実行ファイル名」として得たアセンブラだ。
符号ありのコピー命令にはmovs命令(符号拡張版mov)が、符号なしのコピー命令にはmovz命令(ゼロ拡張版mov)が使われているらしい。

 # この前にも処理はあります

 # char c = 0x80;
 53a:   c6 45 ea 80             movb   $0x80,-0x16(%ebp)

 # unsigned char uc = c;
 53e:   0f b6 45 ea             movzbl -0x16(%ebp),%eax
 542:   88 45 eb                mov    %al,-0x15(%ebp)

 # short s = 0x8000;
 545:   66 c7 45 ec 00 80       movw   $0x8000,-0x14(%ebp)

 # unsigned short us = s;
 54b:   0f b7 45 ec             movzwl -0x14(%ebp),%eax
 54f:   66 89 45 ee             mov    %ax,-0x12(%ebp)

 # int i = 0x80000000;
 553:   c7 45 f0 00 00 00 80    movl   $0x80000000,-0x10(%ebp)

 # unsigned int ui = i;
 55a:   8b 45 f0                mov    -0x10(%ebp),%eax
 55d:   89 45 f4                mov    %eax,-0xc(%ebp)

 # push ui → push us → push uc
 560:   0f b7 55 ee             movzwl -0x12(%ebp),%edx
 564:   0f b6 45 eb             movzbl -0x15(%ebp),%eax
 568:   ff 75 f4                pushl  -0xc(%ebp)
 56b:   52                      push   %edx
 56c:   50                      push   %eax

 # push addr_of("uc = %x, us = %x, ui = %x\n")
 56d:   8d 83 68 e6 ff ff       lea    -0x1998(%ebx),%eax
 573:   50                      push   %eax

 # call printf
 574:   e8 37 fe ff ff          call   3b0 <printf@plt>
 579:   83 c4 10                add    $0x10,%esp

 # c = uc
 57c:   0f b6 45 eb             movzbl -0x15(%ebp),%eax
 580:   88 45 ea                mov    %al,-0x16(%ebp)

 # s = us
 583:   0f b7 45 ee             movzwl -0x12(%ebp),%eax
 587:   66 89 45 ec             mov    %ax,-0x14(%ebp)

 # i = ui
 58b:   8b 45 f4                mov    -0xc(%ebp),%eax
 58e:   89 45 f0                mov    %eax,-0x10(%ebp)

 # push i → push s → push c
 591:   0f bf 55 ec             movswl -0x14(%ebp),%edx
 595:   0f be 45 ea             movsbl -0x16(%ebp),%eax
 599:   ff 75 f0                pushl  -0x10(%ebp)
 59c:   52                      push   %edx
 59d:   50                      push   %eax

 # push addr_of("c = %x, s = %x, i = %x\n")
 59e:   8d 83 83 e6 ff ff       lea    -0x197d(%ebx),%eax
 5a4:   50                      push   %eax

 # call printf
 5a5:   e8 06 fe ff ff          call   3b0 <printf@plt>
 5aa:   83 c4 10                add    $0x10,%esp

 # 処理は続きます

このアセンブラからもキャストの細かい動作をしることができる。

【サイズの異なるキャスト】

この場合、考えうる場合は３つある。
それは、「キャスト前のサイズ < キャスト後のサイズ」「キャスト前のサイズ = キャスト後のサイズ」「キャスト前のサイズ > キャスト後のサイズ」である。

「キャスト前のサイズ < キャスト後のサイズ」

まず、「キャスト前のデータ型 < キャスト後のデータ型」の場合のキャスト。
この場合は、キャスト前のデータ型によって動作が変わってくる。
もし、キャスト前のデータ型が符号ありだった場合、足りないバイト部分にはMSBのビットが補完される。
逆に、キャスト前のデータ型が符号なしだった場合、足りないバイト部分はゼロ補完される。

例えば、次のようなコードが考えられる。
printfの「%x」の部分では暗黙的に型変換されているので、これは1byte→4byteのキャストである。

#include<stdio.h>

int main(void){
    char c1 = 10;
    char c2 = -10;
    unsigned char uc1 = 10;
    unsigned char uc2 = 246;

    printf("c1 = %x, c2 = %x, uc1 = %x, uc2 = %x\n", c1, c2, uc1, uc2);
    return 0;
}

このプログラムの実行結果は次のようになる。

c1 = a, c2 = fffffff6, uc1 = a, uc2 = f6

見る限り「signed char」でMSBが1の時には、拡張された部分に1が補完されている。
しかし、「unsigned char」の場合はMSBが1であろうと0であろうとゼロ補完されているようだ。

これはキャスト後のデータ型が「signed」であるか「unsigned」であるかに依存しない。
また、この結果は「short」「long」「long long」とその符号なしバージョンでも同様に成り立つようだ。

サイトにアセンブラを確認しておきたい。
先ほどと同様に部分的にしか載せていない。

 # char c1 = 10;
 53a:   c6 45 e4 0a             movb   $0xa,-0x1c(%ebp)

 # char c2 = -10;
 53e:   c6 45 e5 f6             movb   $0xf6,-0x1b(%ebp)

 # unsigned char uc1 = 10;
 542:   c6 45 e6 0a             movb   $0xa,-0x1a(%ebp)

 # unsigned char uc2 = 246;
 546:   c6 45 e7 f6             movb   $0xf6,-0x19(%ebp)

 # set uc2, uc1, to registers (use movzbl)
 54a:   0f b6 75 e7             movzbl -0x19(%ebp),%esi
 54e:   0f b6 5d e6             movzbl -0x1a(%ebp),%ebx

 # set c2, c1 to registers (use movsbl)
 552:   0f be 4d e5             movsbl -0x1b(%ebp),%ecx
 556:   0f be 55 e4             movsbl -0x1c(%ebp),%edx

 # ???
 55a:   83 ec 0c                sub    $0xc,%esp

 # push uc2→push uc1→push c2→push c1
 55d:   56                      push   %esi
 55e:   53                      push   %ebx
 55f:   51                      push   %ecx
 560:   52                      push   %edx

 # push addr_of("c1 = %x, c2 = %x, uc1 = %x, uc2 = %x\n")
 561:   8d 90 38 e6 ff ff       lea    -0x19c8(%eax),%edx
 567:   52                      push   %edx

 # call printf
 568:   89 c3                   mov    %eax,%ebx
 56a:   e8 41 fe ff ff          call   3b0 <printf@plt>
 56f:   83 c4 20                add    $0x20,%esp

このアセンブラからも、符号あり・なしの場合にmovsとmovzが使い分けられているのがわかるだろう。

「キャスト前のサイズ = キャスト後のサイズ」

はじめに示した例と同じであるが、もう一度確認しておきたい。
キャスト前のサイズとキャスト後のサイズが等しい場合、バイト列がそのままコピーされる。
例えば「signed int」から「unsigned int」にキャストする場合、キャスト前のバイト列が「-1(ffffffff)」だったときはキャスト後のバイト列も「ffffffff」となる。

次のプログラムで確認するとよい。

#include<stdio.h>
int main(void) {
　　char c = -1;
　　unsigned char uc = c;
　　short s = -1;
　　unsigned short us = s;
　　int i = -1;
　　unsigned int ui = i;
　　long l = -1;
　　unsigned long ul = l;
　　long long ll = -1;
　　unsigned long long ull = ll;

　　printf("c = %x, uc = %x\n", c, uc);
　　printf("s = %x, us = %x\n", s, us);
　　printf("i = %x, ui = %x\n", i, ui);
　　printf("l = %lx, ul = %lx\n", l, ul);
　　printf("ll = %llx, ull = %llx\n", ll, ull);
　　return 0;
}

このプログラムの実行結果は次のようになる。
例えば「c」と「uc」に注目すると、cのffというバイト列がucのffにコピーされている。

c = ffffffff, uc = ff
s = ffffffff, us = ffff
i = ffffffff, ui = ffffffff
l = ffffffff, ul = ffffffff
ll = ffffffffffffffff, ull = ffffffffffffffff

アセンブラは次のようになる。

 # char c = -1;
 53a:   c6 45 d2 ff             movb   $0xff,-0x2e(%ebp)

 # unsigned char uc = c;
 53e:   0f b6 45 d2             movzbl -0x2e(%ebp),%eax
 542:   88 45 d3                mov    %al,-0x2d(%ebp)

 # short s = -1;
 545:   66 c7 45 d4 ff ff       movw   $0xffff,-0x2c(%ebp)

 # unsigned short us = s;
 54b:   0f b7 45 d4             movzwl -0x2c(%ebp),%eax
 54f:   66 89 45 d6             mov    %ax,-0x2a(%ebp)

 # int i = -1;
 553:   c7 45 d8 ff ff ff ff    movl   $0xffffffff,-0x28(%ebp)

 # unsigned int ui = i;
 55a:   8b 45 d8                mov    -0x28(%ebp),%eax
 55d:   89 45 dc                mov    %eax,-0x24(%ebp)

 # long l = -1;
 560:   c7 45 e0 ff ff ff ff    movl   $0xffffffff,-0x20(%ebp)

 # unsigned long ul = l;
 567:   8b 45 e0                mov    -0x20(%ebp),%eax
 56a:   89 45 e4                mov    %eax,-0x1c(%ebp)

 # long long ll = -1;
 56d:   c7 45 e8 ff ff ff ff    movl   $0xffffffff,-0x18(%ebp)
 574:   c7 45 ec ff ff ff ff    movl   $0xffffffff,-0x14(%ebp)

 # unsigned long long ull = ll;
 57b:   8b 45 e8                mov    -0x18(%ebp),%eax
 57e:   8b 55 ec                mov    -0x14(%ebp),%edx
 581:   89 45 f0                mov    %eax,-0x10(%ebp)
 584:   89 55 f4                mov    %edx,-0xc(%ebp)

 # push uc, c
 587:   0f b6 55 d3             movzbl -0x2d(%ebp),%edx
 58b:   0f be 45 d2             movsbl -0x2e(%ebp),%eax
 58f:   83 ec 04                sub    $0x4,%esp
 592:   52                      push   %edx
 593:   50                      push   %eax

 # push string literal
 594:   8d 83 c8 e6 ff ff       lea    -0x1938(%ebx),%eax
 59a:   50                      push   %eax

 # call printf
 59b:   e8 10 fe ff ff          call   3b0 <printf@plt>
 5a0:   83 c4 10                add    $0x10,%esp

 # push us, s
 5a3:   0f b7 55 d6             movzwl -0x2a(%ebp),%edx
 5a7:   0f bf 45 d4             movswl -0x2c(%ebp),%eax
 5ab:   83 ec 04                sub    $0x4,%esp
 5ae:   52                      push   %edx
 5af:   50                      push   %eax

 # push string literal
 5b0:   8d 83 d9 e6 ff ff       lea    -0x1927(%ebx),%eax
 5b6:   50                      push   %eax

 # call printf
 5b7:   e8 f4 fd ff ff          call   3b0 <printf@plt>
 5bc:   83 c4 10                add    $0x10,%esp

 # push ui, i
 5bf:   83 ec 04                sub    $0x4,%esp
 5c2:   ff 75 dc                pushl  -0x24(%ebp)
 5c5:   ff 75 d8                pushl  -0x28(%ebp)

 # push string literal
 5c8:   8d 83 ea e6 ff ff       lea    -0x1916(%ebx),%eax
 5ce:   50                      push   %eax

 # call printf
 5cf:   e8 dc fd ff ff          call   3b0 <printf@plt>
 5d4:   83 c4 10                add    $0x10,%esp

 # push ul, l
 5d7:   83 ec 04                sub    $0x4,%esp
 5da:   ff 75 e4                pushl  -0x1c(%ebp)
 5dd:   ff 75 e0                pushl  -0x20(%ebp)

 # push string literal
 5e0:   8d 83 fb e6 ff ff       lea    -0x1905(%ebx),%eax
 5e6:   50                      push   %eax

 # call printf
 5e7:   e8 c4 fd ff ff          call   3b0 <printf@plt>
 5ec:   83 c4 10                add    $0x10,%esp

 # push ull, ll
 5ef:   83 ec 0c                sub    $0xc,%esp
 5f2:   ff 75 f4                pushl  -0xc(%ebp)
 5f5:   ff 75 f0                pushl  -0x10(%ebp)
 5f8:   ff 75 ec                pushl  -0x14(%ebp)
 5fb:   ff 75 e8                pushl  -0x18(%ebp)

# push string literal
 5fe:   8d 83 0e e7 ff ff       lea    -0x18f2(%ebx),%eax
 604:   50                      push   %eax

 # call printf
 605:   e8 a6 fd ff ff          call   3b0 <printf@plt>
 60a:   83 c4 20                add    $0x20,%esp

面白かったのが、8byte変数のllとullを扱う際は単に2回のmovl命令やpushl命令を使っていたところ。
一度に32bitしか扱えないようにしているので、当然といえば当然なのだが...

符号なしから符号ありにキャストする場合もまとめてみた。

#include<stdio.h>

int main(void) {
　　unsigned char uc = 0xab;
　　char c = uc;
　　unsigned short us = 0xabcd;
　　short s = us;
　　unsigned int ui = 0xabcdef12;
　　int i = ui;
　　unsigned long ul = 0xabcdef12;
　　long l = ul;
　　unsigned long long ull = 0xabcdef1234567890LL;
　　long long ll = ull;

　　printf("c = %x, uc = %x\n", c, uc);
　　printf("s = %x, us = %x\n", s, us);
　　printf("i = %x, ui = %x\n", i, ui);
　　printf("l = %x, ul = %x\n", l, ul);
　　printf("ll = %llx, ull = %llx\n", ll, ull);
　　return 0;
}

実行結果は次のようになる。

c = ffffffab, uc = ab
s = ffffabcd, us = abcd
i = abcdef12, ui = abcdef12
l = abcdef12, ul = abcdef12
ll = abcdef1234567890, ull = abcdef1234567890

個人的には調べていて結構面白かった。

「キャスト前のサイズ > キャスト後のサイズ」

今回は「char」「int」の2つで実験してみる。

#include<stdio.h>

int main(void){
    int i = -2;
    unsigned int ui = 0xfffffffe;
    char c1 = i;
    char c2 = ui;
    unsigned char uc1 = i;
    unsigned char uc2 = ui; 

    printf("c1 = %x, c2 = %x, uc1 = %x, uc2 = %x\n", c1, c2, uc1, uc2);
    return 0;
}

実行結果は次のようになる。
4byteの中の一番下のアドレスの値のみを引き抜いている形だ。

c1 = fffffffe, c2 = fffffffe, uc1 = fe, uc2 = fe

アセンブラは次のようになる。あとは自分で読んでみてほしい。

 # int i = -2;
 53a:   c7 45 e0 fe ff ff ff    movl   $0xfffffffe,-0x20(%ebp)

 # unsigned int ui = 0xfffffffe;
 541:   c7 45 e4 fe ff ff ff    movl   $0xfffffffe,-0x1c(%ebp)

 # char c1 = i; (32bitのうち、単に8byteだけを使っている)
 548:   8b 55 e0                mov    -0x20(%ebp),%edx
 54b:   88 55 dc                mov    %dl,-0x24(%ebp)

 # char c2 = ui; (32bitのうち、単に8byteだけを使っている)
 54e:   8b 55 e4                mov    -0x1c(%ebp),%edx
 551:   88 55 dd                mov    %dl,-0x23(%ebp)

 # unsigned char uc1 = i; (32bitのうち、単に8byteだけを使っている)
 554:   8b 55 e0                mov    -0x20(%ebp),%edx
 557:   88 55 de                mov    %dl,-0x22(%ebp)

 # unsigned char uc2 = ui;  (32bitのうち、単に8byteだけを使っている)
 55a:   8b 55 e4                mov    -0x1c(%ebp),%edx
 55d:   88 55 df                mov    %dl,-0x21(%ebp)

 # push uc2, uc1, c2, c1
 560:   0f b6 75 df             movzbl -0x21(%ebp),%esi
 564:   0f b6 5d de             movzbl -0x22(%ebp),%ebx
 568:   0f be 4d dd             movsbl -0x23(%ebp),%ecx
 56c:   0f be 55 dc             movsbl -0x24(%ebp),%edx
 570:   83 ec 0c                sub    $0xc,%esp
 573:   56                      push   %esi
 574:   53                      push   %ebx
 575:   51                      push   %ecx
 576:   52                      push   %edx

 # push string literal
 577:   8d 90 48 e6 ff ff       lea    -0x19b8(%eax),%edx
 57d:   52                      push   %edx

 # call printf
 57e:   89 c3                   mov    %eax,%ebx
 580:   e8 2b fe ff ff          call   3b0 <printf@plt>
 585:   83 c4 20                add    $0x20,%esp

【まとめ】

これらの実験から次のことがわかった。
・キャスト前のサイズ < キャスト後のサイズのときのキャストは符号ビット(MSB)拡張される。
・キャスト前のサイズ = キャスト後のサイズのときのキャストはただのビット列コピー。
・キャスト前のサイズ > キャスト後のサイズのときのキャストは単に下のバイトが切り取られる形になる。

FromNandの日記

自分的備忘録