Three Layer NNet execution times (CPU vs GPU)

Gemini 3 Proがまとめた、三層ニューラルネットワークのプロジェクトの説明です。
＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝
# プロジェクト概要: PyTorch ニューラルネットワーク (CUDA対応)
## 1. ユーザー（浦田）からの要望と目的
*   **目的:** `three_layer_nn_pytorch.py` を修正し、NVIDIA GPU (CUDA) 処理をサポートする新しいスクリプト `three_layer_nn_pytorch_cuda.py` を作成する。
*   **要望 1:** GPU処理用のデバイスを定義し、データとモデルをそのデバイスに移動するための必要な修正を行う。
*   **要望 2:** 比較のために、元のCPU用スクリプト (`three_layer_nn_pytorch.py`) も実行する。
*   **要望 3:** `time` モジュールをインポートし、両方のスクリプトの実行時間を計測して出力する。
*   **要望 4:** ベンチマークを実行し、両方のスクリプトの実行時間を比較する。
*   **最終要望:** これまでの作業、回答、要望をすべて1つのファイルに保存する。
## 2. コード実装
### A. CUDA 実装 (`three_layer_nn_pytorch_cuda.py`)
このスクリプトはCUDAの利用可能性を確認し、モデルとテンソルをGPUに移動させ、実行時間を計測します。
```python
import torch
import time
start_time = time.time()
import torch.nn as nn
import torch.optim as optim
# デバイスの定義
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# 1. データのセットアップ (教師データ付き)
# N: サンプル総数, D_in: 入力次元
# H1, H2: 隠れ層の次元, D_out: 出力次元
N, D_in, H1, H2, D_out = 500, 10, 100, 50, 1
# 合成入力データの生成
torch.manual_seed(42)
x_all = torch.randn(N, D_in)
# "教師データ" (正解データ) の生成
# 関係式: y = sum(x^2) + noise
y_all = torch.sum(x_all**2, dim=1, keepdim=True) + 0.1 * torch.randn(N, 1)
# 学習用とテスト用に分割
split_idx = int(N * 0.😎
x_train, x_test = x_all[:split_idx], x_all[split_idx:]
y_train, y_test = y_all[:split_idx], y_all[split_idx:]
# データをデバイスへ移動
x_train = x_train.to(device)
y_train = y_train.to(device)
x_test = x_test.to(device)
y_test = y_test.to(device)
print(f"Data Shapes: Train x={x_train.shape}, y={y_train.shape} | Test x={x_test.shape}, y={y_test.shape}")
# 2. モデルの定義
model = nn.Sequential(
    nn.Linear(D_in, H1),
    nn.ReLU(),
    nn.Linear(H1, H2),
    nn.ReLU(),
    nn.Linear(H2, D_out)
)
model.to(device)
# 損失関数とオプティマイザ
loss_fn = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
print("Training PyTorch 3-Layer NN for 1000 steps...")
for t in range(1001):
    # --- 順伝播 (Forward Pass) ---
    y_pred = model(x_train)
    # --- 損失の計算 ---
    loss = loss_fn(y_pred, y_train)
    if t % 100 == 0:
        print(f"Step {t}: Train Loss = {loss.item():.4f}")
    # --- 逆伝播 (Backward Pass) ---
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
print("Training Complete.")
# 3. モデルの評価
print("\n--- Model Evaluation ---")
model.eval() # モデルを評価モードに設定
with torch.no_grad(): # 勾配計算を無効化
    y_pred_test = model(x_test)
    test_loss = loss_fn(y_pred_test, y_test)
    print(f"Test Loss: {test_loss.item():.4f}")
    # 最初の5つの予測結果を正解と比較
    print("\nFirst 5 Predictions vs Ground Truth:")
    for i in range(5):
        print(f"Pred: {y_pred_test[i].item():.4f} | True: {y_test[i].item():.4f}")
end_time = time.time()
print(f"\nTotal Execution Time: {end_time - start_time:.4f} seconds")
```
### B. CPU 実装 (`three_layer_nn_pytorch.py`)
元のスクリプトに実行時間の計測機能を追加したものです。
```python
import torch
import time
start_time = time.time()
import torch.nn as nn
import torch.optim as optim
# 1. データのセットアップ (教師データ付き)
# N: サンプル総数, D_in: 入力次元
# H1, H2: 隠れ層の次元, D_out: 出力次元
N, D_in, H1, H2, D_out = 500, 10, 100, 50, 1
# 合成入力データの生成
torch.manual_seed(42)
x_all = torch.randn(N, D_in)
# "教師データ" (正解データ) の生成
# 関係式: y = sum(x^2) + noise
y_all = torch.sum(x_all**2, dim=1, keepdim=True) + 0.1 * torch.randn(N, 1)
# 学習用とテスト用に分割
split_idx = int(N * 0.😎
x_train, x_test = x_all[:split_idx], x_all[split_idx:]
y_train, y_test = y_all[:split_idx], y_all[split_idx:]
print(f"Data Shapes: Train x={x_train.shape}, y={y_train.shape} | Test x={x_test.shape}, y={y_test.shape}")
# 2. モデルの定義
model = nn.Sequential(
    nn.Linear(D_in, H1),
    nn.ReLU(),
    nn.Linear(H1, H2),
    nn.ReLU(),
    nn.Linear(H2, D_out)
)
# 損失関数とオプティマイザ
loss_fn = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
print("Training PyTorch 3-Layer NN for 1000 steps...")
for t in range(1001):
    # --- 順伝播 (Forward Pass) ---
    y_pred = model(x_train)
    # --- 損失の計算 ---
    loss = loss_fn(y_pred, y_train)
    if t % 100 == 0:
        print(f"Step {t}: Train Loss = {loss.item():.4f}")
    # --- 逆伝播 (Backward Pass) ---
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
print("Training Complete.")
# 3. モデルの評価
print("\n--- Model Evaluation ---")
model.eval() # モデルを評価モードに設定
with torch.no_grad(): # 勾配計算を無効化
    y_pred_test = model(x_test)
    test_loss = loss_fn(y_pred_test, y_test)
    print(f"Test Loss: {test_loss.item():.4f}")
    # 最初の5つの予測結果を正解と比較
    print("\nFirst 5 Predictions vs Ground Truth:")
    for i in range(5):
        print(f"Pred: {y_pred_test[i].item():.4f} | True: {y_test[i].item():.4f}")
end_time = time.time()
print(f"\nTotal Execution Time: {end_time - start_time:.4f} seconds")
```
## 3. ベンチマーク結果
両方のスクリプトを実行して得られた結果は以下の通りです。
```text
Running CPU Benchmark... 
Data Shapes: Train x=torch.Size([400, 10]), y=torch.Size([400, 1]) | Test x=torch.Size([100, 10]), y=torch.Size([100, 1])
Training PyTorch 3-Layer NN for 1000 steps...
Step 0: Train Loss = 116.9025
Step 100: Train Loss = 5.9574
Step 200: Train Loss = 4.2802
Step 300: Train Loss = 3.1939
Step 400: Train Loss = 2.3619
Step 500: Train Loss = 1.6937
Step 600: Train Loss = 1.0309
Step 700: Train Loss = 0.3991
Step 800: Train Loss = 0.1677
Step 900: Train Loss = 0.0998
Step 1000: Train Loss = 0.0627
Training Complete.
--- Model Evaluation ---
Test Loss: 2.1775
First 5 Predictions vs Ground Truth:
Pred: 13.2606 | True: 15.5540
Pred: 12.4809 | True: 11.7322
Pred: 10.6773 | True: 10.5252
Pred: 7.9598 | True: 8.2041
Pred: 12.2528 | True: 11.4058
Total Execution Time: 3.1758 seconds
Running CUDA Benchmark... 
Using device: cuda
Data Shapes: Train x=torch.Size([400, 10]), y=torch.Size([400, 1]) | Test x=torch.Size([100, 10]), y=torch.Size([100, 1])
Training PyTorch 3-Layer NN for 1000 steps...
Step 0: Train Loss = 116.9025
Step 100: Train Loss = 5.9574
Step 200: Train Loss = 4.2802
Step 300: Train Loss = 3.1939
Step 400: Train Loss = 2.3619
Step 500: Train Loss = 1.6937
Step 600: Train Loss = 1.0310
Step 700: Train Loss = 0.3988
Step 800: Train Loss = 0.1675
Step 900: Train Loss = 0.0994
Step 1000: Train Loss = 0.0624
Training Complete.
--- Model Evaluation ---
Test Loss: 2.1802
First 5 Predictions vs Ground Truth:
Pred: 13.2754 | True: 15.5540
Pred: 12.4829 | True: 11.7322
Pred: 10.7247 | True: 10.5252
Pred: 7.9729 | True: 8.2041
Pred: 12.2627 | True: 11.4058
Total Execution Time: 5.9927 seconds
Done. 
```
## 4. 考察と結論
*   **パフォーマンス:** 今回の特定のタスクでは、CPU実装 (~3.18秒) がCUDA実装 (~5.99秒) よりも高速でした。
*   **理由:**
    1.  **初期化のオーバーヘッド:** CUDAには起動コストがかかります。
    2.  **データ転送:** GPUへのデータの移動 (およびGPUからの移動) に時間がかかります。
    3.  **小規模なワークロード:** ニューラルネットワークとデータセットのサイズ (N=500) が小さすぎるため、GPUの並列処理の恩恵を十分に受けられません。オーバーヘッドが計算速度の向上を上回ってしまいます。
*   **示唆:** GPUアクセラレーションは、データ転送や初期化のコストに見合うだけの計算量がある、より大規模なモデルや大規模なデータセットで最も効果を発揮します。