free性欧美xx69,被绑在坐桩机上抹春药bl推文,少妇被大黑捧猛烈进出A片

Tensorflow

1.在運行之前先查看GPU的使用情況：
指令：nvidia-smi備注：查看GPU此時的使用情況
或者
指令：watch nvidia-smi備注：實時返回GPU使用情況
2.指定GPU訓練：
方法一、在python程序中設置：
代碼：os.environ['CUDA_VISIBLE_DEVICES'] = '0' 備注：使用 GPU 0
代碼：os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' 備注：使用 GPU 0，1
方法二、在執行python程序時候：
指令：CUDA_VISIBLE_DEVICES=2 python yourcode.py
指令：CUDA_VISIBLE_DEVICES=0,1 python yourcode.py
備注：‘=’的左右不允許有空格

注：TensorFlow會默認直接占滿我們模型部署的GPU的存儲資源，只允許一個小內存的程序也會占用所有GPU資源。因此有的時候我們通過nvidia-smi查看GPU狀態的時候，會發現有些GPU的計算利用率很低或者計算利用率為0，但是存儲被占滿了，而這個時候其他人也不能使用這塊GPU。但是現在公司的問題是模型多，卡不夠用，所有只能“文明”使用GPU，如果設置為允許動態增長的話，這樣這個GPU沒有被占的存儲還可以被其他人使用。

3.兩種限定GPU占用量的方法：
方法一、設置定量的GPU顯存使用量:
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4 # 占用GPU40%的顯存
session = tf.Session(config=config)
方法二、設置最小的GPU顯存使用量，動態申請顯存:（建議）
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)

Pytorch

1. 告訴程序哪些GPU可以使用
  import os
  os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
  os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
1. 使用多GPU訓練網絡
  方法一、使用 torch.nn.DataParallel。這種方法會出負載不均衡的問題，因為當并行計算時，loss每次都在第一個GPU里相加計算，這樣第一張卡會用的明顯多。

import torch.nn as nn
## 判斷使用cpu還是gpu
def get_device():
    if torch.cuda.is_available():
        return torch.device('cuda:0')
    else:
        return torch.device('cpu')

if torch.cuda.device_count() > 1:#判斷是不是有多個GPU
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    # 就這一行
    model = nn.DataParallel(model,device_ids=range(torch.cuda.device_count())) # device_ids=[0, 1, 2]

? ?方法二、使用distributedDataparallel
? ?官方鏈接如下：https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel 。這個函數的主要目的是為了多機多卡加速的，但是單機多卡也是沒問題的。相比于之前的Dataparallel，新的函數更加優雅，速度更加快（這一點官方文檔里有提到），而且不會出現負載不均衡的問題，唯一的小缺點可能就是配置稍微有點小麻煩。

Pytorch 中分布式的基本使用流程如下：

1、在使用 distributed 包的任何其他函數之前，需要使用 init_process_group 初始化進程組，同時初始化 distributed 包。
2、如果需要進行小組內集體通信，用 new_group 創建子分組
3、創建分布式并行模型 DDP(model, device_ids=device_ids)
4、為數據集創建 Sampler
5、使用啟動工具 torch.distributed.launch 在每個主機上執行一次腳本，開始訓練
6、使用 destory_process_group() 銷毀進程組
train_dataset最好不要用自己寫的sampler，否則還需要再實現一遍分布式的數據劃分方式

首先，我們需要對腳本進行升級，使其能夠獨立的在機器（節點）中運行。
我們想要完全實現分布式，并且在每個結點的每個GPU上獨立運行進程，這一共需要8個進程。
接下來，初始化分布式后端，封裝模型以及準備數據，這些數據用于在獨立的數據子集中訓練進程。更新后的代碼如下：
from torch.utils.data.distributed import DistributedSampler
from torch.utils.data import DataLoader

# Each process runs on 1 GPU device specified by the local_rank argument.
#設置local_rank參數，每個進程在local_rank參數指定的1個GPU設備上運行。
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", default=0, type=int,help='node rank for distributed training')
args = parser.parse_args()

# Initializes the distributed backend which will take care of sychronizing nodes/GPUs。
# 初始化負責同步節點/ gpu的分布式后端
torch.distributed.init_process_group(backend='nccl')

# Encapsulate the model on the GPU assigned to the current process
# 封裝模型在指定給當前進程的GPU上
device = torch.device('cuda', arg.local_rank)
model = model.to(device)
distrib_model = torch.nn.parallel.DistributedDataParallel(model,
                                                          device_ids=[args.local_rank],
                                                          output_device=args.local_rank)

# Restricts data loading to a subset of the dataset exclusive to the current process
# 將數據加載限制為當前進程獨占的數據集子集
sampler = DistributedSampler(dataset)

dataloader = DataLoader(dataset, sampler=sampler)
for inputs, labels in dataloader:
    predictions = distrib_model(inputs.to(device))         # Forward pass
    loss = loss_function(predictions, labels.to(device))   # Compute loss function
    loss.backward()                                        # Backward pass
    optimizer.step()                                       # Optimizer step

單機多GPU運行
python -m torch.distributed.launch --nproc_per_node=3 --nnodes=1 --node_rank=0 yourscript.py
多機多GPU運行
服務器1： python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" --master_port=1234 OUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other arguments of our training script)
服務器2： python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="192.168.1.1" --master_port=1234 OUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other arguments of our training script)
除了—node_rank參數之外，上述兩個命令相同；--nproc_per_node表示你使用的1臺服務器上的GPU數量。

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

模型訓練，GPU的使用

模型訓練，GPU的使用

Tensorflow

Pytorch

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

模型訓練，GPU的使用

Tensorflow

Pytorch

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频