并行化 TensorFlow · TensorFlow 機器學習秘籍中文第二版

# 并行化 TensorFlow 為了擴展 TensorFlow 并行化的范圍，我們還可以以分布式方式在完全不同的機器上從我們的圖執行單獨的操作。這個秘籍將告訴你如何。 ## 做好準備在 TensorFlow 發布幾個月后，谷歌發布了分布式 TensorFlow，它是對 TensorFlow 生態系統的一次重大升級，并且允許在不同的工作機器上設置 TensorFlow 集群，并分享訓練和評估的計算任務楷模。使用 Distributed TensorFlow 就像為工作器設置參數一樣簡單，然后為不同的工作器分配不同的工作。在這個秘籍中，我們將建立兩個本地工作器并將他們分配到不同的工作。 ## 操作步驟 1. 首先，我們加載 TensorFlow 并使用配置字典文件（ports `2222`和`2223`）定義我們的兩個本地 worker，如下所示： ```py import tensorflow as tf # Cluster for 2 local workers (tasks 0 and 1): cluster = tf.train.ClusterSpec({'local': ['localhost:2222', 'localhost:2223']}) ``` 1. 現在，我們將兩個 worker 連接到服務器并使用以下任務編號標記它們： ```py server = tf.train.Server(cluster, job_name="local", task_index=0) server = tf.train.Server(cluster, job_name="local", task_index=1) ``` 1. 現在我們將讓每個工作器完成一項任務。第一個工作器將初始化兩個矩陣（每個矩陣將是 25 乘 25）。第二個工作器將找到所有元素的總和。然后，我們將自動分配兩個總和的總和并打印輸出，如下所示： ```py mat_dim = 25 matrix_list = {} with tf.device('/job:local/task:0'): for i in range(0, 2): m_label = 'm_{}'.format(i) matrix_list[m_label] = tf.random_normal([mat_dim, mat_dim]) # Have each worker calculate the sums sum_outs = {} with tf.device('/job:local/task:1'): for i in range(0, 2): A = matrix_list['m_{}'.format(i)] sum_outs['m_{}'.format(i)] = tf.reduce_sum(A) # Sum all the sums summed_out = tf.add_n(list(sum_outs.values())) with tf.Session(server.target) as sess: result = sess.run(summed_out) print('Summed Values:{}'.format(result)) ``` 1. 輸入上面的代碼后，我們可以在命令提示符下運行以下命令： ```py $ python3 parallelizing_tensorflow.py I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job local -> {0 -> localhost:2222, 1 -> localhost:2223} I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:206] Started server with target: grpc://localhost:2222 I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job local -> {0 -> localhost:2222, 1 -> localhost:2223} I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:206] Started server with target: grpc://localhost:2223 I tensorflow/core/distributed_runtime/master_session.cc:928] Start master session 252bb6f530553002 with config: Summed Values:-21.12611198425293 ``` ## 工作原理使用 Distributed TensorFlow 非常簡單。您所要做的就是將工作者 IP 分配給具有名稱的服務器。然后，可以手動或自動為操作員分配操作。