定義并訓練圖以進行異步更新 · 精通 TensorFlow 1.x

# 定義并訓練圖以進行異步更新如前所述，并在此處的圖中顯示，在異步更新中，所有工作任務在準備就緒時發送參數更新，參數服務器更新參數并發回參數。參數更新沒有同步或等待或聚合： ![](https://img.kancloud.cn/5f/62/5f62bc264d8dea3d6ffccded8062a50c_552x261.png)The full code for this example is in?`ch-15_mnist_dist_async.py`. You are encouraged to modify and explore the code with your own datasets. 對于異步更新，將使用以下步驟創建和訓練圖： 1. 圖的定義在`with`塊內完成： ```py with tf.device(device_func): ``` 1. 使用內置的 TensorFlow 函數創建全局步驟變量： ```py global_step = tf.train.get_or_create_global_step() ``` 1. 此變量也可以定義為： ```py tf.Variable(0,name='global_step',trainable=False) ``` 1. 像往常一樣定義數據集，參數和超參數： ```py x_test = mnist.test.images y_test = mnist.test.labels n_outputs = 10 # 0-9 digits n_inputs = 784 # total pixels learning_rate = 0.01 n_epochs = 50 batch_size = 100 n_batches = int(mnist.train.num_examples/batch_size) n_epochs_print=10 ``` 1. 像往常一樣定義占位符，權重，偏差，logits，交叉熵，損失操作，訓練操作，準確率： ```py # input images x_p = tf.placeholder(dtype=tf.float32, name='x_p', shape=[None, n_inputs]) # target output y_p = tf.placeholder(dtype=tf.float32, name='y_p', shape=[None, n_outputs]) w = tf.Variable(tf.random_normal([n_inputs, n_outputs], name='w' ) ) b = tf.Variable(tf.random_normal([n_outputs], name='b' ) ) logits = tf.matmul(x_p,w) + b entropy_op = tf.nn.softmax_cross_entropy_with_logits(labels=y_p, logits=logits ) loss_op = tf.reduce_mean(entropy_op) optimizer = tf.train.GradientDescentOptimizer(learning_rate) train_op = optimizer.minimize(loss_op,global_step=global_step) correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(y_p, 1)) accuracy_op = tf.reduce_mean(tf.cast(correct_pred, tf.float32)) ``` 當我們學習如何構建同步更新時，這些定義將會改變。 1. TensorFlow 提供了一個主管類，可以幫助創建訓練會話，在分布式訓練設置中非常有用。創建一個 supervisor 對象，如下所示： ```py init_op = tf.global_variables_initializer sv = tf.train.Supervisor(is_chief=is_chief, init_op = init_op(), global_step=global_step) ``` 1. 使用 supervisor 對象創建會話并像往常一樣在此會話塊下運行訓練： ```py with sv.prepare_or_wait_for_session(server.target) as mts: lstep = 0 for epoch in range(n_epochs): for batch in range(n_batches): x_batch, y_batch = mnist.train.next_batch(batch_size) feed_dict={x_p:x_batch,y_p:y_batch} _,loss,gstep=mts.run([train_op,loss_op,global_step], feed_dict=feed_dict) lstep +=1 if (epoch+1)%n_epochs_print==0: print('worker={},epoch={},global_step={}, \ local_step={},loss={}'. format(FLAGS.task_index,epoch,gstep,lstep,loss)) feed_dict={x_p:x_test,y_p:y_test} accuracy = mts.run(accuracy_op, feed_dict=feed_dict) print('worker={}, final accuracy = {}' .format(FLAGS.task_index,accuracy)) ``` 在啟動參數服務器時，我們得到以下輸出： ```py $ python3 ch-15_mnist_dist_async.py --job_name='ps' --task_index=**0** I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: name: Quadro P5000 major: 6 minor: 1 memoryClockRate(GHz): 1.506 pciBusID: 0000:01:00.0 totalMemory: 15.89GiB freeMemory: 15.79GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Quadro P5000, pci bus id: 0000:01:00.0, compute capability: 6.1) E1213 16:50:14.023235178 27224 ev_epoll1_linux.c:1051] grpc epoll fd: 23 I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9001} I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:9002, 1 -> localhost:9003, 2 -> localhost:9004} I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:9001 ``` 在啟動工作任務時，我們得到以下三個輸出：工作器 1 的輸出： ```py $ python3 ch-15_mnist_dist_async.py --job_name='worker' --task_index=**0** I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: name: Quadro P5000 major: 6 minor: 1 memoryClockRate(GHz): 1.506 pciBusID: 0000:01:00.0 totalMemory: 15.89GiB freeMemory: 9.16GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Quadro P5000, pci bus id: 0000:01:00.0, compute capability: 6.1) E1213 16:50:37.516609689 27507 ev_epoll1_linux.c:1051] grpc epoll fd: 23 I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9001} I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:9002, 1 -> localhost:9003, 2 -> localhost:9004} I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:9002 I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 1421824c3df413b5 with config: gpu_options { per_process_gpu_memory_fraction: 0.2 } allow_soft_placement: true worker=0,epoch=9,global_step=10896, local_step=5500, loss = 1.2575616836547852 worker=0,epoch=19,global_step=22453, local_step=11000, loss = 0.7158586382865906 worker=0,epoch=29,global_step=39019, local_step=16500, loss = 0.43712112307548523 worker=0,epoch=39,global_step=55513, local_step=22000, loss = 0.3935799300670624 worker=0,epoch=49,global_step=72002, local_step=27500, loss = 0.3877961337566376 worker=0, final accuracy = 0.8865000009536743 ``` 工作器 2 的輸出： ```py $ python3 ch-15_mnist_dist_async.py --job_name='worker' --task_index=**1** I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: name: Quadro P5000 major: 6 minor: 1 memoryClockRate(GHz): 1.506 pciBusID: 0000:01:00.0 totalMemory: 15.89GiB freeMemory: 12.43GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Quadro P5000, pci bus id: 0000:01:00.0, compute capability: 6.1) E1213 16:50:36.684334877 27461 ev_epoll1_linux.c:1051] grpc epoll fd: 23 I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9001} I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:9002, 1 -> localhost:9003, 2 -> localhost:9004} I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:9003 I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 2bd8a136213a1fce with config: gpu_options { per_process_gpu_memory_fraction: 0.2 } allow_soft_placement: true worker=1,epoch=9,global_step=11085, local_step=5500, loss = 0.6955764889717102 worker=1,epoch=19,global_step=22728, local_step=11000, loss = 0.5891970992088318 worker=1,epoch=29,global_step=39074, local_step=16500, loss = 0.4183048903942108 worker=1,epoch=39,global_step=55599, local_step=22000, loss = 0.32243454456329346 worker=1,epoch=49,global_step=72105, local_step=27500, loss = 0.5384714007377625 worker=1, final accuracy = 0.8866000175476074 ``` 工作器 3 的輸出： ```py $ python3 ch-15_mnist_dist_async.py --job_name='worker' --task_index=**2** I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: name: Quadro P5000 major: 6 minor: 1 memoryClockRate(GHz): 1.506 pciBusID: 0000:01:00.0 totalMemory: 15.89GiB freeMemory: 15.70GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Quadro P5000, pci bus id: 0000:01:00.0, compute capability: 6.1) E1213 16:50:35.568349791 27449 ev_epoll1_linux.c:1051] grpc epoll fd: 23 I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9001} I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:9002, 1 -> localhost:9003, 2 -> localhost:9004} I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://The full code for this example is in ch-15_mnist_dist_sync.py. You are encouraged to modify and explore the code with your own datasets.localhost:9004 I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session cb0749c9f5fc163e with config: gpu_options { per_process_gpu_memory_fraction: 0.2 } allow_soft_placement: true I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 55bf9a2b9718a571 with config: gpu_options { per_process_gpu_memory_fraction: 0.2 } allow_soft_placement: true worker=2,epoch=9,global_step=37367, local_step=5500, loss = 0.8077645301818848 worker=2,epoch=19,global_step=53859, local_step=11000, loss = 0.26333487033843994 worker=2,epoch=29,global_step=70299, local_step=16500, loss = 0.6506651043891907 worker=2,epoch=39,global_step=76999, local_step=22000, loss = 0.20321622490882874 worker=2,epoch=49,global_step=82499, local_step=27500, loss = 0.4170967936515808 worker=2, final accuracy = 0.8894000053405762 ``` 我們打印了全球步驟和本地步驟。全局步驟表示所有工作器任務的步數，而本地步驟是該工作器任務中的計數，這就是為什么本地任務計數高達 27,500 并且每個工作器的每個周期都相同，但是因為工作器正在做按照自己的步伐采取全球性措施，全球步驟的數量在周期或工作器之間沒有對稱性或模式。此外，我們發現每個工作器的最終準確率是不同的，因為每個工作器在不同的時間執行最終的準確性，當時有不同的參數。