IO · node · 看云

# IO * [`[Doc]` Buffer](sections/io.md#buffer) * [`[Doc]` String Decoder (字符串解碼)](sections/io.md#string-decoder) * [`[Doc]` Stream (流)](sections/io.md#stream) * [`[Doc]` Console (控制臺)](sections/io.md#console) * [`[Doc]` File System (文件系統)](sections/io.md#file) * [`[Doc]` Readline](sections/io.md#readline) * [`[Doc]` REPL](sections/io.md#repl) # 簡述 Node.js 是以 IO 密集型業務著稱. 那么問題來了, 你真的了解什么叫 IO, 什么又叫 IO 密集型業務嗎? ## Buffer Buffer 是 Node.js 中用于處理二進制數據的類, 其中與 IO 相關的操作 (網絡/文件等) 均基于 Buffer. Buffer 類的實例非常類似整數數組, ***但其大小是固定不變的***, 并且其內存在 V8 堆棧外分配原始內存空間. Buffer 類的實例創建之后, 其所占用的內存大小就不能再進行調整. 在 Node.js v6.x 之后 `new Buffer()` 接口開始被廢棄, 理由是參數類型不同會返回不同類型的 Buffer 對象, 所以當開發者沒有正確校驗參數或沒有正確初始化 Buffer 對象的內容時, 以及不了解的情況下初始化就會在不經意間向代碼中引入安全性和可靠性問題. 接口|用途 ---|--- Buffer.from()|根據已有數據生成一個 Buffer 對象 Buffer.alloc()|創建一個初始化后的 Buffer 對象 Buffer.allocUnsafe()|創建一個未初始化的 Buffer 對象 ### TypedArray Node.js 的 Buffer 在 ES6 增加了 TypedArray 類型之后, 修改了原來的 Buffer 的實現, 選擇基于 TypedArray 中 Uint8Array 來實現, 從而提升了一波性能. 使用上, 你需要了解如下情況: ```javascript const arr = new Uint16Array(2); arr[0] = 5000; arr[1] = 4000; const buf1 = Buffer.from(arr); // 拷貝了該 buffer const buf2 = Buffer.from(arr.buffer); // 與該數組共享了內存 console.log(buf1); // 輸出: <Buffer 88 a0>, 拷貝的 buffer 只有兩個元素 console.log(buf2); // 輸出: <Buffer 88 13 a0 0f> arr[1] = 6000; console.log(buf1); // 輸出: <Buffer 88 a0> console.log(buf2); // 輸出: <Buffer 88 13 70 17> ``` ## String Decoder 字符串解碼器 (String Decoder) 是一個用于將 Buffer 拿來 decode 到 string 的模塊, 是作為 Buffer.toString 的一個補充, 它支持多字節 UTF-8 和 UTF-16 字符. 例如 ```javascript const StringDecoder = require('string_decoder').StringDecoder; const decoder = new StringDecoder('utf8'); const cent = Buffer.from([0xC2, 0xA2]); console.log(decoder.write(cent)); // ￠ const euro = Buffer.from([0xE2, 0x82, 0xAC]); console.log(decoder.write(euro)); // € ``` 當然也可以斷斷續續的處理. ```javascript const StringDecoder = require('string_decoder').StringDecoder; const decoder = new StringDecoder('utf8'); decoder.write(Buffer.from([0xE2])); decoder.write(Buffer.from([0x82])); console.log(decoder.end(Buffer.from([0xAC]))); // € ``` ## Stream Node.js 內置的 `stream` 模塊是多個核心模塊的基礎. 但是流 (stream) 是一種很早之前流行的編程方式. 可以用大家比較熟悉的 C語言來看這種流式操作: ```c int copy(const char *src, const char *dest) { FILE *fpSrc, *fpDest; char buf[BUF_SIZE] = {0}; int lenSrc, lenDest; // 打開要 src 的文件 if ((fpSrc = fopen(src, "r")) == NULL) { printf("文件 '%s' 無法打開\n", src); return FAILURE; } // 打開 dest 的文件 if ((fpDest = fopen(dest, "w")) == NULL) { printf("文件 '%s' 無法打開\n", dest); fclose(fpSrc); return FAILURE; } // 從 src 中讀取 BUF_SIZE 長的數據到 buf 中 while ((lenSrc = fread(buf, 1, BUF_SIZE, fpSrc)) > 0) { // 將 buf 中的數據寫入 dest 中 if ((lenDest = fwrite(buf, 1, lenSrc, fpDest)) != lenSrc) { printf("寫入文件 '%s' 失敗\n", dest); fclose(fpSrc); fclose(fpDest); return FAILURE; } // 寫入成功后清空 buf memset(buf, 0, BUF_SIZE); } // 關閉文件 fclose(fpSrc); fclose(fpDest); return SUCCESS; } ``` 應用的場景很簡單, 你要拷貝一個 20G 大的文件, 如果你一次性將 20G 的數據讀入到內存, 你的內存條可能不夠用, 或者嚴重影響性能. 但是你如果使用一個 1MB 大小的緩存 (buf) 每次讀取 1Mb, 然后寫入 1Mb, 那么不論這個文件多大都只會占用 1Mb 的內存. 而在 Node.js 中, 原理與上述 C 代碼類似, 不過在讀寫的實現上通過 libuv 與 EventEmitter 加上了異步的特性. 在 linux/unix 中你可以通過 `|` 來感受到流式操作. ### Stream 的類型類|使用場景|重寫方法 ---|---|--- [Readable](https://github.com/substack/stream-handbook#readable-streams)|只讀|_read [Writable](https://github.com/substack/stream-handbook#writable-streams)|只寫|_write [Duplex](https://github.com/substack/stream-handbook#duplex)|讀寫|_read, _write [Transform](https://github.com/substack/stream-handbook#transform)|操作被寫入數據, 然后讀出結果|_transform, _flush ### 對象模式通過 Node API 創建的流, 只能夠對字符串或者 buffer 對象進行操作. 但其實流的實現是可以基于其他的 Javascript 類型(除了 null, 它在流中有特殊的含義)的. 這樣的流就處在 "對象模式(objectMode)" 中. 在創建流對象的時候, 可以通過提供 `objectMode` 參數來生成對象模式的流. 試圖將現有的流轉換為對象模式是不安全的. ### 緩沖區 Node.js 中 stream 的緩沖區, 以開頭的 C語言拷貝文件的代碼為模板討論, (拋開異步的區別看) 則是從 `src` 中讀出數據到 `buf` 中后, 并沒有直接寫入 `dest` 中, 而是先放在一個比較大的緩沖區中, 等待寫入(消費) `dest` 中. 即, 在緩沖區的幫助下可以使讀與寫的過程分離. Readable 和 Writable 流都會將數據儲存在內部的緩沖區中. 緩沖區可以分別通過 `writable._writableState.getBuffer()` 和 `readable._readableState.buffer` 來訪問. 緩沖區的大小, 由構造 stream 時候的 `highWaterMark` 標志指定可容納的 byte 大小, 對于 `objectMode` 的 stream, 該標志表示可以容納的對象個數. #### 可讀流當一個可讀實例調用 `stream.push()` 方法的時候, 數據將會被推入緩沖區. 如果數據沒有被消費, 即調用 `stream.read()` 方法讀取的話, 那么數據會一直留在緩沖隊列中. 當緩沖區中的數據到達 `highWaterMark` 指定的閾值, 可讀流將停止從底層汲取數據, 直到當前緩沖的報備成功消耗為止. #### 可寫流在一個在可寫實例上不停地調用 writable.write(chunk) 的時候數據會被寫入可寫流的緩沖區. 如果當前緩沖區的緩沖的數據量低于 `highWaterMark` 設定的值, 調用 writable.write() 方法會返回 true (表示數據已經寫入緩沖區), 否則當緩沖的數據量達到了閾值, 數據無法寫入緩沖區 write 方法會返回 false, 直到 drain 事件觸發之后才能繼續調用 write 寫入. ```javascript // Write the data to the supplied writable stream one million times. // Be attentive to back-pressure. function writeOneMillionTimes(writer, data, encoding, callback) { let i = 1000000; write(); function write() { var ok = true; do { i--; if (i === 0) { // last time! writer.write(data, encoding, callback); } else { // see if we should continue, or wait // don't pass the callback, because we're not done yet. ok = writer.write(data, encoding); } } while (i > 0 && ok); if (i > 0) { // had to stop early! // write some more once it drains writer.once('drain', write); } } } ``` #### Duplex 與 Transform Duplex 流和 Transform 流都是同時可讀寫的, 他們會在內部維持兩個緩沖區, 分別對應讀取和寫入, 這樣就可以允許兩邊同時獨立操作, 維持高效的數據流. 比如說 net.Socket 是一個 Duplex 流, Readable 端允許從 socket 獲取、消耗數據, Writable 端允許向 socket 寫入數據. 數據寫入的速度很有可能與消耗的速度有差距, 所以兩端可以獨立操作和緩沖是很重要的. ### pipe stream 的 `.pipe()`, 將一個可寫流附到可讀流上, 同時將可寫流切換到流模式, 并把所有數據推給可寫流. 在 pipe 傳遞數據的過程中, `objectMode` 是傳遞引用, 非 `objectMode` 則是拷貝一份數據傳遞下去. pipe 方法最主要的目的就是將數據的流動緩沖到一個可接受的水平, 不讓不同速度的數據源之間的差異導致內存被占滿. 關于 pipe 的實現參見 David Cai 的 [通過源碼解析 Node.js 中導流（pipe）的實現](https://cnodejs.org/topic/56ba030271204e03637a3870) ## Console [console.log 正常情況下是異步的, 除非你使用 `new Console(stdout[, stderr])` 指定了一個文件為目的地](https://nodejs.org/dist/latest-v6.x/docs/api/console.html#console_asynchronous_vs_synchronous_consoles). 不過一般情況下的實現都是如下 ([6.x 源代碼](https://github.com/nodejs/node/blob/v6.x/lib/console.js#L42)): ```javascript // As of v8 5.0.71.32, the combination of rest param, template string // and .apply(null, args) benchmarks consistently faster than using // the spread operator when calling util.format. Console.prototype.log = function(...args) { this._stdout.write(`${util.format.apply(null, args)}\n`); }; ``` 自己實現一個 console.log 可以參考如下代碼: ```javascript let print = (str) => process.stdout.write(str + '\n'); print('hello world'); ``` 注意: 該代碼并沒有處理多參數, 也沒有處理占位符 (即 util.format 的功能). ### console.log.bind(console) 問題 ```javascript // 源碼出處 https://github.com/nodejs/node/blob/v6.x/lib/console.js function Console(stdout, stderr) { // ... init ... // bind the prototype functions to this Console instance var keys = Object.keys(Console.prototype); for (var v = 0; v < keys.length; v++) { var k = keys[v]; this[k] = this[k].bind(this); } } ``` ## File “一切皆是文件”是 Unix/Linux 的基本哲學之一, 不僅普通的文件、目錄、字符設備、塊設備、套接字等在 Unix/Linux 中都是以文件被對待, 也就是說這些資源的操作對象均為 fd (文件描述符), 都可以通過同一套 system call 來讀寫. 在 linux 中你可以通過 ulimit 來對 fd 資源進行一定程度的管理限制. Node.js 封裝了標準 POSIX 文件 I/O 操作的集合. 通過 require('fs') 可以加載該模塊. 該模塊中的所有方法都有異步執行和同步執行兩個版本. 你可以通過 fs.open 獲得一個文件的文件描述符. ### 編碼 // TODO UTF8, GBK, es6 中對編碼的支持, 如何計算一個漢字的長度 BOM ### stdio stdio (standard input output) 標準的輸入輸出流, 即輸入流 (stdin), 輸出流 (stdout), 錯誤流 (stderr) 三者. 在 Node.js 中分別對應 `process.stdin` (Readable), `process.stdout` (Writable) 以及 `process.stderr` (Writable) 三個 stream. 輸出函數是每個人在學習任何一門編程語言時所需要學到的第一個函數. 例如 C語言的 `printf("hello, world!");` python/ruby 的 `print 'hello, world!'` 以及 Javascript 中的 `console.log('hello, world!');` 以 C語言的偽代碼來看的話, 這類輸出函數的實現思路如下: ```c int printf(FILE *stream, 要打印的內容) { // ... // 1. 申請一個臨時內存空間 char *s = malloc(4096); // 2. 處理好要打印的的內容, 其值存儲在 s 中 // ... // 3. 將 s 上的內容寫入到 stream 中 fwrite(s, stream); // 4. 釋放臨時空間 free(s); // ... } ``` 我們需要了解的是第 3 步, 其中的 stream 則是指 stdout (輸出流). 實際上在 shell 上運行一個應用程序的時候, shell 做的第一個操作是 fork 當前 shell 的進程 (所以, 如果你通過 ps 去查看你從 shell 上啟動的進程, 其父進程 pid 就是當前 shell 的 pid), 在這個過程中也把 shell 的 stdio 繼承給了你當前的應用進程, 所以你在當前進程里面將數據寫入到 stdout, 也就是寫入到了 shell 的 stdout, 即在當前 shell 上顯示了. 輸入也是同理, 當前進程繼承了 shell 的 stdin, 所以當你從 stdin 中讀取數據時, 其實就獲取到你在 shell 上輸入的數據. (PS: shell 可以是 windows 下的 cmd, powershell, 也可以是 linux 下 bash 或者 zsh 等) 當你使用 ssh 在遠程服務器上運行一個命令的時候, 在服務器上的命令輸出雖然也是寫入到服務器上 shell 的 stdout, 但是這個遠程的 shell 是從 sshd 服務上 fork 出來的, 其 stdout 是繼承自 sshd 的一個 fd, 這個 fd 其實是個 socket, 所以最終其實是寫入到了一個 socket 中, 通過這個 socket 傳輸你本地的計算機上的 shell 的 stdout. 如果你理解了上述情況, 那么你也就能理解為什么守護進程需要關閉 stdio, 如果切到后臺的守護進程沒有關閉 stdio 的話, 那么你在用 shell 操作的過程中, 屏幕上會莫名其妙的多出來一些輸出. 此處對應[守護進程](sections/process.md#守護進程)的 C 實現中的這一段: ```c for (; i < getdtablesize(); ++i) { close(i); // 關閉打開的 fd } ``` Linux/unix 的 fd 都被設計為整型數字, 從 0 開始. 你可以嘗試運行如下代碼查看. ``` console.log(process.stdin.fd); // 0 console.log(process.stdout.fd); // 1 console.log(process.stderr.fd); // 2 ``` 在上一節中的 [在 IPC 通道建立之前, 父進程與子進程是怎么通信的? 如果沒有通信, 那 IPC 是怎么建立的?](sections/process.md#q-child) 中使用環境變量傳遞 fd 的方法, 這么看起來就很直白了, 因為傳遞 fd 其實是直接傳遞了一個整型數字. ### 如何同步的獲取用戶的輸入? 如果你理解了上述的內容, 那么放到 Node.js 中來看, 獲取用戶的輸入其實就是讀取 Node.js 進程中的輸入流 (即 process.stdin 這個 stream) 的數據. 而要同步讀取, 則是不用異步的 read 接口, 而是用同步的 readSync 接口去讀取 stdin 的數據即可實現. 以下來自萬能的 stackoverflow: ```javascript /* * http://stackoverflow.com/questions/3430939/node-js-readsync-from-stdin * @mklement0 */ var fs = require('fs'); var BUFSIZE = 256; var buf = new Buffer(BUFSIZE); var bytesRead; module.exports = function() { var fd = ('win32' === process.platform) ? process.stdin.fd : fs.openSync('/dev/stdin', 'rs'); bytesRead = 0; try { bytesRead = fs.readSync(fd, buf, 0, BUFSIZE); } catch (e) { if (e.code === 'EAGAIN') { // 'resource temporarily unavailable' // Happens on OS X 10.8.3 (not Windows 7!), if there's no // stdin input - typically when invoking a script without any // input (for interactive stdin input). // If you were to just continue, you'd create a tight loop. console.error('ERROR: interactive stdin input not supported.'); process.exit(1); } else if (e.code === 'EOF') { // Happens on Windows 7, but not OS X 10.8.3: // simply signals the end of *piped* stdin input. return ''; } throw e; // unexpected exception } if (bytesRead === 0) { // No more stdin input available. // OS X 10.8.3: regardless of input method, this is how the end // of input is signaled. // Windows 7: this is how the end of input is signaled for // *interactive* stdin input. return ''; } // Process the chunk read. var content = buf.toString(null, 0, bytesRead - 1); return content; }; ``` ## Readline `readline` 模塊提供了一個用于從 Readble 的 stream (例如 process.stdin) 中一次讀取一行的接口. 當然你也可以用來讀取文件或者 net, http 的 stream, 比如: ```javascript const readline = require('readline'); const fs = require('fs'); const rl = readline.createInterface({ input: fs.createReadStream('sample.txt') }); rl.on('line', (line) => { console.log(`Line from file: ${line}`); }); ``` 實現上, realine 在讀取 TTY 的數據時, 是通過 `input.on('keypress', onkeypress)` 時發現用戶按下了回車鍵來判斷是新的 line 的, 而讀取一般的 stream 時, 則是通過緩存數據然后用正則 .test 來判斷是否為 new line 的. PS: 打個廣告, 如果在編寫腳本時, 不習慣這樣異步獲取輸入, 想要同步獲取同步的用戶輸入可以看一看這個 Node.js 版本類 C語言使用的 [scanf](https://github.com/Lellansin/node-scanf/) 模塊 (支持 ts). ## REPL Read-Eval-Print-Loop (REPL) 整理中