learnyounode Lesson 8 – HTTP Collect

In this lesson we need to write a program that collects all data from an HTTP GET request and logs the number of characters and the complete string of characters received from the server.  But didn’t we just do that in the last exercise?  Not exactly.  The http module emits an events events as the request is processed.  In the previous exercise we used the .on() method to listen for the first data event we received.  All other events are ignored.  Take a look at the official solution from the previous lesson again.

var http = require('http')

http.get(process.argv[2], function (response) {
  response.setEncoding('utf8')
  response.on('data', console.log)
  response.on('error', console.error)
}).on('error', console.error)  

If you look closely you can see that we are only acting upon two events.  If we get a data event, we are logging that data to the console.  If we get an error event, we are logging the error to the console.  There is nothing in our code that allows us to act upon more than these events.  If there are more data events that might occur, we aren’t doing anything to act upon those events.

We get a couple of hints from learnyounode about how to collect all data that is sent from the server rather than just the first event.  The first hint is that we can use the end event to determine when we have received all data.  The second hint is that we can leverage some existing node modules from npm to help us solve this problem such as bl and concat-stream.

Official Solution

var http = require('http')
var bl = require('bl')

http.get(process.argv[2], function (response) {
  response.pipe(bl(function (err, data) {
    if (err)
      return console.error(err)
    data = data.toString()
    console.log(data.length)
    console.log(data)
  }))
})

In the official solution, the buffer list (bl) module is used.  The response from the GET request is piped to the bl method using the .pipe() method.  The buffer list accepts a callback as an argument and exposes the data in the buffer object with the main node buffer object.  It is not necessary to use buffer list module to complete this task.  However, it is worth knowing about because the buffer list module includes a number of useful prototype methods in its API such as bl.get() which returns bytes at the specified index and bl.slice() which returns a new buffer object with bytes in the specified range.  These methods are useful for manipulating buffer streams.  To learn more about the buffer list module, refer to the module API documentation on getHub.

My Solution

If you are curious to see how this problem can be solved without using an additional node module, refer to the alternate solution below.

var http = require('http')
var url = process.argv[2]
var body = ''

http.get(url, function (response) {
  response.on('data', function (chunk) {
    body += chunk
  })
  response.on('end', function () {
    console.log(body.length)
    console.log(body)
  })
})

The alternate solution adds the response body to the body variable when a data event occurs.  When the response.on('end') method encounters the end event, the body and the length of the body are logged to the console.

It is worth mentioning that using the http events is also a great way to avoid getting stuck in a set of deeply nested callbacks.  Notice how var body is assigned outside of the asynchronous http.get method?  Each time we get a data event, we concatenate the next chunk of the response to body.  Then, when the end event is emitted, we know that asynchronous operation is complete and we can log the contents of body to the console.  Try it yourself by running node http-collect.js http://google.com on the command line.  You should see something like the following on the command line:

219
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>

Compare this with to the output of the curl command by running curl -v http://google.com.

* Rebuilt URL to: http://google.com/
*   Trying 2607:f8b0:4005:809::200e...
*   Trying 172.217.6.46...
* Connected to google.com (172.217.6.46) port 80 (#0)
> GET / HTTP/1.1
> Host: google.com
> User-Agent: curl/7.43.0
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Location: http://www.google.com/
< Content-Type: text/html; charset=UTF-8
< Date: Wed, 24 Jan 2018 16:19:28 GMT
< Expires: Fri, 23 Feb 2018 16:19:28 GMT
< Cache-Control: public, max-age=2592000
< Server: gws
< Content-Length: 219
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: SAMEORIGIN
<
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
* Connection #0 to host google.com left intact

You can see that the body and content-length are both the same.  What happens if you try to log body outside of the callback?

var http = require('http')
var url = process.argv[2]
var body = ''

http.get(url, function (response) {
  response.on('data', function (chunk) {
    body += chunk
  })
  response.on('end', function () {
  })
})

console.log(body.length)
console.log(body)

Well, you get nothing…

Why?  The answer is timing.  http.get() is executed first.  It starts some asynchronous work.  Then both console.log() statements are executed.  But, all of this happens before we get an http response back, so nothing gets concatenated to body before we log its contents.  In fact, after the last console.log() is executed, the message queue is empty and the program exits.  Node.js does not wait for asynchronous messages to be added to the message queue before exiting.

What if we still want to pass the results of http.get() to another function?  Think back to make it modular.  Let’s apply the same thinking by wrapping http.get() with another function that accepts a callback, and then returning that callback in the anonymous function we pass to response.on('end').

var http = require('http')
var url = process.argv[2]
var body = ''

var getBody = function (callback) {
  http.get(url, function (response) {
    response.on('data', function (chunk) {
      body += chunk
    })
    response.on('end', function () {
      return callback()
    })
  })
}

getBody(function () {
  console.log(body.length)
  console.log(body)
})

As you can see, we can use the callback to coordinate when we log the contents of body with the end event from our http response.  There are other ways to pass around asynchronous data.  Take a look at this article to learn how to pass asynchronous data with promises.

Lesson 9

Contents