Making a scraper in Node.js... - Gerardo Grimaldi

Post Top Ad

Your Ad Spot

Saturday, March 15, 2014

Making a scraper in Node.js...

Let's make a multi link crawler for a multi page query in a web page listing jobs opportunities.

First let's make a module of it, in a separate file from our server in node whatever it is and we gonna call it with

var worker = require('worker.js');

It will be called with this sentence from our express router or whatever you are using.

First we gonna need two libraries 

var request = require("request");
var cheerio = require("cheerio");

One for making requests more easy(request) and another for making jQuery available on the server side(cheerio).

Now we need the list of pages of the main web site, this one makes a default paged list with the jobs listed that day in in one link, and paginates on base of that link so...

var url = "http://www.bumeran.com.ar/empleos-publicacion-hoy.html";

Now for exporting this function to the node server we gona make a "start" function and export it.

In this example we gona make a request to the url of before and read the body with cheerio taking the 
number of pages of that list from the paginator in the bottom. 

Then we send that number of pages to the scraper function 

exports.start = function(req, res) {
    request(url, function() {
        return function(err, resp, body) {
            if (err && resp.statusCode == 200) {
                console.log(err); //throw err;
            }
            $ = cheerio.load(body);
            var pages = $(".paginador.box a:nth-last-child(2)").text().trim();
            console.log(pages); 
            scraper(pages);
        };
    });
};  

The function scraper must have the variable url by the number of page in this case it gona be:

var url = "http://www.bumeran.com.ar/empleos-publicacion-hoy-pagina-" + NUMBEROFPAGE + ".html";

The function scraper will only recive the number of pages from the scraping from before.

It will read every list in every page, and send specific links to the pages with the job descriptions to the scraperLinks function this process will be asynchronously by the nature of node.js. 

function scraper(pages) {
    for(var i = 0; i < pages ; i++){
        var url = "http://www.bumeran.com.ar/empleos-publicacion-hoy-pagina-" + (i + 1) + ".html";
        request(url, ( function(i) {
            return function(err, resp, body) {
                if (err && resp.statusCode == 200){
                  console.log(err); //throw err;
                }
                $ = cheerio.load(body);
                $(".aviso_box.aviso_listado").each(function(index, tr) {
                    console.log("Scrapping..." + $(this).attr("href")); 
                    scraperLinks($(this).attr("href"));
                });
            };
        })(i));
    }
}
NOTE: We are sending the value from "i" to the function as a value itself not a ref because it would not work as expected for the asynchronously way of work.


For last scraperLinks will get every link to a details page and get the info we need with plain jQuery:

function scraperLinks(link) {
            var url = "http://www.bumeran.com.ar" + link;
            request(url, function(err, resp, body) {
                    if (err && resp.statusCode == 200) { console.log(err); } //throw err;
                    $ = cheerio.load(body);            //console.log(body); 
                    var location = $('.aviso-resumen-datos tr td').last().text().trim(); // $('#.aviso-resumen-datos tr').last().find( "a" ).text();
                    var detail = $("#contenido_aviso p:nth-child(2)").text();//$("#contenido_aviso p").first().text();
                    var title = $(".box h2").first().text().trim();
                    var date = $(".aviso-resumen-datos tbody tr td").first().text().trim();
                    console.log("Saving..." + url); 
                    saveAd(url, location, detail, title, date);
                });
        }  

In this case we get title, date, and details from the published job, and send it to the saveAdd funtion that will recive the values and store them in any way posible. 

Well that's all 

No comments: