Website Scraper Javascript
When it comes to website scraper, there are many ways to do it like Python, Ruby, c++, java, etc. But there’s one language that stands out with its simplicity and ease of website scraping JavaScript. The vast amount of data available on the internet has made web scraping an essential technique for extracting valuable information. With the power of JavaScript and a library called Puppeteer, we can create effective website scrapers to automate the process.
Web scraping
Web scraping also called web harvesting or web data extraction and is an automated process of collecting publicly available web data from targeted websites. Instead of gathering data manually, you can use web scraping tools to acquire a vast amount of information automatically. It involves three main steps,
- Retrieve content from the website: you retrieve content from the targeted website by using web tools to make HTTP requests to the specific URLs. Depending on your goals, experience, and budget, you can either buy a scraping service or acquire the tools that can help you create a web scraper yourself.
- The specific information, you need from the HTML is parsed by web scrapers according to your requirements.
- The data needs to be stored in CSV, JSON formats, or in any database for further use.
Use cases of web scraping
Businesses use it for various purposes, such as market research, brand protection, travel fare aggregation, price monitoring, SEO monitoring, and review monitoring.
- Web scraping is broadly used for market research to stay competitive, companies need to know their mark and analyze competitors’ data.
- It is crucial for brand protection because it allows gathering data all over the web to ensure that there are no violations in terms of brand security.
- Travel companies also use web scraping for travel fare aggregation. With the help of web scrapers, they search for deals across multiple websites and publish the results on their websites.
- Since businesses need to keep up with the ever-changing prices in the market, scraping prices is vital to make accurate pricing strategies.
- Web scrap allows companies to conduct SEO monitoring to track their result and progress in the rankings. It is also used for review monitoring to track customer reviews.
Is web scraping legal?
Even though website scraper activity isn’t illegal by itself and does not have a clear law or regulation to address its application. The examples fo web scraping possibly being illegal that you should consider.
- Website scrapers should not log into websites and then download data. By logging in on any website, users have to agree to the Terms of Service(TOS), which may forbid activity like automated data collection.
- Scraping creative works: You have to make sure that you are not breaching laws that may be applied to copyrighted data, such as designs, layouts, articles, videos, etc.
So, website scraping is a legal activity as long as it doesn’t break any laws regarding the source targets or data itself. However, before getting involved in any sort of scraping activity legal advice is needed.
Prerequisite for web scraping in JavaScript
- To perform website scraper in JavaScript, familiarize yourself with the basics of JavaScript, such as functions, arrays, loops, and conditional statements.
- Understanding concepts like callbacks, promises, or async/await will help you handle asynchronous operations effectively.
- You should know a little bit about web scraping libraries or tools built for JavaScript like Puppet, Serio, or jsdom.
- You need to know about developer tools in the browser(chrome or Firefox) that helps to find the class or id name of specific images or text of a website.
Understanding Puppeteer library
Puppeteer is a very famous and powerful Node.js library. It is used for many different reasons not just scraping, you can perform various browser automation tasks, such as generating screenshots and PDFs of web pages, testing web applications, and crawling websites. This library was developed by the Chrome team at Google. We can use it to programmatically control and automate the Chrome or Chromium browser
Benefits of using Puppeteer for website scraping
- It provides a complete browser environment and allows you to scrape websites that rely heavily on
- JavaScript to render content. So we can use this library to handle dynamic content, run JavaScript code and interact with pages like a real user.
- We can easily install this library in any Node.js project by using a simple command.
- We can use Puppeteer to extract data from the website in convenient methods by traversing the DOM and interacting with elements.
- You can use it with headless browsers like Chromium or with frameworks like Cheerio or JSDOM for additional data manipulation.
Step-by-step Guide to Scraping a Site Data
To scrape a website, you must have an integrated development environment (IDE) like Visual Studio Code for writing and executing codes. And additionally, you will need to install Node.js on your system for a runtime environment for executing website scraper JavaScript outside the browser. To check whether you already have node.js installed on your system or not, type the command “node -v
” in your command prompt. If your system had Node. js then it will display the version of the Node otherwise it will display some error.
If you have both VS code and Node.js in your system, you can easily scrape your desired website. Let’s see how to scrap a website.
1. Setting up Puppeteer
Go to your VS code and create your folder. Then open your terminal, navigate to your project directory, and run the below command to create "package.json
” file.
npm init -y
To install the Puppeteer library, you need to run the below command in the terminal.
npm install puppeteer
When you install this package, it might take a bit because it also installs something called Chromium which is a web browser to scrap. After installing Puppeteer, you can able to see the new file called “package-lock.json"
and the version of Puppeteer in your “package.json
“.
2. Launching Browser
Create a javascript file called “index.js
“, then add the “puppet library” to it and write the code to scrape the web. We can link the puppeteer library to index.js
like:
const puppeteer = require('puppeteer');
The format of a puppeteer application usually runs an anonymous function. Most of our code will run inside this body of the async function
because most of the action of the function that you can use with the puppeteer. And it returns a promise which makes sense because when you’re web scraping you’re telling a bot to wait for something to happen to take some sort of action.
There are two ways of launching a browser in Puppeteer there is
- The headless way: It will run without opening a browser on your computer.
- The non-headless: It will run by opening up a browser.
Let’s see how to launch a headless browser with an async function.
// launch headless browser
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Continue with scraping logic
await browser.close();
})();
In this above async function, the first line launches a browser. The second line of code opens a new tab in your browser. We can actually tell the constant “page
” to go somewhere by giving a link. In the next step, we will see how to do the scraping logic. And the end of the code, we closed the headless browser.
3. Scraping logic
In this step, we will see how to navigate a webpage using Puppeteer and how to extract data.
Navigating to a Web Page
We can pass the URL of the web page we want to scrape using Puppeteer’s ‘goto
‘ method.
await page.goto('https://quotes.toscrape.com/');
In this code snippet, we instruct the browser to navigate to the specified URL.
Extract screenshot from the Web Page
We can take a screenshot by using Puppeteers’screenshot
‘ method, like:
await page.screenshot({ path: 'screenshot.png' });
After this, our full index.js
code will look like this:
const puppeteer = require('puppeteer');
//Creating function
(async () => {
//launching browser
const browser = await puppeteer.launch({headless:false});
// open new page
const page = await browser.newPage();
//Open this website -- waith until dom content is loaded
await page.goto('https://quotes.toscrape.com/');
//Taking screenshot of quotes.toscrape.com website
await page.screenshot({ path: 'screenshot.png' });
//close the browser
await browser.close();
})();
To run this project, you just run the below command in your terminal
node index.js
After running this code, it will launch a browser into your computer and the browser window will close automatically because, in this above code, we closed the browser. And the screenshot of the web page will automatically download into our website scraper javascript project in png format, like:
Extract Title from the Web Page
To get a title of a web page, we need to use the Puppeteers”title()
” and logged that like:
const pageTitle = await page.title();
console.log('Page title:', pageTitle);
After this, our full index.js
code will look like this:
const puppeteer = require('puppeteer');
//Creating function
(async () => {
//launching browser
const browser = await puppeteer.launch({headless:false});
// open new page
const page = await browser.newPage();
//Open website to scrap
await page.goto('https://quotes.toscrape.com/');
//get title
const pageTitle = await page.title();
console.log('Page title:', pageTitle);
await browser.close();
})();
To extract this project, you just run the “node index.js
” command in your terminal. See the below output, the title was successfully Extracted.
Extract specific Data from the Web Page
Puppeteer provides various methods and selectors to target specific page elements. To extract specific data from your desired website, you need to find the class name of that element by using Chrom’s developer tools. Go to the website you want to scrap and “right click > inspect
” to find your specific section of the class name.
Here, we can scrap the author name, so we need to find its ‘class name’ to scrap the author name. Look at the below image, we have found the classname “author
” of the author name section using developer tools.
To extract specific Data from the Web Page, we need to create a function. Let’s see an example:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
// Continue with scraping logic
await page.goto('https://quotes.toscrape.com/');
const grabAuthor = await page.evaluate(() =>{
const author = document.querySelector(".author");
return author.innerHTML;
})
console.log(grabAuthor);
await browser.close();
})();
So, in this above example code, we have created the await function “grabAuthor
“. Inside the function we select the class name “author
” by using document.querySelector()
and stored its value in the variable ‘author’. Then return the variable author by invoking “innerHTML
‘. And finally, we logged the function “grabAuthor
“. So, see the above output this function returns the author’s name perfectly.
Extract specific many Data from the Web Page:
In this example, we have scraped the quotes and author names. We pass the “.quote
“class name to the document.querySelectorAll()
method to select the quotes as well. And author’s name is selected by the “small” element. And We use foreach
a loop to store each data in an array and finally return all.
Ex:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
//navigating a website
await page.goto('https://quotes.toscrape.com/');
//Creating function and store a data one by one in array quotesArr
const grapQuotes = await page.evaluate(() =>{
const quotes = document.querySelectorAll(".quote");
let quotesArr = [];
quotes.forEach((quoteTag) => {
const quoteInfo = quoteTag.querySelectorAll("span");
const actualQuote = quoteInfo[0];
const actualAuthor = quoteInfo[1];
const authorName = actualAuthor.querySelector("small");
//Pushing data to quotesArr
quotesArr.push({
quote : actualQuote.innerHTML,
author : authorName.innerHTML,
});
});
return quotesArr;
})
console.log(grapQuotes);
await browser.close();
})();
Output:
Conclusion
Website scraper in javascript with Puppeteer offers a powerful and flexible solution for extracting data from websites. By following this step-by-step guide and exploring the provided examples, you can begin leveraging Puppeteer’s capabilities to scrape websites and automate your data extraction tasks.