How to Extract an Image From a Website
Psst! You are reading a tutorial for version 7.3, which is slowly on its way out. The latest version 8.4 is more beginner-friendly with its newly designed task interface. Upgrade right now and check out the updated tutorial in our new help center.
Continue reading if you still decide to finish your task on version 7.3. In this tutorial, we will show you how to use Octoparse to extract text, URL, image, and HTML.
But before we start, let's get a glance at how Octoparse scraps the data you need.
While building a new task, usually you will begin by selecting the data you want on the web page for Octoparse to scrape. To select elements on the page, you need to create a selection. Generally, there are two steps to create the selection:
1. Click on your target data
2. Select the appropriate action, such as "Select all" and "Extract text of the selected element", to perform from "Action Tips"
When you click on the element you need, the selection area would be in a green box. You can also find that there are some other elements on the page highlighted in a red box at the same time. This is because Octoparse intelligently figures out the specific pattern which represents the selected element on the web page, and automatically selects the other elements of the similar pattern as you may want to capture them all.
Once the selection is created, all similar elements across multiple pages will be detected and added into the selection based on the pattern.Octoparse will then repeatedly execute scraping until every element in the selection is extracted.
Now, you've known Octoparse better. Let's see how to select and extract three specific types of data with Octoparse!
1) Extract Text
2) Extract the URL of a link or an image
3) Extract inner/outer HTML
1) Extract Text
Most of the data are represented as human-readable text on the web, such as news articles, product information, and blog. So once you acquire the skill to extract text data, when later coupled with other techniques like pagination and list building, you are able to achieve data scraping on almost all kinds of web pages.
Let's see how to select and extract the text data with Octoparse.
a. Click on any data you want
When you click on the element you need, the selection area would be in a green box. Similar elements on the web page will be highlighted in red.
b. Create the selection
Click "Select all". The similar elements in a red box on the web page will be highlighted in green, and you can notice the selection is created in "Action Tips". Octoparse will then repeatedly execute scraping until the text of every element in the selection is extracted.
c. Extract text
Click "Extract text of the selected elements" to finish creating the selection.
2) Extract the URL of a link or an image
Colloquially, a URL is a hyperlink. With a single click on a URL, you can open a new web page or go to a new website, just like what happens when you click on the title of a book on Amazon.
Besides a web page, the URL also enables you to access to the specific file resource via the Internet, such as an image. If you get the URL, you can download the correspondent file or image from the Internet.
Let's see how to select and extract the URL of a link or an image with Octoparse.
a. Click on the link/image you want
When you click on the link/image you need, the selection area would be in a green box. Similar elements on the web page will be highlighted in a red box.
Tips!
When you select an item with URL, the selected tag on the bottom of "Action Tips" should be "A", which stands for anchor that usually links one page to another. To create a correct pattern to scrape all elements, make sure you select the right area.
b. Create the selection
Click "Select all". The similar elements in a red box on the web page will be highlighted in green, and you can notice the selection is created in "Action Tips". Octoparse will then repeatedly execute scraping until the text of every element in the selection is extracted.
c.Extract the URL
Click "Extract the URLs of the selected elements"/ "Extract image URL in the loop" to finish creating the selection.
Tips!
Can I just use Octoparse to directly get an image, not its URL, from the web page?
Unfortunately, you can't use Octoparse to extract the image itself. If you want to extract images, you can scrape the URLs of the images with Octoparse first, and then bulk download the images with a "download from URL" tool.
3) Extract inner/outer HTML
Unlike the text and URL, data like icons are not available to be extracted directly. When you want to extract some visual non-text contents, like the star rating, you have to extract the inner/outer HTML of these contents.
Besides icons, you can also scrape hidden texts, charts and graphs from a web page by extracting the HTML of these elements first.
To get the data behind icons, then you need to apply regular expressions to clean the data up.
First, let's see how to select and extract inner/outer HTML with Octoparse.
a. Click on the target data you want
When you click on the element you need, the selection area would be in a green box. Similar elements on the web page will be highlighted in red.
b. Extract inner/outer HTML
Click "Extract inner/outer HTML of the selected" in "Action Tips" to finish creating the selection.
Related articles:
Use lists to extract
Extract multiple pages through pagination
Extract behind a login
Extract from source code
Extract page-level data
Extract from a list of URLs
How to Extract an Image From a Website
Source: https://www.octoparse.com/tutorial-7/extract-data
0 Response to "How to Extract an Image From a Website"
Post a Comment