The updated version of this tutorial (based on the latest webpage) is available now. Go to have a check here!
Provided that the task we create with Octoparse doesn’t work as expected, how can we find out the bug in our task/workflow?
By following these steps, we can debug the task/workflow on our own. This article shows you how to debug your tasks:
Step 1: Manually click through each step in the workflow
Generally speaking, when we click on a step in the workflow, the corresponding process is displayed in the built-in browser and details about this step are displayed in "Customize Action"
Since Octoparse executes each step from the top down, we should click this step in the top-down order.
The following example shows how to debug by manually clicking each step.
1. Click "Go To Web Page", the target webpage opens in the built-in browser. In addition, the Go-To-Web-Page action can be customized in the "Customize Action".
If the web page takes a long time to load, you may need to extend the Timeout.
2. Click "pagination". The Loop item information will be in the "Customize Action".
In some cases, if there is no ">(Next button)" for pagination, we can try to handle this type of pagination via XPath. Please refer to How to handle pagination with page numbers? and Extract multiple pages through pagination.
3. Click "Click to Pagination" to check whether the next-page button is located in the loop item area accurately.
If the action works well, the next page displays in the built-in browser. If not, you may need to modify the XPath for the "Pagination".
For this step itself, what we need to be cautious is whether the website employs AJAX technique. If so, the "AJAX Timeout" is needed to be set up.
4. Click "Loop Item". The Loop item information will be in the "Customize Action". We can check whether they are right or not.
5. Click "Click Item". The selected web page opens in the built-in browser.
"Click Item" is quite similar to "Click to paginate". We should check whether it can click the item on the "Loop item" and whether the website employs AJAX technique.
6. Click "Extract data". We have access to view the extracted data.
If we have data extracted to the wrong "columns" or not being extracted at all, it may be due to the inaccurate XPath.
We can solve this problem by referring to the following tutorials.
With the above operations, we can check whether every step of the task works.
· Before clicking the next step, we need to make sure the page is fully loaded, ie, the loading signal disappears.
· When we click "Click Item" or "Extract data" step in a loop, we need to select any option in the loop item in addition to the first option. By doing this, we can see whether the "Click Item" or "Extract data" step
Step 2: Run the task by Local Extraction
After checking the task manually, we can use the local extract to help with debugging In the process of local extraction, we can consider there are bugs when the following situation occurs:
· Getting no data extracted
When the reminder pops up, we’d better refer to Why Octoparse stops and no data is extracted?.
· Extracting duplicate data
When the task keeps producing duplicated data, there are bound to be problems with its "loop Item". We can get some solutions from the following article: Why does Octoparse only extract the first item and duplicate?
· Too many missing data when extracting
Sometimes, if we notice that the data we need is not completely extracted, there are some minor problems in the task. They may be caused by:
The "Loop Item" does not cover all the items on the list of each listing page.
The web page does not load completely.
· Extracting data at a relatively low speed
When using local extraction, if the extraction speed is low, we should first consider the local resources we use, such as operating system, hardware capacity, IP address, network bandwidth and so on. We also need to consider the content of the website we want to scrape. For example, if you want to scrape the data from a website that contains a lot of images, it takes more time to fully load the page. Therefore, the task will run at a low speed.
However, low speed also could be a signal to indicate there is a bug.
For example, when we forget to set up AJAX Timeout for some steps, Octoparse will be stuck there and waiting for 120 seconds by default.
Step 3: Debug in Cloud Extraction (Optional)
When we finish debugging via manually clicking and running extraction locally, we can move forward to debug in cloud extraction.
· Data missing when using cloud extraction
If we notice that there are some missing data in the results extracted with cloud extraction, we should go back to look at the tasks and websites we need to extract.
We can solve the data-missing problem with the tutorial: How to deal with data missing on cloud extraction?
· Getting no data extracted on the cloud
Sometimes we can have the task running well locally. However, when running it in the cloud, we get no data extracted.
Under this circumstance, we can refer to the tutorial: Why does cloud extraction get no data while local extraction works perfectly?