Category: 

What is Web Harvesting?

Article Details
  • Written By: K.C. Bruning
  • Edited By: John Allen
  • Last Modified Date: 15 September 2016
  • Copyright Protected:
    2003-2016
    Conjecture Corporation
  • Print this Article
Free Widgets for your Site/Blog
In late 19th-century London, mail was delivered to residential addresses up to twelve times each day.   more...

September 28 ,  1924 :  Two US military planes complete the first flights around the world.  more...

Web harvesting is the process by which specialized software collects data from the Internet and places it into files for an end user. It serves a function similar to, but more advanced than, the tasks a search engine performs. Also known as Web scraping, Web harvesting gives the user automated access to information on the Internet that search engines cannot process because it can work around HTML code. The three major types of Web harvesting are for Web content, structure, and usage.

Web content harvesting involves the extraction of information by pulling data from both search page results and from a deeper search of the content hidden within Web pages. This additional information is often obscured from search engines because it is obscured by HTML code. The process scans information similar to the way human eyes would, discarding characters that do not form meaningful phrases in order to extract useful elements.

Rather than search for content, Web structure harvesting collects data about the way information is organized in specific areas of the Internet. The data collected provides valuable feedback from which improvements in areas such as information organization and retrieval can be made. It is a way to refine the very structure of the Web.

Ad

Web usage harvesting tracks general access patterns and customized usage by Web users. By analyzing Web usage, harvesting can help to create clarity about how users behave. This is another way to improve the function of the Web, but on an end-user level. It can help designers to improve their Web sites' user interfaces for maximum efficiency. The process also provides insight into what sorts of information users search for and how they go about finding it, thus giving an idea of how content should be developed going forward.

By collecting text and image data from HTML files and images, Web harvesting can perform more complex Web crawling that delves deeper into each document. It also analyzes the links that point to that content in order to determine whether the information has importance and relevance across the Internet. This provides a more complete picture of how the information relates to and influences the rest of the Web.

Companies use Web harvesting for a wide array of purposes. It can be an effective way to collect data to be analyzed. Some of the more common data sets compiled are information about competitors, lists of different product prices, and financial data. Data may also be collected to analyze customer behavior.

Ad

You might also Like

Recommended

Discuss this Article

Post your comments

exception 'Exception' with message 'error writing captcha: Duplicate entry '2147483647' for key 'PRIMARY'' in /ssd/www/wisegeek/public_html/_core/classes/public/Captcha.php:44
Stack trace:
#0 /ssd/www/wisegeek/public_html/_core/controls/public/ControlDiscussionPostBox.php(324): Captcha->createCaptcha()
#1 /ssd/www/wisegeek/public_html/framework/classes/Control.php(104): ControlDiscussionPostBox->preRender(false)
#2 /ssd/www/wisegeek/public_html/framework/classes/Control.php(149): Control->render()
#3 /ssd/www/wisegeek/public_html/tpl/default-nocustom-lu/pages/public/article/article.htm(526): Control->__toString()
#4 /ssd/www/wisegeek/public_html/framework/classes/Control.php(300): require('/ssd/www/wisege...')
#5 /ssd/www/wisegeek/public_html/framework/classes/Control.php(309): Control->requireTpl('pages/public/ar...', Object(PageArticleCom), true)
#6 /ssd/www/wisegeek/public_html/framework/classes/Control.php(131): Control->renderTpl('pages/public/ar...', Object(PageArticleCom))
#7 /ssd/www/wisegeek/public_html/framework/classes/FormDataControl.php(87): Control->renderTemplate()
#8 /ssd/www/wisegeek/public_html/framework/classes/Control.php(109): FormDataControl->renderTemplate()
#9 /ssd/www/wisegeek/public_html/framework/classes/ScriptPage.php(50): Control->render(false)
#10 /ssd/www/wisegeek/public_html/framework/classes/Control.php(149): ScriptPage->render()
#11 /ssd/www/wisegeek/public_html/framework/classes/Page.php(97): Control->__toString()
#12 /ssd/www/wisegeek/public_html/_core/classes/public/PublicFrontController.php(443): Page->processRequest()
#13 /ssd/www/wisegeek/public_html/_core/classes/public/PublicFrontController.php(7): PublicFrontController->renderPage()
#14 /ssd/www/wisegeek/public_html/index.php(11): PublicFrontController::run()
#15 {main}