Web harvesting is the process by which specialized software collects data from the Internet and places it into files for an end user. It serves a function similar to, but more advanced than, the tasks a search engine performs. Also known as Web scraping, Web harvesting gives the user automated access to information on the Internet that search engines cannot process because it can work around HTML code. The three major types of Web harvesting are for Web content, structure, and usage.
Web content harvesting involves the extraction of information by pulling data from both search page results and from a deeper search of the content hidden within Web pages. This additional information is often obscured from search engines because it is obscured by HTML code. The process scans information similar to the way human eyes would, discarding characters that do not form meaningful phrases in order to extract useful elements.
Rather than search for content, Web structure harvesting collects data about the way information is organized in specific areas of the Internet. The data collected provides valuable feedback from which improvements in areas such as information organization and retrieval can be made. It is a way to refine the very structure of the Web.
Web usage harvesting tracks general access patterns and customized usage by Web users. By analyzing Web usage, harvesting can help to create clarity about how users behave. This is another way to improve the function of the Web, but on an end-user level. It can help designers to improve their Web sites' user interfaces for maximum efficiency. The process also provides insight into what sorts of information users search for and how they go about finding it, thus giving an idea of how content should be developed going forward.
By collecting text and image data from HTML files and images, Web harvesting can perform more complex Web crawling that delves deeper into each document. It also analyzes the links that point to that content in order to determine whether the information has importance and relevance across the Internet. This provides a more complete picture of how the information relates to and influences the rest of the Web.
Companies use Web harvesting for a wide array of purposes. It can be an effective way to collect data to be analyzed. Some of the more common data sets compiled are information about competitors, lists of different product prices, and financial data. Data may also be collected to analyze customer behavior.