Getting Bulk Data Through Google: An empirical study

To store the information in a database is one of the major tasks. The efficient storage of data is important for future use. Information retrieval is a method of gathering information related to input queries from the various sources or stored databases. To retrieve the information, a search engine plays an important role. A web search engine creates an index to match queries. The quality of information is improved with the help of search engine. For retrieving the information, a search engine comprises some modules such as query processor, a searching and matching function, document processor and page rank capability. This paper focuses on the retrieval of web documents against input queries and stores them in to database. A Google search API can be used to fetch the results. It analyses the data by processing through these modules and downloads the content available in different formats.


INTRODUCTION TO INFORMATION RETRIEVAL
Information retrieval is process of accessing desire information form storage system, The data can both be structured or unstructured information retrieval process not only works for large data set but also for small set of data. It helps to reduce time required to extract the information. A search engine is a source of retrieving information for the specified keywords and lists the result. This paper focused on the text data retrieved from google and the comparision of that documents. It has its own importance in different fields such as it is used for business perspective, education purpose, marketing and in the filed of research. It surveys the sites and create a database by retreiving the information from a query[1] [3].

LITERATURE REVIEW
Nowadays search engines are very helpful to extract information from world wide web. It contains billions of web pages that are classified, indexed and ranked. The journey of search engines begin since 90's. Nowadays search engines are more advanced. The number of search engines originates time to time with different functionalities. But some of them are active and rest are inactive. Some of them are described here: DOI: 10.15415/jtmge.2016.72006 1. Archie: it was the first search engine that was based on the FTP sites that stores the listings of the pages are capability of being downloaded, but the index didn't link to user because of limited space. It was found in 1990[3]. 2. Veronica and Jughead: As the archie was not able to connect with the user, veronica and jughead were two programs for search that works like GOPHER system. Gopher System, Veronica and Jughead both were launched 1991. Gopher system allows user to find, distributes the data over the web. 3. WWW Excite, Wanderer, Aliweb, Basic Web search are the search engines launched in 1993 based on bots. But now they are inactive. 4. Altavista, web Crawler, Yahoo(1994): it was the first search engine which provids the access to the user to insert and delete the queries with in some given time. It processes the natural language queries with unlimited bandwidth. Another search engine named WEB CRAWLER is also launched din 1994 which index the entire pages. YAHOO is also come into exixtence in same year which was cabaple to increase the size of directory of pages searched. These all search engines are active now a days[1]. 5. Google, MSN(1996MSN( -1998

INFORMATION RETRIEVAL MODELS
Information retrieval is a method of gathering information related to input queries from the various sources or stored databases. It is basically a process to recover the information available on various sources into a database. The process starts with the query entered by user. These queries are not uniquely defined because several results match to those queries during the search. These queries are matched with the database information which is already stored on the computer. The generated results are ranked according to page ranking algorithm depends on the relevancy of the information. • Representation for input queries • Representation for text Documents • A context for represent input queries, document and relationship among them. The documents are the set of keywords and terms that are presented by Boolean model and the queries input by user are the boolean expressions for keywords. The queries consists of AND, OR and NOT operators. A Document is predicted relevent to input query expression if it satisfies the following condition: ((input text OR information) AND retrieval AND NOT theory ) Where OR is the union of two sets , AND is the intersection of two sets and NOT inverse the sets. Boolean model has some disadvantages : OR:one matching word is$as$good$as$many AND: one missing word$as$bad$as$all Queries are difficult to express because a keyword has several meanings.

Vector Space Model:
This is an algebric model that represent the text documents as index terms. This model is most suitable to filter the pages, checking relevance pages to the search term or keywords or to index and rank those pages. . This approach gives the vector representation of the documents. If cosine similarity value is 0 or the angle between objects is 90 degree then the documents do not share any attributes or words. The representation of queries and documents are done as following: query =(word 1 ,q ,word 2 ,q, ……….,word n ,q) document j =(word 1 ,j,word 2 ,j,……….,word t ,j) • A text document contains a list of key terms with their$weights.
• Weight = it is the measure of importance of term expressed for information available in the specific Tex Document.

Working of$Search$Engine
A search engine is designed to retrieve the information according to query or keyword hit by user. The information retrieved from queries are stored on WWW. The$search$engine retrieves a huge list$of results that matches$to the input terms. Search engine updates the information available on index servers so that the latest information can be retrieved in efficient way. The list of results generated by web servers are of two different types either the results are natural search or the paid search that is pay per click. When the user search search for any term and put the query into search box, large numbers of results are generated. The most relevant results matches to those queries are filtered and represented to users. The searching, indexing and ranking are the major functions of search engines to produce the results for input[1][4].
1. Crawling: It is a method of finding latest, updated and new pages to add into the google index. The detection and fetching of pages is done by crawler or spider. When a large amount of data is searched the work of crawling is done by Googlebot. It is a process that defines the how many pages are to be fetched form how many sites. The crawling process begins with list of URLs of web pages obtained from the different crawling processes. As the Googlebot crawler explores the websites it identifies the various links on each page and add them to the already crawled pages. The working of crawling process is shown in fig  1.2.
2. Indexing: web crawler processes each web page to explore an enormous directory of words and their location on each page. All the attributes of pages are processed. The web crawler can explores content of many types but sometimes it is unable to process some dynamic pages and media files[8].

IMPLEMENTATION AND RESULTS
To extract the results of Google search engine HttpUrlConnection, useragent or ApacheHttpClient are the different option to perform the task. A query parameter is a part of String Url where google search is an Http GET request.
Jsoup is an open source HTML parser that is able to fetch the results in the form of urls as shown in fig 1.4.   Fig 1.3 process of indexing and page ranking User Agent : User Agent identifies the browser and the required details of computer system. These details are send to the servers of that pages that you are visiting. The details must be the version number of browser, operating system you are using. The Web server use this information to provide the required content related to particular browser.
The user can also enetred the number of results. This program can find the open source files available on the web and download that pages into database shown in fig 1.4. Fig 1.5 represents the details of the files that are downlaoded from the urls extarcted. This can be achieved by the user agent Mozilla5.0.
Google search engine use different page algorithm then other search engines. A Google Search API is used to retrieve the results and HttpUtility Downloader is used to save that open source URL into Repository. Based upon that repository the comparison is being done between the text documents.  still lacking in terms of result quality in response to informative queries. The most significant advantage on the Google internet search engine is maybe the sheer number of sites that indexes with comparison to yahoo search engine.
Giagblast was created to provide search engine functionality on the least amount of hardware possible at the current state of technology.

CONCLUSION
There are many search engines are available but Google Search engine is very useful tool in present era of internet. The user agent helps to extract the most relevant results based on query. The Google search API is used to fetch the results generated by search engine implementation. It fetches the URL Links into database and then downloads the files linked to those URLs for the text comparison of that document.