Structure of a Website
Question :
You are required to carry out an investigation on web information retrieval and web data mining. You should present the results of this investigation in the form of a report.
Answer :
Introduction
This paper has been developed to carry out an investigation on web information retrieval and web data mining. The focus of this task is on analyzing how a website is structured. The discussion includes how a user retrieves information on the web and how the individual subpages of the website are linked to each other. It is significant that the crawler can easily as well as quickly find all subpages of the website if the website has a large number of subpages. Therefore, it is important to link the homepage of the website to the most important subpages of the website. There are some files such as robots.txt as well as sitemap.xml also help the crawler in completing its job easily and effectively. This paper also includes a discussion on how web search-engines crawl, index and rank web content. Apart from this, the paper also includes an explanation of the application of data mining concepts as well as techniques for web mining.
Structure of a website
Before discussing the structure of a website, it is significant to understand what is a website. A website is nothing but a collection of web pages as well as related content. It is commonly identified by a domain name that is published at least on one server such as Google.com. A website could be a public website or a private website. A public website or a publicly accessible website constitutes the World Wide Web while a private website cannot be accessed publicly and hence it is accessed on a private network only (Paranjape et al., 2016). For example, the internal website of a company can be accessed by its employees only. It is also important to understand that every website is dedicated to a particular purpose or topic. For example, a website can be dedicated to education, entertainment, commerce, news, or social networking. In order to navigate the site, hyperlinks are given between the web pages that help the user to navigate the site. A website often starts with a home page.
Mostly a website is made up of several webpages. The webpages of a website are HTML documents and these are stored in a directory on a domain. There is a difference between a domain and a website and people always get confused with domain and website (Chen, 2018). The internet address of a website is its domain with the help of which the content of a website can be called up.
People also get confused with HTML and considered it as a programming language but it is a page description language. A developer using page description language defines which elements are to be displayed at which point. It indicates that during a process of website design, a picture is not used for the entire site. The developer defines in the code how the color, possible videos or pictures, and font are to be arranged. It also determines by the device that is used to access the website. For example, the arrangement of font, color, videos, and pictures could be different for a laptop computer and smartphones (Akolkar & Awati, 2016). If a website is being designed to adapt to the requirement of the smartphone, it is called responsive design.
A website's individual webpage is stored in directories. Each of them forms their URL that can be used to directly access the particular webpage. When a user enters the URL, he or she can land on the desired website. The search engine also stores the URL for individual websites. The URL along with meta description and meta title in search snippet displays the search result as a clickable link.
The structure of a website is very important because it does not only help in optimizing SEO strategy but also attracts the user to navigate the website. A good structure of a website is needed because it works as a guide for Google (Hansmann & Stober, 2018). It is also important to provide an excellent user experience. There are several types of website such as blogs, online shops, forums, news websites, etc.
Discussion of web search engine
A web search engine is also called an internet search engine. It is a software system that is designed to carry out a web search. A web search engine helps in searching the world wide web in a systematic way to get particular information that is specified in a written web search query. In a simple word, a web search engine can be defined as a web-based tool that is used by the user to generate information on the world wide web (Van den Bosch et al., 2016). Some of the popular examples of the web search engine are Google and Yahoo. The web search engine uses an automated software application that navigates along with the web.
A search engine uses a complex mathematical formula in order to produce search results. It is also important to note that different mathematical formulas are used by each search engine to generate a search result. The search result for a particular query is displayed on the SERP (search engine result page). The algorithms of a search engine take web pages' key elements such as page title, keyword density, and content and then come up with a ranking and accordingly place the result of the query on the wen page. The algorithms of every search engine are unique. Therefore, if a website is getting a prominent rank on Google, it is not necessary that it will also get a prominent rank on Yahoo and vice-versa. The ranking of a website on a search engine is significantly determined by the algorithm of the website. The search engine always makes things complex and complicated and hence every search engine uses secrets in its algorithm (Laura & Me, 2017). Apart from this, the algorithm of every search engine constantly goes through revision as well as modification. Thus, in short, it can be said that search engines are answer machines that are used by users to produce the desired result on web pages. Search engines are used to discover, understand, as well as organized the content of a website in order to provide the relevant result to the question that the users are asking.
Discussion on how web search-engines crawl, index, and rank web content
Crawling, indexing and ranking of web pages by search engines are done various reasons. But the most important reason is to provide the users what they want to see on the web pages. Search engines like Google want to keep its result relevant for the users and hence for the same purpose, it has a well-defined process that helps in identifying the relevant or the best web pages for a given search query (Kolobov et al., 2019). And in order to provide an even better result with the passage of time, the search engines evolve their process. The search result of a search engine say for example Google involves the following processes.
Crawling
It follows links in order to discover the most significant pages on the web in order to provide the most relevant result to the users for their query.
Every search engine has crawlers that help in crawling the world wide web in order to identify and discover pages that are available to help in identifying the best web page for a query. The method with the help of which the crawler travels is called website links. The website links are used to bind the web pages together in a website across the web. It creates a way for the scrawler to reach billions of web pages that are available.
When a crawler looks at a web page, they look at the web page through the Document Object Model of the page in order to see what is there on the page. The document object model is a JavaScript as well as HTML code of the page that is used by the crawler to find links to other pages (Mahale et al., 2018). It helps the search engine to identify new pages on the web. It is a very big deal to crawl the entire web every day. Hence, Google spreads its crawl over several weeks.
Indexing
It means storing information about all pages that have been retrieved in order to retrieve such pages later on. The index simply is a process of adding information about a web page to the index of a search engine. It is a collection of web pages that is called a database. It includes information about the pages that have been crawled by the spider of a search engine (Pavani & Sajeev, 2017). It organizes as well as catalogs the following;
• A map of all pages that each page links to
• Detailed data on the nature of the content as well as topical relevance of each web page
• The clickable text of any links
• Other information about links. For example, where they are located on a web page and if they are ads or not.
The index is a database that is used by the search engine in order to retrieve information when a user seeks a query from the search engine. The search engine use algorithm in order to rank the web pages.
Ranking
It means determining what every page is about and how it will rank for relevant queries. The search engine should perform some critical steps in order to serve results to the end-users of the search engine. These critical steps are the following;
• The search engine should interpret the intent of the query of the user.
• The search engine should identify web pages in the index related to the user query.
• The search engine should rank and return all web pages in order of importance as well as relevance to the user query.
The ranking is one of the major areas where the optimization of the search engine comes in. With the help of effective SEO, the relevance, as well as the importance of the web page, can be influenced for related queries. The algorithm of the search engine helps in assigning importance and relevance to the web pages. The algorithms of the search engine have hundreds of signals that help in determining the importance as well as the relevance of the web pages (Tang et al., 2017). Thus, a search engine ranks web pages according to their importance and relevance to the user query. But the search engine always keeps changing its algorithm in order to serve its user in a better way by improving the result of the user query.
Explanation of the application of data mining concepts and techniques to web mining
Data mining can be defined as a process of identifying or discovering relevant or necessary data from a large set of data. It uses mathematical analysis in order to drive trends as well as the pattern that exists in data (Shu et al., 2017). With the help of traditional data exploration, such patterns cannot be discovered because there is too much data, as well as the relationship, is very complex.
The trend and pattern can be defined as a data mining model which can be applied to particular scenarios which are discussed below;
Risk and probability: selecting the best group of customers for mailing, assigning probabilities in order to diagnose, determining the break-even point.
Finding sequences: identifying and analyzing the selection of products or services by customers in a shopping cart as well as forecasting the next possible events.
Forecasting: estimating server load time, estimating sales, or predicting server downtime.
Grouping: separating events or customers into different clusters of the related items, predicting as well as analyzing affinities (Ratner, 2017).
Recommendations: analyzing and determining which products or services are likely to be sold together.
Data mining is also known as knowledge extraction, knowledge discovery, information harvesting, or pattern/data analysis. It can be performed on the following types of data such as data warehouse, text database, spatial and transactional database, relational database, text mining as well as web mining, streaming and multimedia database, etc.
There are several techniques of data mining such as classification, clustering, outer, regression, prediction, sequential patterns, association rules (Feng & Zhu, 2016).
Classification
The classification technique is used in order to retrieve relevant as well as important information about data. With the help of this technique, data can be classified into a different class.
Clustering
It is used to identify similar data. With the help of this technique, differences, as well as similarities between data, can be understood.
Regression
It is a data mining technique that is used to identify as well as analyze the relationship between two or more than two variables. With the help of regression, the likelihood of a specific variable can be identified.
Outer
It refers to the observation of data items in the dataset that do not match the expected behavior or pattern. It is used in a variety of domains such as fault detection, intrusion, fraud, detection, etc.
Association rules
It helps in finding the association between two or more than two items. With the help of this technique, a hidden pattern in the data set is discovered.
Prediction
In this technique of data mining, a combination of data mining techniques is used such as sequential, clustering, pattern, classification, etc. With the help of this technique, past events are analyzed in the right sequence in order to predict a future event (Pourghasemi et al., 2017).
Sequential pattern
The sequential pattern helps in discovering similar patterns or trends in transaction data for a particular period of time.
Conclusion
The website structure is very important for helping the website to rank. However, it is not necessary that an effective website structure will help to rank on the top. But it is necessary to attract the user to navigate the website. The search engine plays an important role in retrieving relevant and important data. It uses complex algorithms that analyze the web pages and based on their relevance and importance assign ranking. Data mining is a tool that is used to retrieve the required data from a large set of data. There are several techniques of data mining that could be used to get the needed data from the database.
References
Akolkar, M. S., & Awati, M. C. (2016). An Approach to Improve Structure of Website for Effective User Navigation.
Chen, M. (2018). Improving website structure through reducing information overload. Decision Support Systems, 110, 84-94.
Feng, Z., & Zhu, Y. (2016). A survey on trajectory data mining: Techniques and applications. IEEE Access, 4, 2056-2067.
Hansmann, U., & Stober, T. (2018). U.S. Patent No. 9,996,511. Washington, DC: U.S. Patent and Trademark Office.
Kolobov, A., Peres, Y., Lubetzky, E., & Horvitz, E. (2019, July). Optimal freshness crawl under politeness constraints. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 495-504).
Laura, L., & Me, G. (2017). Searching the Web for illegal content: the anatomy of a semantic search engine. Soft computing, 21(5), 1245-1252.
Mahale, V. V., Dhande, M. T., & Pandit, A. V. (2018, August). Advanced web crawler for deep web interface using binary vector & page rank. In 2018 2nd International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC) I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC), 2018 2nd International Conference on (pp. 500-503). IEEE.
Paranjape, A., West, R., Zia, L., & Leskovec, J. (2016, February). Improving website hyperlink structure using server logs. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (pp. 615-624).
Pavani, K., & Sajeev, G. P. (2017, September). A novel web crawling method for vertical search engines. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 1488-1493). IEEE.
Pourghasemi, H. R., Yousefi, S., Kornejady, A., & Cerdà, A. (2017). Performance assessment of individual and ensemble data-mining techniques for gully erosion modeling. Science of the Total Environment, 609, 764-775.
Ratner, B. (2017). Statistical and Machine-Learning Data Mining:: Techniques for Better Predictive Modeling and Analysis of Big Data. CRC Press.
Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter, 19(1), 22-36.
Tang, Y., Wei, L., Wang, W., & Xuan, P. (2017, October). Towards Intelligent Web Crawling–A Theme Weight and Bayesian Page Rank Based Approach. In International Conference on Web Information Systems Engineering (pp. 192-201). Springer, Cham.
Van den Bosch, A., Bogers, T., & De Kunder, M. (2016). Estimating search engine index size variability: a 9-year longitudinal study. Scientometrics, 107(2), 839-856.