CS 272 Software Development

CS 272-01, CS 272-02 • Fall 2021

Project 4 Search Engine

Associated Assignments Project 4 Search Engine


For this project, you will extend your previous project to create a search engine web interface using embedded Jetty and servlets.

This writeup is for the search engine functionality only. See the general Project 4 Writeup for more details.

The eligibility requirements for this project are the same design eligibility requirements of the previous projects. Specifically:

If you are missing a grade you should already have, please reach out to us on the course forums.

The functionality for this project is broken into 2 parts: core functionality and extra features. You must complete the core functionality before attempting extra functionality.

Core Functionality

In addition to maintaining the functionality of the previous project, your code must support the following core features using embedded Jetty and servlets for a total of 30 points:

Points Functionality Description
5 Web Form: Display a web page with a form that includes (at a minimum) a text box where users may enter a multi-word search query and a button to submit that query to the web server.
5 Query Processing: When the web form button is clicked, send the queries to a Jetty servlet and process those queries to match how the data is stored by your inverted index.
10 Partial Search: After processing the queries, the servlet should retrieve the partial search results of those queries from the index generated by the Driver class.
10 Search Results: The servlet should return the partial search results to the client (or web browser) as dynamically generated HTML with sorted (most relevant first) and clickable links.

You cannot earn credit for additional features until the core functionality is working properly.

Additional Functionality

Once the core functionality is complete, you may implement 70 points of additional features. These features are broken into several categories. You may choose any combination of features from these categories.

Have a feature idea? You can propose an extra feature in a public post on the course forums. If approved, the instructor will post the number of points that feature will be worth on the final project.

User Tracking

The following features requires your search engine to track user data. There are two implementation options (choose one):

  1. Base Functionality: Implemented by storing data in memory; only supports a single user.

  2. Extra Functionality: Implemented by storing data using session tracking or cookies; supports multiple users.

Ideally, you should use the same implementation option for all features in this subcategory. For example, if you implement search history using sessions, you should also implement visited results using sessions.

The possible features are:

Base Extra Functionality Description
5 10 Search History: Store a history of all search queries. Allow users to view and clear that history.
5 10 Visited Results: Store a history of all visited search results (i.e. results clicked on). Allow users to view and clear that history.
5 10 Favorite Results: Allow users to save favorite search results. Allow users to view and clear those favorites.
5 5 Time Stamps: Add timestamps to each item stored. Implement this for all related features to earn full credit.
5 5 Private Search: Allow users to set an option that turns off all tracking of user data. Implement this for all related features to earn full credit.
5 5 Last Visit Time: Track and display the last time a user visited your search engine. This is NOT the current time that the page was generated!

There are 30 to 45 points possible in this category depending if you choose to implement base or extra functionality.

Metadata

The following features requires your search engine to track search metadata (not specific to users). There are two implementation options:

  1. Base Functionality: Track metadata in memory (non-persistent).

  2. Extra Functionality: Track metadata in the on-campus SQL database (persistent).

Ideally, you should use the same implementation option for all features in this subcategory. For example, if you implement page snippets using a database, you should also implement popular queries using a database too.

The possible features are:

Base Extra Functionality Description
10 20 Page Snippets: When a web page is crawled, store a short snippets of the page. Display the snippet whenever that page is returned as a result.
10 20 Page Statistics: When a web page is crawled, store the page title (via the <title> tag in HTML), content length (via the Content-Length HTTP header), and timestamp of the crawl. Display these statistics whenever that page is returned as a result.
5 10 Most Visited Results: Track the number of times a page has been visited by any user. Allow users to see the top 5 visited pages.
5 10 Most Searched Queries: Track the number of times a multi-word query has been searched for. Allow users to see the top 5 most popular queries.
5 10 Reset Metadata: Allow users with an administrator password to clear all the metadata stored.

Some features require others to be implemented first. For example, Reset Metadata cannot be implemented until at least one of the other features that stores metadata is implemented.

There are 35 to 70 points possible in this category depending if you choose to implement base or extra functionality.

Extendable Functionality

The following features have base functionality that can be extended with additional functionality. The base functionality must be implemented first.

Base Extra Base Functionality Extended Functionality
5 10 New Seed: Allow a user to specify a new seed URL that should be added to the existing inverted index. If the URL has already been crawled, skip crawling that URL and output a warning to the user. Max Support: In addition to entering a new seed URL, allow the user to also specify a maximum number of pages to crawl. This is the maximum number of new pages to crawl in addition to the pages already crawled. URLs that are already included in the inverted index should be skipped and should not contribute to this maximum count.
5 10 Index Browser: Allow users to browse your inverted index as an HTML page with all of the words stored, clickable links to all of the indexed URLs for those words, and the number of positions stored for that word and location (but not list all of the positions). Subindex Browser: Allow users to enter a specific word and display the data stored in your inverted index for that specific word.
5 10 Location Browser: Allow users to browse all of the locations and their word counts stored by your inverted index as an HTML page with clickable links to all of the indexed URLs. Partial Location Search: Allow users to browse all of the locations and their word counts for locations that start with the same text. For example, browse all locations that start with “http://www.cs.usfca.edu/~cs272”.

For example, if you implement base functionality for New Seed, you will earn 5 points. If you implement both base and extra functionality for New Seed (including Max Support), you will earn 10 points instead.

There are 15 to 30 points possible in this category depending if you choose to implement base or extra functionality.

Miscellaneous Functionality

The following miscellaneous features may also be implemented:

Points Functionality Description
5 Graceful Shutdown: Allow an administrator to trigger a graceful shutdown of your search engine without calling System.exit()`.
5 Search Statistics: Display the total number of results along with the time it took to calculate and fetch those results, and display the score and number of matches per search result listed.
5 Server Statistics: As a footer on every page, display the server uptime (i.e. time since the server was started), total number of words stored, total number of locations stored, and total number of queries conducted. This information can be stored in memory by the server.
5 Quick Search: Add a new button to your search form (in addition to the normal search button) that automatically redirects the user to the first search result instead of listing all of the search results. This is similar to the Google Search “I’m Feeling Lucky” button. Output a warning if there are no search results.
5 Reverse Sort Order: Allow the user to select an option to reverse the sort order of the search results using a checkbox on the search form.
5 Partial/Exact Search Toggle: Allow the user to toggle on/off partial versus exact search using a checkbox on the search form.
5 Web Framework: Design a search engine using any popular CSS/style framework to create a consistent style for all the web pages. For example, consider using Bulma, Bootstrap (Twitter), Pure.css, Material (Google), Semantic UI, and many more.
5 Search Brand: Design a search engine with a distinct brand, logo, and tagline. This includes creating a logo and tagline, and including it on all of the web pages. Do not use unlicensed unattributed media on your website.
5 Light/Dark Mode Toggle: Allow users to toggle between light mode (light colored background with dark text) and dark mode (dark colored background with light text) styles for your website.

There are 45 points possible in this category.

Potential Deductions

There will be a brief review of your code for this project. If you have implemented additional functionality and your implementations have any of the following issues, points may be deducted from your the project grade:

Points Deduction Description
-10 Multi-User Support: Deducted if your code does not support multi-user search. Users conducting search simultaneously should see results relevant to their own queries only.
-5 Thread Saftey: Deducted if your code is not thread-safe. In-memory data accessed by different threads should be properly protected.
-5 Cross-Site Scripting (XSS) Vulnerabilities: Deducted if your code does not protect against cross-site scripting (XSS) attacks. Escape or sanitize any data from a user (either via the HTTP request or a database) prior to using it on an HTML page.
-5 SQL Injection Vulnerabilities: Deducted if your code does not protect against SQL injection attacks. Use prepared statements where appropriate anytime it accesses a database.
-5 Poor Encapsulation: Deducted if your code breaks encapsulation.
-5 Poor Code Style: Deducted if your code is not professional. Use professional formatting, variable names, Javadoc, exception handling, and address all compiler warnings.

While there are many ways to lose points, the total possible deduction is capped such that no more than 15 points total will be removed from your project grade due to the above issues.

Extra Credit

You may complete additional functionality as extra credit. There is no cap on how much extra credit you can earn for this specific project, however the overall project category grade will be capped to 110% at the end of the semester.

For example, suppose you lost 10% because you submitted project functionality late and completed 130% worth of extra credit on the search engine project. Instead of earning 130% – 10% = 120% in the project category, your overall project category grade will be capped to 110% instead. This is a great way to make up missed points from late submissions, as well as boost your score if you struggled in the other grade categories.

Regardless of what you implemented, you will NOT earn points for the search engine core functionality if you are not passing all of the web crawler tests, and will NOT earn points for extra functionality if you have not fully implemented the core functionality!

Your main method must be placed in a class named Driver. The Driver class should accept the following additional command-line arguments:

  • -server port where -server indicates a search engine web server should be launched and the next argument port is the port the web server should use to accept socket connections. Use 8080 as the default value if it is not provided.

    If the -server flag is provided, your code should enable multithreading with the default number of worker threads even if the -threads flag is not provided.

The command-line flag/value pairs may be provided in any order, and the order provided is not the same as the order you should perform the operations (i.e. always build the index before performing search, even if the flags are provided in the other order).

Your code should support all of the command-line arguments from the previous project as well.

The output of your inverted index and search results should be the same from the previous project. As before, you should only generate output files if the necessary flags are provided.

There are no functionality tests for this project. Instead, you will demonstrate your search engine functionality to the instructor during your final code review appointment during finals week.

The following content from this semester may be helpful in completing this project:

  • The ServletBasics and ServletData lecture code illustrates how to use embedded Jetty and servlets to create a basic web interface.

  • The HeaderServer homework assignment illustrates how to use web forms with embedded Jetty and servlets.

  • The Sessions lecture code illustrates how to enable user tracking (optional).

  • JDBC lecture code illustrates how to connect servlets to a SQL database on campus (optional).

It is strongly recommended to pass all of the homework tests before integrating them into your projects.

It is important to develop the project iteratively. Some considerations to make while developing are:

  • Start with using GET requests for basic search functionality.

  • Using POST requests are usually only useful for user tracking features, and it is possible to implement those features using only GET as well.

  • To ensure multi-user support, avoid static and instance members for storing anything related to search queries and results in your servlets.

  • For visited and favorite results, modify the search result links to direct back to your search engine, so that it may first store that the link was clicked on and then redirect as necessary.

  • For crawl metadata, modify the web crawler to store more information per crawl (instead of just which unique URLs have been crawled).

  • For graceful shutdown, you will need to create a special servlet combined with the ShutdownHandler in Jetty.

The important part will be to test your code as you go. Use the JUnit tests provided for previous projects to come up with your own test cases.