Sensitive Information

Do you have any sensitive information on a server you’re using to make web pages available? The very nature of web servers makes that information vulnerable to exploitation even if you have taken steps to limit its exposure.  If you have a non-web file stored in public mode on a web server, that file may be accessible and copied by internet search engines whether you intended it or not.  So: don’t store any sensitive information on a web server; don’t make files public for sharing; and do be aware that the files on your web server may be copied and retained by Google and others (see below) long after you think the file has been deleted.

Student data is protected by FERPA, and only information that is defined as "directory" information may be released about students, and even release of that information should be controlled. Information about public employees in some circumstances is more public, but in all cases, information that could lead to identity theft must be protected. See Guidelines for Data Security for more information.

If it is information that you don't want made public, then don't put it on the web or a web server! For more information about web security, see the World Wide Web Security FAQ.

The Problem with Google

Google is a powerful search engine which allows you to find just about anything on the web. The problem is, Google is now a hacker's tool as well. The power of Google's search engine has enabled it to dig up all sorts of information that wasn't meant to be exposed.

Cached Pages

One particular feature enables Google to display older versions of web pages. As Google's robots crawl the web they automatically take a "snapshot" of each page and archive it. These pages are "cached" (i.e. copied) on Google's servers where they continue to be available to the public long after the information has been removed from your server. If you delete a page or change the permissions to it so it is no longer accessible, the cached copy may remain available on Google's server for weeks or months.

To prevent search engines such as Google from displaying a cached copy of your page, put the following HTML code in the <HEAD> section:

<META NAME="ROBOTS" CONTENT="NOARCHIVE">

Be aware that this only removes the cached link for the page. Google will continue to index the page and display a snippet.

Removing Part of Your Web Site

To indicate that certain directories or files should be excluded from a search engine's index, you can apply the Robots Exclusion Standard. This involves creating a robots.txt file and placing it at the document root of your web server. For example, to exclude all robots from crawling your site, your robots.txt file would consist of these two lines:

User-agent: *
Disallow: /

To exclude only the Google robot from indexing your test directory, your robots.txt file would look like this:

User-agent: Googlebot
Disallow: /test

Alternatively, if you want to exclude a particular page from being crawled, or if you lack the permissions to create files in the document root, you can use a robots meta tag. To exclude all robots from indexing a particular page on your site, you would put the following HTML code in the <HEAD> section:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

You should be aware that the Robots Exclusion Standard does nothing to prevent a robot from crawling your site. It simply indicates to well behaving robots (i.e. those that have been programmed to respect your request that the page not be indexed) what pages they should ignore. It is not a method of access control. As the Web Robots FAQ describes it "think of it as a 'No Entry' sign, not a locked door."  It might even attract non-well-behaving robots.  So the key is NOT to have sensitive information anywhere near your web server, and never to make sensitive information “public” or shareable.

Google provides a lot of information about removing content from their site if you need more details.

Last Updated: 8/13/14