Introduction to PRISM, XKeyscore, and other scary NSA programs
Access to complete phone records of all Americans is only the tip of the iceberg. Details from whistleblower Edward Snowden recently revealed that a top-secret NSA program named “XKeyscore” allows National Security Agency analysts to search (with no prior authorization) through vast databases containing content of emails, online chats, social networking sessions, web content searches, websites visited of millions of Americans – everything that a person does over the Internet. The revelation comes via a secret XKeyscore training document that found its way into the public realm. From the document, we can deduce a bit about how XKeyscore works and the technology behind it.
According to The Guardian:
“Training materials for XKeyscore detail how analysts can use it and other systems to mine enormous agency databases by filling in a simple on-screen form giving only a broad justification for the search. The request is not reviewed by a court or any NSA personnel before it is processed. Analysts can search by name, telephone number, IP address, keywords, the language in which the internet activity was conducted or the type of browser used.”
What types of data is collected?
The key to understanding Digital Networking Intelligence (DNI) systems such as XKeyscore (and the other layers of NSA’s PRISM system – Trafficthief, pinwale, MARINA, etc.), is to first consider the type of content that we know is obtained by the NSA – email, chat logs, http-based web data, etc. While email contents are archived by the host, chat sessions, social networking communications, and websites may not be archived. This indicates that the data collection device is likely on the wire (e.g. the fiber-optic cables that make up the Internet’s backbone) cataloging this information as it crosses (vs. ISPs logging everything and allowing secured-access to those private log files).
What type of data is searchable?
Another clue comes from the type of searches and filters that XKeyscore allows. The leaked training documents indicate that the NSA can search for “type of browser” used, language tags, etc. This shows that metadata is being collected, real-time, and stored, likely keyed off of the user’s IP address and/or MAC address. This again hints at data being sniffed directly off the wires.
How fresh is the data?
On the other hand, the easiest way to design such a system would be to require ISPs log information and provide NSA with remote (i.e. SFTP) access to it. With such an architecture, NSA could schedule periodic downloads of the data and import the downloaded data into their XKeyscore system (after which, the data would pass through various layers that parse, decrypt, mine, and store the resulting datasets). But with this type of architecture, we would not expect the data to be “fresh”. The leaked documents hint that the data is very current, possibly even “real-time” current. And we know from the documents that “events” are monitored and analyzed to fire “triggers” when certain conditions are recognized. That again hints that an application is located on the wires collecting, and saving, real-time Internet traffic data as it passes by. Once the data has been collected, saved and cataloged, it can be parsed and indexed by various application layers, allowing for quick, real-time searches through the now “cleaned” records.
Scraping data off the wires
All indications are that the NSA’s enormous Internet dragnet copies Internet traffic as it enters and leaves a country’s borders. And yes, that includes the United States and yes, it requires at least some level of cooperation between a variety of the Nation’s largest corporate entities (as evidenced by revelations regarding the business practices of Verizon, Facebook, Google, or even Mark Klein’s 2006 revelation that AT&T allowed optical splitters to be installed in San Francisco thus allowing the copying of all data flowing through AT&T’s network).
In addition, it has long been known that the United States taps directly into the undersea fiber optic cables that form the backbone of the Internet and that they have done so as early as the 1970’s. In the early 1970’s, the U.S. government learned that a heavily protected undersea cable ran parallel to the Kuril Islands off the eastern coast of Russia, providing a vital communications link between two major Soviet naval bases. In response, the National Security Agency launched Operation Ivy Bells, deploying fast-attack submarines and combat divers to drop waterproof recording pods on the lines. Every few weeks, the divers would return to gather the tapes and deliver them to the NSA, which would then binge-listen to their juicy disclosures.
There are more than 500,000 miles of flexible undersea cables (about the size of your average garden hose) with regeneration points placed along the cable route to amplify signals. Today, the U.S.’s cable-tapping program, known variously by the names OAKSTAR, STORMBREW, BLARNEY and FAIRVIEW, accesses communications on the fiber cables and infrastructure as the data flows past, most likely tapped at the regeneration points where the individual fiber cables are laid out individually and where cables make landfall (if the host country or operating company grants permission). They call this tapping mechanism an “intercept probe”.
Another proposed method of intercepting data off the undersea fiber optic lines was explained by PC Pro:
“You can get these little cylindrical devices off eBay for about $1,000. You run the cable around the cylinder, causing a slight bend in cable. It will emit a certain amount of light, one or two decibels. That goes into the receiver and all that data is stolen in one or two decibels of light. Without interrupting transfer flow, you can read everything going on on an optical network. The loss is so small, that anyone who notices it might attribute it to a loose connection somewhere along the line. They wouldn’t even register someone’s tapping into their network.”
In short, there is little doubt – massive amounts of assumed-to-be private Internet data is being scraped off the Internet wires all around the planet.
Are ISPs and Corporations contributing to NSA’s data collection efforts?
The question that remains is “where is the data collection service located and who allowed it to happen?” Is the service located inside the ISP’s network (or inside a corporation’s network) with a secured service interface used by the NSA with the full cooperation of the ISP (or private company)? Or is the service located outside the ISP’s walls, directly on the wires, similar to the way a telephone wiretap would function? Or is the data collected from private corporations via secured web services acting as connection points for data collection? All evidence points to services that are located outside of the ISP’s walls with nodes distributed around the world that continually scrape Internet data off the wire and save for subsequent parsing and storage. The locations of the installed nodes (as illustrated in a map included in the leaked training document) do not align perfectly with Internet Exchange Points. They align closely, but not precisely in all cases. They do however, align with many, if not most, of the known locations of underwater cables, especially cables located along the borders of countries that the United States maintains friendly relations with. Again, not in all cases but in the majority of cases, the nodes are located along Undersea Internet Cables. That they are located along country borders hints at cooperation between the NSA and the country that the data collection node is located in.
What about data collection nodes located inside the U.S.?
The location of data collection points along country borders begs a very important question. Why are there data collection points located within the United States? If NSA surveillance is intended, and permitted, only for foreign entities, what purpose would data collection points within United States borders serve? It can be argued that these collection points are needed to snag data that might be otherwise missed on data traffic routes not covered by the collection points installed in foreign countries. And we can see from the training documentation that even the ones located within the United States are located along three major borders (as of 2008, along southern, north eastern, and north western borders). These connection points could be installed as backup nodes intended to act as a redundant connection point or they could be used to consolidate data that has been collected at other points. Most likely, however, they are for the purpose we all suspect – to capture traffic originating and terminating on American soil.
Tie it all together and you get…
With regards to the overall operation of the system, my guess is that the nodes are located along undersea cabling (the NSA has been tapping these since the 1970’s) *and* major Internet switching centers, routing cores of many telecoms companies, and major Internet traffic hubs – with the explicit cooperation of trusted commercial companies (and if unequivocal cooperation is not involved, the NSA most certainly uses publicly available APIs from products such as Google Maps to create a more user-friendly and functional surveillance system for its analysts).
In short, it is one bad-ass data mining system that aggregates and utilizes data from many, diverse connection points with a data processing hub likely located at NSA’s Utah Data Center that integrates everything from parking receipts and travel itineraries to bookstore purchases. The data is collected primarily along undersea cabling but is aggregated with data collected at Internet Exchange Points and data supplied by private corporations. The collected data is then decrypted (if needed), parsed, and stored in NSA databases where it is accessed by various web-based front ends that utilize publicly available services (e.g. Google Maps) to improve and enhance the presentation of information to the NSA analyst. Quite a feat, albeit likely illegal or at least a violation of American citizen’s rights to privacy (unless you subscribe to the President’s philosophy: “You can’t have 100 percent security and also then have 100 percent privacy and zero inconvenience.”)
Other Interesting points from the leaked documents:
- Searches can be by email (termed a “strong” search) or by content (a ‘soft” search).
- Unfiltered data is held for 3 days and then rolled off, likely due to storage space limitations. Before data is rolled off, it passes through additional application layers where the data is parsed, cataloged, and saved.
- The system is driven by over 500 (as of 2008) Linux servers distributed “around the world”. It can be assumed that some sort of data processing occurs on these servers, which act as the individual collection points, before the data is passed on to other layers of the application system.
- Even encrypted data is stored. The training documents indicate that the encrypted data is then decrypted by PRISM.
- The service specifically allows the extraction of data regarding “exploitable machines” that the system finds, data which can then utilize zero-day exploits to compromise.
- Although not directly derived from the leaked document, code names for top-secret projects mentioned in the document can be searched in Social Networking sites such as LinkedIn to find other interesting surveillance program codenames that dopey NSA employees post in the public profiles. I kid you not. Here’s one job description: Skilled in the use of several Intelligence tools and resources: ANCHORY, AMHS, NUCLEON, TRAFFICTHIEF, ARCMAP, SIGNAV, COASTLINE, DISHFIRE, FASTSCOPE, OCTAVE/CONTRAOCTAVE, PINWALE, UTT, WEBCANDID, MICHIGAN, PLUS, ASSOCIATION, MAINWAY, FASCIA, OCTSKYWARD, INTELINK, METRICS, BANYAN, MARINA
You can read the entire XKeyscore training presentation (PDF) here.