Adventures in the Dark Web of Government Data

Adventures in the Dark Web of Government Data



thanks everyone for coming I'm certainly excited to have the opportunity to share with you all some of my my adventures and unreal fascination and passion for public data so yes I'm mark from New York City originally sometimes I go around the city with my laptop and a large antenna and sort of tune into some of the the fun things that can be overheard on the the radio spectrum around the city as mentioned I do a lot of work with kind of public and government data and the company enigma that I started we have a big sort of open source search engine called enigma public of all of this stuff that we we aggregated bring together but I think probably to kick things off it would be helpful to sort of get some clarity on on terms and you know what you know what exactly is government data and does it really have a dark web so it's interesting I think you know one of the easiest ways to think about like this more expanded idea of government data is that it's sort of the thing that's produced every time you come up against or hit regulation in some ways you know the we of course we have these you know sprawling bureaucracies at you know federal state and local levels and every time you you touch them they have a way of kind of kicking off some data exhaust and from sort of reconnaissance and open-source intelligence perspective this can be really good one of the really kind of interesting Maps I think at least at the u.s. federal level to what's going on from a data collection perspective all came out of this thing from 1980 called the paperwork reduction Act and basically what happened in the 70s is there was just a massive proliferation of forms and and sort of you know government information collection instruments of all these things and it rose to the level where the Congress passed the law and with that law said was basically every time the federal government wants to make a new form they have to themselves fill out a form and register it with the Office of Management and Budget which is part of the executive branch and just kind of show you here you know this is a kind of an ordinary tax return 1040 and it has this OMB Control Number in it and this is great anytime you have a federal fat US federal government form it will definitely have an OMB number on it somewhere and you know when a government agency wants to make a new form they've got to apply to the OMB and one of the things they have to do is justify why they need the form and also estimate how you know what are called sort of the burden hours of of the forms so right now there's you know maybe about 10,000 different unique forms that are registered with the federal government and what I find extraordinarily remarkable is that according to the government's own estimates they require eleven point three billion hours each year of people's time to fill out so we can certainly extrapolate that there's a lot of information being produced here just to kind of flag this if there's something that's kind of interesting to you guys and you want to explore further this it's kind of hard to google for but it's called the current inventory report I made just a little bit liya that'll drop you right into the sort of proper government site and it is kind of fun because there are there is like an XML file that has all of this stuff structured in it and and you can go and and play with it and so you know just to kind of flush out like you know what is this real spectrum of information the government's producing you know if anyone's come into the country into the u.s. from abroad you've probably seen this form it's one of the most filled out with over 300 million of them a year things like w-2 so the sort of you know tax form for your if you're on payroll somewhere so the quarter billion of those produced a year I was sort of surprised to see that these fiction Ridge cards or friction Ridge cards actually they're about 90 million of them filled out every year and I suppose it's not all strictly for people being arrested this one that I just found on Google Images is for someone applying to be a pyrotechnic operator so I suppose these things are produced in lots of ways and so those are you know some of the more common forms that are produced but there also is a really long tail here so everything from the 20 or 30 companies that actually fish off the Alaska coast near Russia and have a specific form they need to fill out to the importation of shelled peas from Kenya and things like the petroleum supply reporting system which I'm not sure exactly what it is but does sound like it could be interesting and juicy in some ways and so once you start to know that oh there is a form out there there of course not all public but there's a really interesting tranche of them that are you can start to go out in and collect information so just as an example this is you know what a Federal Election Commission form looks like and I don't know if you can see it but this is like a line item by line item sort of disbursement schedule for all of the things that the Trump administer campaign spent money on so we have a hundred and forty dollar uber credit there I'll talk a little bit more about this data set later but the FCC licenses you know all commercial radios in some way in the country and you can use that data set to actually find every McDonald's drive-thru in the country and also what frequencies it's linked to the restaurant with certainly I'm she has been in elevators you'll see these little inspection placards let's run at the state level but that sort of data that you can get in and learn all about what's going on inside of a building you know certainly aircraft registrations are really interesting and have tail numbers and all sorts of interesting joins you can do with radios this is an example is just the deed for the hotel that we're in and when you take it a step further it's kind of cool because you whenever their building permit applications filed for changes use in the space for renovation there's often sort of architectural drawings and things like that so this is also from the hotel the Department of Labor collects a lot of information on things like this is I think for the OSHA so the Occupational Safety when hazard something a little bit of a sad case of someone who fell down an elevator shaft here but there's a lot of information produced you know h-1b visas I couldn't really show it very clearly here but these are the 14 or 15 or whatever it is h-1b visas that Caesar is applied for 18 and you can see they're mostly tech you know sort of tech programming looking people this I just kind of found and thought was kind of funny it's the 401k plan that DEF CON Communications has for the four people that are enrolled in it and this is actually one of my all-time favorite pieces so this is a customs declaration from the 1960s that the Apollo 11 mission had filed upon coming back with with moon rocks and so you know it just is kind of a lovely artifact of bureaucracy I think and does give us some sort of you know sense of the kinds of things that do appear hidden away in the state end you know I think the takeaway you know that I want to leave you guys with just from kind of having blown through all of that stuff is you know government bureaucracy can really be your friend I think that there's you know certainly a key set of probably sources of government data that are in our toolkits be they you know real estate records or corporate registrations or whatever but this is a really deep and n stand sort of massive well of resources and by kind of thinking about like what are the processes and how does that potentially reflect in data you can start to develop all sorts of new avenues for research and exploration so you know I have a personal interest sort of in software-defined radios and the sort of AM spectrum and I was really curious to sort of see how the public data that's available around usage of the electromagnetic spectrum could be used to serve ask different questions over the world I'm sure this won't be really a surprise to anybody in this room but of course you know radio waves are all around us you know they're sort of in that really cool sort of spectrum that takes us from all the visible light that we see around us to you know the FM stations in our car and the Wi-Fi and all of these things are just waves of different lengths you know of course you know Marconi is often credited as being one of the sort of inventors of radio and it's kind of amazing in its early days it was you know it's not surprisingly like a terribly unregulated and quite of chaotic technology that was you know people were just broadcasting and creating all sorts of interference and actually a lot of the regulatory regimes now that we have in the u.s. are said to sort of come as a result of the sinking of the Titanic in part because the Titanic being a new ship you know did have a radio operator on it and was sending out SOS messages but the kind of thought was that there was so much interference on the sort of land-based stations that a lot of those messages weren't received and so that led eventually in 1912 to the Congress passing what was called the radio act which became sort of the precursor to setting up the sort of FCC regulatory regime that we have today and so what you know so we of course now live in a world where there is a lot more sort of a tension regulation around the radio spectrum and that that's actually really cool and exciting when it comes to trying to understand you know how this spectrum is being used so I'm just curious show of hands has anyone seen this map before so it looks like about maybe 20% of people I'll keep coming back to this sort of throughout the remainder of the talk because it's I think a really good sort of touchstone to understand how you know how a lot of these things are existing next to each other so if you see this it's I know a little difficult to get with much detail on the screen but it it good is basically from maybe three kilohertz all the way at the top to like 300 gigahertz all the way at the bottom and each one of those little blocks is a basically you know a sort of reserved set of uses that that bit of the spectrum can be used for so you can see here you know the FM radio band of course like 88 megahertz to 108 megahertz roughly and that sort of blocked off there but what's interesting is you can start to see that like these things exist you know next to you and alongside of course other uses of the spectrum so you know further down and like the SiC 150s 160 mega hurt range is where this thing called a is which is like a merit like like ship positioning data is transmitted and then you know further down in the sort of next block at you know ten hundred and ninety megahertz or just about a gig that's where all of the sort of aircraft are broadcasting their ship positioning their vessel positioning data and so I just call that out to show how these different you know protocols and uses of the spectrum do you have a kind of continuity to them and of course you know there's a ton of politics and money at stake here and you know as we know recently you know as sort of analog television has has all been shut down that spectrum is getting sold off and you know just last year you have twenty billion dollars being spent you know mostly by the big telco companies to get access to some of that stuff that was freed up so you know needless to say this is like a very kind of high stake if somewhat obscure and it invisible is replaced that data is produced so in the u.s. you know of course the FCC is the main regulatory body here and they basically like collect a ton of different information and release it in two different ways the first one which has the most data is this thing called the universal licensing system and there's maybe fifteen or sixteen different kinds of licenses that we end up giving the windup get Gowda and each one has a lot of sort of detailed information associated with it as part of like an open data initiative the FCC has done some work to unify all of that into this database called the license view database so I think it's maybe like a hundred columns that sort of are harmonized across all of these things what's nice is it collects in one place all of these different licenses and it basically pops out in one CSV file this is a bitly link to a github repo I made which basically makes it relatively easy to if you have a post degree server running you can basically run the script and it'll you know download the most recent version of this database geocoded geo index it and make it searchable for you and the cool thing is once you do that you can actually start to use this data to ask really targeted and specific questions about your local environment in a way so this is I just did a sort of search of a kilometer radius around the Caesars hotel here and said basically like for all the licenses that have been given out within a kilometer of here who kind of has the most of them and what are the kind of rank ordered counts of like how the spectrum is being used so you know probably not super surprisingly the top three are all like next house this is your sort of cell phone stuff but you know then kind of digging in it was I sort of interesting for me to start to learn like where where am I on what's going on around here so Perini building company is a legitimate construction firm that has no ties with the Mafia but they have done a lot of the casino construction in Las Vegas and you know certainly one of the biggest holders around here and then sort of drilling down we of course see a bunch of the casinos themselves are really big recipients of licenses I was kind of surprised to see because they'll come up later in the talk but this firm Recon robotics which by their own tagline is the world leader in tactical micro robot and personal sensor systems has a good 32 licenses right in this part of Las Vegas and that in fact puts them on par with DEFCON who I was very impressed to see is quite fastidious about making sure that the official FCC licenses are all sort of filled out and one other thing that I sort of call out here that's I think really important when to keep in mind when you're working with these sort of government data sets is that there can be often a lot of confusion and difficulty when it comes to you know doing like entity recognition and resolution and stuff and so towards the bottom here we have pH wlv LLC which I saw is that what is that and in fact it's the parent company that it's a Planet Hollywood Holdings that this casino and many others so then you know now that you can kind of start to identify what's going on around around you geographically how can you start to use and apply that you know of course it's been quite amazing to see in the last several years how cheap software-defined radios have gotten how much that's really opened up so for those of you who don't know you know for like literally 20 bucks you can get you know a little USB dongle that will let you tune into pretty broad spectrum I think these will go from like maybe 50 or 60 megahertz to just over a gigahertz um thing like this and they're really cool and you know very easy to just sort of get started with this is a program called GQ rx which is just a really simple sort of tuner so if you plug in one of these USB is and you know put in a frequency you can listen to whatever might be coming coming across it and so what's kind of interesting is we can start to you know not only just look at like what is the sort of the clustering of radio licenses around us but actually dig into them a bit more specifically and what's really nice about these is you do get some very high resolution information about how organizations kind of operate in function so this one is for the Caesars hotel it's you know one of many that they have but it's sort of interesting is you know the person who actually filled out this license his name is Eric Dominguez who is the VP of sort of facilities and engineering here and what's also included is his phone number an email address and it is his direct line I I called him so I doing it about to be true and so these things you know kind of become interesting when you're trying to think about what are other ways of you know understanding a target or a place of interest and finding things that let you have a lot of sort of base knowledge about what's going on if anyone's interested these are sort of a big tranche of the radio frequencies that the Caesars Palace itself has licenses for there are other ones under other entities that come up in my sort of first search but they can be ferreted out and just to kind of remind us to keep all of this in context you know we can see sort of these Caesar Palace radios are in the the 450 Meg zone but then just a little bit down the spectrum we've got the radio frequencies being used for sort of the control infrastructure around the the water system in Las Vegas and so it's a very rich and crowded sort of space you know but of course this isn't only limited to these sorts of things so there's a you know know a 19 is a weather satellite that's that's flying around above head it all operates in sort of the 137 megahertz range and a friend of mine actually in New York built an antenna and a G Cal reminder so that whenever this weather satellite is actually over the eastern seaboard he can bring this thing outside and actually download the images because of course you know satellites these are kind of coming down unencrypted and are there for gathering and that's the URL for it if anyone's interested but I was also kind of very curious to see in what ways different kinds of public data could start to get joined with what we know on is available on the radio spectrum in order to do things like maybe look inside of a cargo ship so of course today ships are that you know really diverse radio stations in and of themselves you can see here you know you of course have GPS antennas and maybe satellite TV you know Pam radio antennas but importantly up here on the the top left is an AIS antenna and AIS stands for automated identification system and it's basically a radio protocol that is used for navigation and safety and whenever a ship is under way it broadcasts some information included on this channel and it all basically lives around I guess 160 162 megahertz there's two different channels that it goes on and what interesting is if you are you know they have a line of sight or have a decent antenna you can actually using one of these $20 dongles as an example receive those AIS messages that the ship is sending off and so here you can kind of see in this like text box or whatever those are what sort of the raw demodulated packet sort of look like and what you can basically do it's because there's a you know people there's a great Python library called Lib AIS and there's many other ones where we've all sort of taken the spec and made all the decoding but basically what data you're getting when you're listening to these ships basically breaks down to you know what you're seeing here and this tells you things like you know the position and the heading and rate of turn and things like that but importantly it also has this thing called an MMS I and the MMS I is a sense for mobile maritime subscriber identifier it's basically like the cell phone number of the ship and you can use that to then join with a second order piece of government data here I wrote an API that was all linked in that repo that I showed earlier but to connect to the International Telecommunication Union to take that MMS I identifier of the ship and turn it basically into the vessel name and some other information about the ship itself and once you have this few pieces you can then get to the place where you can actually look inside of a ship and the way that you do that the sort of conceit here is by taking bills of lading data that often get filed before ship hits the port that explained basically for the purposes of customs taxation everything that's inside of the ship that data is kind of made available in a very crazy way so it's the only way that anyone can get access to it is by going to the Customs and Border Protection office in Washington DC giving them $100 certified check and getting a CD in return but through a nigga map uh blick we actually gather all of that it's free with an API on it and so is able to sort of stock all of these things together I'm just grounding time okay so it's sort of you know one example another one I'll just quickly talk about is using a dsb sort of data which is very similar to a is but it's for aircraft and there's a really interesting piece of work that was done by BuzzFeed specifically around looking for the extent to which governments were using stingray devices which you know often are put in aircraft and flown in circles you know when they're going after a target and stingrays of course are ways to you know track and intercept Zuma it's very specific cellphones and so basically what they did that was really smart is we're able to take all of the sort of like flick a DSP flight data and there's companies like FlightAware and others that aggregate it for the whole US and they applied some you know basic kind of analytics to it to look for all of the flight patterns over cities where planes were just kind of flying in circles a lot and based on that they were able to identify all of these you know both airplanes that were like very clearly registered to Homeland Security or to a police department but also in addition all of these new companies that were shell companies being used by the government but that they were able to kind of back into you once they knew that those companies were potentially of interest because of these unusual flight patterns you know there is you know I think when we think about all of the different radio devices that surround us all the time there are a lot of different opportunities and examples of taking this sort of contextual public data and applying them to to those devices and just kind of enclosing since we're coming up on time I want to tell you about sort of another investigation that I did here around trying to understand the surveillance infrastructure along the us-mexico border so what you're looking at here is just kind of a slightly interpolated map of all the radio licenses that are within 10 kilometers of the us-mexico border and when I was looking at them you know did you sort of see these normal dispersion patterns around cities of course like the radio towers and uses are all over the place but what I was kind of very interested in is sort of seeing out in some of these more remote sort of desert frontier places these very regularly spaced towers that were being put up along the border and this one in particular is was put up by a company called MSR and so I started looking sort of what is M SAR do well they make you know the kind of radar packages that the ground radar packages that go on predator drones and other things like that so I thought this could be interesting to try to dig in and get a sense of who who and what else is sort of happening along the border so this is just kind of like a account of like who are all of these kind of entities that are showing up doing experimental work specifically along the border I just called out that company Recon robotics which was the one I had mentioned earlier is also doing a lot of work around this hotel but then I sort of one Piron actually wanted to look at all these companies and basically you know found that it's not so surprising but that in fact the vast majority of them are defense contractors of different stripes and so sort of starting to go through and looking at like you know who are these companies and what are they doing sort of you know stumbled upon all of this really kind of fascinating technology I suppose anyway so T comm makes these aerostatic blimps that introduces surveillance platforms leonardo DRS is a italian defense contractor but their purport to have the most widely used at ground surveillance radar and you sort of see a lot of these interesting packages LTA is an israeli defense company that does a lot of border security work that's also sort of working there as is elbit systems and so you know what's really interesting is you know you can again pivot from these very specific licenses or these sort of aggregates of licenses to then go and look at like where are the sites and where are these sorts of things happening so you know kind of incredible for me to just then actually be able to go go over to Google Maps punch these things up and start to see all of the sort of sites where these bits of exploration and and prototyping this like virtual offense are starting to happen it just as a like last piece of context there there was a bunch of these were part of an older program that Boeing had was sort of wound up being a massive disaster they were supposed to be able to cover the entire border for 7 billion dollars but wound up spending a billion dollars to only do 50 miles and the thing didn't even work but you know the thing that I'll sort of leave you with and hopefully kind of came across in the talk and sort of through these examples in context of like what's possible with data more generally is to really think about you know not only where these deeper perhaps unseen bits of data are but really thinking about how they can be put together to tell us sort of brought our stories so anyway thank you very much [Applause]