Last updated at Fri, 29 Dec 2023 19:24:17 GMT
Last year, Rapid7 Labs launched the Open Data Portal on our Insight platform, putting our planetary-scale internet telemetry data into the hands of data scientists, threat intelligence analysts, enterprise teams, and individual security researchers all for free. All you need to do is request access (we do some light vetting in an effort to ensure the data goes into the hands of defenders, not attackers), and once you gain access to the platform, you can search for and select from our wide array of datasets.
Interactive use is all well and good, but we also provide API access that makes it possible to set up automated operational or data science workflows. We recently published an R package, ropendata
, to access the Rapid7 Open Data API on CRAN.
Let's take a look at how you can use ropendata
in R to search for available studies, download datasets, and explore the data.
First, you'll need to install the package, which is as simple as:
install.packages("ropendata")
Now, we grab the current list of studies and take a look at them:
library(ropendata)
library(tidyverse)
studies <- list_studies()
glimpse(studies)
## Observations: 13
## Variables: 15
## $ uniqid <chr> "sonar.ssl", "sonar.fdns_v2", "sonar.cio", "sona…
## $ name <chr> "SSL Certificates", "Forward DNS (FDNS)", "Criti…
## $ short_desc <chr> "X.509 certificate metadata observed when commun…
## $ long_desc <chr> "The dataset contains a collection of metadata r…
## $ study_url <chr> "https://github.com/rapid7/sonar/wiki/SSL-Certif…
## $ study_name <chr> "Project Sonar: IPv4 SSL Certificates", "Forward…
## $ study_venue <chr> "Project Sonar", "Project Sonar", "RSA Security …
## $ study_bibtext <chr> "", "", "", "", "", "", "", "", "", "", "", "", …
## $ contact_name <chr> "Rapid7 Labs", "Rapid7 Labs", "Rapid7 Labs", "Ra…
## $ contact_email <chr> "research@rapid7.com", "research@rapid7.com", "r…
## $ organization_name <chr> "Rapid7", "Rapid7", "Rapid7", "Rapid7", "Rapid7"…
## $ organization_website <chr> "http://www.rapid7.com", "http://www.rapid7.com/…
## $ created_at <chr> "2018-06-07", "2018-06-20", "2018-05-15", "2018-…
## $ updated_at <chr> "2019-02-09", "2019-02-09", "2013-04-01", "2018-…
## $ sonarfile_set <list> [<"20190209/2019-02-09-1549672918-https_get_208…
If you've ever pursued the Open Data portal, the metadata elements will look familiar. Even if all of this is brand-new territory, you can see that you have access to the study name, description, and timestamps of study creation and update dates. Let's take a look at the main study categories:
select(studies, name, uniqid) %>%
arrange(name) %>%
print(n=20)
## # A tibble: 13 x 2
## name uniqid
## <chr> <chr>
## 1 Critical.IO Service Fingerprints sonar.cio
## 2 Forward DNS (FDNS) sonar.fdns_v2
## 3 Forward DNS (FDNS) -- ANY 2014-2017 sonar.fdns
## 4 HTTP GET Responses sonar.http
## 5 HTTPS GET Responses sonar.https
## 6 More SSL Certificates (non-443) sonar.moressl
## 7 National Exposure Scans sonar.national_exposure
## 8 Rapid7 Heisenberg Cloud Honeypot cowrie Logs heisenberg.cowrie
## 9 Reverse DNS (RDNS) sonar.rdns_v2
## 10 Reverse DNS (RDNS) -- 2013-2017 sonar.rdns
## 11 SSL Certificates sonar.ssl
## 12 TCP Scans sonar.tcp
## 13 UDP Scans sonar.udp
For this introductory post, we're going to use one of our smaller datasets, which should make it easier to work with the data, especially for those new to internet scan data and security data analysis in general.
Let's see what Rapid7 Labs has been doing in the UDP space recently:
filter(studies, uniqid == "sonar.udp") %>%
pull(sonarfile_set) %>%
flatten_chr() %>%
head(10)
## [1] "2019-02-04-1549303426-udp_ubiquiti_discovery_10001.csv.gz"
## [2] "2019-02-04-1549300200-udp_coap_5683.csv.gz"
## [3] "2019-02-04-1549296290-udp_ripv1_520.csv.gz"
## [4] "2019-02-04-1549292633-udp_chargen_19.csv.gz"
## [5] "2019-02-04-1549289039-udp_qotd_17.csv.gz"
## [6] "2019-02-04-1549285686-udp_dns_53.csv.gz"
## [7] "2019-02-04-1549284002-udp_wdbrpc_17185.csv.gz"
## [8] "2019-02-04-1549281938-udp_mssql_1434.csv.gz"
## [9] "2019-02-04-1549281910-udp_bacnet_rpm_47808.csv.gz"
## [10] "2019-02-04-1549271093-udp_upnp_1900.csv.gz"
Ah, that Ubiquiti study was pretty fun and informative and the blog post we wrote about it garnered a great deal of attention. Let's see how big it is:
get_file_details(
study_name = "sonar.udp",
file_name = "2019-02-04-1549303426-udp_ubiquiti_discovery_10001.csv.gz"
) %>%
glimpse()
## Observations: 1
## Variables: 4
## $ name <chr> "2019-02-04-1549303426-udp_ubiquiti_discovery_10001.csv.g…
## $ fingerprint <chr> "1669feb358ef7bc13fb28915c95b8a315770ed67"
## $ size <int> 39740649
## $ updated_at <chr> "2019-02-04"
Sweet! It's only around 38 MB. The get_file_details()
function has an include_download_link
parameter that defaults to FALSE
, since every time you generate a download link, it goes against your download credits. Those credits exist primarily to prevent abuse from errant automation scripts and the limits are fairly high, so it's unlikely you'll run into any issues. So, we'll re-issue the call for details, include the link, and download the study data:
get_file_details(
study_name = "sonar.udp",
file_name = "2019-02-04-1549303426-udp_ubiquiti_discovery_10001.csv.gz",
include_download_link = TRUE
) -> ubi_dl
download.file(ubi_dl$url[1], "~/Data/019-02-04-1549303426-udp_ubiquiti_discovery_10001.csv.gz")
Since it's a CSV, there's nothing super-special we need to do to read it into R. We're using readr::read_csv()
here but you may want to give data.table::fread()
a try as well since this is going to end up being a structure with about half a million rows, including payload data, and fread()
is super fast.
read_csv(
file = "~/Data/2019-02-04-1549303426-udp_ubiquiti_discovery_10001.csv.gz",
col_types = "dcdcdddc"
) -> ubi_df
select(ubi_df, -daddr)
## # A tibble: 503,997 x 7
## timestamp_ts saddr sport dport ipid ttl data
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1549303435 177.38.2… 10001 10001 0 51 0100009302000a002722bccf9db12…
## 2 1549303435 176.122.… 10001 10001 25312 45 0100009902000a0027222c17d4b07…
## 3 1549303435 177.44.1… 10001 10001 0 47 0100009302000a0000221ae991b12…
## 4 1549303435 187.121.… 10001 10001 0 49 0100009302000a0027220ee9a0bb7…
## 5 1549303435 211.228.… 10001 10001 0 49 010000000000001605040001
## 6 1549303435 138.117.… 10001 10001 2 47 0100007402000a0027225874dd8a7…
## 7 1549303435 195.116.… 10001 10001 0 45 0100009802000a802aa82679f2c37…
## 8 1549303435 191.37.1… 10001 10001 0 51 0100009302000a44d9e77e44b5bf2…
## 9 1549303435 198.204.… 10001 10001 574 51 010000b202000a0027225fabb3c0a…
## 10 1549303435 221.157.… 10001 10001 0 48 010000000000001605040001
## # … with 503,987 more rows
Note that I redacted the IP addresses (daddr
) field solely to make it a bit harder for attackers or infosec pranksters to poke at those nodes.
Along with the missing daddr
field, the other salient addresses for analysis are saddr
, which is the Sonar study node that performed the probe (we publish them so you can avoid alerting on them and let our scans work so we can help keep the internet safe), and dport
, the port we scanned for, along with data
, which has the response payload from the UDP probe.
Let's first take a look at where these Ubiquiti nodes are. To do that, we'll use the rgeolocate
package to geolocate them using the MaxMind free databases:
library(rgeolocate)
bind_cols(
ubi_df,
maxmind(
ips = ubi_df$daddr,
file = "/data/maxmind/prod/GeoLite2-City_20190205/GeoLite2-City.mmdb",
fields = c("country_code", "country_name")
)
) -> ubi_df
count(ubi_df, country_name, sort=TRUE) %>%
mutate(pct = n/sum(n))
country_name | n | pct |
---|---|---|
United Kingdom | 265,322 | 52.64% |
United States | 238,675 | 47.36% |
Note that the results are a bit less granular with the free datasets then they are with the paid ones we use in our internal data pipelines.
Now, if you try to decode that hex-encoded data, you'll soon find that it's unreadable raw Ubiquiti Discovery Protocol binary data and fairly unusable in its current form. However, R folks are in luck, as I've [written a handy decoder for it](https://git.sr.ht/~hrbrmstr/udpprobe) that you can use if you install another package:
devtools::install_git("https://git.sr.ht/~hrbrmstr/udpprobe")
library(udpprobe)
As noted, the data
column is the hex-encoded version of the response payload, which means every two characters is a byte value. We'll need to get this into a R raw vector format so we can decode it. While we could do this in R, a small C++ helper function will speed things up dramatically:
library(Rcpp)
cppFunction(depends = "BH", '
List dehexify(StringVector input) {
List out(input.size()); // make room for our return value
for (unsigned int i=0; i<input.size(); i++) { // iterate over the input
if (StringVector::is_na(input[i]) || (input[i].size() == 0)) {
out[i] = StringVector::create(NA_STRING); // bad input
} else if (input[i].size() % 2 == 0) { // likely to be ok input
RawVector tmp(input[i].size() / 2); // only need half the space
std::string h = boost::algorithm::unhex(Rcpp::as<std::string>(input[i])); // do the work
std::copy(h.begin(), h.end(), tmp.begin()); // copy it to our raw vector
out[i] = tmp; // save it to the List
} else {
out[i] = StringVector::create(NA_STRING); // bad input
}
}
return(out);
}
', includes = c('#include <boost/algorithm/hex.hpp>'))
Let's test it out:
parse_ubnt_discovery_response(
unlist(dehexify(ubi_df[["data"]][[1]]))
)
## [Model: AG5-HP; Firmware: XM.ar7240.v5.6.3.28591.151130.1749; Uptime: 0.3 (hrs)
Looking good! Let's decode all of them:
# infix helper for assigning a default value 'b' in the event the length of 'a' is 0
`%l0%` <- function(a, b) if (length(a)) a else b
ubi_df %>%
# scanning the internet is dark and full of terrors and some are dead responses despite stage2 . iprocessing
filter(!is.na(data)) %>%
# turn it into something we can use
mutate(decoded = dehexify(data)) %>%
# this takes a bit since the parser was originally meant just to show how to
# work with binary data in R directly and it not optimized for production use
mutate(decoded = map(decoded, parse_ubnt_discovery_response)) %>%
# extract some useful elements; note that we need to still be careful
# to ignore fields that are potentially malformed or missing; again, scanning
# the internet is fraught with peril, esp when it comes to UDP
mutate(
name = map_chr(decoded, ~.x$name %l0% NA_character_),
firmware = map_chr(decoded, ~.x$firmware %l0% NA_character_),
model = map_chr(decoded, ~.x$model_short %l0% .x$model_long %l0% NA_character_)
) %>%
select(name, firmware, model) %>%
filter(!is.na(firmware)) -> device_info
print(device_info)
## # A tibble: 483,281 x 3
## name firmware model
## <chr> <chr> <chr>
## 1 bjs.erenildo XM.ar7240.v5.6.3.28591.151130.17… AG5-HP
## 2 HACKED-ROUTER-HELP-SOS-HAD-DEFAULT-… XM.ar7240.v5.3.5.11245.111219.20… LM5
## 3 85171 Sandra Mara XM.ar7240.v6.0-beta8.28865.16030… N5N
## 4 Elcio Donizette Vieira XM.ar7240.v5.6.3.28591.151130.17… AG5-HP
## 5 ag-6672 XM.ar7240.v5.3.5.11245.111219.20… AG5
## 6 kazimierzow126 XW.ar934x.v5.6.5.29033.160515.21… AG5-HP
## 7 ZANETE CARMINATI XW.ar934x.v5.6.2.27929.150716.11… AG5-HP
## 8 cpe-hannah@digitalpath.net XM.ar7240.v5.6.dpn.5014.160726.1… NB5
## 9 LOCO Kwiatkowska M XS5.ar2313.v4.0.4.5074.150724.13… LC5
## 10 UBNT-2155 XW.ar934x.v6.0.30097.161219.1705 P5B-3…
## # … with 483,271 more rows
We can also take a look at some of the extracted data. First, let's see the top 20 Ubiquiti models using the raw model name response:
count(device_info, model, sort=TRUE) %>%
mutate(pct = n/sum(n)) %>%
slice(1:20)
model | n | pct |
---|---|---|
AG5-HP | 121,308 | 25.1% |
LM5 | 88,888 | 18.4% |
LB5 | 42,156 | 8.7% |
N5N | 33,509 | 6.9% |
P5B-400 | 18,543 | 3.8% |
P5B-300 | 18,014 | 3.7% |
NB5 | 17,245 | 3.6% |
N5B-16 | 16,189 | 3.3% |
NS5 | 11,862 | 2.5% |
LC5 | 11,441 | 2.4% |
WOM5AMiMo | 11,429 | 2.4% |
N2N | 8,886 | 1.8% |
LAP | 8,822 | 1.8% |
LAP-HP | 6,059 | 1.3% |
BS2 | 5,725 | 1.2% |
ERLite-3 | 4,972 | 1.0% |
LM2 | 4,698 | 1.0% |
AG5 | 3,778 | 0.8% |
NS2 | 3,649 | 0.8% |
ER-X | 3,057 | 0.6% |
We can also see if any have (theoretically) been "hacked":
filter(device_info, str_detect(name, "HACKED")) %>%
count(name, sort=TRUE)
name | n |
---|---|
HACKED-ROUTER-HELP-SOS-HAD-DUPE-PASSWORD | 8,813 |
HACKED-ROUTER-HELP-SOS-WAS-MFWORM-INFECTED | 3,852 |
HACKED-ROUTER-HELP-SOS-DEFAULT-PASSWORD | 1,616 |
HACKED-ROUTER-HELP-SOS-VULN-EDB-39701 | 1,135 |
HACKED-ROUTER-HELP-SOS-HAD-DEFAULT-PASSWORD | 1,047 |
HACKED-ROUTER-HELP-SOS-WEAK-PASSWORD | 110 |
HACKED-ROUTER-HELP-SOS-CLONEPW-LEAKED-BY-MFW | 33 |
HACKED | 1 |
HACKED ROUTER | 1 |
HACKED-ROUTER | 1 |
YOU HAVE BEEN HACKED | 1 |
Yikes! If you read the aforelinked blog post, these numbers may look like things are actually getting better. While that is a possibility, experience has shown us that this could just be standard scan variance due to routing conditions and device reachability issues.
Fin
I hope you've enjoyed this first foray into using our Open Data API and working with data from one of our scans. Drop us a note at research@rapid7.com with any questions about this post, the Open Data portal/API, or the new ropendata
package.
You can find the code used in this post over at Rapid7’s GitHub.