2019 Layer8 Conference “Getting the Good Stuff” Talk Companion Post

I was fortunate enough to present a talk at the Layer8 OSINT and Social Engineering conference in June 2019. Below are the notes to the demo I executed during that talk so that you can follow along.

Talk title: Getting the Good Stuff: Understanding the Web to Achieve Your OSINT Goals

Abstract: As OSINTers we need to look beyond what is rendered in a web browser. Much like an ocean, the web pages we visit contain a wealth of data under the surface. If you understand how to access that information, you can find pivot points to continue your research.

Come and learn how to decode web traffic using simple tools, to retrieve Google Analytics codes and social media IDs from web content, and how to interact with APIs (Application Programming Interfaces) to grab your OSINT data. This will not be a “use this tool and it’ll do all the hard work” talk but instead, will give you the confidence and understanding of how the web works so that you can develop your own techniques to harvest OSINT data.

I’ve made 10 minute tips of each of the sections below and posted them on the https://osintcurio.us/10-minute-tips/ web site’s YouTube channel.

You take the red pill—you stay in Wonderland, and I show you how deep the rabbit hole goes. Remember: all I’m offering is the truth. Nothing more.

Morpheus, The Matrix

Most of us have seen or at least heard of the quote above from The Matrix movie. It holds true within our OSINT work as well as the fictional dystopia of that movie. We can look at what appears on our screens, in our web browsers, and smile and nod to ourselves that that is all the data about a person or from a site or we can choose to look beyond what is rendered in our browsers and seek other sources of data.

This talk is about being OSINT Curious and breaking free from the rendered world.

Live demo

HTML Source Code Data Harvesting

Sometimes we can find interesting information inside the HTML source code that is sent to our web browsers. Remember that only some of that code gets rendered in the web browser that you look at while other content helps load additional page content or has comments that may be interesting to you as an OSINTer. Let’s take a look at a couple of HTML pages with fun things in their source code.

  • In a browser, go to view-source:https://www.smule.com/ and examine the HTML comments (see the ASCII art?)
  • In a browser, go to view-source:https://keybase.io and examine the HTML comments (there is a note to you in there…see it?)-
  • In a browser, go to view-source:https://www.flickr.com/ and examine the HTML comments. Flickr also has ASCII art but a message “You’re reading. We’re Hiring.” Let’s see if other sites have that message in their pages. Perhaps we can connect multiple web sites using this string?
  • For more efficient searching of source code, use https://censys.io
    • To find other sites with the same Flickr.com HTML comments: https://censys.io/domain?q=%22You%27re+reading+We%27re+hiring%22
    • Do you see that several other sites have that content? Looks like some other Flickr sites and then one called permission.io. Load that web site and look at the source code for the string “You’re reading”.
    • Find same string in there? Yeah…it is there but a little different format (comma instead of period). These strings can sometimes be used to find links between seemingly-distinct websites.
    • Look at the source code for permission.io again. Right below the string you searched for above is a section that has a “<!–” that starts it. This code is commented out and will not render in your browser and yet we see it references the forum.permission.io site.
    • Go back to the main permission.io page…this subdomain for “forum” is not referenced. We discovered it through using the HTML source code
  • Analytics and Tracking Codes
    • Tracking and analytics codes can help us show the relationships between seemingly unrelated domains and web sites.
    • In a browser, go to view-source:https://ge.com
    • Search for and extract the Google Analytics code: UA-10221857 from the source of the page
    • Go back to Censys.io and look for other sites with that same code: https://censys.io/domain?q=%22UA-10221857%22
      • I found 2 sites: ge.com and gecompany.com which I can understand how they are related.
    • Try SpyOnWeb: http://spyonweb.com/ua-10221857
      • I found 7 sites this time with some domains that are no longer active and with interesting names.
    • Now visit the https://builtwith.com/relationships/ge.com site
      • Huge number of related sites based on Google Analytics code but also other trackers, tags, and IP addresses
      • Has a dynamically generated graphic showing the time when each other domain used the trackers or had a similar IP address
  • Summary HTML Source Code for OSINT
    • Find hidden content not referenced in the rendered web page
    • Pivot on content discovered to find other hosts/web sites with it
    • Find commented-out or deprecated references to pages that are still live

XHR (XMLHttpRequests) and JSON

Sometimes web pages load a main portion, the HTML, and then use JavaScript within your web browser to make successive calls to web sites to pull data that is used in your browser. Developers can use an XMLHTTPRequest (XHR) to make these additional requests from your browser.

These XHRs usually use a data format called JSON (JavaScript Object Notation) which is easy for JavaScript and other programming languages to parse. Let’s see an example.

  • In a browser, go to http://technisette.com
  • This should redirect your browser to the StartMe page at https://start.me/p/m6XQ08/osint
  • Select some content that is displayed on that page such as “Select a category” and copy it.
  • Right click on the page and view page source
  • Look for that string you copied in the page source. You see it? Nope? Let’s look for an XHR.
  • Launch the developer tools (press F12 on most modern browsers)
  • Press the Network tab in the developer tools window
  • Reload the page (using F5 or the reload/refresh button)
  • Look at all the resources that are loading in the network pane. To filter for the XHR, just click on the “XHR” option in the filter bar. This should leave you with 2 entries…one named m6XQ08.json and settings.
  • Look at the information in the https://start.me/p/m6XQ08.json file by bringing it up in your web browser
    • WARNING: While Firefox has a built-in JSON decoder, Chrome does not. If using Chrome, there are Extensions like JSON Viewer that can “prettify” the JSON.
  • Look for that string you copied in the page source. You see it? Yes! It is there. The developers of the start.me pages use JSON files (called via XHR) to retrieve additional data for the pages.
  • Do you see data in the JSON file that is not present in the web page when it is rendered?
    • Technisette’s email address?
    • The unique “owner id” for her account
    • Timestamps for page creation and modification

Neat example…but how would we use these for OSINT? OK, let’s go to the social media platform called TikTok.com.

Putting It Together: TikTok

The social media site TikTok (https://www.tiktok.com/en/) allows users to post short videos on its site. Other users can comment on those videos. Let’s examine how it all works to cement our understand of looking in the source code, examining XHRs, and decoding JSON.

  • Let’s find a user’s profile to examine. Go to https://www.tiktok.com/en/trending in your web browser and click on a video.
  • In the upper right of the video page is the poster of the video. Click on that person’s profile pic.
  • I chose the random user profile https://www.tiktok.com/share/user/10778167?langCountry=en for my work.
  • Look at the number of followers that you person has. In the case of the above user, in June 2019 they had “2.3k following, 507.8k fans, 6.9m hearts”. See how these values are rounded? Let’s see if we can get more details.
  • TikTok embeds JSON data in the profile page. View the page source for your profile or use the one I’m doing view-source:https://www.tiktok.com/share/user/10778167?langCountry=en
  • There is a lot of source code in there. Search for “INIT_PROPS”.
  • Select and copy from the {“ to the immediate right of the INIT_PROPS all the way down to the first }]}]}} to the left of the first </script> tag.
  • Visit the CyberChef page to decode this JSON.
  • Paste the JSON you copied into the Input field on the right.
  • On the left, in the Operations panel, type: “json” (no quotes) in the Search… field.
  • Click and drag the JSON Beautify operation from the blue section under the search field to the Recipe pane in the middle. Drop it in there by letting go of the mouse button.
  • Once you do that, JavaScript in your browser performs the JSON decoding of the input and places content in the Output pane. This should be human-readable. If not, you may be copied too many characters or not enough.
  • Scroll down in the output pane until you see the “userData” item. That has the data about the user. Scroll down a little further and you will see the exact number of followers, fans, and hearts instead of the rounded numbers we saw earlier.

So, we have some detailed profile data. Now let’s switch to a TikTok video and harvest the JSON data.

  • I’ve clicked on a random video from the user above: https://www.tiktok.com/share/video/6698401324473519366?langCountry=en
  • Looking at the comments for the video, we see “Can u sing without me by Harley“. Let’s look for that in the view-source:https://www.tiktok.com/share/video/6698401324473519366?langCountry=en
  • It is not there….is it? Let’s launch our Developer Tools (F12) and see if this site uses XHRs.
  • Go back to the regular video page and press F12 then reload the page.
  • Select the XHR filter (if it is not already selected). You should see a bunch of XHR requests. Two of them are probably for the list?id=6698401324473519366 resources. We want the one that has the count right after the id. Right click on that entry and Open in New Tab (works in Firefox and Chrome)
  • Now search for the “Can u sing without me by Harley” string. You find it? Yup. This JSON page has ALL the video comments along with who made them, profile IDs, date and time stamps (in UNIX/Epoch time), and more! OSINTing!

APIs (Application Programming Interfaces)

We now understand that this JSON and XHR data can have vast quantities of data we want for our OSINTing. Let’s shift to another place that has JSON, APIs.

We may have already introduced APIs to you in the TikTok example above as there is most likely an API or Application Programming Interface) that services the TikTok XHR requests. In other words, when programs (think mobile devices and scripts like Python) make requests for data to web sites, they often make these requests to APIs so that they get data returned in a format that is easily digestible by the script or app. Inside of this data can be content that is not found in the comparable web page that you’d see in your web browser.

Want an example? OK. Let’s use the OpenCorporates.com site:

Did you see the extra information in the JSON data from the API? Lesson here is always check for an API and compare the data in it to what is rendered in your browser. Look for the word JSON or API in the web site.

In fact, the word JSON appears at the bottom of the regular web page. If you were OSINTCurious and clicked it, you would have been rewarded with this extra data in the JSON API!

Unlinked Web Files

When some people use a search engine like DuckDuckGo, Google, or Yandex, they might think that these applications will show them any file that exists on a crawled web site. That simply is not true. Web site owners can tell search engines to ignore certain files or directories by putting those forbidden resources in a file called robots.txt and placing it at the root of their domain. For more info about this file, visit http://www.robotstxt.org/robotstxt.html.

Let’s look at some of these robots.txt files as they reveal directories and files that, as OSINTers, we may WANT to visit and that are not indexed by search engines.

So why does this matter? Web site owners can prevent search engines from indexing certain resources. When you search on specific term, pages that are available on these sites will not come up because of the robots.txt files. A gentle reminder that search engines don’t index the entire internet!

Wrapping It Up

This blog post and companion conference talk are meant to show:

  1. As OSINTers, the more we understand about platforms, web applications, and how they work, the greater our collection opportunities.
  2. We should not rely solely on what we see rendered in our web browsers.
  3. Diving deeper into source code, APIs, and XHRs help us gather more data in an easy to understand/parse format (JSON).
  4. OSINT can be REALLY fun! There are “Easter Eggs” (surprise gifts) in many of the sites you might use daily.
  5. Be OSINT Curious and click on things, look for and use APIs, and examine source code!
  6. Seek to understand the platforms you are using…not just harvest data from them.
  7. Now that you know these places are out there and have valuable OSINT data, you will never be able to go back. Welcome to the “OSINT Matrix”.

Comments are closed.

A WordPress.com Website.

Up ↑

%d bloggers like this: