URIs: The Anatomy of a Web Address

Background

We often take for granted the ease at which we can retrieve “stuff” over the Internet. Recently, at work, I was asked what sounded like an easy enough question, but the answer was based upon multiple layers of prerequisite understanding. A simple question turned into a fairly lengthy explanation.

Just in case that explanation is helpful to you, I’ve posted it here for you.

How do we ask for “stuff” on the Internet?

… and what, exactly is a “URI”? And why do we call it that instead of a “URL”?

Universal

  • The industry-accepted standards (de facto or ratified) for addressing and/or identifying resources via web technologies.

Resource

  • A “resource” is a “thing”. That thing may have references to, requests for, and/or dependencies upon other resources.
  • Most commonly, “resources” are HTML documents, images, CSS files, JS files, icons, videos, audio files, .PDF files, etc.
  • A resource may also be a reference to an action or intent (dial a certain phone number, send an email to a specific address, send a text message to a certain number, open an installed application to a certain state, etc.).

Indicator

  • Resources are “indicated” and how they are handled depends on the context of the request (protocol/pseudo-protocol handler requested, phone vs. desktop computer, etc.).
  • Like most web technologies, we cannot explicitly control how a device handles these. The best we can do is “indicate” to the device (the “User Agent”) how we would like it to handle our request.
  • The reason we use URI instead of URL, is that the “indicator” part of a URI helps describe how a resource might be handled, not just how a resource should be “located” (the “L” in “URL” stands for “Locator”).

Protocol/Pseudo-Protocol Handlers

  • This is the first section of the URI and indicates the protocol by with the resource should be requested. The most common are http:// and https:// Both indicate to use the “Hypertext Transport Protocol”, the first being the standard variant of the protocol, the second being the “Secure” (encrypted) variant of the protocol.
  • Other protocols include (but are not limited to) FTP://, FTPS://, SFTP://, archie, gopher, etc.

Pseudo-protocols handlers are things like:

  • mailto: to ask the User Agent to send an email using their configured email client to the supplied email address (with subject and body content optionally supplied)
  • tel: to ask the User Agent to open the configured dialer app and pre-populate the supplied phone number (user must interact with the “call” button to instantiate the phone call)
  • skype: to ask the User Agent to open the app which handles Skype Communications and open a communique to the supplied Skype user
  • wire: to ask the User Agent to open the app which handles Wire Encrypted Communications and open a communique to the supplied Wire user
  • etc.

Although the protocol/pseudo-protocol handlers are not case-sensitive, the industry-best-practice is to refer to them in uppercase in documents like this, but in lowercase when in use (like on web pages).

IP Address

Think of an IP address like a “phone number” for your computer. In order to call you, I have to know your number. Since it’s unlikely I have it memorized, I usually consult with a directory (a Rolodex, your business card, my hand-written address book, a phone book; the company Active Directory, LDAP, the address book in my smartphone or computer; etc.). This “directory” lets me look up your name, which is associated with your number, which my phone then uses to call your phone.

Computers operate much the same way. Their “phone numbers” are called IP addresses. There are currently two primary versions of IP (Internet Protocol): IPv4, and IPv6.

IPv4 addresses look like 255.255.255.255 (where each octet can be 0 – 255). We have more devices than IPv4 addresses today – and have been limping along which how to address that for some years now. One solution to the shortage is IPv6, which, among other things, changes the formatting of the IP address. Without getting into a lot of detail, this would be like making your phone number 30 characters long, including both letters and numbers – and you thought memorizing an 11-digit phone number was hard!

DNS (Domain Name System)

Just like your smartphone or the white pages help you look up a person’s phone number by their name, DNS is a system through which User Agents can look up a computer’s address by an “easy to remember” name:

  • JoeLevi.com
  • Lifetime.com
  • HomeDepot.com
  • etc.

DNS uses TLDs, Domains, and subdomains to pair computers names with their addresses.

Top-Level Domain (TLD)

Back in the early Internet, we had .COM, .ORG, .NET, .EDU, .GOV, .MIL, and a very few other Top-Level Domains. As new countries started using the internet, they got their own country code TLDs (.TV, .LY, .UK, .AU, .MX, etc.). Eventually, the ccTLDs got repurposed:

  • .TV (the ccTLD for the country of Tuvalu) was repurposed for Television stations, shows, and related sites
  • .LY (the ccTLD for the country of Libya) was repurposed for URL shorteners (bit.ly, etc.), and “plays on words” (happi.ly, etc.)
  • etc.

Today we have a lot of other TLDs (.site, .booking, .lawyer, .gop, .republican, etc.).

Domains registered under these TLDs are controlled and maintained by “registrars” like Register.com, MyDomain, GoDaddy, Google, etc.

The TLD is not case-sensitive.

Domain

The highest level a person or company can register and associate with a computer is the Domain (JoeLevi.com, Lifetime.com, etc.). By registering the domain, that entity has control over where it is pointed and the contact information associated with it.

The Domain is not case-sensitive.

Subdomain

A subdomain is used to add specific uses to the domain:

  • mail.something.com, smtp.something.com, pop3.something.com could each be used for various email services
  • ftp.something .com could be used for a file-share accessible (via http, https, ftp, ftps, or sftp)
  • www.something.com could be used to access the web server (via http or https)
  • Uk.something.com could be used to access the version of the website specific to visitors from the United Kingdom
  • etc.

Folder/Directory (“route parameters”)

From a skeuomorphic or even “desktop computer” analogy, a Domain could be considered to be a building belonging to a corporation, a Subdomain could be a filing cabinet inside that building (labeled WWW or MAIL, etc.), and a Folder or Directory could be considered a handing folder inside that cabinet with its label (CUSTOMER SUPPORT, WARRANTY REGISTRATIONS, etc.).

Inside those hanging folders can be individual documents (TABLE-OF-CONTENTS.doc, 2017-REGISTRATION-REPORT.xls, etc.), and/or manila folders (2017-REGISTRATIONS, 2016-REGISTRATIONS, etc.).

Inside those manila folders can be individual documents, and/or even more manila folders.

And so on.

Folder/Directory names ARE case-sensitive, and industry-best-practices tell us they should be in all lowercase.

Like the “file cabinet” analogy, a webserver can get set up like your desktop computer, with drives (cabinets), folders, subfolders, and files within then. This is ironically called a “digital analog”.

Unlike the “file cabinet”/desktop computer “digital analog”, most modern websites do not use static folders, subfolders, and files. Instead they utilize a concept called “routing” that goes something like this:

  • https://sub.domain.com/(parameter 1)/(parameter 2)/(parameter 3)

In this example, the values of “parameter 1” and “parameter 2” might correspond to particular “folder” of contents, but might also display pages “within” it in a certain way.

  • · If the value of “Parameter 1” is “products”, the data retrieved and formatted might come from the a table called “products” in the database
  • If “Parameter 2” is representative of a category of products (“basketball”, for example), the site may do something like this:
    • Limit the database request to just products of type “basketball”
    • Style the page with the color-scheme of the “basketball” product line
    • Pull in the text and images that relate to the “basketball” product line
    • Display a list of products within the “basketball” product line
  • If “Parameter 3” is representative of a specific model number or sku of a particular product (“12345”, “6405”, etc.), the site may do something like this:
    • Display the particular product including its featured image, and related image set
    • Display the variations and packaging options for the particular product
    • Display the stock/inventory level of the particular variations/packaging options for the particular product
    • Allow the customer to add the specific variation/packaging option for the particular product to their shopping cart
    • Display marketing information about the particular product
    • Display technical specifications about the particular product
    • Display assembly instructions for the particular product
    • Display helpful videos about the particular product
  • etc.

Casing considerations

As mentioned previously, the domain, subdomain, and TLD are not case-sensitive, but in practice it’s generally all lowercase (to emphasize the visual importance of the Domain) or in all-uppercase (to match the stylistic implementation of the URI:

  • www.lifetime.com
  • www.Lifetime.com
  • www.LIFETIME.com
  • WWW.LIFETIME.COM
  • www.joelevi.com
  • www.JoeLevi.com
  • www.JOELEVI.com
  • WWW.JOELEVI.COM

Utilizing mixed casing for the subdomain is advisable to help with readability:

  • www.Lifetime.com
  • www.EmotionKayaks.com
  • www.TheLifetimeSettlement.com.

Everything after the TLD should be all lower-case except in rare cases. Even though on our server, the same page will be returned for www.Lifetime.com/myproduct, because of casing, search engines will consider it to be different from (and will be indexed separately from) all of these, splitting search engine ranking between them:

  • www.Lifetime.com/MyProduct
  • www.Lifetime.com/MYPRODUCT
  • www.Lifetime.com/MYproduct
  • www.Lifetime.com/myPRODUCT
  • www.Lifetime.com/MyPrOdUcT
  • www.Lifetime.com/mYpRoDuCt
  • etc.

This is referred to as “search engine schizophrenia” and is a fantastic way to kill your search engine ranking.

Default Documents

The document that is loaded from inside a folder by default (without having to supply the filename and extension). This could be:

  • index.htm
  • index.html
  • home.htm
  • home.html
  • index.asp
  • index.aspx
  • index.php
  • default.htm
  • default.html
  • default.asp
  • default.aspx
  • etc.

The best practice is to omit the default document from the URL string so search engines and other linking sources point only to the folder, so if your technologies change, the folder structure is still intact, and you don’t lose all your page indexing and SEO-mojo when you switch from one tech to the other.

Link rel canonical

To overcome the challenges of including/not including:

  • default documents
  • sub-domain

And to address the casing issues associated with “search engine schizophrenia”, the industry has adopted as a defacto standard the link rel canonical meta tag. This tag, essentially, tells the user agent “I really don’t care how you came to this page, but rather than giving you an error message or redirecting you a bunch of times, I’m going to load the page anyway, but THIS is what the REAL URI of this page/resource is”.

Not every user agent respects the link rel canonical tag, and not all of them treat it the same way, but it’s the only thing we can reliably do to “suggest” to the user agent what the actual address for the resource should be.

Got all that?

Leave a Reply