For our FolioProjects SaaS, we do not use Google Analytics for a range of reasons. We store visitor records in a relational database management system and share them with platform users. We capture a range of information about visitors including, the technology they are using to access the SaaS, their location, and the pages they are visiting.
In this article, we review the pros and cons of managing your own visitor logs vs using a 3rd party platform like Google Analytics.
Table Of Contents
- Capturing Visitor Records
- Visitor Count Discrepancies
- Invisible Messages From Visitors
- Malicious Attacks In The URL
- Which Records To Show And How
- Pros and Cons
Capturing Visitor Records
There are many providers that will attempt to assist you in capturing and analyzing website visitor logs. In this article, we talk about Google Analytics; however, there are many other popular platforms like Adobe Experience Cloud. In this article, we are using Google Analytics as an example simply because our observations included this particular platform.
Most servers also store raw visitor logs. Depending on the type of server you use, there will be applications available. For example, on linux servers, you can analyze the access logs with a free application like AWStats. Ultimately, logs are mostly helpful once analyzed for patterns, trends, and anomalies.
In our records, we capture a range of information including the user’s IP, landing page, and location data including lat/lon. We at BeProSoftware analyze the raw logs on an almost daily basis. In this article, we talk about some interesting aspects we noticed about the visitor records, especially when compared to 3rd party analytic solutions.
Visitor Count Discrepancies
We recently observed the phenomenon while running LinkedIn ads for a YouTube video. The YouTube analytics initially showed a visitor count of (X) but later reverted the count lower to (Y). The number of users sent (Z) from the LinkedIn ads was reported to be higher than (X) and (Y). We have no clear answer as to why these 3 numbers are different.
Invisible Messages From Visitors
We are able to capture information like user_agent from all traffic to our servers. In this server variable, we capture details about the technology the user is utilizing to access our SaaS. This typically includes details like the browser and operating system being used. This is invaluable information for understanding which technologies we should pay special attention to while building and testing our SaaS. Either for catering to traffic with particular demographics or defending against patterns confirmed to be malicious.
While reviewing these records, we began to see a range of anomalies like names and messages in this field. Some turned out the be large corporations that regularly mine the internet and provide details on how to opt out of their scans. Others turned out to indicate potentially malicious intents like “curl/7.58.0“. We even received lessons through code references left in the user_agent record like masscan/1.3 (https://github.com/robertdavidgraham/masscan)
Malicious Attacks In The URL
While monitoring visitor traffic, we noticed traffic to URLs with extra parameters attached. Some were relatively low risk in nature, attempting to learn about the server setup like /?phpinfo=1. Many were hunting for specific applications like /?XDEBUG_SESSION_START=phpstorm. Unfortunately, we also encountered obvious malicious attempts like /?a=fetch&content=die(@md5(HelloThinkCMF)) which was designed to exploit a security hole in a particular CMS.
You could capture and block the IP of visitors who attempt to access URLs you identify to be malicious. However, note that this log could become large since the malicious server requests are probably automated with changing IPs. Understanding what to look for and the tools available for addressing issues will help to make use of the visitor records you are capturing.
Every time your server is accessed, resources that you pay for are being used. It’s important to understand how your resources are being used and manage that utilization accordingly. Some traffic should be blocked, others redirected, and some served content. You should establish rules on how you react to server requests.
One popular technique is to utilize firewalls for intercepting the requests and addressing them with redirects or tailored responses. This reduces the waste of resources and preserves your servers’ capacity for actual customers.
You could optimize the response to generic requests like these through cache servers. Using a solution like varnish servers in front of your web servers, could quickly respond to generic requests both malicious and valid. This method stores the content of web pages and quickly relays them when requested, VS having the web server rebuild the page from scratch.
Lastly, you could mitigate the issue at the web server itself through server-level instructions or in the code. Also note that with the earlier examples like the firewall and caching, these could also be done on the same server where the web pages are generated. Separating these layers into individual servers has cost implications.
Which Records To Store And How
Deciding which visitor traffic to keep and for how long has a lot of implications. The more information you store, the more costly it becomes. At some point, you will need to ask yourself how much information is needed. For example, are visitor logs from 2yrs ago useful?
One popular method is to store all records possible for a particular amount of time. Then you can either delete records older than that timeframe or engage in a technique known as warehousing. With this method, you move data to cheaper yet slower storage options also known as long-term storage or archiving.
Most servers store access logs. For example, the Apache web server which runs on Linux servers automatically captures this data. As noted before, we like to use a relational database management system when relevant. Logs are files on your file system taking up hard drive space. Depending on your database setup, it will also be a file on your file system taking up hard drive space, or it could be a dedicated instance. Understanding how each technology works and its limitations will help you to make the right decision for your organization.
PROs and CONs
In case we missed any during the article, we will sum up the pros and cons below
Google Analytics Pros
- Popular, making it relatively easy to get help
- Many products are designed to integrate quickly with it
- Graphical analysis of data
Google Analytics Cons
- Several points of data are missing or summarized like user_agent and IP
- Many records are not counted for a range of reasons
- The visitor summary is updated only once per day
- Forced changes to interface e.g. GA4 rollout
Custom Recording Pros
- Clear record of all visits regardless of the scenario e.g. bot, real visitor, etc
- Ability to check and analyze visitor details at will
- Potential for deeper analysis with more data points available e.g. lat/lon and continent.
- Ability to combine visitor data with other information for context e.g. user ID or content ID
Custom Recording Cons
- Takes more effort to analyze the data, even if using 3rd party tools in conjunction with logs
- Increase in storage costs for storing lots of data or potential loss of details if storage is limited
- Another aspect that needs to be secured
Managing your own visitor logs has pros and cons. As a pro, you are able to view raw visitor information before it is analyzed by a third party, where aspects may be summarized or missed. On the other hand, utilizing a 3rd party provider like Google Analytics is relatively easy.
We personally look at the situation this way. If you are paying for traffic, you should be analyzing it yourself as well as using 3rd party providers. If security is important to you, you should be recording and analyzing these records yourself and with 3rd party providers. However, for most websites online, the information is noise and should be left to analysis software.