Jul 15, 2015 7 min read tech

Explaining the Code: Log Analytics Kata

The Problem

Recently, I worked on a programming problem or a kata based on the problem below:

Given an arbitrary set of webserver log files in the industry-standard Common Log Format, produce a histogram of the number of unique page views per hour. The solution may be written in any language or combination of languages. You may use any and all resources at your disposal to solve this problem. Source: https://www.quora.com/Why-is-it-so-hard-for-me-to-get-a-software-developer-job?share=1

Demo

You can see the live working example: http://log.helloima.ninja/.

Notes: Wait for a second or two for the table of requests to load. For the first time, you might see "We couldn't find any requests". It's just processing the large log file in the background. The hide/unhide doesn't quite work right.

Code

You can see the code on github.

The Code Explained

Now on to the purpose of this post, an explanation. A kata wouldn't be as useful if you don't get a chance to look over the code and reflect on it. We'll take a look at the parts and pieces and I'll try to explain my thoughts creating it.

Disclaimer: It's been some time since I wrote the code for this and it's been on and off so I may have missed a thing or two.

The Approach

There are a lot of different ways to go through this problem. My overall solution is quite simple. From the backend, given a log file (or get a sample file) throw an object which contains the visitors, the number of unique visitors and the hour. We want to show all the results from a given hour. Why are we doing it this way? It would be most efficient to just have the front end deal with the presentation side of things. The front end would show the data (in a table) of the data we got from the backend and of course show this in a useful graph.

Structure

The application built using Spring Boot. On the backend, we have a spring application accessible via a REST web service. The front end on the other hand, is a simple AngularJS application. With that out of the way, let's dig deeper into each part.

Back End

We'll start off with the entry point, the REST web service.

@RequestMapping(value = "access", method = RequestMethod.GET)
public ArrayList<Request> getRequestWithAccessLog() throws IOException{
    return new LogFileParser(ACCESS_LOG_FILE_PATH).getRequests();
}

You'll see in the RequestResource.java that I initially created 3 entry points. I've only really used one method, getRequestWithAccessLog, for the current live version. Anyway, as with a resource, we don't want to add any business logic. This should simply be concerned with grabbing the result from another class that does the processing. Since, this is just a kata, I've just specified a sample log file path. We throw an ArrayList because we want to throw the visitors given a particular hour.

private ArrayList<String> visitors; //IPAddress
private LocalDateTime date;

public Request(ArrayList<String> visitors, LocalDateTime date){
    this.visitors = visitors;
    this.date = date;
}

//getters and setters
public int getTotalUniqueVisitors() {
    return VisitorsCalculator.computeUniqueVisitors(visitors);
}

Let's see the domain for a quick second. You'll notice that again we have a variable which has all the visitors. The date is just a container for the date, time and most importantly the hour of this request instance. We'd get a snapshot of all the visitors given this specific date in a specific hour. We also call a util class to compute the total unique visitors.

Processing the Log File

The bulk of the work lies in processing the log file and making sense of the information. We have the LogFileParser.java to handle this.

//...
private static final String TIMESTAMP_FORMAT = "dd/MMM/yyyy:HH:mm:ss Z";
private static final String SAMPLE_FILE_FILE_PATH = "/sample/sample.log";
private ArrayList<String> logFile;
private ArrayList<Request> requests;

//...
private void init(String filePath) throws IOException{
    logFile = new ArrayList<>();
    requests = new ArrayList<>();

    InputStream inputStream = getClass().getResourceAsStream(filePath);
    GrokUtil grokUtil = new GrokUtil();

    if (inputStream == null) {
        throw new IOException("File Not Found");
    } else {
        BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream));

        try {
            String line;

            //Explained below.

        } catch (IOException ex) {
            throw new IOException(ex);
        } catch (GrokException e) {
            e.printStackTrace();
        }
    }
}

The other parts of this file is mostly self-explanatory so let's just discuss the one that has logic and processing. The first part is just initializing some of the things we'll need. You'll also notice that in the initialization new ArrayList<>(); we don't specify the type. It's a nice Java 8 feature. Moving on, we simply try to get the file, see if it's there. If it's not, we throw a file not found. We then read the file line by line.

while ((line = bufferedReader.readLine()) != null) {

	String json = grokUtil.parseApacheLogLine(line);
	ApacheAccess apacheAccess = mapJsonToApacheAccess(json);

	LocalDateTime dateTime = roundDownOnHour(apacheAccess.getTimestamp());
	Request existingRequest = findRequestWithDateTime(dateTime);

	if(existingRequest != null){
		existingRequest.getVisitors().add(apacheAccess.getClientip());
	}else{
		ArrayList<String> visitors = new ArrayList<>();
		visitors.add(apacheAccess.getClientip());

		Request request = new Request(visitors, dateTime);
		requests.add(request);
	}
}

Not Reinventing the Wheel

You may have asked "What the heck is Grok?". Grok is a nifty little thing I discovered while trying to parse logs for another project. It's well described below.

Parse arbitrary text and structure it.

Grok is currently the best way in logstash to parse crappy unstructured log data into something structured and queryable.

This tool is perfect for syslog logs, apache and other webserver logs, mysql logs, and in general, any log format that is generally written for humans and not computer consumption.

https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html

It's a really nice regex parser and library. Regex can be a headache. The cool thing about Grok is that there are patterns you can store and reuse. In this case, somebody already built the regex pattern for Nginx logs. Why bother parsing it yourself when there's a we can just leverage what someone has built. I used a java implementation of grok from here.

As you'll notice from the code snippet above, we'll let the utility class deal with taking out the useful information out of that log line.

Since the format it returns is in json, we import and use one of the most popular json-to-domain mappers out there, Jackson. As the name implies, it transforms the json and puts the values in our domain.

Continuing On

Now that we have the information we need from the log line and in a variable, we'll do some processing on it.

	LocalDateTime dateTime = roundDownOnHour(apacheAccess.getTimestamp());
	Request existingRequest = findRequestWithDateTime(dateTime);

	if(existingRequest != null){
		existingRequest.getVisitors().add(apacheAccess.getClientip());
	}else{
		ArrayList<String> visitors = new ArrayList<>();
		visitors.add(apacheAccess.getClientip());

		Request request = new Request(visitors, dateTime);
		requests.add(request);
	}

Again, We want to get all the visitors in a given hour. We have the current ApacheRequest (containing the IP Address and the time they accessed the site) in memory. We'll create a variable that is the hour they visited (we round it down since we only want to capture the hour). We try to see if there is a Request object (containing all the visitors given an hour) for this hour. If there is none, we create a new object for it and we add a ip address of the visitor and the hour that it happened. If there was an existing object with that hour, we just add the current ip address to the list of ip address we had.

Okay so that may sound a bit confusing at first. Take a moment to read through it. Now, you'll notice that we'll be able to successfully have a list of Request objects (visitors in a given hour) after all this. Neat.

Once we have this, we simply hold all these requests in the ArrayList requests object. The rest service accesses it via getRequests after the being initialized.

Front End

The front end is much simpler. We have an html page where we load the different css and javascript components. The important ones here are Angular JS and the controllers, Bootstrap and ChartJS. We have the RequestFactory which simply calls our REST web service. The HomeController loads the data from the Factory and initializes the chart based on the data we got.

A List of Improvements and Fixes

After working on this for some time (mostly with setting up the project), I did see some possible things I can do to improve it later on or possibly fix. I don't know if it's worth the effort. Since this is a kata, I don't want to spend a long amount of time on it. It's more of just challenging myself if I can solve a problem.

Todo

Mask the ui elements during ajax load
Change the graph size for mobile
Improve speed further
Feature: drop a text file and render a graph
Display/download the text file used in the graph

Issues

Ui shows wrong monthday
Graph placeholder not hiding (angular ng-hide)
Initially shows "no requests" but if you wait for a second, it'll get processed. Some
issue with Angular's waiting and hiding not working properly.

Other Notes, Thoughts

Docker: I used docker to deploy the live version of the kata. I wrote a tutorial on it here if you're interested.

TDD: We don't use TDD in my current work so any chance I can get to use it is nice. I tried to go full Green-Red-Refactor during development. It's really nice to have this workflow as I can always see if a change I made is breaking the application. I still need some more practice to get the hang of the flow, get used to the shortcuts, etc.

Bower: This is the first time I used bower from scratch. I've used it a little bit with JHipster. From my short experience with it, it's quite nice. It's interesting how sometimes you never think it's a problem or a hassle until you get used to it (Similar to Maven on backend). I will probably continue to use it.

Angular Routing and Spring Boot Issues: AngularJS has a "router" like most Single Page Applications. Spring MVC has similar functionality. If you have them both (which sometimes isn't obvious with Spring Boot) you may get confused why your page is not loading in Angular. I might try to test this issue further and see what's a good way to resolve it.

Spring Boot: The problem with a nice automated thing like Spring Boot is some of the underlying process is hard to figure out what's configured or not. In more traditional Spring, you can see exactly (via annotations or xml) how things work. Sometimes, it's not immediately obvious in Boot. Getting that out of the way, I think Spring Boot is extremely cool. Having an embedded server makes it so easy to deploy and run applications.

I'm sure there might be cons later if you go this route in full (via Microservices) but then again what approach does not have it's cons. I think Microservices seems like the nice state that Services Oriented Architecture aimed for (before becoming over engineered - but that's another topic for another time).

The Problem

Demo

Code

The Code Explained

The Approach

Structure

Back End

Processing the Log File

Not Reinventing the Wheel

Continuing On

Front End

A List of Improvements and Fixes

Todo

Issues

Other Notes, Thoughts

You might also like...

Sample Frontend Stack and Why: Svelte, Tailwind and Storybook

Do You Have a High Functioning Team?

You Need More Than Just an Idea

Troubleshooting DigitalOcean Ghost One-Click Install Issues

How Do You Become Better?

Popular tags