Complex Data Visualized

PixelCarpet

by Johannes Landstorfer on July 23, 2014, 0 comments

The paper about the Pixel Carpet is one of the results from a collaboration between data visualization researchers from FHP and computer security engineers of various institutions. It builds on the observation that security engineers know their data and the requirements of their work very well. However, they might not be acquainted with advanced visualization techniques. Visualization researchers, on the other hand, know methods to visualize and analyze data but usually lack insight into the specific requirements of computer network security. The paper revolves around two main contributions:

results and learnings from a co-creative approach of jointly developing visualizations
a pixel oriented visualization technique that graphically represents multi-dimensional data sets (such as computer log files), reflecting ideas from the collaboration

You can get and read the full paper here (27 MB or 4 MB without video). Please feel free to comment to this post or contact us for any details.

Landstorfer, Herrmann, Stange, Dörk, Wettach (2014): Weaving a Carpet from Log Entries: a Network Security Visualization Built with Co-Creation. in Visual Analytics Science and Technology (VAST), 2014 IEEE Conference on, 2014 (to appear)

Co-creative Approach

User centered approaches are well known in the visualization community (although not always implemented) [D'Amico et al. 2005, Munzner et al. 2009]. Jointly developing the visualizations themselves, however, is rather rare. As we have very good experience with co-creative techniques in design and innovation, we wanted to apply them to the domain of data visualization as well. For example, we tried to experiment with data sets during a day-long workshop with a larger group of stakeholders (a session we called the “data picnic” because everyone brought his/her data and tools).

Visualization

For this paper, we focused on a pixel oriented technique [Keim 2000] to fullfill requirements such as visualization of raw data or a chronological view of data to preserve the course of events. We stack graphical representations for various parameters of a log line (such as IP, user name, request or message) so that we get small columns for each log line. Lining up these stacks produces a dense visual representation with distinct patterns. This is why we call it the Pixel Carpet. Other subgroups of our research group took different approaches that can be found at other places in this blog.

Snapshot of the Pixel Carpet interface. Each “multi pixel” represents one log line, as it a appears at the bottom of the screen.

Data and Code

Our data sources included an ssh log (~13.000 lines, unpublished for privacy reasons) and an Apache (web server) access log (~145.000 lines, unpublished), and ~4.500 lines (raw data available, including countries from ip2geo .csv | .json ).

We implemented our ideas in a demonstrator in plain HTML/JavaScript (demo online – caution, will heavily stress your CPU). It helped us iterate quickly and evaluate the idea at various stages, also with new stakeholders. While the code achieves what we need, we are also aware that computing performance is rather bad. If you want to take a look or even improve it, you can find it on github.

To bring it closer to a productive tool, we would turn the Pixel Carpet into a plugin for state-of-the-art data processing engines such as ElasticSearch/Kibana or splunk (scriptable with d3.js since version 6).

Time Series Visualizations – An overview

by Kim Albrecht on October 17, 2013, 1 comment

“Time-series — sets of values changing over time”
A Tour Through the Visualization Zoo
http://hci.stanford.edu/jheer/files/zoo/

This description of the word “Time-Series” is very close to the explanation in Oxfords dictionary which adds that the word comes from a statistic background and often the intervals are equal within the time-series.
http://www.oxforddictionaries.com/definition/english/time-series?q=time-series

Within our research project we are mainly interested in the visualization part within the vast field of statistics. In the book “The Visual Display of Quantitative Information” Edward Tufte defines time-series visualizations as:

“With one dimension marching along to the regular rhythm of seconds, minutes, hours, days, weeks, months, years, centuries, or millennia, the natural ordering of the time scale gives this design a strength and efficiency of interpretation found in no other graphic arrangement.”
Edward R. Tufte
The Visual Display of Quantitative Information
p. 28

Classical datasets of time series visualizations are temperature, wind, condensation (or any other kind of weather measurement), stock data, population change, electricity usage etc. the field is so vast that Tufte writes that in a study that analysed graphics between 1974 and 1980 75% of the graphics where time-series visualizations. Obviously more than 30 years later the field has changed but time-series still seams to be an important part within the area.

In my opinion most Security Network Data doesn’t provide information with changing values over time initially. For example Flow Data is structured through nodes and edges with additional information. These single incidents in time don’t hold the same characteristics as usual time-series datasets where one value changes. But on a certain level of abstraction (for example by counting incidents within set timeframes) or by combining time-series with other methods like network visualizations this kind of graphics could be very helpful for us.

This article first summarises a few classical time-series examples and than looks at recent developments in the field.

The first time-series visualization was designed in the tenth or possibly eleventh century. It shows the changing positions of the planets with the time on the x-axis.

As we will see the use of the x-axis is still the most common form of presenting time-series graphics. Nathan Yau gives an overview of the most common forms of time-series visualizations in his book “data points” which are in his opinion bar graphs, line charts, dot plots & dot-bar graphs. All of this charts are actually similar in what they do. The only difference is the graphical representation of the data. While all of them use the time dimension on the x-axis, Nathan Yau gives two examples for different representation methods. Radial plots, which are similar to line charts, just circular and calendar heat maps.

Jeffrey Heer, Michael Bostock, and Vadim Ogievetsky from Stanford University are giving a different overview of time-series visualizations in their article “A Tour Through the Visualization Zoo”. Their overview starts with index charts, which is an interactive line chart.

Index Chart

Stacked Graphs. Which are Area Charts that are stacked on top of each other. They are also called stream graphs. What makes them special is the fact that we get a visual summation of all time-series values.

The controversy around stacked graphs is very big. Alberto Cairo, graphics director at El Mundo Online wrote in a blog article that stacked graphs are “one of the worst graphics the New York Times have published – ever!” on the other hand the publisher of the first paper on stacked graphs wrote: “simplifying the user’s task of tracking individual themes through time by providing a continuous ‘flow’ from one time point to the next”. Furthermore, “we believe this metaphor is familiar and easy to understand and that it requires little cognitive effort to interpret the visualization” both points seam valid to me the cognitive effort needed in some contemporary visualizations is so high that it becomes hard to understand them without putting a lot of effort into them. Stacked Graphs are very simple to understand for the complexity they hold but the information output that can be generated from them is questionable. Andy Kirk from visualisingdata.com credits both sides very fairly in his blog article about the graphs with these comments:

“… a streamgraph is a fantastic solution to displaying large data sets to a mass audience.”

“The main problem facing static streamgraphs lies in the difficulty of reading data points formed by uncommon shapes.”

Tools: D3, Processing

Paper: ThemeRiver: Visualizing Theme Changes over Time,

Stacked Graphs – Geometry & Aesthetics

Example: The Ebb and Flow of Movies, How Different Groups Spend Their Day, Trace (this one is about visualizing wireless networks)

Stacked Graph

Small Multiples are multiple time-series graphs (what kind these graphs are is another question, in this case, area charts) arranged within a grid. Small multiples are more use full to understand different datasets on its own and not as a summary apposed to the stacked graphs.

Small Multiples

The last example from the article are horizon graphs. These are actual also area charts which are mirrored and separated by occupacity. This is especially interesting in combination with small multiples because the “data density” is much higher than which classic area charts which leads to more information in a smaller space. An important factor when we are dealing with big datasets.

Horizon Graph

There is some interesting research about the usefulness of horizon graphs that I recommend: Tool, Paper, Article

The list of graphics from the Stanford Group are much more contemporary than the examples from Nathan Yau, but still all of these examples use the same mechanism to visualize time-series data by using one axis as a dimension for time. This now more than 1.000 years old way to visualize time is helpful and very common but might not always be the best choice. As we know from scatter-plot visualizations our two space dimensions within a graphic are maybe the most powerful ones for pattern recognition and time might not be the main factor to identify these patterns. So what other ways are there to use time as a dimension within a visualization a part from space?

Animation:
At least since Hans Roslings famous TED talks the usage of animation for displaying time is common and it seams to be the most obvious way to visualize time very literal though time. But the technique needs to be used with caution.
Tamara Munzners visualization principles give a great insight on page 59 why visualizing time with animation is dangerous:

Principle: external cognition vs. internal memory

easy to compare by moving eyes between side-by-side views –harder to compare visible item to memory of what you saw

Implications for animation

great for choreographed storytelling
great for transitions between two states
poor for many states with changes everywhere

There is also a paper about the topic which gives more insights into the problem.

Small multiples:
I already mentioned small multiples above but as I raised before the idea behind small multiples is more of a frame for visualizations than an actual kind of visualization. Like this we can also use each multiple as a timeframe. A beautiful example of small multiples with time as a dimension comes from the NYTimes Graphics department.

Binning time in bubbles:
The idea here is to use bubble charts where the time dimension gets binned by minutes, days, years etc. into one bubble and compared to each other. In the Nasdaq 100 Index example each year is represented by one bubble.

Scatterplots:
Scatterplots where time is displayed as connected points against two variables. This is similar to the animation idea. But in this case the animated dots leave behind a path behind. Also here the NYTimes has a good example.

Raphael Marty on the need for more human eyes in sec monitoring

by Johannes Landstorfer on October 16, 2013, 0 comments

Raphael Marty spoke at the 2013 (ACM) conference for Knowledge Discovery and Data mining (KDD’13). It is a very enlightening talk if you want to learn about the status of visualization in computer network security today and core challenges. Ever growing data traffic and persistent problems like false positives in automatic detection cause headaches to network engineers and analysts today, and also Marty admitted often that he has no idea of how to solve them. As he has worked for IBM, HP/ArcSight, and Splunk, the most prestigious companies in this area, this likely not because of lacking expertise).

Marty also generously provided the slides for his talk.

Some key points I took away:

Algorithms can’t cope with targeted or unknown attacks – monitoring needed

Today’s attacks are rarely massive or brute force, but targeted, sophisticated, more often nation state sponsored, and low and slow (this is particularly important as it means you can’t look for typical spikes, which are a sign a mass event – you have to look at long term issues).

Automated tools of today find known threats and work with predefined patterns – they don’t find unknown attacks (0 days) and the more “heuristic” tools produce lots of false positives (i.e. increase the workload for analysts instead of reducing it)

According to Gartner automatic defense systems (prevention) will become entirely useless from in 2020. Instead, you have to monitor and watch out for malicious behaviour (human eyes!), it won’t be solved automatically.

Some figures for current data amounts in a typical security monitoring setup:

So, if everything works out nicely, you still end up with 1000 (highly aggregated/abstracted) alerts that you have to investigate to find the one incident.

Some security data properties:

Challenges with data mining methods

Anomaly detection – but how to define “normal”?
Association rules – but data is sparse, there’s little continuity in web traffic
Clustering – no good algorithms available (for categorical data, such as user names, IP addresses)
Classification – data is not consistent (e.g. machine names may change over time)
Summarization – disrespect “low and slow” values, which are important

How can visualization help?

make algorithms at work transparent to the user
empower human eyes for understanding, validation, exploration

because they bring
supreme pattern recognition
memory for contexts
intuition!
predictive capabilities

This is of course a to-do list for our work!

The need for more research

What is the optimal visualization?

– it depends very much on data at hand and your objectives. But there’s also very few research on that and I’m missing that, actually. E.g. what’s a good visualization for firewall data?

And he even shares one of our core problems, the lack of realistic test data:

That’s hard. VAST has some good sets or you can look for cooperations with companies.

Tags: analysis, conference, cyber-security, talk, visualization

Best in Big Data 2013: On the relevance of user interfaces for big data

by Johannes Landstorfer on September 27, 2013, 0 comments

Network traffic data becomes “big data” very quickly, given today’s transaction speeds and online data transfer volumes. Consequentially, we attended the Best in Big Data congress in Frankfurt/Main, to learn about big data approaches for our, but also for other domains.

[official pix not available yet]

Most of the presntations seemed to be made by big companies to sell to other big companies. Business value of big data and how to deal with it in enterprise contexts consumed most of the slides. In a couple of statements you could hear that big data technology is now well enough understood and spread that the discussion can focus on use and business cases instead. Big data might also move away from IT departments and get closer to domain experts.

For my user experience perspective, I missed aspects like:

user interface: how do people get in touch with these vast amounts of data? Do they get autmatically aggregated information? How and by whom are the aggregation methods defined? Do they use visualizations (this was naturally quite important to me)? Analysis tools? How are they different to the traditional ones?
use cases: although there were examples of how to put big data into praxis, they were mostly presented on architecture level, with little details on user level and output examples.
consumer perspective: the consumer was mostly an object of analysis, and little effort was visible to empower consumer decisions through big data. A 10min exception was Sabine Haase/Morgenpost, who presented the flight route radar. As far as I understood, this project did not use big data techniques very much. It appeared as if it was the “social project” that you need to include.

Haase was also one of two women on stage and she was even acting as a substitute to her male colleagues – there were a couple of women in the audience but in principle it appeared to be a rather masculine topic or event)

There are a couple of aspects that I find worth mentioning in detail:

Better user interfaces

Klaas Bollhoefer from The Unbelievable Machine and Stephan Thiel from StudioNAND held a furious plea for taking the user interface for big data more serious. At the moment, it was still the case that a lot of effort (and budget) is spent on data aggregation, storage, processing, etc. “With an additional 5.000 bucks we create some interface, at the end.” was a common attitude. Bollhoefer found this particularly ill balanced and counter productive for an effective use of information. Obviously, the decisive people in companies knew too few about visualization and design, and thought too little about the eventual users of such a system.

One important feature for analysis tools was direct manipulation of the data and an immediately updating visualisation (think of Bret Victor): this way, the user can try out various deviating values and play through a couple of “what if”-scenarios: such as “if we get a higher conversion rate on our webshop, what would that mean for our profits”. This is something that also otherwise well designed products such as Google Analytics don’t provide yet.

Unfortunately, Klaas and Stephan hardly showed any examples of systems that work that way, from data visualization or other domains. I couldn’t agree more to their statements but some more visuals would have made it far more compelling to the hardly design-literate audience.

From the exhibiting companies, splunk and tableau showed very promising tools that took many of these demands into account. splunk keeps you close to the “raw” data but provides a variety of mini-statistics and context tools that provides the user with a quick understanding of the data set and puts her in control.

tableau, a Stanford viz group spin off, has a drag-n-drop operated interface for data manipulation and super quick access to a wide variety of visualizations to try and to combine. Both stated that they had found new insights in data of their clients within hours, thanks to their tools.

Data ethics and privacy

Big data is keen on data, of course, so the collection or origins of this data might be a little off radar. This was certainly true for the Best in big data-congress. Unintentionally, a video by IBM raised these thoughts: it was asking questions like “Do you know my style? Do you know what I’m buying?” Obviously, it wanted to make the case for more profiling of consumers by means of big data. But questions went on like “Do you know that I tweet about you right now?” and ended in “Know me.”

“… powered by NSA” commented Wolfgang Hackenberg, lawyer and member of Steinbeis transfer center pvm. Despite some awareness of the privacy topic, his talk unfortunately didn’t get to the real dilemmas, let alone proposed solutions. In a huge talk/article from 2012, danah boyd pointed out that taking personal information and statements out of context is very often per se already violating privacy: people make statements in contexts that they understand and find appropriate. If you remove or change the context, a statement might be embarassing or otherwise open for misinterpretation. Big data collection methods tend to be highly susceptible for this offending behaviour – hence, people feel uneasy about it. Hackenberg admited that he doesn’t want to be fully screened himself and that big data for personal information necessarily means the “transparent user”. But he also found strict German and European legislation on privacy simply a burden in international competition for all companies in this domain.

One way could be to involve the “data sources” more in this process and offer them the results of the data analysis. But as I mentioned above, consumer facing ideas were very rare. There is room for improvement.

A remarkable feature of the congress was the venue, inside the Frankfurt Waldstadion (soccer stadium): all breaks allowed the audience to step out of the room and enjoy the sun in the special atmosphere on the ranks of the stadium: a big room for big thoughts.

Tags: bigdata, conference, talk, tools

IPython: interactive/self-documenting data analysis

by Johannes Landstorfer on September 9, 2013, 1 comment

IPython is an “interactive” framework for writing python code. Code snippets can be run at the programmer’s will and the output will be displayed right below the code. Together with rich input from html-markup to iFrames, an entire workflow can be fully documented. This is very handy for learning, of course, but also to make a complex analysis of a computer incident available and transparent to later readers. As everything (docu, code, output) gets “statically” saved in JSON, the documentation is even independent of the availability of data sources. (Note: there is also a special “Notebook viewer” available online so the reader doesn’t have to know/have IPython her/himself)

As a couple of powerful viz and analysis libraries are available for Python (such as PANDAS), this is (almost) ideal for recording an analysts way to a result.

Ideas for improvement:

make it even more interactive/auto-updating so that changes in one place (“cell”) show up in other places at once (maybe even work with realtime sources?) – maybe towards frameworks like puredata/MAX: this would help explore various parameters for the analysis functions.
Think about some auto-recording functions so that documentation becomes easier and the “author” has to think less about it. This might be especially possible in the narrow context of network security analysis where certain procedures are standardized or very common.

See how it works, e.g. with PCAPS (German)