We are constantly told that everything is better with data. Data provides an unseen hand that guides us behind the veil of technology.
When was the last time we went shopping or found an interesting video online without assistance from recommendation algorithms? Data is everywhere! And based on some meetings I attend, I am left with the impression that data is often treated as a mystical force that surrounds us everywhere. However, have we improved much since the olden times where we tried to turn lead into gold in laboratories? Today we work in labs and try to turn data into gold, sometimes more or less methodical.
We decide on which outcomes we want to achieve and look for key performance indicators to steer our journey to success. Then we extract, transform, load, and pipeline data from silos to lakes aiming to create systems that work on a feedback loop to improve themselves. All this is to gradually reduce the impact of the unknown risk that is the human factor in these equations.
And indeed, we are flawed. Humans are biased, subjective, and sometimes motivated by forces that should not be a part of the decision process. Originally, a cluster of experts formulated a vision and a plan for their respective areas. Instead, data is now the arbiter of success using an approach to create a simulation. Effectively, a machine that learns in the center of all things. A machine that works on numbers and is factually correct in its divinations. We are unburdened of our responsibility, as we can now be led by scientific fact instead of following our gut feelings.
Of course, we would never relinquish control that easily. After all, we select which data to input, which algorithms are used, and ultimately, we are informed in our decisions by data.
But I found that this is not necessarily true.
We started a more serious look into utilizing data almost over a decade ago. We felt that we had good insight into what happens in a computer network, given that we build the boxes that comprise computer networks. Almost every enterprise network device provides data points, from powerful core routers to edge switches and access points (APs). The prevailing assumption has been that value will look back at you if you look long enough at all the data.
It did not.
While it is useful and necessary to know what data is available, just poking around and waiting for patterns to emerge has never seemed to work out for us. Maybe we are not very good at harvesting use cases from the data by itself, but I am sure that we would and should have found more in all that time.
One reason could be that network data simply does not carry a lot of value beyond its own area of application. However, we have found use cases that were network-adjacent. Essentially, we used the data extracted from networks as the glue to tie different data stores together. What can we conclude from this? We learned that the use case must be defined first, and then we can look at data to build a solution that provides true value.
We now have an interesting problem - if data itself does not easily convey its value and we must start with a use case, how objectively, factually correct is the decision-making process afterward? Obviously, you must be smart when selecting which data to look at and filter and how to measure success. But all of this leaves enough room for human misinterpretation.
I recently saw an interesting video by a German outlet that is part of the public broadcast service - Jonas Ems, Gnu, LeFloid: Macht Social Media krank? | STRG_F. Their video highlighted how stressful it is to produce content. One of the contributing factors to the stress is the data that comes back from YouTube about existing content.
Most of us think that more is better when it comes to feedback toward adjusting your content for maximum reach and success – I know I would. However, if we take a closer look, we see a systematic loop emerge that appears to be quite toxic for the human participation in this process. As shown in Figure 1, this loop starts with publishing content and cycles to content adjustments based on consumption data and algorithms.
Figure 1 - Systematic data loop
Have you ever had this weird déjà vu where you see a video thumbnail on YouTube and could have sworn that you saw it with a different title? This happens because the uploader adjusted the title to attract more viewers. For similar reasons, many videos use click-bait titles like “I am quitting!” or “It is finally over!” – but then the viewers discover that nothing this shocking transpires in the video. The truth is that the analytics provide feedback that these bombastic titles work, so they are employed to make people click on tiny pictures.
My point is not to criticize the best practices of a successful channel operation on a video platform – instead, we should focus on what is driving this behavior. As depicted in figure 1, the cycle includes a major Blackbox – the algorithm, which might provide problematic.
It is inherently a good idea to optimize your output for maximum effect. A content creator might start crafting a better intro because the system tells them that a fair share of viewers doesn’t make it past the first 5 seconds of the video. The algorithm ranks the clip lower, which in turn reduces potential viewership. This leads to an unending cycle of optimizations which might be fine for a machine but bad for human beings.
I can go so far as to make the case that content producers are data-driven companies. They are creating products for a market that directly feeds back into the content creation process based on raw numbers. There is not much difference between an application developer doing A/B testing to verify the impact of a particular feature versus changing the title of a YouTube video to increase user engagement.
You might nod and smile while leaning back and thinking that this could never happen to your business. After all, you make the business decisions, and ideally, data is just there to guide your decisions. But take a moment to consider how you use the data. It is very tempting to argue that a specific product should be discontinued because the forecast numbers are going down. More importantly, it feels like the right decision to take as you can back it up with facts. And even if it does not lead to the expected results, then nobody is to blame as the decision was made with the best intentions. Yet, somebody might be to blame if the system providing data is looking at information that might be accurate but provide the wrong foundation for the decision to be made.
I want to end this blog on a lighter note by providing you a path to solve this problem. However, I won’t, because I don’t have a good answer.
At Extreme Networks, we try to mitigate a problem by first defining the use case, and then looking at the data. We regularly re-evaluate data sets and ask the most critical question, “Is there a causal connection or just a correlation between the input data and the desired outcome?”
Becoming a data-driven company is not a single step that magically transforms your business into the digital age. It is an ongoing process that does offer enormous benefits by streamlining the decision process through automation. But be aware that this decision process can swing in both directions. If the input data is bad, but the process is correct, a remedy usually becomes apparent in the short term. Should the input data be good, but the system behind it is flawed, automation might lead you to a wrong path, maybe even to the point of no return.
There are certainly people who prefer to follow their gut instead of an automated data pipeline, and I don’t want to go so far as to say that one is better than the other. I do think that it is generally a good idea to reserve the same skepticism for both. After all, you wouldn’t blindly follow the results of a single study or paper without looking at the methods employed, would you?