In an era of big data and technological advancements in machine learning is this dream finally a reality?
To find out, we charged today’s guest, Johnna Verry, Intern at Pivot Point Security, with putting machine learning to the test to see if it can really be the breakthrough we need in predictive security. She joins me to share the results.
In this episode, we discuss:
- The challenge of — and tools necessary for — scraping and cleaning data for use in machine learning
- The types of machine learning algorithms and how they work
- The results of Johnna’s research and what they mean for the future
To hear this episode, and many more like it, you can subscribe to The Virtual CISO Podcast here.
If you don’t use Apple Podcasts, you can find all our episodes here.
Time-Stamped Transcript
This transcript was generated primarily by an automated voice recognition tool. Although the accuracy of the tool is 99% effective you may find some small discrepancies between the written content and the native audio file.
Narrator (Intro/Outro) (00:06):
You’re listening to The Virtual CISO Podcast, a frank discussion providing the best information security advice and insights for security, IT and business leaders. If you’re looking for no BS answers to your biggest security questions or simply want to stay informed and proactive, welcome to the show.
John Verry (00:25):
Hey there. And welcome to yet another episode of The Virtual CISO Podcast. With you as always your host John Verry and with me today, Ms. Johnna Verry, Johnna thanks for coming on today.
Johnna Verry (00:36):
Thanks for having me.
John Verry (00:37):
Cool.
Johnna Verry (00:38):
I’m sure you can recognize the last name a little.
John Verry (00:42):
I noticed that the last name has this strikingly similar spelling to my last name. I guess we come somewhere from far enough back in the same genetic tree or something. It’s a German name if I’m not mistaken, you might have some German in your background Johnna?
Johnna Verry (00:55):
I don’t know, actually, I think it’s more Italian but you never know.
John Verry (01:01):
The Italian ruined all that was good German genes. All right. So I always like to start easy. Tell us a little bit about who you are and what is it that you do every day?
Johnna Verry (01:10):
So my name is Johnna Very, I’m an undergraduate engineering student at James Madison University in Harrisonburg, Virginia. I study general engineering, although my capstone is in a concentration of almost biomedical and this summer and last summer I have been an intern at Pivot Point Security.
John Verry (01:32):
Cool. Thank you for that. And before we get down to business, got to know what’s your drink of choice?
Johnna Verry (01:38):
So if there are any future employers listening to the podcast, I do not drink alcohol, although I am 21. We got to get up at 6:00 every morning and get to work, alcohol is not your favorite choice. But if I was having an alcoholic drink on a weekend, sours or goses are probably my favorite.
John Verry (02:01):
I’ve always disliked sours, but I have somebody in my family who started to drink sours and has started to turn me on to a couple of them. So I’m getting a little bit more flexible, I’m willing to dabble out of the Weiss beers and out of the stouts so interesting. All right. So let’s get down to business, so the reason why I asked you to be on the podcast is as most people know machine learning, if you want to call it big data, data analytics, artificial intelligence, these are really a key emerging technologies in many fields including the field of information security. So your proof of concept that we asked you to work on stems from, we act as the virtual CISO for an organization that has a presence in all 50 states in the nation.
John Verry (02:48):
And as we’ve taken on the responsibility to manage the security and the risk associated with each of the 50 affiliates of the entity, we were beginning to wonder how could we possibly enhance our visibility into potential attacks against this organization or against any organization for that matter. So that was really if you think back to eight weeks ago, 10 weeks ago, 12 weeks ago, whatever it was, that was the first charge that you got. So we didn’t give you a lot more than that, we kind of threw you in the deep end of the pool to be honest with you. So where did you start? That’s a big task. How does somebody go about breaking that down and starting that process?
Johnna Verry (03:28):
So I knew we started with the how might we, which is how almost any good engineering project starts. And it was how might we move security from reactive to proactive. So with that whole idea, we wanted to get data from Threat Exchange And from information sharing analysis centers, but unfortunately they didn’t have the API to do that. So from the start, it was, “I have no idea how to get this data.” And machine learning is all about data, 80% of it is data and preprocessing and cleaning. So not being able to get data easily is definitely a big challenge to start with. So luckily, I came into some really cool YouTubers and used different resources online and figured that I was going to have to do web scraping for this, which was definitely hearing that never web scraped before.
John Verry (04:27):
Real quick for someone who might not be familiar, when you say web scrape, what does that mean?
Johnna Verry (04:32):
So to web scrape, I used a Python package, so the whole project was done in Python. So the Python package that I was going to use is Beautiful Soup. And basically what Beautiful Soup and web scraping is is you are giving Python a HTML web URL and you are saying, “Go to this webpage and get this data that I want… or these numbers, this table.” So I was taking a table, let’s say, and then you’re parsing that data and then you’re pulling it and putting it into a DataFrame using let’s say another package like pandas.
John Verry (05:07):
Cool. Go ahead.
Johnna Verry (05:10):
So I was just going to say that I ended up having to go the web scraping route just because that data wasn’t readily available for download.
John Verry (05:16):
Gotcha. How did you end up on Python? Because when we talk about big data you hear things like AR and there’s other types of tools that you might use. Was it because you hit that Beautiful Soup right away and then you started to think, “Can I use this? Does it plug in nicely to the other things I need?”
Johnna Verry (05:35):
So last summer I had looked at the feasibility of creating a PDF scraper for Pivot Point. And for that, I had downloaded PyCharm and from reading a lot about not even machine learning but just coding itself. They talk about how often how Python is a very easy language to learn, first off, very intuitive. Almost like writing sentences of fragments that are broken English in order to write the code. And so the nice thing about using Python especially the IDE that I was using, it’s all very integrated. So it’s very easy to pull packages such as Beautiful Soup or Pandas or other ones that I might talk about later on in this call and start working with them right away.
John Verry (06:21):
Cool. Now you’ve talked about pandas and I’ve played around a little bit personally with Python, just for fun, which sounds stupid but it’s kind of a cool language. And the whole idea of being able to do some of the things that you’re talking about doing is really cool. So pandas is a really popular package. What made you pick pandas? What does pandas bring to the table?
Johnna Verry (06:39):
So pandas does a lot with just working with data itself, so creating DataFrames and being able to manipulate those DataFrames. So for example, I could create a DataFrame using pandas and then be able to say, “Hey, pandas, do a .ilook,” which just means it’s an ILookup almost like a, I guess, maybe-
John Verry (06:58):
VLOOKUP.
Johnna Verry (06:58):
… VLOOKUP in itself, which I don’t know how to do.
John Verry (07:03):
Which we just lost everybody. So we are not going to talk about VLOOKUPS anymore in the podcast, we promise don’t turn us off.
Johnna Verry (07:09):
But it’s a lot of dot functions where you can easily say, “Hey, find me any rows with the name John and then drop them.”
John Verry (07:20):
That’s a little harsh, wasn’t it?
Johnna Verry (07:28):
But you get the point.
John Verry (07:29):
Unfortunately, I got the point. Thank you. Sorry. So let me see if I understand this. So we know we need to get this data, we know we need to get a lot of data. You figure out that you’re going to use Beautiful Soup to go and get the data, because you don’t have an application programming interface that’s elegant to retrieve the data in a good structure, an optimal structure. We get this data, so now we have the data, you used Beautiful Soup, you ended up in a pandas DataFrame and now we have the data and we have the understanding of where we’re going. Do I have that all right so far?
Johnna Verry (08:00):
So unfortunately we don’t have the data. I ran into a little bit more trouble where… So Beautiful Soup is really cool and the reason why they call it Beautiful Soup is not because it was, “What weirdest name could we pull?” Unfortunately, it has nothing to do with soup, but what is the soup is the table you want to take it to, what you’re telling the program to look at. So my soup in this case if I was taking… I wanted a table offline it would be, “Hey, the soup is this table.” Titled maybe The Virtual CISO Podcast. And so, unfortunately, Beautiful Soup only goes as far as what is on the original web URL you give it. So if you need to go to the next page in order to get more data, it cannot be told to do that. So where you could go from there, which is what I ended up doing is package, which is really cool called Selenium, which actually implements a web driver and controls the computer so that you could tell it, “Click to the next page, click this, and take this data and continually do that.”
John Verry (09:08):
Gotcha. So if that table extended across multiple pages you would actually literally be able to programmatically click the next button and capture that next piece of data.
Johnna Verry (09:18):
Yep. And keep doing it so you could run loops and have it keep doing it.
John Verry (09:21):
Cool. Just curious can Selenium log into a page? So could you do this if this was a site that you had needed paid access to as an example?
Johnna Verry (09:31):
So Selenium can type as well. So as easy as, “Hey, go to Google and type in the search bar Pivot Point Security and then click the link that includes Virtual CISO in the title.”
John Verry (09:45):
Really. That’s pretty cool.
Johnna Verry (09:48):
Yes. Pretty cool tool.
John Verry (09:51):
You have me thinking about how we could use that for like vendor risk management and vendor due diligence, where you’d want to say, “Search for this company and tell me if there’s any reputational challenges or tell me if they’ve been involved in a breach or something of that nature.” So it sounds like I could programmatically do that with a tool like that.
Johnna Verry (10:06):
Yeah.
John Verry (10:07):
Oh, that’s cool. That’s cool. All right. So correct me if I’m wrong.
Johnna Verry (10:11):
So now we have the data.
John Verry (10:11):
Now we have the data. So now we have the data, which I understand how you got. And now you have your objectives, what we were trying to accomplish. So what’s next? Are we starting to attack the machine learning component and if so what did that process look like?
Johnna Verry (10:25):
So now that you have the data, what logically would be next is the machine learning. Although, what is interesting about when they teach you machine learning is they don’t really teach you cleaning the data or preprocessing data, which is maybe 80% of the whole machine learning process. So that definitely took a long time figuring out what are we looking for when we are pre-processing data, what is clean data? And definitely a challenge of machine learning itself is having clean data, finding clean data.
John Verry (11:01):
Gotcha. So when we say clean data we’re talking about things where let’s say that we picked eight attributes of security information that we were looking at maybe geographic location, type of attack, things of that nature. Is it that sometimes we don’t have data in a field would be one form of “non clean data?” Would the other one be maybe that someone refers to something as malware and someone refers to something as ransomware although it’s the same thing, so that data normalization might need to take place. Are those the types of things when you say dirty data that you’re referring to, and if so, are those the two main ones and are there other types that you had to address?
Johnna Verry (11:39):
So those are definitely a big ones, no values, data that when it’s transferred over into a DataFrame just does not transfer how you’d want it to.
John Verry (11:49):
Would it be a data type like something’s a float versus, I don’t know, a text or normal…
Johnna Verry (11:56):
When you’re running machine learning, if it’s not a classification type of machine learning, it needs to be all floats because if you’re thinking about it, you’re working with numbers. So if something’s an object, the program’s going to have no idea what to do with it in terms of numerical processing.
John Verry (12:16):
Okay. So let me see if I understand it. So I know one of the things that we were looking at was we were getting these alerts from the Open Threat Exchange or from the ISAC and there was a geographic location associated with them. Let’s use states, California and Hawaii or Texas or whatever it would be. So you’re saying that I can’t machine learn using California or CA, you have to convert those text values into a numeric value to be able to use them?
Johnna Verry (12:45):
So what you’d actually do with that, so let’s say I wanted to have all 50 states in my code. The two ways of doing it in Python that I worked with was pandas get dummies, which is getting dummy variables for those states or one hot encoding, which is just a fancy term for you’re encoding those states as numbers. So maybe California would be 001 while New Jersey would be 010. And while those numbers themselves have no weight as in one doesn’t mean it’s better than something else, it’s no weighting system. What it does is it allows you to identify, it’s an identification and encoding process.
John Verry (13:32):
Gotcha. So it just creates a unique identifier. The absolute number in and of itself is not all that significant.
Johnna Verry (13:39):
It’s always a combination of zeros and ones.
John Verry (13:41):
Gotcha. All right, cool. So now are we ready? I’m getting tired already and I don’t feel like we got to the machine learning stuff yet.
Johnna Verry (13:53):
Now we’re at machine learning. Well, I started with machine learning, but unfortunately I only got so far because you learn machine learning and then none of it ran. I got error after error.
John Verry (14:02):
So basically what you’re doing is you’re lying, you’re creating a false-
Johnna Verry (14:06):
This actually was in my process. I’m going to be quite honest. My process was learning-
John Verry (14:12):
I’m going to jump the machine learning, if it doesn’t work, “Oh, crap. Let me go backwards.”
Johnna Verry (14:16):
All of the red errors were very welcoming to learning machine learning, but no, honestly, the first step of machine learning is there’s two major types of machine… There’s three really neural networks that gets into machine learning, but the two types that I originally was looking at was supervised versus unsupervised machine learning. So the first step was honestly, which one would this project be? Because the process of doing both is different.
John Verry (14:46):
All right. So I’ll bite what’s supervised versus unsupervised learning?
Johnna Verry (14:48):
So I ended up going with a supervised for this project but supervised versus unsupervised, in supervised machine learning I’m giving you a dataset and I’m almost giving you the answer. So I’m saying all of these factors, so they’re called descriptive features. So my descriptive features in this case might be political, technical factors that would cause a cyber attack to occur. And then I have a target feature, which would be maybe the number of cyber incidents or something along that line.
Johnna Verry (15:21):
And so when I’m using supervised machine learning, I’m running a dataset through it and it originally has the answer. It knows that these descriptive features link to this target feature. And you’re running through training and test sets in order to be able to eventually predict it on its own, but it knows the fundamental combination that gets it to the target. Meanwhile, in unsupervised, you are giving it the descriptive features and all of that kind of stuff and you’re saying, “Figure out the pattern. Continually work with it and figure out the pattern without really me telling you.”
John Verry (16:00):
In the latter one, the way I think of machine learning, and I know nothing about the field. The way I think machine learning is I’ve got these inputs or attributes if we can call them that, and I know that they somehow positively or negatively influence an outcome. And I would have thought that I would create a giant set of correlating data inputs and outputs, I would feed that to a machine and the machine would look at that and say, “Hey, after I’ve analyzed a million of these inputs and outputs, I can tell you that these inputs are the most predictive of what an output is and that way I can put in the next day, I can put the inputs in and before we know the output, it’s going to try to predict it.” So it sounds more like that’s the first one.
Johnna Verry (16:48):
Supervised.
John Verry (16:48):
Okay. So in the unsupervised, what’s different there? If we don’t ever tell it what the outcome was, how would it get us there?
Johnna Verry (16:58):
So in supervised it has labeled input and output data, while unsupervised does not. So it works on its own in order to discover an inherent pattern within the data.
John Verry (17:11):
That’s cool.
Johnna Verry (17:12):
And in the other sense, you’re kind of saying, “Hey, here’s the pattern-” We know that these numbers amount to these values so that it can more easily figure out the coefficient that get you there, but unsupervised it runs it a million times in order to figure out what is the inherent pattern to the data.
John Verry (17:31):
Oh, that’s cool. So I know that and you use the term coefficients there and I know that we… I shouldn’t be leaning back. I know that you at one point when you were explaining this to me, you wrote down on a piece of paper and it was coefficient times the one attribute plus another coefficient times another attribute. And it was almost like a, I’m not sure if it was a polynomial equation, if I could…
Johnna Verry (17:54):
Almost like a slope.
John Verry (17:55):
y = mx +…
Johnna Verry (17:56):
y = mx + b.
John Verry (17:59):
So realistically, at the end of the day, what machine learning is doing is creating an equation. It’s figuring out the correlation values for each of those attributes in the proper weightings and it’s literally a mathematical equation that creates the answer?
Johnna Verry (18:13):
I did skip over a little bit of something again.
John Verry (18:18):
Because I’m not that smart and you realize I wouldn’t have understood it. Maybe be the people listening will-
Johnna Verry (18:20):
I’m so sorry.
John Verry (18:20):
… so you can do it.
Johnna Verry (18:24):
There is two types of also supervised unsupervised, it’s classification and regression machine learning. So this project entailed regression, which is talking about you’re basically doing multivariable linear regression, which is exactly what we were talking about where it’s y = mx + b with a lot of different independent variables that give you a dependent variable. And then there’s classification, which is kind of the same process as if I give you a bunch of images of animals, the machine can recognize which one would be a cat versus which one would be a dog. So that’s just a classification thing as well, but this one, obviously, because the data is both numerical and it’s continuous data, it would be a regression problem.
John Verry (19:12):
Cool. And then I’m assuming that you use some type of, again, because you were in Python, are there packages that are specific to machine learning?
Johnna Verry (19:22):
So there are a lot, and honestly, there are ones that are probably way more advanced for people who are working let’s say at Amazon has data analytic-
John Verry (19:37):
Degrees.
Johnna Verry (19:37):
… with data analytics degrees, but Google itself actually does have a package. But how I did it was I use scikit-learn, which is just a very easy package to work with. A lot of people prefer it, honestly, especially if you’re doing the type of machine learning that this project entailed more just linear regression and stuff like that. scikit-learn has 1,000,001 ways to analyze the data within it, just built in features and models to run.
John Verry (20:08):
Cool. So now we got clean data. You figured out what the right approach was to these. How does the actual machine learning work? Is it what we talked about? What’s training? I heard you referred to the term regression testing. How does this work? Now that I’ve got all this data and I’ve got this strategy, take me through the next step.
Johnna Verry (20:30):
So once you have the data, the first thing that you would look at is, which are my descriptive features and which are my target. So descriptive features would be all of the things that combined would get you to the target. So, like I said, that’d be in this case if you’re looking at different technical political or economic factors that predicts cyber attacks, and then the target feature is what you want to predict.
Johnna Verry (20:58):
So maybe that might be the number of incidents on a day. And so once you have those, you’re splitting it into X and Y variables based on these are my descriptive features and this is my target feature. And then you would run through a linear regression model within scikit-learn if you decide to use that package. You can use the train-test split in their model, which is basically you’re saying, “Okay, split my data into training and test sets.”
Johnna Verry (21:32):
So I could give it a test size of 30%, only take 30% of it for the testing purposes and train on the rest. So it’s going to run through the data that I’m giving in and say, “Okay, what coefficients or what correlated values allow me to get to these targets?” So if they know the targets to begin with, they’re working their way in order to figure out how they got there in the first place. So what coefficients would give me a value of seven on this day and what coefficients would give me a value of nine on this day.
Johnna Verry (22:08):
And so once it understands and calculates the coefficients and it runs through it, however many times that you’re telling it to based on its random state, you’re saying, “Okay, now I’m going to give you the test set. And what that does not include is the output data that you originally gave it.” So it doesn’t know anymore how many incidents happen on that day, all it knows is the descriptive features you are putting in. And at that point its job is to use the correlative values that it originally calculated and have a predictive output of what it would have been that for that day.
John Verry (22:46):
And then it measures how accurate it actually was, gives you some type of correlation value or something of that nature. So that way you can say, “Hey, based on this training and this particular test set here was the accuracy of my predictions.”
Johnna Verry (23:04):
You tell it to but you can output the actual. So the actual and then the predicted values for those descriptive features that you gave it on the test set. And then you can run accuracy, RMSE, a bunch of different accuracy and precision-
John Verry (23:24):
Measures.
Johnna Verry (23:24):
… testing on it. Yeah, measures. In order to see is it actually being accurate or is it just outputting values that don’t have any statistical significance.
John Verry (23:36):
Gotcha. Let’s say to keep the math easy you had a thousand data points in your set. Combinations of the attributes and the result. And you ran that function that you referred to and you ran in that 70/30. So it would originally take 70% of that train on it and then use 30%. Will it do that iteratively and grab just 700 different… Like randomly grabbed 700, so that way it can almost use the same data iteratively in different combinations and learn from that part of the process as well?
Johnna Verry (24:10):
So when you are running training tests you give it two parameters, both its test size, so how much data do I want it to test on versus train on and then it’s random state, which is what you’re saying, how many times do I want you to go back in, pick up different variations-
John Verry (24:27):
That’s cool.
Johnna Verry (24:28):
… of the set in order to actually… If you’re giving it the same one every time it’s not going to get much out of it. But if you’re a continually telling it to go in and grab different combinations of the data to work with each time, it gets a more accurate correlated values for all the descriptive features.
John Verry (24:49):
Makes sense. I know when we were on the front side of this process, there were different models that you could use. But when you’re on the backside of this process are there different models that you would use, for that that training and regression testing as well?
Johnna Verry (25:02):
So I actually ran two different models. One was multivariable linear regression, which was what we were originally just talking about the combination. And then the second would be ridge regression, which is just a different type of regression testing, but regression works best in predictive models where the descriptive features are related. So when they have some type of maybe two of the features are together relate higher than some of the others, it works best because it recognizes the relation between the descriptive features themselves as well.
John Verry (25:40):
Gotcha. That sounds like I know in health studies they’ll call them longitudinal health studies. And what they’ll do is they’ll say, “We know that there’s a correlation between the variables that we’re using.” So as an example someone who eats healthy and someone who exercises or someone eats healthy and doesn’t smoke they are correlative values. So I would imagine ridge, you said ridge regression testing, that ridge regression testing would be a better choice there if I had those confounding variables, they conflate together.
Johnna Verry (26:11):
So ridge regression is actually even if you did a quick Google search, they’re going to tell you that it’s almost always used in engineering that like application or engineering mathematical just because almost always the descriptive factors are correlated in some manner, like you’re saying.
John Verry (26:31):
Right. Like fast people tend to be good athletes and very often are strong.
Johnna Verry (26:39):
And usually taller.
John Verry (26:40):
And usually taller. I got you. All right. So we went through the whole process, you got your machine learning, so now time for the big unveil, does it work? Can you predict attacks for me?
Johnna Verry (26:54):
So I want to say yes, it has a lot of promise.
John Verry (26:57):
I want you to say yes too, so say yes and we’ll be done with this.
Johnna Verry (27:00):
Yes. It works. Are we logging off?
John Verry (27:02):
And it’s super accurate and we’ll know tomorrow exactly what’s going happen.
Johnna Verry (27:05):
I’m going to make you a million dollars tomorrow actually, I’m signing a contract. So it definitely shows a lot of promise. The funny part about machine learning, data analytics is that what is good is relative or what is accurate is relative. So in a field like cyber security, if there is no way currently, if it’s all reactive and there’s no way currently that people are predicting cyber attacks, then 60% accuracy is-
John Verry (27:32):
Fantastic.
Johnna Verry (27:33):
… being accepted. It’s great. So if you’re telling me your competitor put out a product last week that does 60%, then that’s no longer as accurate as I need it to be, it needs to be 80 now in order for it to be useful. But what did come out very well, so the predictive versus actual values of the data that I used were very close in value. And I was able to see through the coefficients calculated that it was recognizing based on the positive and negative, it was saying that basically these data points were correlated positively versus these were negatively, so it was showing that correctly as well, which is very promising.
John Verry (28:16):
So you mean if we saw increased geopolitical, a measure of geopolitical activity, if that was up and we saw an increase in the number of attacks at that point in time. Or we might’ve seen the opposite, where things were quieter or where some other factor that would negatively correlate, you saw a negative correlation?
Johnna Verry (28:38):
Inversely versus proportionately related. So let’s say if a honeypot activity is up, if that’s up typically cybersecurity incidents would be up for that day or cyber attacks, then that would be positively correlated.
John Verry (28:53):
Gotcha.
Johnna Verry (28:53):
Versus if something’s usually down, which would cause the increase… It’s based on is it inversely or proportionately related.
John Verry (29:01):
Cool. So at this point, you’re ready to say, “This looks promising.” So if somebody said to you, “What would you need to say something more positive definitively, is this going to work or not?” What would you need to do to say, “This absolutely works or it doesn’t work that good?”
Johnna Verry (29:21):
A lot more data would be number one and more easily accessible data would also be nice. Although I guess I can’t be too picky but way more data. Something of this scale, one, you need a lot more data, but you’d also want representative data of different situations. Like maybe if we’re talking about 50 states, you’re going to want data from each state instead of just a couple of them you’re predicting all over the map and there’s no correlation there. So you definitely want more data, you’d also want cleaner data, especially in forums or the way they’re getting the data currently a lot of it, it’s being user input both vocab terminology about how they’re inputting things.
Johnna Verry (30:08):
Making sure they’re doing it correctly as well, are we labeling something that it’s not supposed to be labeled as. And making sure that in the future, if this was taken from some type of system like that, that there was meaningful categories to the data that’s being input. We don’t want too many, whereas people are going in and vastly trying to fill something out. Maybe there’s so many categories or especially categories where they don’t even know what they’re really supposed to be putting in those categories. So the data itself has to have a lot of significance and it needs to be input in a baseline… How would you say that?
John Verry (30:46):
Standardized-
Johnna Verry (30:46):
Standardized. Yeah. It needs to be standardized.
John Verry (30:48):
… standardized, normalized way. So the idea behind this, the reason why we were excited to work on the project with you was it’s a cool idea. And then on top of that if you look at the presidential executive order, one of the key components of the presidential executive order was broader information sharing. So in a weird way this is what the folks there are going to be dealing with right now. How do we get the data in, in a standardized format? How do we make sure that that the data that’s being reported is being reported in a consistent format?
John Verry (31:19):
How do we know that the people that are actually responsible for inputting the data know enough to put the right attributes. If we’re going to classify an attack as being an APT assistant, an APT a RAT or ransomware, as subclass of malware, how will they know which that is? Because if they’re not putting the right value in there, they’re destroying the value of your data set, right?
Johnna Verry (31:43):
Or either skewing it as well. If I’m thinking that I have a thousand malware attacks of this nature and they weren’t actually that nature, it’s not going to do much good when it comes to the actual avatar.
John Verry (31:55):
In fact, it actually might have a negative impact.
Johnna Verry (31:58):
It will.
John Verry (31:58):
Because you might actually make a bad prediction either positively or negatively, in both of those cases that’s going to denigrate the effectiveness of what we’re trying to do.
Johnna Verry (32:09):
Exactly.
John Verry (32:10):
The other thing, which I think, and you talked about this early, I think an API is a must, right?
Johnna Verry (32:16):
I think that it’s almost essential. It’s very one painful to have to use packages such as Selenium and Beautiful Soup in order to do it. But in the long term of a project like this or creating some type of tool like this, it wouldn’t be able to work I don’t think.
John Verry (32:37):
There’s idiosyncrasies in HTML, we all know that. And then you’ve got the situation where if somebody alters a page or alters a tab or changes the location of a page, or your password rotates and now all of a sudden your log-ins failing and suddenly your tool’s not able to get to the data, that’s not a good situation.
Johnna Verry (32:59):
Exactly. It just leaves way too many chances for a error to occur. And you don’t need errors in aspects like that.
John Verry (33:06):
Anything else there?
Johnna Verry (33:08):
I think just one final thought would be that-
John Verry (33:10):
Hold on a second, hold on a second, I have one more question for you.
Johnna Verry (33:15):
I’m saying one final thought about machine learning-
John Verry (33:17):
Okay.
Johnna Verry (33:18):
… would be that while this proof of concept was based on regression, you could end up, as I was working on this project thinking about you could also do the classification version of this, where it was rating attacks or looking at attacks and being able to tell you where they have high severity versus what you’d classify as low danger or medium risk, putting it into different classifications in order to look at cyber attacks, stuff like that, would be also another way to use this or use a similar version of this.
John Verry (33:52):
And just out of curiosity let’s say we’ve got these different models that you’d want to test, we can test them all off the same data set. In fact, we can iteratively… you could build this in such a way that we could be using one set of data for predictions, consistently testing multiple models against it and could always shift gears if we found a model based on additional data sets might do better.
Johnna Verry (34:14):
Yeah.
John Verry (34:14):
Okay. Cool stuff. So how has working on the project changed your view about being an engineer? Last year you worked with us and you were more in a I’ll call it a not uber technical role, you were very good and helped us a lot at translating technical information into information that a business individual could understand from a marketing materials perspective. This was a little bit more what would happen when you get out into the real world and start using your engineering degree. How has this changed your view? Do you think it changed your future job path in any way?
Johnna Verry (34:52):
So I wouldn’t say this changed my thought process on engineering in the sense that I’ve always looked at engineering as a thought process or a way of looking at problems approaching problems. But I think more than anything that’s what they’re teaching you as an undergraduate. But I will say that it’s no wonder that every job application you look at nowadays has some type of programming language on there required or wanted skills for any who apply. I think that looking at engineering undergraduate degrees, that this should be something that’s taught in every undergraduate program, although they skip over it a lot.
Johnna Verry (35:32):
And then I’d say that I thought I wanted to do biomedical and that definitely could still be on the table, but this has definitely been eye-opening in a sense of big data are going along the security path with my degree. So that will be really interesting maybe I’ll work on something with securing the Internet of Bodies or working with Internet of bodies and cyber attacks against those.
John Verry (36:01):
That’s what I was going to say, the nice thing about this and the nice thing about, and I agree with you about your thoughts on engineering, I’m an engineer as you know. The thing that I like about something like this is that you can do all of those, you don’t have to make a choice. Because like you said let’s say you were going to get involved in a wearables company. Think about the amount of data that a wearable device is generating on a given day and then multiply that across the tens of thousands, hundreds of thousands of users that are wearing them.
John Verry (36:28):
And then think about the security implications of people’s biomedical information being pushed from an Internet of Things device up to cloud infrastructure, that is technically visible to the world. It’s properly secured, but it’s got to be visible because the device has got to talk to it. You could take advantage of the biomedical, take advantage of the big data and take advantage of the information security all in the same role.
Johnna Verry (36:57):
Data analytics honestly you could do anything with it, which is extremely cool.
John Verry (37:04):
Exactly. Anything else we didn’t chat about that you’d like to before we say goodbye?
Johnna Verry (37:11):
I don’t believe so.
John Verry (37:12):
Cool. I always ask, hopefully you prepared for this. Gave me a fictional character or a real person that you would think would… I usually say an amazing or horrible CISO because of your work for us, a big data analyst and why. So fictional character, a real person you think would make an amazing or a horrible big data analyst and why?
Johnna Verry (37:35):
I’m going to have to go with my favorite show, which happens to be Rick and Morty. If you have ever watched Rick and Morty, you are Rick and Morty fan.
John Verry (37:47):
I am.
Johnna Verry (37:47):
I believe that Rick would be a great CISO although he might get ahead of himself, but the ability to be able to go to a million different timelines and either A stop cyber attacks from occurring or anything like that, or being able to see what happens and then go back to your timeline and stop it from happening or be able to fix it easily is I think, could you get any better?
John Verry (38:17):
Basically, you cheated the system. So you’re basically saying you don’t need to do data analytics when you can go to the future, know what happens and then come back and tell people.
Johnna Verry (38:27):
Yes.
John Verry (38:28):
I can’t argue with you. I think you cheated, but I can’t argue with the answer.
Johnna Verry (38:34):
Maybe a little bit, but it’s fair, there was no limitations to the question.
John Verry (38:40):
You know what, knowing you’re an engineer, I should have set a better… a more finite set of potential answers.
Johnna Verry (38:46):
Should have been requirements. Strict requirements.
John Verry (38:48):
That’s exactly right. All right. So assuming some listener out there is saying, “Oh my God, I have to hire this girl next May, when she graduates from JMU with her engineering degree.” LinkedIn, the best way to get in touch with you?
Johnna Verry (39:07):
LinkedIn is good. It’s underneath my name Johnna Verry so-
John Verry (39:11):
Excellent.
Johnna Verry (39:12):
… usually found hopefully.
John Verry (39:13):
Excellent. Well, listen, thank you for coming on today this was fun.
Johnna Verry (39:16):
Thank you for having me.
John Verry (39:18):
You’ve been listening to The Virtual CISO Podcast, as you’ve probably figured out we really enjoy information security. So if there’s a question we haven’t yet answered or you need some help, you can reach us at [email protected]. And to ensure you never miss an episode, subscribe to the show in your favorite podcast player. Until next time, let’s be careful out there.