Inside Analysis

Sifting for Diamonds: Finding Value in Text Analytics

In this episode of Inside Analysis, Bloor Group CEO Eric Kavanagh interviews Stu Shulman of Texifter, a SaaS-based solution for topic-modeling, sentiment detection and information retrieval on large amounts of unstructured text.

 

Eric Kavanagh: Okay. Ladies and gentleman, welcome back once again to Inside Analysis. This is Eric Kavanagh, your host and I’m very pleased to have Stu Shulman with me today from Texifter. First of all, Stu welcome to the show.

Stu Shulman: Well, thanks very much. It’s a pleasure to be here.

Eric Kavanagh: Sure. So, tell me a bit about this complete Texifter, it’s T-E-X-I-F-T-E-R if I recall it correctly. It applies text sifting. Can you talk about what that means?

Stu Shulman: Sure. Texifter is a company that grew out of academic research funded by the National Science Foundation, so we got about 10 years of funding to explore the intersection of citizens and government when they make comments about proposed regulations, and many of your listeners will perhaps have sent the comment to the federal government as a part of a mass-sent email campaign. Somebody has to read those comments on the other end and we set Texifter up as a way of commercializing technology that allows federal government agencies to review those comments more efficiently.

Eric Kavanagh: Yeah, that’s really interesting stuff. So you’re really going in there and sifting through the text comments and you’re trying to separate the wheat from the chaff. Is that fair to say?

Stu Shulman: The basic problem is the mass email campaign is driven by form letters and in regulatory rule making a good comment is one that makes an original point. It’s not a plebiscite and it’s not a vote, so they’re not counting up to see who can get the most clicks on the website or the most pass throughs on a form letter. What’s happening back in 2002, 2003, 2004, groups were learning that they could grow their membership and actually earn valuable donations by piggybacking fund raising efforts on the back of this mass email campaigns and so groups started generating lots of duplicative and near duplicate comments and we built a solution that’s a little bit like plagiarism detection for finding the duplicates and clustering together the similar comments.

Eric Kavanagh: That’s brilliant. That’s really good stuff because you’re trying to basically watch out for people who are trying to game the system, right?

Stu Shulman: Well, some people would call it gaming the system. Other people would say it’s giving people their first step and do a democratic process. Regulatory rule making is open to the public. You don’t have to be a trade association or a lobbyist to engage in commenting on proposed regulations and they do have the binding power of law. It’s important for people to know that this process exists, but I think it’s also incumbent on the groups to educate their members that sending a form letter or a slightly modified form letter with a few dirty words or curses in it doesn’t in fact carry the day in regulatory rule making.

Eric Kavanagh: Yeah, that’s interesting. What about in the commercial world or the capitalist world? What are some applications for this technology in the real world?

Stu Shulman: It turns out that some of the properties that allow us to cluster form letters and near duplicate together work really well on data streams like Twitter. Twitter has lots of duplicates and lots of near duplicates and if you’re doing social media research, you might find it useful to find the duplicates and near duplicates and see what’s going on in the overall landscape of Twitter data, and also automatically finding and sorting social media into its large clusters of similar comments allows you to more quickly classify that data, perhaps build a machine classifier to learn over time how to sort the data.

Eric Kavanagh: What kind of use do you have or what kind of functionality do you have for sentiment analysis and how is that done, if you do that kind of stuff?

Stu Shulman: We do support that. Part of our technology base allows our users to create what we call custom machine classifiers or sifters and those are based like human training. Any set of categories is typical when we hear there’s is positive, negative, neutral, can be created and trained by the user. The important thing from our point of view is not to get into sentiment analysis without a clear understanding of what’s possible and what’s not possible, as well as some of the important steps that we have for doing it.

For example, you’ve got to indicate some social data, you’ve got the clean the data up first. You can’t have a raw data stream with a common keyword in it and not do some work to make sure the data in that stream is relevant to what you’re interested in.

Let’s take the example of smoking. If I’m collecting Twitter data on smoking then maybe a strong interest and understanding the sentiment around smoking tobacco, but not around smoking marijuana, which is very common in that feed, as well as smoking barbecue or smoking hot boys and smoking hot girls, that’s pornography. If we did a sentiment analysis around smoking and didn’t clean the data first to get just the tobacco tweets, we get false impression that tobacco smoking is very popular because people love marijuana, they love pornography and they love barbecue.

Eric Kavanagh: That’s a really, really good point and talking about using big data basically, it’s important to recognize that you do have these variations in semantics and frankly, words just mean very different things and you got to clean that up before you start your formal analysis process I suppose, right?

Stu Shulman: I agree. Data cleaning is fundamental. You’re absolutely right there. Another thing to think about with sentiment analysis is don’t do it too early in the process because text analytics is a series of steps to where you’re taking a very big stream of data and reducing it down to smaller and smaller piles. If you save sentiment analysis for the last step and do, for example, the relevance and then the topics and then do a sentiment analysis on a particular topic. The results are going to be much more satisfying than if you try and jump straight to sentiment analysis without having those preliminary steps in place.

Eric Kavanagh: Yeah, that’s a really good point, and it’s kind of like buying any one of the interesting tools in this whole space. A lot of people want to buy the tool, start using the tool, but you’ve got to prepare the battle field first. You’ve got to really think about what it is you’re trying to accomplish because as you say, if you pull the trigger too soon, you’ll get very bad results. They may be accurate in a certain context, but it’s not the context that’s meaningful for you and for your use case, right?

Stu Shulman: Exactly. I’ve got a little pitch about “Is there a magic button?” and the answer is no. There is no magic button. You need to work your way through the steps necessary to reach a valid inference. For example, if you are thinking about sentiment around price, you’re going to want to work on data that’s about price before you jump into, say, a sentiment about service. It’s going to be very different.

Eric Kavanagh: You have to build models. Can you walk through, what does it look like as you’re putting these tools into place? I think you have two different actual software products. Maybe say what they are, and then just quickly talk about what goes into the step by step process of making them sing.

Stu Shulman: The main tool for doing text analytics from our shop is called DiscoverText and the process combines what humans do well with what machines do well. We like to think we talk about the five pillars of text analytics. Those five pillars include search, filtering, classification, using machines, coding using humans and also clustering using machines. It’s a mix of humans and machines, and you can take those various techniques, whether that’s searching for things that are relevant or not relevant, setting them into big piles, filtering based on important metadata values.

For us, the human coding is really very important. This is not a view that everybody wants to hear, that there’s some role for training the machines, but we think the results are much more accurate and the humans who are interpreting the result have a better understanding of the data behind it if they’re involved in some level at doing the human coding that leads to the training of the machine classifiers.

We do support an integrated set of tools that can be used independently or in conjunction, and what we’re doing from an educational point of view is trying to highlight the role that humans play in making these machine systems work. It’s not a magic button. It is a case where an analyst still needs to be an analyst or a researcher, and they need use the tools. There are of course ways to spread the pain of coding around.

Our favorite method is crowd sourcing and we support that within our web based text analytics package. You can put a group of 2 or 4 or 10 or 20 or 50 coders to work labeling your data, depending on the scope of the project and the requirements of that project. You may need more or less coders.

What we’re able to do with those human coders is to take the time that it used to take to train a machine to do this work down from weeks or months down to hours, and our goal over the next year or so is to get it down to a matter of minutes to train a custom machine classifier.

Eric Kavanagh: Well, that is cool, cool stuff folks. For more information they go to texifter.com. Is that right?

Stu Shulman: That’s right. Texifter.com for the company. Also discovertext.com for our main product. We have a new product that allows folks to search the full history of Twitter and get a free estimate on what it would cost to license the data for a day or a week or a month out of the history of Twitter on a specific set of keywords, and that’s  called SIFTER, you can find that at sifter.texifter.com.

Eric Kavanagh: That is fantastic stuff folks. We’ve been talking to Stu Shulman. This is great. Thank you so much for your time.

Stu Shulman: Thanks very much, Eric. Take care.

Eric Kavanagh: Okay. Bye-bye.

 

Leave a Reply

Your email address will not be published. Required fields are marked *