Annotation Query Language (AQL) is the primary language in the InfoSphere® BigInsights™ Text Analytics component for building extractors that pull structured information from unstructured or semistructured text.
Use Cases - Most of the data in the world is either unstructured or semi-structured as one can find in Social media, machine log, call center logs, email analysis or financial services.
AQL is a sql style programming langauage for text mining. it was developed in 2004 by IBM and is used in several other IBM products such as streams, SPSS. It is used for developing text analytics extractors in the BigInsights Extractor. Watson used text-analytics for its operation such as reading from Wikipedia and doing text search. It has eclipsed based environment and can be added as plugin.
Text analytics has following terminology- Regular expression as in phone number, emails etc. Creating dictionary that contains possible term (dictionary file) AQL Scripts that combines with regex & dictionary (AQL Extractor program) Labeled text (annotation) Information extraction - You can read a text file and find out the information. For example, text analytics knows from dictionary about names, titles & organization. So, as soon as it finds those value, they are presented in the order.
Sentiment analysis - it is required to find out the information from reviews, comments about product at imdb.com, rottentomatoes.com or review of brand. It can be used to identify good or bad part in a movie or review. But there are challenges in doing sentiment analysis. For example
- it is hard to understand human language - differentiating between positive & negative comments (sarcasm) - lastly, interpreting something like a human would do.
Two things to consider while working with Text Analytics - Precision, Recall. Precision is about the accuracy while recall is required for the completeness. If you are looking for precision, you have to compromise on recall. While , the higher the precision & recall, the better the program would be in text analytic. Confusion in the name- eg Morgan stanley can be name of the person as well as organization. Tool can do false positive on hearing the name of Morgan Stanley. There are four different kind of scenarios. False Positive - You have to be most careful about it as you are getting false data False Negative - You can live with it as it would require you to leave some information. True positive True negative
Process in AQL: Concept extraction includes dictionaries, Regex; Syntactic Analysis is patterns or part of the speect. Disambugation is context analysis. Tokenization -> Concept Extraction -> Syntactic Analysis -> Disambugation
Dictionaries are collection of words that can be imported from files or be created. It can be created inline as well. BigInsight allows to use external dictionaries. Regular expression uses textual values form non-structured files. You can write regular expression using regular expression builder. Using regular expression generator, one can get the the regular expression for that particular words. It allows to add more words for improving the regular expression.
It also allows to define rule to find pattern. For example, causal analysis (problem, annotator~reference point)
AQL Language -> Optimizer -> compiled plan You will put them into streams & BigInsights from where you would extract the information. Hadoop|| JAQL doesn't understand if the JSON files is in multiple lines.