INSIDE NetDocuments AI Workshop
Oct, 25, 2017
Within NetDocuments' development facility, is a small conference room called Sundance. The SOLR team is huddled around an updated Python script and scratching their heads: could this really be right? Is it possible to take the 6.1B - that is with a B or Billion documents and create an entity extraction machine learning algorithm that can automatically file any email or document submitted to the platform. Mou is reconciling the results with her code. She sits back and smiles. "I think we have a nice balance between precision and a flexible model."
Mou leads the NetDocuments SOLR engineering team. SOLR is the open source search engine that indexes 145 documents per second every business day with peaks to 350 documents per second. Mou and her team have devised a way to…
The team evaluates the results and turns to the whiteboard, covered with boxes, arrows, and annotations. The team summarizes the essence of AI: tables of candidates and anti-candidates for each model relevant to the task. The product managers in the room look at Mou and say "what?".
Once again she collects herself and picks up an erasable ink pen, then moves to the chaotic whiteboard. "Look," Mou starts, "these algorithms are just lookup tables. The key is the terms we submit, and the value is the label. We have hundreds of labels in our ND models. A few common ones are plaintiff, expert, judge, and court. I submit to our algorithm as many examples in each category as I can. ‘This is a judge. This is a plaintiff. This isn’t a judge, etc. ' Our model keeps those decisions in a table. Then, if a new example comes along — or if I tell it to watch for new examples — well, the algorithm just goes and looks at all those examples we fed it. Which rows in the table look similar? And how similar? It’s trying to decide, ‘Is this new thing a judge? I think so.’ If it’s right, the entity gets put in the ‘This is a motion-pleading-court document’ group, and if it’s wrong, it gets put in the ‘This isn’t a judge’ group. Next time, it has more data to look up. There are hundreds of similar models her team created in the legal domain, consistent with customer privacy.
Mou's team is hyper-focused on a few big challenges. One is how to resolve similar names compared to those stored in the table. One aspect of machine learning is to learn similarity functions. You know more when you see more. Another challenge is, What happens when your table grows really large? The value in Machine Learning is that the algorithms can "roughly estimate what the corresponding value should be based on learning models," says Mou.
The engineering team frowns when a plucky product manager says. "Wow, that sounds boring, almost mechanical. So much of the conversation around AI is awash in mystical descriptions for its near-magic capabilities." Mou doesn’t like that and tries to use more-prosaic terms. "Sure it’s powerful, but not magical. It has limitations. You need data at scale, for example. During presentations, she frequently draws a picture of a wizard hat with a one under it and an n-dimensional table, aka a modern version of the factory. The contrast defines @NetDocuments approach to AI as the factory, because “wizards don’t scale.”
The @NetDocuments approach is unique. We are building AI workbenches to achieve our goal of an Invisible DMS, whereby customers and business partners can submit any content - email, documents, pitch-books. Once the content is submitted our SOLR platform process the document, classifies, extracts parties, names, dates, etc. The results are available for personalized search, governance or marketing teams to query collected names for business process outside the document. Consider the possibilities available as document content is unlocked. Contact us with your ideas.