Haystac

Solutions Integrator

Unstructured Data, the hidden digital threat

80% of global data will be unstructured by 2025, and it’s a near certainty your organization has vast amounts of PII (personally identifiable information) stored throughout your systems in the form of unstructured data. Worse yet, that information is accessed, shared, copied, and stored in an unprotected state. Putting off Data Discovery and Classification may be unwise; regulatory bodies are mandating data protection and security, while the cybersecurity threat landscape has increased dramatically, and we have taken action. Team-IIS has partnered with Haystac, the Content Intelligence Company™, and developer of Indāgō™; to enhance our Information Governance practice with true AI content analytics for Unstructured Data and Image Classification. Haystac Indago™ is a Comprehensive Enterprise-Grade Unstructured Data Analytics Platform; find, analyze and organize your institution’s unstructured data, regardless of volume, type, or location. Haystac’s Indago™ is the only platform capable of analyzing both scanned images and born-digital content with the same engine at a multi-petabyte scale.

Indago™ Enterprise analyses data from all unstructured data sources, file shares, MSFT 365, Google Drive, eMail servers, ECM systems, e.g., SharePoint, FileNet, Documentum, OnBase, SharePoint, IBM, Oracle, etc., as well as any archival or social media platform. It searches, crawls, profiles, classifies, extracts data points, secures sensitive information, and cleans (ROT) redundant, outdated, or trivial data.

• Fast and quantifiable ROI;
• Fast implementation. On-premise or cloud; All-in-one platform;
• No data scientist required;
• Scalable – from TB’s to PBs of data;
• Integrates with existing infrastructure – on-premise/ cloud repositories, ECM, DLP, information/mission-critical applications;
• Flexible licensing (perpetual or subscription);
• Neural-networks classification, transfer-learning;
• False-positive reduction/Novelty detection;
• Fuzzy-rules data point detection and extraction;
• Rules enhanced classification and Q/C;
• Fuzzy-rules analytics;
• Dupes/near dupes detection and analytics;
• Automatic extractive summary;
• Business use case and e-discovery classifications;

• Retention, disposition, confidentiality classification;
• Automatic sensitive data anonymization;
• Ease of Use;
• Automatic data profiling;
• Agent-less ingestion from any data source;
• Interactive dashboard and menu-driven user portal;
• Integrated data modeling feed-back and Q/C;
• Distance from original near dupes mapping and analysis;
• Faceted search, including keyword, Regex and contextual search;
• Exclusive “more-like-this” search;
• Extensive document (text and html), data-point and metadata viewer;
• JSON export;
• Standard and custom-reporting facility;

Indago™ Enterprise comes complete with Full-featured connectors for virtually any data source

 Agent-less ingestion from virtually any data source, either on-premise or on-cloud.

Agent-less ingestion from virtually any data source, either on-premise or on-cloud.

Profiles and clusters content based on semantic similarity. Graphically reports on the data landscape without human intervention. Global and Segmented Clustering.

Supervised ML methods for system training. Fuzzy rules for (meta)data identification, extraction, analytics and export (JSON, XML or CSV).

Extended/enriched metadata for faceted searches. No-SQL, Keyword, Regex, “more-like-this”, etc.

Supports legal requirements for defensible (repeatable and quantifiable) retention, disposition and legal-hold.

Connectors for FileNet, ContentServer, Documentum, SharePoint, AOdocs,
Laserfiche, SharePoint, OnBase, etc.

Proprietary deep learning methods to free text documents. Reduces indexing errors and delivers faster and more relevant content searches and retrievals.

Virtually eliminates false positives to the data models.

Data models can be initiated with 1 root and two categories and as few as five (5) exemplars per category. Integrated feedback mechanisms, including graphical analysis and Q/C of data models.

Applies fuzzy rules data analytics to identify sensitive information (PII, PCI, PHI, restricted or confidential content). Fully searchable and exportable.

Selectively anonymizes sensitive and private information (ex. PII, PCI, HIPAA, etc.).

Supports integration with existing RPA tools such as UI Path, AutomationAnywhere, DigitalGuardian, Stealthbits, and more

Enables Subject Matter Experts to quickly examine and determine the relevance of large documents. Automatically identifies and highlights key phrases based on user selected metrics.

Addresses content-driven information security and data privacy requirements for data at rest.

Indago™ Images out-performs legacy OCR classifiers by applying Haystac’s proprietary visual anchor technology, the most advanced classification and data extraction technology available powered by deep learning.

Haystac’s Visual/Content Intelligence (VCI) builds on NLP and machine learning by combining text and visual analytics in a unified vector space model. This combination sidesteps NLP’s “cold-start” problem and dramatically reduces the requirements for prior organizational structures like taxonomies, ontologies, or editorially selected training sets. Instead, clustering and classification techniques organize information visually.

VCI produces high-quality results by exploiting the frequent presence of templates in the enterprise. These templates go unnoticed by text-only analysis, which often starts by discarding formatting and images. VCI samples non-text clues, e.g., lines, shapes, boundary boxes, bar codes, photographs, etc. Combining these visual clues with text provides a decisive advantage for clustering and classification.

Haystac VCI makes sense of unstructured information following a much simpler workflow:
  • A few document types are found by searching through the initial raw content set or applying a filter/rule;
  • Documents (s) are then labeled as a class;
  • The labeled class is visually generalized (clustered) across the remaining documents;
  • Exceptions are placed in a queue; the user can identify them as a new type, merge them with an old type – updating the classification models;

Beyond classification and labeling, VCI “visual analysis” provides excellent insight into fielded structures such as forms and labels. VCI technology turns unstructured information into data without modeling.

Visual Analytics: Classification & Clustering with More Input

Underlying VCI is a layered approach to detecting and extracting non-text features. Edge detection, sampling, and visual similarity algorithms, among others, are used to identify frames, table edges, logos, etc. These extracted small, specially “anchored” regions – are normalized and then modeled in a unified vector space. Text extracted from the documents is also modeled in a vector space. Unified clustering and classification via SVM or other algorithms are then applied.

There are several benefits to this method when compared to legacy OCR, primitive image search, and textual analysis:
  • Allows high-quality classification and clustering without pre-definition of categories;
  • Uses otherwise discarded data to discern major classes of documents and variations;
  • Uses anchored OCR data extraction, which yields higher accuracy and is skew/rotation independent;
  • Uses boundary/border detection support;
  • Supports multi-line/multi-column format;