As a follow up to our blog on Computer Science and Library Science – Can we get along?, we wanted to offer a few therapeutic and pragmatic ways to answer the question with a ‘yes’ – sort of.
What are the personas and differences?
– Computer Science makes repetition fast and more accurate. It applies the principles of mathematics, engineering, and logic to a plethora of functions. This includes algorithm formulation, software and hardware development, and artificial intelligence. Computer science applies Variables, Repetition and Loops, Lists and Arrays, Functions, Text and Pixel Manipulation, Random, Algorithms, and Tree building. All of this ‘computing’ is doubling and scaling every two years in capability to process and store more and more. – Library Science makes investigation complete and academically accurate. It is all about classification and use of data to preserve knowledge, provenance, and promote discovery and use. Library Science employs keywork, subject, and facet searching & sorting, profiles and schemas, Citation linking, semantic relationships, source weighting, subject are categorization and taxonomy, and both unstructured and structured and in-between. |
Recall, they both have interest in accuracy. Computer science is interested in ‘knowledge’ to be confidence interval-based. On the other hand, Library Science needs to get the specific object of interest. Similar to criminal investigation. Computer science can be used to filter down and weed out. However, Library science is needed for the final selection for conviction or exoneration.
Top Ten ways to fuse the two worlds
Establish enterprise standards based on immediate use need (not theory)
Lack of standards, templates and interfaces are the main reason for interoperability issues. Users currently spend large amounts of time addressing data product interoperability issues. This includes addressing missing fields to fix join issues, detecting wrong fields, data discovery challenges due to legacy vague column names, and orphaned records missing keys. Find out where your users are wasting time, missing out on decisions, and impacting outcomes. Then, work to those rules first.
Pattern-Match Gap Validation in Standards
Establish non-invasive and incremental data governance of the standards through rules-based data validations. Whether using FAIR data principles, 5-star principles, or schema profiling, prior to transformation, validate the inputs against standards and capture scoring
Adopt culture of ‘continuous improvement’ towards standards
Establish your minimum viable product towards standards instead of arbitrary ‘should be done’ to an epic such as at minimum to support the most common discovery. Additional rules over time can ‘gamify’ improvement towards making sources more accessible, traceable, citable, etc.
Automate Validation
Robotic process automation or orchestration using newer data pipeline concepts such as ledger-based serverless processing can lower the burden of standards validations. By moving into issues, warning, and risk detection and advisory services for suppliers to be made aware of gets the data at least out there, while they improve. This is much better than forcing full compliance. Drive Data Pipelines by metadata to create low-code environments by reading in the metadata against known schemas, value lists, profiles, and determining time and size changes in source metadata and data files/services.
Gamify and Weight Validation in discovery results
Consider how the resulting advisory points can even be gamified to advise the user of the data readiness, quality, and advisory results. This can be provided back to the supplier to show traffic impacts and deterrence due to lack of certain compliance. These results can enhance in-the-wild SEO (search engine optimization) results. Consequently, this creates more traffic and uses, while maintaining authoritative provenance and linking.
Add in Data Anomaly Detection
Metadata matching is only the start. Finding ways to bring in the data itself and automate schema change validation, deterministic DQ validation schema, list values, key integrity, CRUD change detection, typecast validation can also help in making aware of supplier data integrity and veracity issues.
Consider using NLP or more AI/ML to further augment gap unstructured Pattern matching
Natural Language processing using rapid key-value pair capabilities of NoSQL tools such as Elastic MapReduce can help in adding more validation rules and learning as it goes into unstructured data as well as structured. This can help in finding out legal rule application or taxonomic references. Even geospatial areas of interest that may take humans weeks to months to uncover. Keep exploring AI, ML, DL, NLP models, simulations, and libraries to create more unsupervised AI solutions to help keep growing the possibilities. AI needs to be nurtured and developed like a child and it can take time to mature.
Expand Standards by linking to single and multiple Policy driver for standards
Give the reason a rule exists by linking to a local, agency, or oversight policies guidance. This helps show where a certain datasets is invalidating compliance with a policy. Best practices are great that computer scientists may apply. Yet layering in the library science governed policy driving those best practices adds teeth.
Provide Asset Inventory Portfolio Views
Provide Validation Exploration by various facets – by Policy, Organization, Communities of Use/Themes, Communities of Supply. Allowing to compare readiness patterns by those facets. This finds trends in areas that are doing well and others that are behind.
Make data available in API, multiple formats and ontologies
Ensure resulting metadata and data is made available in native format as well as transformed formats, ontologies. Also, APIs to enable various planner, analyst and scientist goals. This means you may need a primary warehouse and single ontology for your MIS and decision making purposes. Also, to allow for data to be manipulated in different vocabularies and ontologies to suit different modeling, linking, brokering, simulation, and analytic purposes. This approach also puts use in the drivers seat, while still allowing records compliance for snapshot ledgers.
BONUS: Data lakes and ledge models allow for temporospatial querying
If point, line, polygon, or pixel data is updated, the power of storage and compute APIs are so fast now, new models can allow for not only saving versions for posterity. Also, they save versions to allow for cross-version comparison. Imagine examining versions for a dataset as well as across datasets. This could enable data scientists – yet overlaying this with Deep Learning models can now do time and place comparison to watch pixel value changes in imagery (2D) and 3D data (e.g. LiDAR, seismic, ocean & atmospheric data, more), polygon boundary changes over time, and histogram views over time such as for census.
Establish your approach for non-invasive and Agile incremental Data Governance framework to include standards functions, to systematically investigate the discovery, data management, policies, standards and practices, across the entire data life-cycle (Creation, Processing, storage, usage, records). This hits the library science side. However, further enhanced standards with latest capabilities in computer science for deeper temporal storage, cloud serverless automation and orchestration, and establish ML/NLP services to detect anomalies and data standards validation – which can provide resulting metadata insights to support end usage.