Computer Science and Library Science - Can we get along?

Computer Science and Library Science got Hitched in the 1960s

Data Management and Metadata Library Science experts started us down the path in the 1960s or even earlier in a smaller segment towards Master Data Management concepts we know most recently from the DMBoK – Data Management Body of Knowledge. This came from an information theory from the offline world a hundred years ago. Card Catalog systems models drove how we approached metadata by citing our datasets and sources. From there, we had myriads of data catalogs from academic community models from ResearchGate to university facilities to aggregating harvesters like the data.gov catalog.

Many reach beyond their envisioned limits. Yet they are still adapting well.

Information Metadata – Management Information Systems use DMBOK very well in establishing shared lists, and vocabularies. Yet, integrating MIS with different lists across agencies beyond the scope of authority of those lists causes integration issues.
Data Products Metadata – More physical assets are being scanned and discovered (e.g. Scanned Maps, scanned publications) metadata standards help in tagging such. They see, yet again, how different organizations tag and save metadata when it gets out into the wild world of the internet, making discovering very difficult using traditional keyword and facet searching.
Data Products Linking – Those digital object identifiers and other persistent URLs help address the ethereal nature of data vs. the physical constructs of books (meaning, links can move thus references breaks – where physical items only disappear when books aren’t returned), yet the linking between metadata and data is still behind.

Ten Years Later, Computer Science and Library Science Divorced.

Ok, that’s a bit dark, yet it has some truth to it. Through all this in the 1970s, computer scientists stopped working on advancing library science constructs. Computer science found more alluring work to do, even though the cross-organizational heartburn was hard.

The relationship got rocky. While Library science is not a dying field, it is a hybrid field now. Physical and data assets are on par with each other in value. They had a child to save the marriage, BUT it was rocky. They didn’t even name the child. That child would later self-name themselves in 2010 “Data Science”. Data Science became the poster child for “latchkey kid” finding its way much later on in life. We will tell you that part later.

Computer scientists were dealing with massive sprawl in more data and information all the time. With Moore’s law advancing computing power every 18 months to 2 years and similar laws advancing storage, network, and video, computer science got bored with library science. Well, maybe not bored, but I just did not have the time for data management. Major stakeholders, like science, got more enamored with computation modeling as the 3rd paradigm of science. Consequently, they lost interest in the slower process of cataloging their data for future and cross-organizational use. This latter topic didn’t even become a cultural issue to consider data as 4th paradigm of science in the 2000s. Furthermore, universities never started teaching any semblance of data management. The jobs and research dollars were in computing, robotics, and integration with engineering.

And Things Are Still Crazy Today

To this day, the the explosion of corporate and organization of paper to electronic records, information workflow, management information systems, and lots of amazing IT salespeople, focused on grabbing data to support the MIS needs, yet data was not considered for front office and improvement until the latter patter of the 2000s, and only barely started now.

Library science continued to focus on MIS with Master Data Management to address enterprise MIS. It could not and never will address cross-organizational issues correctly. Consortiums languish in trying to bring organizations together without clear ROI at the time. Pushing forward into satellite and sensor data, MDM constructs cannot handle the massive data growth, so MIS chose to reduce the fidelity of data to what they needed to store for compliance. It became too much.

As the paper moved to electronic, in the ’90s, the internet created a data deluge (90% of the world’s data is usually within the last 1-2 years each time), it is almost as if the relationship of computer science and library science never existed as data and information regressed to content cross-organizationally. Meaning, having FAIR data continues to be the bane of existence for corporations and organizations. Data is still linear on the library science models and underfunded. Where steno pools and records clerks dominated floors of corporate offices and buildings into the last century, those dormant floors were assumed that the data management would be self-managed… Nope.

Computer Science Enjoyed its Mid-Life Crisis in the ’90s and 2000s.

With search engines prevailing some semblance of F in FAIR in the late 90s, there was some rekindling of library science and computer science’s ‘love affair’. The keywords HTML tag effort helped a bit with content. However, this led to delays until semantic schemas would attempt to connect and link data at a feature level. Many consortiums acted as bridges and counselors to create metadata standards, exchange standards, ontologies, and vocabularies, yet after months of getting nowhere. computer scientists go on the lam again bored to work on different ways to solve discoveries. If you thought it was dark before, it didn’t get much better.

At the turn of the millennium, and with computer science (CS) hitting its true middle-age years stride, CS double-down in research in advancing the Artificial intelligence side to handling ways to bring cross-enterprise data together. Like buying a hot rod, CD throttled down the interstate, dodging social ethics impacts left and right proudly bringing up a battle of Machine Learning vs. Ontology.

The world saw the advancement of data moving to decision, advisor, and wisdom. They subsequently sped through various computing models: data processing models information systems models to knowledge platforms models and are now in the early R&D phases of intelligence advisor models. The stacks of massive hardware and amazing data and compute platforms using high 4 V’s capabilities are moving Computer scientists further away from working closer with library scientists.

So What Happened?

Instead of human-focused vocabularies to enable scientists, modelers, and analysts directly, a right turn was made to enable computing solutions via creating training data to support AI and machine learning. Modelers, analysts, and scientists have begun to focus on establishing rules, signals, and algorithms on this training data for computers to bring the data together. In short, let the computer figure it out. How fun is this? Well, much more fun than making vocabulary. There is a promise of a web that links multiple taxonomies, vocabularies, and ontologies, yet it appears the audience for the results will be massive computing units to discern the lack of clear mapping between data envisioned in the 1960s.

Computer science keeps advancing these tools to avoid directly linking data, and instead creating compute solutions for it to handle the data deluge of mapping. And with fun new toys too: NoSQL, Data Lake, BigTable, MapReduce, and data pipelines (ETL to ELT). Even began creating some tools for the kid (Data Science): neural networks, NLP, Deep Learning, Bayesian probabilities, temporo-spatial analytics, predictive models, graph theory, and so many more toys. They play together a lot. But they still struggling to find a stable impact with so much growth and change. They pump data through it to see how big and fast they can process better outputs.

T

The Reality of the Situation

And, yes, they both ignored the need for library science. They did this to convince 5,000 science institutions to adopt a new metadata standard who by the way want to avoid you like the plague as funding is dependent on their study output, not if others could find it. Yet, who can blame them – which would you do? If you could work on artificial intelligence to generate computer algorithms itself, mass generating common semantic snippets to get us to richer content in discovery OR would you rather be on boards that craft people-made Ontological Web Languages (OWLs) to get us to the description frameworks we need to do analysis, which would you do?

Library Science in its Later Ages Began Re-Evaluating its Re-Invention, Too.

Library scientists didn’t sit back. They advanced local MIS solutions. Many organizations established their local “malls” of data stores, and many did it quite well. Records policies advance greatly with the days of paper, while taking time, have passed. They tag content more and more. Consequently, records were searchable and could integrate MIS solutions. Without computer science to make advanced data-to-knowledge solutions, they brought scientists, modelers, and analysts along to self-teach themselves in programming. Lots of Library MIS solutions were created and continue to be enabling local catalogs, and local sharing.

Since this hasn’t addressed the enterprise nor the wild need, fractured compliance investments continue to proliferate. These mostly stop at local online libraries we call searchable catalogs. They are customized, and tons of money are put into them like building an amazing restaurant at a mall no one goes to anymore. They do not integrate well and are not discoverable beyond the original intended audience. Yet, these local library solutions support those patrons well. Yet, the reality is most go to “in the wild” searches now, so many are abandoned data stores in a mall we all once used to go to.

Many library scientists have become “woke” to the data knowledge age. Also, the allure of information stores parading as data stores is fading. Many are hitting these top 3 things.

1. Simplify Their Catalog Role to a Store to be Found in the Wild.

You still need to enumerate your assets. You can make them fair by completing basic metadata. Stop overengineering features in the catalog. Heck, can you get out of the catalog hosting business altogether and become a node in someone else’s catalog? Likely you started this route because you could. That aside, these Swiss army knife catalogs that can do all need to be whittled down. Make your data in the catalog discoverable via schema.org flagging. This shift from retail to wholesale is trending.

2. Prioritizing Assessed Records for Discovery Over Having Simply a Lot of Records.

Sure, the policy wonks want to hear you have the “most” records. Fine, stuff them in there for your compliance reporting if you must. Yet, make your default search only show the good records. Let the user determine if they check the flag to go into the dark, deep bowels of your catalog. Would you do the same in a library? Nice organized books on the shelves and the power library users got time access to the downstairs of the library.

Should be the same online. By assessing, many are creating peer pressure to get on the good list. Look if people won’t play, give peer pressure. It works for the movies online. Be a meta-critic. Let people know if the metadata stinks or not and it is what it is. And most importantly, we’re working on making it better. Maybe peer pressure will kick in or someone with policy control will get up and seek what funding is needed to make it not embarrassingly bad metadata.

3. Populating the Record Metadata They can – Rather Than Stuffing the Unmaintainable Fields

Save the enhanced fields for your rock stars. Make a small base level of metadata for discovery mandatory. Make some preferred. And the rest is nice to have. Those complex design-time, metadata constructs from the 90s need an enema. Figure out what you can maintain. If people want it, they need to demonstrate and fund the ROI to do it. The over-engineered “hope-based” instead of Return-on-investment-based metadata models need to go. Clean it out. You need minimalization, even though it is scary.

There are many other honorable mentions:

Moving from local solutions to collaborative.
Creating engagement with suppliers with gamify and incent metadata improvement program offering things suppliers need (give a dog a bone)
Moving to training data models to inform standards vs. consensus discussion-based or hero-based standards development. Starting to play and understand the viability of computer scientists’ toys.
Stop working only for MIS and work for the objectives of reducing data to support the business agile task.
Or our favorite heretical statement of call Consider a new catalog concept altogether – NONE! Before you burn me at the stake, hear this heretic out. Can you get your constituents to publish their metadata with semantic tags and flags, and then you use “the wild” to keep a management or status-only view of the catalog by crawling the web? Now people aren’t searching for you, you’re finding what your suppliers put out there. You can meet your quality and compliance reporting and post-snapshot for records, and let the 3rd layer of the internet – the semantic web – be your source

Computer Science and Library Science Hang out Again From Time to Time.

Library science is beginning to approach metadata from the idea that it won’t be perfect to establish a human-readable flow as the primary target. They are starting to see that search engine optimization is a great way to start with schema.org tagging. This will address the F in FAIR. And some library scientists are exploring ways to feed the varying and growing cross-organization ontologies and vocabularies into the semantic web knowing that computers are the audience more and more and even that means AI for discovery. It may take some serious couple counseling to let AI lead vs. a human user as it’s a massive change. It feels like we’re still pre-Apollo program isn’t getting the progress needed, start participating in and funding more pilot programs.

And to be clear, we’re not saying Library science and computer science should drift to the middle for the sake of their only child – Data Science. Yet, a little focus on the child as a mutual interest goal would benefit data science greatly. Instead, we’re saying that both need to address the same problem. Both must manage the data explosion, extract knowledge from data, and reduce data to the user. That is what we are solving. This diagram from PNNL captures the overall context for knowledge discovery. In this work we’re focused on this core portion of knowledge extraction, realizing that the upstream and downstream tasks also need to be addressed for an end-to-end workflow.

Take Care of Each Other – You do Need Each Other.

The takeaways:

Computer Science – We get the computing world progressing at a double rate every 2 years is impossible to ignore.
Library science – We get it hard to change the deep origins of stakeholders from physical records for human discovery and adaptation to go into the records features to support compute-based audiences and models is the critical knowledge need of today.

Yet, you two need to work together.

Computer scientists – Use caution on recklessly ignoring the need for solutions to reduce data using organizing and data management constructs as AI and its kin can only establish such confidence intervals if we have garbage in, garbage out – taxonomic and vocabulary principles in a context which library science brings will get us to more robust models. AI needs training data. Training data needs a solid heuristic form. That requires tagging and flagging. Which needs metadata. Which needs a semblance of agreed taxonomy, categorization, ontological, vocabulary – and overall semantic solutions.

Library scientists – Moving away from delivering local data stores as online libraries to simplified, high-quality data feeds for the wild to consume is the priority. We need to focus more on feeding broader pipelines and less on local apps. Heed the top 3 and honorable mention suggestions, and see what computer science can do with all that – from there, your role will be the most crucial role to quality training data and more integrated data workflows.