Latest News | Reprex

iotables: Integrate Data from Reliable Statistical Sources for Economic and Environmental Impact Analysis

Sat, 24 Sep 2022 08:13:55 +0200

The iotables R package is an open-source, scientifically validated package that collects and integrates data from reliable statistical sources and performs economic and environmental impact analysis. It is the backbone of our connected financial-sustainability reporting tool, Eviota, because it can analyze the value chains of 64 industries in every European country for free. It is your starting point to calculate employment, tax, or greenhouse gas multipliers for various policy actions in your country.

The 0.9.1 version of iotables was released today on CRAN. This new minor release contains a bug fix reported by a user and some documentation improvements. Check out our product page and our examples.

The iotables package works with the open-source R statistical environment. We programmed every example of the Eurostat Manual of Supply, Use and Input-Output Tables, which can be used both as a control tool for our package and as a comprehensive, extended handbook on use.
We have also started to meet the demands of a more global audience by adding more and more examples from the Handbook on Supply and Use Tables and Input-Output Tables with Extensions and Applications published by the United Nations. Our goal is to make both our iotables a free analytic and data processing tool and our premium Eviota ESG reporting premium solution applicable anywhere in the world.

# From CRAN:
install.packages("iotables")

# From Github (development version)
devtools::install_github("rOpenGov/iotables")

# with vignettes:
devtools::install_github("rOpenGov/iotables", build_vignettes = TRUE)

The iotables tool has hundreds of technical and professional users in the world. We know that it is not for the faint heart. In 2022 we started the development of a tool that will create impact assessments for small companies or civil society organizations without the need to learn and use R.

Dutch AI Coalition Working Group Culture and Media

Thu, 22 Sep 2022 19:30:00 +0200

Reprex presented its Digital Music Observatory and Cultural Creative Sectors Industries Data Observatory as platforms for developing and evaluating trustworthy AI in the cultural domains. We hope to find new partners within the NLAIC community to join our open, collaborative projects.

We have reviewed more than 80 data observatories in the world, and we are building five modern ones.

It was particularly important for us to get away from the Hague, and meet organizations like DEN and the KB to find out how our ambitious plans could connect to their excellent work. Reprex is a finalist in the Hague Innovators Challenge 2022, and we would like to bring at least one global observatory, the planned European Music Observatory, into our beautiful and smart city. While knowledge graphs are virtual and live in the web 3.0, the Dutch AI Coalition and the country’s future competitiveness need to ensure that essential knowledge graphs will be managed by the ecosystem of Netherlands-based researchers, institutions, and startups. The ethical consciousness shown by the members of our Culture AI Lab shows that it is probably the best for future human generations globally, too.

The SABIO is one of the most interesting in the world and couuld be connected easily with our Cultural Creative Sectors Industries Data Observatory prototype.

The Culture AI Lab presented a handful of very interesting, ethical and interesting projects. Pressing Matter responds to growing concerns in the Netherlands and Europe about how to deal with the legacies of colonialism in museums and builds innovative tools for museums (and broader society) to address the question of ownership of objects collected in the colonial period. Dr Emily Hansell Clark, former editor of our Data&Lyrics blog, presented the Polyvocal Interpretation of Contested Colonial Heritage project.

Both projects are conceptually and technologically relevant to our Listen Local project. Our project aims to prevent the colonization or start the de-colonization of the local music ecosystem and make local artists of Utrecht, Vilnius, or Sarajevo visible and audible in their own cities’ public spaces or on the smartphones of their town.

The most compelling use case of Listen Local project is finding out why music recommender systems do not recommend some music at all. Or why is it so hard to connect Utrecht-based artists with fans living or visiting Utrecht on the Spotify or YouTube platform? The Responsible Recommenders in the Public Library Sector is looking for similar answers for librarians to avoid all recommendations to visitors pointing to U.S. authors and publishers.

Savvina Daniil’s excellent presentation raised very similar questions to our Feasibility Study On Promoting Slovak Music In Slovakia & Abroad.

Our deep dive into legislative and regulatory issues of AI highlighted that the past decade unleashed global web-based tools that have the potential to undermine our democratic and cultural cohesion. Reassuringly, we have seen that our thinking about the dangers of AI on European culture and the technological solutions to combat them are very widely shared by NLAIC Culture and Media members. We hope that our partners’ policy work in the Digital Music Observatory and our forming new observatories will support policy design and decision-making that will protect the Netherlands and the EU from some of these threats.

Learn more about the Dutch AI Coalition’s Cultur and Media Working Group (in Dutch:)

Reprex Nominated for The Hague Innovators Award

Tue, 13 Sep 2022 08:12:00 +0200

Reprex is a finalist for The Hague Innovators Award 2022, and the prize of the audience, in the startup category with our respectable competitors, Sibö, WECO, STHRIVE and ECOBLOQ.

The transition towards a sustainable and inclusive economy depends on collaboration. That is why we are bringing together startups, scale-ups, investors, policymakers, and other impact makers from around the world in The Hague for the 7th edition of ImpactFest.

With the The Hague Innovators Challenge, the municipality of The Hague challenges startups, scale-ups, and students to present their innovative ideas for global issues, as described in the UN Sustainable Development Goals (SDGs).

stacodelists: use standard, language-independent variable codes to help international data interoperability and machine reuse in R

Wed, 29 Jun 2022 08:12:00 +0100

Visit the documentation website of statcodelists on statcodelists.dataobservatory.eu/.

The goal of statcodelists is to promote the reuse and exchange of statistical information and related metadata with making the internationally standardized SDMX code lists available for the R user. SDMX – the Statistical Data and Metadata eXchange has been published as an ISO International Standard (ISO 17369). The metadata definitions, including the codelists are updated regularly according to the standard. The authoritative version of the code lists made available in this package is https://sdmx.org/?page_id=3215/.

Purpose

Cross-domain concepts in the SDMX framework describe concepts relevant to many, if not all, statistical domains. SDMX recommends using these concepts whenever feasible in SDMX structures and messages to promote the reuse and exchange of statistical information and related metadata between organisations.

Code lists are predefined sets of terms from which some statistical coded concepts take their values. SDMX cross-domain code lists are used to support cross-domain concepts. What are these cross-domain coded concepts?

Geographical codes, like NL: the Netherlands in the CL_AREA code list.
Standard industry codes J631 for Data processing, hosting and related activities in Europe. (NACE Rev 2 in Europe, beware, it is J592in Australia and New Zealand, see CL_ACTIVITY_ANZSIC06.)
Occupations, like OC2521 for Database designers and administrators in CL_OCCUPATIONS
Time fomatting standards, like CCYY for annual data series in CL_TIME_FORMAT.

Check out the available codlists on the package homepage.

The use of common code lists will help users to work even more efficiently, easing the maintenance of and reducing the need for mapping systems and interfaces delivering data and metadata to them. A very obvious advantage of using the code systems is that you can retrieve data from national sources indifferent of the natural language used in North Macedonia, Japan, the U.S. or the Netherlands. While the data labels may change to be locally human-readable, computers and geeks can read the codes and understand them immediately. Provided that they use the standard codes.

Our data observatories are rolling out SDMX coding across all datasets to help data ingestion and interoperability, data findability and data reuse. statcodelists can help the use of standard SDMX codes in your R workflow–both for downloading data from statistical agencies and to produce publication-ready datasets that the rest of the world (and even APIs) will understand.

Installation

You can install statcodelists from CRAN:

install.packages("statcodelists")

Further recommended code values for expressing general statistical concepts like not applicable, etc., can be found in section Generic codes of the Guidelines for the creation and management of SDMX Cross-Domain Code Lists.

For further codelists used by reliable statistical agency but not harmonized on SDMX level please consult the SDMX Global Registry Codelists page.

The creator of this package is not affiliated with SDMX, and this package was has not been endorsed by SDMX.

Code of Conduct

Please note that the statcodelists project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Developing a software-as-service solution for micro-, and small enterprises

Thu, 09 Jun 2022 09:40:00 +0100

The music sector must increase its environmental and social (ESG) sustainability management to meet the challenges of the climate emergency and to make the music sector a fairer, more just workplace for womxn and artists coming from minorities, small countries. The European Union will make target setting and audited reporting mandatory in environmental and social sustainability for large companies. The application of these new accounting, reporting and disclosure rules are optional for the music sector where almost all entities are micro-, or small enterprises and civil society organizations.

Even if music organizations are not pushed by regulators to adopt these new standards, it is in their best interest to take the initiative on the principle of subsidiarty, and develop tools that can be applied as an extension to their simplified financial and tax reporting. Music organizations and businesses that can prove that they are making progress in reducing their carbon footprint, making their water use more sustainable, and they provide equal opportunities for womxn, they will be eligible for new, green bank and insurance products (which are particularly important in live music) and can attract new sponsors and donors.

Compliance with these new rules is very costly, because tools are being developed for stock-exchange listed big companies and financial institutions. The Commission’s impact assessment (SWD/2021/150 final) estimates the cost of compliance with the Corporate Social Responsibility Directive exceeding 4 bn euros for the European companies or around 10,000 euros per company. Reprex, working together with large accounting, audit and value-based banking partners, scientific, research and industry partners in the Digital Music Observatory open knowledge collaboration, hopes to bring down this cost below 500 euros, which will immediately pay off when a music organization receives green money.

We are working on a simple interface that can connect the accounting system of micro and small enterprises with new methodologies, starting with greenhouse gas reporting with Reprex’s open source EEIO application iotables. We will keep many aspects of our software and data solution open, so that later methodological innovations and scientific achievements can be easily incorporated into the system. Reprex’s minimum viable product will be created in four iteration rounds in Malta, Czechia, Bulgaria and Belgium. However, our testing is open for any amount of donations to any music entities in the European Union who can provide input data in English or Dutch, or be able to pay for their translation and localization costs.

Link: Final List of Awareded Projects by MusicAIRE

Trustworthy Autonomous Recommender Systems on Music Streaming Platforms

Mon, 28 Feb 2022 19:00:00 +0100

Currently almost 60% of the global recording industry sales are made via streaming platforms. Given the enormity of choice on these platforms, and that music listening is a low-key, routine consumption choice, consumers are more and more relying on the recommendations of autonomous recommendation systems. Streaming platforms are two-sided markets, where recommendations are deployed to enhance the user experience on the consumer side, but they also decide the fate of the investments that composers, lyricists, producers, and performers made into the music. We are going to contribute to a research on how such systems may lead to potentially tilted competition field between the content providers, and more specifically, between major labels and independents.

Reprex maintains the Digital Music Observatory and the Listen Local system for granular microdata about music use in small territories (i.e., on small country or sub-national level.) We will provide data/expertise in music streaming and recommendation systems and links to many relevant stakeholders with our considerable experience running experiments on music platforms.

A research team of the University of East Anglia (UEA) the University of Liverpool (UoL), The University of London (City), and King’s College (KCL), supported by the Competition Market Authority of the United Kingdom and Reprex won a prestigious research grant to understand how recommender systems on music streaming platforms can employ trustworthy AI.

The researchers will explore the relationship between the autonomous recommendation systems and entry barriers via simulation. Working closely with Reprex, they will simulate sets of users, and iteratively generate recommendation lists, which the simulated users will react to by deciding how long to engage for and which recommendations to listen to. Through their engagement their user profiles will be updated based on what they listen to which will feed into future recommendations.

See our Feasibility Study for Listen Local.

The empirical experiments of the project want to explore how autonomous recommendation systems are driving consumer choice in a real-life setting, and to establish causality between the recommendation systems and the barrier to entry. As part of the second work package, the researchers will conduct randomised trials by inviting participants to stream music through our own user interface. Reprex has extensive experience conducting similar experiments in the music domain (for various online, field experiments, and high-quality surveys.)

Link: Eight new TAS research projects announced

Reproducible Economic Impact Assessment

Sun, 19 Dec 2021 13:00:00 +0100

Get started with iotables.

We made an important, peer-reviewed release of iotables in the last week as a preparation to increase the functionality of our open-source software. The official release of the iotables R package currently works with economic impact assessments, and can evaluate the likely employment, tax, wage, or gross value added direct, indirect and multiplied impacts of various policy changes in about 30 countries.

Originally the package was developed to calculate the economic impact of the Hungarian film tax shelter and the impact of the music sector on the Slovak economy. (See Slovak Music Industry Report).

The new CRAN release improved the documentation of the function and removed most outdated dependencies. The new, development version (which did not go through peer-review yet) is adding new functionality for environmental impact analysis with the following pollutants: Carbon dioxide without emissions from biomass (CO2), Carbon dioxide from biomass (Biomass CO2), Nitroux oxide (N2O), Methane (CH4), Perfluorocarbons (PFCs), Hydrofluorocarbons (HFCs), Sulphur hexafluoride (SF6) including nitrogen trifluoride (NF3), Nitrogen oxides (NOx), Non-methane volatile organic compounds, (NMVOC), Carbon monoxide (CO), Particulate matter < 10μm (PM10), Particulate matter < 2,5μm (PM2,5), Sulphur dioxide (SO2), Ammonia (NH3) and their combinations (see Reference Metadata in Euro SDMX Metadata Structure (ESMS)).

Our aim is to develop new sustainable finance applications, and understand the sustainability impacts of bank’s lending activities and insurer’s underwriting activities on climate change mitigation and adoption, biodiversity, preservation of water reservers, preventing pollution, and promoting the circular economy.

EU Taxonomy on Sustainable Activities

The European Commission created an created an EU Taxonomy Compass, which provides a visual representation of the contents of the EU Taxonomy, starting with the Delegated Act on the climate objectives, as adopted on 4 June 2021. Whilst you can download the EU Taxonomy in xlsx or json format, they are not tidy datasets, and they are not particularly well-suited for calculations, filtering, or inclusion in applications.

Reprex created a tidy version of the EU Taxonomy for developing better sustainability indicators into the Green Deal Data Observatory.

Open Data

EU Taxonomy on Sustainable Activities (Tidy) download.

Using our iotables is not for the faint heart. It is a scientific software, and it requires a good command of national accounts, input-output economics and sustainability to work with. Our Green Deal Data Observaotry is designed to be an API of scientific software, and produce clean, ready to use data for researchers, policy-makers and business planners who do not have the skills to work with scientific software. We are planning to release well-designed datasets that go through dozens of checks to make sure they have the best data quality.

Do you want to develop input-output models for any European country to measure the direct and indirect green house gas impacts of policy actions? Do you need well-formatted data on interindustry linkages or other relevant topics for sustainable economy or susitainable finance research? Get in touch with us – we are happy to help and test our new software tool with data you need, and create high-quality, open datasets that are ready to use.

Jumping Ahead With the Digital Music Observatory

Thu, 02 Dec 2021 13:00:00 +0100

Our Digital Music Observatory project spent a year in the JUMP Music Market Accelerator’s program. Over the course of 9 months, co-founder Daniel Antal could meet many stakeholders from almost all European countries, meet other new music technology startups and projects, and got mentoring and other professional help to further develop the project.

The Digital Music Observatory is one of the several initiatives to fill the data gaps of the fragmented European music ecosystems. While most of Europe’s music is available and promoted on data-heavy, AI-driven autonomous platforms like TikTok, Spotify, YouTube, Deezer, music labels, publishers, national export offices are lacking the necessary data solutions to remain competitive.

Daniel is pitching for partnership with Music Tech Europe on Linechech and finding a music city that wants to be the seat of the future European Music Observatory. Photo: Wen Liu.

One of the recurring themes of 2021 was the notion that the music streaming economy is broken. Several JUMP fellows are working on various projects that aim to fix this, and our Digital Music Observatory has both the data and track record to provide evidence and test ideas about possible solutions – change in pricing, better targeting in export and domestic markets, and checking for algorithmic biases. See what we have done in the field this year in the UK IPO-initiated Music Creators’ Earning project; understanding algorithmic recommendation problems with the support of the Slovak Arts Council, and making recommendations about better music metadata and copyright regulation with our research consortium.

The other very interesting theme of the year was the emergence of new, immersive music tech companies. We hope that our Digital Music Observatory can grow into a hub for their data needs, too. How is the world of 2.7 billion gamer and music lovers is forming a new market for Ristband? We would also like to curate data about the healing effects of sound, and work in the future with immersive, functional music providers like Flower of Sound who place music and sound design into a less stressful, more healthy acoustic environment.

We were often criticized for placing too little emphasis on data visualization. Our next priority is to provide clear, beautiful infographs and charts to all of our datasets.

There were many professionals who helped us in the JUMP program. We are particularly thankful for Alessanra di Caro (partnership building), Elodie Crouzet (program coordination), Steve Farris (mentoring), Veronique Friedrich (team building), Thierry Giesler (improving our pitch) and Anna Zò (Music Tech Europe).

Are you a data user? Give us some feedback! Shall we do some further automatic data enhancements with our datasets? Document with different metadata? Link more information for business, policy, or academic use? Please ive us any feedback!

How We Add Value to Public Data With Better Curation And Documentation?

Mon, 08 Nov 2021 09:00:00 +0000

In this example, we show a simple indicator: the Turnover in Radio Broadcasting Enterprises in many European countries. This is an important demand driver in the Music economy pillar of our Digital Music Observatory, and important indicator in our more general Cultural & Creative Sectors and Industries Observatory. Of course, if you work with competition policy or antitrust, than any industry may be interesting to you–but not all of them are well-serverd with data.

This dataset comes from a public datasource, the data warehouse of the European statistical agency, Eurostat. Yet it is not trivial to use: unless you are familiar with national accounts, you will not find this dataset on the Eurostat website.

The data can be retrieved from the Annual detailed enterprise statistics for services NACE Rev.2 H-N and S95 Eurostat folder.

Our version of this statistical indicator is documented following the FAIR principles: our data assets are findable, accessible, interoperable, and reusable. While the Eurostat data warehouse partly fulfills these important data quality expectations, we can improve them significantly. And we can also improve the dataset, too, as we will show in the next blogpost.

Table of Contents

Findable Data

Our data observatories add value by curating the data–we bring this indicator to light with a more descriptive name, and we place it in a domain-specific context with our Digital Music Observatory and Cultural & Creative Sectors and Industries Observatory and a policy-specific context with our Competition Data Observatory and Green Deal Data Observatory. While many people may need this dataset in the creative sectors, or among cultural policy designers, most of them have no training in working with national accounts, which imply decyphering national account data codes in records that measure economic activity at a national level. Our curated data observatories bring together many available data around important domains. Our Digital Music Observatory, for example, aims to form an ecosystem of music data users and producers.

We added descriptive metadata that help you find our data and match it with other relevant data sources.

We added descriptive metadata that help you find our data and match it with other relevant data sources. For example, we add keywords and standardized metadata identifiers from the Library of Congress Linked Data Services, probably the world’s largest standardized knowledge library description. This ensures that you can find relevant data around the same key term ("Radio broadcasting") in addition to our turnover data. This allows connecting our dataset unambiguously with other information sources that use the same concept, but may be listed under different keywords, such as Radio–Broadcasting, or Radio industry and trade, or maybe Hörfunkveranstalter in German, or Emitiranje radijskog programa in Croatian or Actividades de radiodifusão in Portugese.

Accessible Data

Our data is accessible in two forms: in csv tabular format (which can be read with Excel, OpenOffice, Numbers, SPSS and many similar spreadsheet or statistical applications) and in JSON for automated importing into your databases. We can also provide our users with SQLite databases, which are fully functional, single user relational databases.

Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. This makes the data easier to clean, and far more easier to use in a much wider range of applications than the original data we used. In theory, this is a simple objective, yet we find that even governmental statistical agencies–and even scientific publications–often publish untidy data. This poses a significant problem that implies productivity loses: tidying data will require long hours of investment, and if a reproducible workflow is not used, data integrity can also be compromised: chances are that the process of tidying will overwrite, delete, or omit a data or a label.

Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.

While the original data source, the Eurostat data warehouse is accessible, too, we added value with bringing the data into a tidy format. Tidy data can immediately be imported into a statistical application like SPSS or STATA, or into your own database. It is immediately available for plotting in Excel, OpenOffice or Numbers.

Interoperability

Our data can be easily imported with, or joined with data from other internal or external sources.

All our indicators come with standardized descriptive metadata, and statistical (processing) metadata. See our API

All our indicators come with standardized descriptive metadata, following two important standards, the Dublin Core and DataCite–implementing not only the mandatory, but the recommended descriptions, too. This will make it far easier to connect the data with other data sources, e.g. turnover with the number of radio broadcasting enterprises or radio stations within specific territories.

Our passion for documentation standards and best practices goes much further: our data uses Statistical Data and Metadata eXchange standardized codebooks, unit descriptions and other statistical and administrative metadata.

Reuse

All our datasets come with standardized information about reusabililty. We add citation, attribution data, and licensing terms. Most of our datasets can be used without commercial restriction after acknowledging the source, but we sometimes work with less permissible data licenses.

In the case presented here, we added further value to encourage re-use. In addition to tidying, we significantly increased the usability of public data by handling missing cases. This is the subject of our next blogpost.

Are you a data user? How could we serve you better?

Shall we do some further automatic data enhancements with our datasets? Document with different metadata? Link more information for business, policy, or academic use? Please get in touch with us!

How We Add Value to Public Data With Imputation and Forecasting

Mon, 08 Nov 2021 10:00:00 +0100

Public data sources are often plagued by missng values. Naively you may think that you can ignore them, but think twice: in most cases, missing data in a table is not missing information, but rather malformatted information. This approach of ignoring or dropping missing values will not be feasible or robust when you want to make a beautiful visualization, or use data in a business forecasting model, a machine learning (AI) applicaton, or a more complex scientific model. All of the above require complete datasets, and naively discarding missing data points amounts to an excessive waste of information. In this example we are continuing the example a not-so-easy to find public dataset.

In the previous blogpost we explained how we added value by documenting data following the FAIR principle and with the professional curatorial work of placing the data in context, and linking it to other information sources, such as other datasets, books, and publications, regardless of their natural language (i.e., whether these sources are described in English, German, Portugese or Croatian). Photo: Jack Sloop.

Completing missing datapoints requires statistical production information (why might the data be missing?) and data science knowhow (how to impute the missing value.) If you do not have a good statistician or data scientist in your team, you will need high-quality, complete datasets. This is what our automated data observatories provide.

Table of Contents

Why is data missing?

International organizations offer many statistical products, but usually they are on an ‘as-is’ basis. For example, Eurostat is the world’s premiere statistical agency, but it has no right to overrule whatever data the member states of the European Union, and some other cooperating European countries give to them. And they cannot force these countries to hand over data if they fail to do so. As a result, there will be many data points that are missing, and often data points that have wrong (obsolete) descriptions or geographical dimensions. We will show the geographical aspect of the problem in a separate blogpost; for now, we only focus on missing data.

Some countries have only recently started providing data to the Eurostat umbrella organization, and it is likely that you will find few datapoints for North Macedonia or Bosnia-Herzegovina. Other countries provide data with some delay, and the last one or two years are missing. And there are gaps in some countries’ data, too.

See the authoritative copy of the dataset.

This is a headache if you want to use the data in some machine learning application or in a multiple or panel regression model. You can, of course, discard countries or years where you do not have full data coverage, but this approach usually wastes too much information–if you work with 12 years, and only one data point is available, you would be discarding an entire country’s 11-years’ worth of data. Another option is to estimate the values, or otherwise impute the missing data, when this is possible with reasonable precision. This is where things get tricky, and you will likely need a statistician or a data scientist onboard.

What can we improve?

Consider that the data is only missing from one year for a particular country, 2015. The naive solution would be to omit 2015 or the country at hand from the dataset. This is pretty destructive, because we know a lot about the radio market turnover in this country and in this year! But leaving 2015 blank will not look good on a chart, and will make your machine learning application or your regression model stop.

A statistician or a radio market expert will tell you that you know more-or-less the missing information: the total turnover was certainly not zero in that year. With some statistical or radio domain-specific knowledge you will use the 2014, or 2016 value, or a combination of the two and keep the country and year in the dataset.

Our improved dataset added backcasted (using the best time series model fitting the country’s actually present data), forecasted (again, using the best time series model), and approximated data (using linear approximation.) In a few cases, we add the last or next known value. To give a few quantiative indicators about our work:

Increased number of observations: 65%
Reduced missing values: -48.1%
Increased non-missing subset for regression or AI: +66.67%

If your organization is working with panel (longitudional multiple) regressions or various machine learning applications, then your team knows that not havint the +66.67% gain would be a deal-breaker in the choice of models and punctuality of estimates or KPIs or other quantiative products. And that they would spent about 90% of their data resources on achieving this +66.67% gain in usability.

If you happen to work in an NGO, a business unit or a research institute that does not employ data scientists, then it is likely that you can never achieve this improvement, and you have to give up on a number of quantitative tools or visualizations. If you have a data scientist onboard, that professional can use our work as a starting point.

Can you trust our data?

We believe that you can trust our data better than the original public source. We use statistical expertise to find out why data may be missing. Often, it is present in a wrong location (for example, the name of a region changed.)

If you are reluctant to use estimates, think about discarding known actual data from your forecast or visualization, because one data point is missing. How do you provide more accurate information? By hiding known actual data, because one point is missing, or by using all known data and an estimate?

Our codebooks and our API uses the Statistical Data and Metadata eXchange documentation standards to clearly indicate which data is observed, which is missing, which is estimated, and of course, also how it is estimated. This example highlights another important aspect of data trustworthiness. If you have a better idea, you can replace them with a better estimate.

Our indicators come with standardized codebooks that do not only contain the descriptive metadata, but administrative metadata about the history of the indicator values. You will find very important information about the statistical method we used the fill in the data gaps, and even link the reliable, the peer-reviewed scientific, statistical software that made the calculations. For data scientists, we record the plenty of information about the computing environment, too-–this can come handy if your estimates need external authentication, or you suspect a bug.

Avoid the data Sisyphus

If you work in an academic institution, in an NGO or a consultancy, you can never be sure who downloaded the Annual detailed enterprise statistics for services (NACE Rev. 2 H-N and S95) Eurostat folder from Eurostat. Did they modify the dataset? Did they already make corrections with the missing data? What method did they use? To prevent many potential problems, you will likely download it again, and again, and again…

See our The Data Sisyphus blogpost.

We have a better solution. You can always rely on our API to import directly the latest, best data, but if you want to be sure, you can use our regular backups on Zenodo. Zenodo is an open science repository managed by CERN and supported by the European Union. On Zenodo, you can find an authoritative copy of our indicator (and its previous versions) with a digital object identifier, in this case, 10.5281/zenodo.5652118. These datasets will be preserved for decades, and nobody can manipulate them. You cannot accidentally overwrite them, and we have no backdoor access to modify them.

Get the data

How can we do better?

Are you a data user?

Shall we do some further automatic data enhancements with our datasets? Document with different metadata? Link more information for business, policy, or academic use? Please get in touch with us!

Reprex Joins RECREO Research Consortium To Develop Innovation Indicators

Sat, 06 Nov 2021 16:00:00 +0000

The Scuola Superiore di Studi Universitari e di Perfezionamento Sant’Anna and Università degli Studi di Trento (Italy); University of Glasgow (United Kingdom); Universiteit van Amsterdam and Stichting Europeana from the Netherlands; the National University of Ireland Maynooth (Ireland); Tartu Ulikool (Estonia); Szegedi Tudományegyetem (Hungary); Fundacion Santa Maria La Real del Patrimonio Historico from Spain; the Katholieke Universiteit Leuven, (Belgium); Culture Action Europe AISBL and IDEA Strategische Economische Consulting (Belgium) and Reprex created theREshaping CCSI REsearch: Open data, policy analysis and methods for evidence-based decision-making consortium consortium, which will mainly develop new policy evidence in the field of innovation and inclusiveness for the creative and cultural sectors, industries. The Consortium applies for a Horizon Europe grant with the HORIZON-CL2-2021-HERITAGE-01-03 Cultural and creative industries as a driver of innovation and competitiveness call of the European Commission.

Policymakers face challenges when trying to implement a strict evidence-based approach to decision-making in the field of cultural and creative sectors and industries (CCSI). This is mostly due to four phenomena:

Evidence dissonances in mapping, measuring and analysis of key indicators, which lead to improper generalizations and gaps in decisionmakers’ knowledge and stakeholders’ awareness
Fragmentation of hubs of production and concentration of platforms, which create statistical biases and have features that hardly fit with traditional impact assessment methods;
Datafication, which is revolutionizing CCSI but remains difficult to investigate, thus broadening knowledge gaps; and
Stakeholders’ fragmentation and conflicting interests, which hinders their engagement, awareness-raising and uptake of policy inputs.With its cross-disciplinary consortium of academics, practitioners and a strong network of stakeholders, engaged via participatory research strategies, RECREO will help policymakers and stakeholders tackling such challenges, by generating new knowledge and methods to fill in knowledge and awareness gaps. RECREO will achieve this goal through four actions.

First, it will generate a wide array of horizontal and sector-specific datasets, made openly accessible via the CCSI Data Observatory and the Evidence Synthesis Platform. Second, it will offer an unprecedented EU and comparative mapping and impact assessment of key regulatory and policy measures relevant for CCSI, made available on the Law and Policy Observatory. Third, it will develop innovative methods to measure and assess CCSI innovation, competitiveness and spill-over effects, emphasizing inclusiveness, diversity and sustainability. Last, it will offer policy recommendations and best practices aimed at supporting the sustainable growth and competitiveness of culturally diverse CCSI, and their cross-fertilization with cultural heritage promotion and preservation.

Reprex on MaMA

Fri, 15 Oct 2021 19:00:00 +0000

Reprex’s co-founder, the main developer of the Digital Music Observatory, Daniel Antal and Digital Music Observatory curator, Marie Zhorová participated in the MaMA Festival & Convention in Paris on 13-15 October within the JUMP Music Market Accelerator Program Program. We introduced our Digital Music Observatory to national music organizations and encouraged them to try out a cooperation with us. (See Use Cases below)

Our main aim was to find new users to our Digital Music Observatory, and to find partners for a future Horizon Europe R&D project to develop the scientific pillars of the Observatory in a manner that meets practical industry needs and the feature requirements laid out in hte Feasiblity Study for a Euroepan Music Observatory.

Our concept was introduced in Le Trianon to a wider audience during the JUMP Music Market Accelerator Pitch Session and in one-to-one meetings to representatives of French national organizations. We have also started to investigate the possibility to cooperate with two startups to bring our data services closer to artists, labels, and publishers.

Use Cases

Fair Streaming

Daniel introduced our work made for the UK IPO’s Music Creators’ Earnings in the Digital Era Project about the justified and not-justified differences among music rightsholders earnings and the diminishing market value of streams. We believe that our UK approach is a particularly interesting addition to join with the distribution analysis performed by the Centre Nationale de la Musique and Deloitte in France.

Fair Value

Daniel introduced to collective management professioanls our innovative approach for private copying valuation, royalty price setting, estimating the values of value transfer to media platforms, and other topics of interests for collective management and rights management organizations. Our approach has a proven track record to increase revenues for creators.

Open Music Observatory

We introduced our approach to building the European Music Observatory in a decentralized way, relying not only on the resources of Creative Europe but also on Open Science, Horizon Europe, bringing the music industry, music research in universities and cultural policy under one open collaboration. Because France is building its own music observatory of a kind, the decentralized approach could particularly benefit French stakeholders.

Listen Local

Why Data Observatory?

Our use cases highlight the value of having a wide range of data available for the industry players, researchers and policy-makers. In the era of big data, and when open data is becoming legally more and more available, it is important to have one place with a single data collection method. Copernicus built a permanent observatory for the ongoing observation of celestial bodies. We built an automated data observatory to permanently collect data about music.

CCSI Data Observatory

Wed, 06 Oct 2021 16:00:00 +0000

The creative and cultural sectors and industries are mainly made of networks of freelancers and microenterprises, with very few medium-sized companies. Their economic performance, problems, and innovation capacities hidden. Our open collaboration to create this data observatory is committed to change this. Relying on modern data science, the re-use of open governmental data, open science data, and novel harmonized data collection we aim to fill in the gaps left in the official statistics of the European Union.

We believe that introducing Open Policy Analysis standards with open data, open-source software and research automation can help better understanding how creative people and their enterprises and institutions add value to the European economy, how they create jobs, innovate, and increase the well-being of a diverse European society. Our collaboration is open for individuals, citizens scientists.

The new observatory can be reached on ccsi.dataobservatory.eu and will be institutionally hosted by IViR, the Institute for Information Law of the University of Amsterdam, where Reprex’s co-founder, Daniel Antal will coordinate the development of this new, open scientific tool. Reprex will continue to develop the working model of the data observatory and continue to build open source software tools within the rOpenGov community and the R-Universe initative of ROpenSci.

The Scuola Superiore di Studi Universitari e di Perfezionamento Sant’Anna and Università degli Studi di Trento (Italy); University of Glasgow (United Kingdom); Universiteit van Amsterdam and Stichting Europeana from the Netherlands; the National University of Ireland Maynooth (Ireland); Tartu Ulikool (Estonia); Szegedi Tudományegyetem (Hungary); Fundacion Santa Maria La Real del Patrimonio Historico from Spain; the Katholieke Universiteit Leuven, (Belgium); Culture Action Europe AISBL and IDEA Strategische Economische Consulting (Belgium) and Reprex created the the RECREO consortium, which will mainly develop new policy evidence in the field of innovation and inclusiveness for the creative and cultural sectors, industries. The Consortium applies for a Horizon Europe grant with the HORIZON-CL2-2021-HERITAGE-01-03 Cultural and creative industries as a driver of innovation and competitiveness call of the European Commission.

Research & Analysis: Music Creators’ Earnings in the Digital Era

Thu, 23 Sep 2021 08:00:00 +0000

Reprex with its Digital Music Observatory team was commissioned to prepare an analysis on the justified and not justified differences in music creators’ earnings. We have posted our most important findings in an earlier blogpost (Music Creators’ Earnings in the Streaming Era. United Kingdom Research Cooperation With the Digital Music Observatory.

The UK Intellectual Property Office has published the entire report on the music creators’ earnings, and we have made our detailed analysis available in a side-publication. Reprex also signed an agreement with the researchers of the Music Creators’ Earnings project to deposit all data published in the report in the Digital Music Observatory, and to promote the building of the observatory further.

The research questions asked in this report are related to the Music Creator Earnings’ Project (MCE), exploring issues concerning equitable remuneration and earnings distributions. We were tasked with providing a longitudinal analysis of earnings development and relating our findings to equitable remuneration. The starting point of our work was centred around a very broadly defined problem: how much money music creators (rightsholders) earn from streaming, how these earnings are distributed, and how the earnings and their distribution have developed during the last decade.

The highly globalized music industry generates two important international reports, as well as several national reports, but these are not suitable for the analysis of the typical or average rightsholder, nor for small labels and publishers who do not represent a large and internationally diversified portfolio of music works or recordings. Copyright and neighboring right revenues are collected in national jurisdictions. Because British artists are almost never constrained by their use of language, and the UK Music Industry is highly competitive in the global music markets, even relatively less known rightsholders earn revenues from dozens of national markets. The lack of market information on music sales volumes, prices for each jurisdiction, and the unaccounted for national, domestic, and foreign revenues makes the analysis of the rightholder’s earnings, or the economics of a certain distribution channel like music streaming or media platforms, impossible.

The Effect of International Diversification on Revenues - a combination of international price differences and exchange rate fluctuations.

While total earnings are reported by international and national organizations, they hide five important economic variables: changes in sales volumes, changes in prices, market share on various national jurisdictions (which have their own volume and price movements), the exchange rates applied, and the share of the repertoire exploited. Even worse, the global music industry has no comprehensive database of rightsholders, music works, and recordings – this is the data gap that we would like fill with the Digital Music Observatory.

Our report highlights some important lessons. First, we show that in the era of global music sales platforms it is impossible to understand the economics of music streaming without international data harmonization and advanced surveying and sampling. Paradoxically, without careful adjustments for accruals, market shares in jurisdictions, and disaggregation of price and volume changes, the British industry cannot analyze its own economics because of its high level of integration to the global music economy. Furthermore, the replacement of former public performances, mechanical licensing, and private copying remunerations (which has been available for British rightsholders in their European markets for decades) with less valuable streaming licenses has left many rightsholders poorer. Making adjustments on the distribution system without modifying the definition of equitable remuneration rights or the pro-rata distribution scheme of streaming platforms opens up many conflicts while solving not enough fundamental problems. Therefore, we suggest participation in international data harmonization and policy coordination to help regain the historical value of music.

Context

The idea of our Digital Music Observatory was brought to the UK policy debate on music streaming by the Written evidence submitted by The state51 Music Group to the Economics of music streaming review of the UK Parliaments’ DCMS Committee¹.

The music industry requires a permanent market monitoring facility to win fights in competition tribunals, because it is increasingly disputing revenues with the world’s biggest data owners. This was precisely the role of the former CEEMID² program, which was initiated by a group of collective management societies. Starting with three relatively data-poor countries, where data pooling allowed rightsholders to increase revenues, the CEEMID data collection program was extended in 2019 to 12 countries.The final regional report, after the release of the detailed Hungarian, Slovak and Croatian reports of CEEMID was sponsored by Consolidated Independent (of the state51 music group.)

CEEMID was eventually to formed into the Demo Music Observatory in 2020³, following the planned structure of the European Music Observatory, and validated in the world’s 2nd ranked university-backed incubator, the Yes!Delft AI+Blockchain Validation Lab. In 2021, under the final name Digital Music Observatory, it became open for any rightsholder or stakeholder organization or music research institute, and it is being launched with the help of the JUMP European Music Market Accelerator Programme which is co-funded by the Creative Europe Programme of the European Union.

In December 2020, we started investigating how the music observatory concept could be introduced in the UK, and how our data and analytical skills could be used in the Music Creators’ Earnings in the Streaming Era (in short: MCE) project, which is taking place paralell to the heated political debates around the DCMS inquiry. After the state51 music group gave permission for the UK Intellectual Property Office to reuse the data that was originally published as the experimental CEEMID-CI Streaming Volume and Revenue Indexes, we came to a cooperation agreement between the MCE Project and the Digital Music Observatory. We provided a detailed historical analysis and computer simulation for the MCE Project, and we will host all the data of the Music Creators’ Earnings Report in our observatory, hopefully no later than early July 2021.

The Digital Music Observatory contributes to the Music Creators’ Earnings in the Streaming Era project with understanding the level of justified and unjustified differences in rightsholder earnings, and putting them into a broader music economy context.

We started our cooperation with the two principal investigators of the project, Prof David Hesmondhalgh and Dr Hyojugn Sun back in April and will start releasing the findings and the data in July 2021.

Join us

Do you need high-quality data for your music business or institution? Are you a music researcher? Join our open collaboration Digital Music Observatory team as a data curator, developer or business developer.

Footnote References

state51 Music Group. 2020. “Written Evidence Submitted by The state51 Music Group. Economics of Music Streaming Review. Response to Call for Evidence.” UK Parliament website. https://committees.parliament.uk/writtenevidence/15422/html/. ↩︎
Artisjus, HDS, SOZA, and Candole Partners. 2014. “Measuring and Reporting Regional Economic Value Added, National Income and Employment by the Music Industry in a Creative Industries Perspective. Memorandum of Understanding to Create a Regional Music Database to Support Professional National Reporting, Economic Valuation and a Regional Music Study.” ↩︎
Antal, Daniel. 2021. “Launching Our Demo Music Observatory.” Data & Lyrics. Reprex. https://dataandlyrics.com/post/2020-09-15-music-observatory-launch/. ↩︎

The Data Sisyphus

Thu, 08 Jul 2021 09:00:00 +0000

Sisyphus was punished by being forced to roll an immense boulder up a hill only for it to roll down every time it neared the top, repeating this action for eternity. This is the price that project managers and analysts pay for the inadequate documentation of their data assets.

When was a file downloaded from the internet? What happened with it sense? Are their updates? Did the bibliographical reference was made for quotations? Missing values imputed? Currency translated? Who knows about it – who created a dataset, who contributed to it? Which is an intermediate format of a spreadsheet file, and which is the final, checked, approved by a senior manager?

Big data creates inequality and injustice. On aspect of this inequality is the cost of data processing and documentation – a greatly underestimated, and usually not reported cost item. In small organizations, where there are no separate data science and data engineering roles, data is usually supposed to be processed and documented by (junior) analysts or researchers. This a very important source of the gap between Big Tech and them: the data usually ends up very expensive, ill-formatted, not readable by computers that use machine learning and AI. Usually the documentation steps are completely omitted.

“Data is potential information, analogous to potential energy: work is required to release it.” – Jeffrey Pomerantz

Metadata, which is information about the history of the data, and information how it can be technically and legally reused, has a hidden cost. Cheap or low-quality external data comes with poor or no metadata, and small organizations lack the resources to add high-quality metadata to their datasets. However, this only perpetuates the problem.

The hidden cost item behind the unbillable hours

As we have shown with our research partners, such metadata problems are not unique to data analysis. Independent artists and small labels are suffering on music or book sales platforms, because their copyrighted content is not well documented. If you automatically document tens of thousands of songs or datasets, the documentation cost is very small per item. If you, do it manually, the cost may be higher than the expected revenue from the song, or the total cost of the dataset itself. (See our research consortiums’ preprint paper: Ensuring the Visibility and Accessibility of European Creative Content on the World Market: The Need for Copyright Data Improvement in the Light of New Technologies)

In the short run, small consultancies, NGOs, or as a matter of fact, musicians, seem to logically give up on high-quality documentation and logging. In the long run, this has two devastating consequences: computers, such as machine learning algorithms cannot read their documents, data, songs. And as memory fades, the ill-documented resources need to be re-created, re-checked, reformatted. Often, they are even hard to find on your internal server or laptop archive.

Metadata is a hidden destroyer of the competitiveness of corporate or academic research, or independent content management. It never quoted on external data vendor invoices, it is not planned as a cost item, because metadata, the description of a dataset, a document, a presentation, or song, is meaningless without the resource that it describes. You never buy metadata. But if your dataset comes without proper metadata documentation, you are bound, like Sisyphus, to search for it, to re-arrange it, to check its currency units, its digits, its formatting. Data analysts are reported to spend about 80% of their working hours on data processing and not data analysis – partly, because data processing is a very laborious task that can be done by computers at a scale far cheaper, and partly because they do not know if the person who sat before them at the same desk has already performed these tasks, or if the person responsible for quality control checked for errors.

Uncut diamonds need to be cut, polished, and you have to make sure that they come from a legal source. Data is similar: it needs to be tidied up, checked and documented before use. Photo: Dave Fischer.

Undocumented data is hardly informative – it may be a page in a book, a file in an obsolete file format on a governmental server, an Excel sheet that you do not remember to have checked for updates. Most data are useless, because we do not know how it can inform us, or we do not know if we can trust it. The processing can be a daunting task, not to mention the most boring and often neglected documentation duties after the dataset is final and pronounced error-free by the person in charge of quality control.

Our observatory automatically processes and documents the data

The good news about documentation and data validation costs is that they can be shared. If many users need GDP/capita data from all over the world in euros, then it is enough if only one entity, a data observatory, collects all GDP and population data expresed in dollars, korunas, and euros, and makes sure that the latest data is correctly translated to euros, and then correctly divided by the latest population figures. These task are error-prone,and should not be repeaeted by every data journalist, NGO employee, PhD student or junior analyst. This is one of the services of our data observatory.

The tidy data format means that the data has a uniform and clear data structure and semantics, therefore it can be automatically validated for many common errors and can be automatically documented by either our software or any other professional data science application. It is not as strict as the schema for a relational database, but it is strict enough to make, among other things, importing into a database easy.
The descriptive metadata contains information on how to find the data, access the data, join it with other data (interoperability) and use it, and reuse it, even years from now. Among others, it contains file format information and intellectual property rights information.
The processing metadata makes the data usable in strictly regulated professional environments, such as in public administration, law firms, investment consultancies, or in scientific research. We give you the entire processing history of the data, which makes peer-review or external audit much easier and cheaper.
The authoritative copy is held at an independent repository, it has a globally unique identifier that protects you from accidental data loss, mixing up with unfinished an untested version.

Cutting the dataset to a format with clear semantics and documenting it with the FAIR metadata concep exponentially increases the value of data. It can be publisehd or sold at a premium. Photo: Andere Andre.

While humans are much better at analysing the information and human agency is required for trustworthy AI, computers are much better at processing and documenting data. We apply to important concepts to our data service: we always process the data to the tidy format, we create an authoritative copy, and we always automatically add descriptive and processing metadata.

The value of metadata

Metadata is often more valuable and more costly to make than the data itself, yet it remains an elusive concept for senior or financial management. Metadata is information about how to correctly use the data and has no value without the data itself. Data acquisition, such as buying from a data vendor, or paying an opinion polling company, or external data consultants appears among the material costs, but metadata is never sold alone, and you do not see its cost.

In most cases, the reason why there is no gold rush for open data is that fact that while the EU member states release billions of euros’ worth data for free, or at very low cost, annually, it comes without proper metadata.

Data-as-Service

Reusable, legal, easy-to-import, interoperable, always fresh data in tidy formats with a modern API. Photo: Edgar Soto.

If the data source is cheap or has a low quality, you do not even get it. If you do not have it, it will show up as a human resource cost in research (when your analysist or junior researcher are spending countless hours to find out the missing metadata information on the correct use of the data) or in sales costs (when you try to reuse a research, consulting or legal product and you have comb through your archive and retest elements again and again.)

The data, together with the descriptive and administrative metadata, and links to the use license and the authoritative copy can be found in our API. Try it out!

Including Indicators from Arab Barometer in Our Observatory

Mon, 28 Jun 2021 09:00:00 +0000

A new version of the retroharmonize R package – which is working with retrospective, ex post harmonization of survey data – was released yesterday after peer-review on CRAN. It allows us to compare opinion polling data from the Arab Barometer with the Eurobarometer and Afrorbarometer. This is the first version that is released in the rOpenGov community, a community of R package developers on open government data analytics and related topics.

Surveys are the most important data sources in social and economic statistics – they ask people about their lives, their attitudes and self-reported actions, or record data from companies and NGOs. Survey harmonization makes survey data comparable across time and countries. It is very important, because often we do not know without comparison if an indicator value is low or high. If 40% of the people think that climate change is a very serious problem, it does not really tell us much without knowing what percentage of the people answered this question similarly a year ago, or in other parts of the world.

With the help of Ahmed Shabani and Yousef Ibrahim, we created a third case study after the Eurobarometer, and Afrobarometer, about working with the Arab Barometer harmonized survey data files.

Ex ante survey harmonization means that researchers design questionnaires that are asking the same questions with the same survey methodology in repeated, distinct times (waves), or across different countries with carefully harmonized question translations. Ex post harmonizations means that the resulting data has the same variable names, same variable coding, and can be joined into a tidy data frame for joint statistical analysis. While seemingly a simple task, it involves plenty of metadata adjustments, because established survey programs like Eurobarometer, Afrobarometer or Arab Barometer have several decades of history, and several decades of coding practices and file formatting legacy.

Variable harmonization means that if the same question is called in one microdata source Q108 and the other eval-parl-elections then we make sure that they get a harmonize and machine readable name without spaces and special characters.
Variable label harmonization means that the same questionnaire items get the same numeric coding and same categorical labels.
Missing case harmonization means that various forms of missingness are treated the same way.

For the evaluation of the economic situation dataset, get the country averages and aggregates from Zenodo, and the plot in jpg or png from figshare.

In our new Arab Barometer case study, the evaulation of parliamentary elections has the following labels. We code them consistently 1: free_and_fair, 2: some_minor_problems, 3: some_major_problems and 4: not_free.

“0. missing”	“1. they were completely free and fair”
“2. they were free and fair, with some minor problems”	“3. they were free and fair, with some major problems”
“4. they were not free and fair”	“8. i don’t know”
“9. declined to answer”	“Missing”
“They were completely free and fair”	“They were free and fair, with some minor breaches”
“They were free and fair, with some major breaches”	“They were not free and fair”
“Don’t know”	“Refuse”
“Completely free and fair”	“Free and fair, but with minor problems”
“Free and fair, with major problems”	“Not free or fair”
“Don’t know (Do not read)”	“Decline to answer (Do not read)”

Of course, this harmonization is essential to get clean results like this:

For evaluation or reuse of parliamentary elections dataset get the replication data and the code from the Zenodo open repository.

In our case study, we had three forms of missingness: the respondent did not know the answer, the respondent did not want to answer, and at last, in some cases the respondent was not asked, because the country held no parliamentary elections. While in numerical processing, all these answers must be left out from calculating averages, for example, in a more detailed, categorical analysis they represent very different cases. A high level of refusal to answer may be an indicator of surpressing democratic opinion forming in itself.

Survey harmonization with many countries entails tens of thousands of small data management task, which, unless automatically documented, logged, and created with a reproducible code, is a helplessly error-prone process. We believe that our open-source software will bring many new statistical information to the light, which, while legally open, was never processed due to the large investment needed.

We also started building experimental APIs data is running retroharmonize regularly. We will place cultural access and participation data in the Digital Music Observatory, climate awareness, policy support and self-reported mitigation strategies into the Green Deal Data Observatory, and economy and well-being data into our Economy Data Observatory.

Further plans

Retrospective survey harmonization is a far more complex task than this blogpost suggest. Retrospective survey harmonization is a far more complex task than this blogpost suggest, because established survey programs have gathered decades of legacy data in legacy coding schemes and legacy file formats. Putting the data right, and especially putting the invaluable descriptive and administrative (processing) metadata right is a huge undertaking. We are releasing example codes, datasets and charts for researchers to comapre our harmonized results with theirs, and improve our software. We are releasing example codes, datasets and charts for researchers to comapre our harmonized results with theirs, and improve our software.

Use our software

The retroharmonize R package can be freely used, modified and distributed under the GPL-3 license. For the main developer and contributors, see the package homepage. If you use it for your work, please kindly cite it as:

Daniel Antal (2021). retroharmonize: Ex Post Survey Data Harmonization. R package version 0.1.17. https://doi.org/10.5281/zenodo.5034752

Download the BibLaTeX entry.

Tutorial to work with the Arab Barometer survey data

Daniel Antal, & Ahmed Shaibani. (2021, June 26). Case Study: Working With Arab Barometer Surveys for the retroharmonize R package (Version 0.1.6). Zenodo. https://doi.org/10.5281/zenodo.5034759

For the replication data to report potential issues and improvement suggestions with the code:

Daniel Antal, & Ahmed Shaibani. (2021). Replication Data for the retroharmonize R Package Case Study: Working With Arab Barometer Surveys (Version 0.1.6) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5034741

Experimental API

We are also experimenting with the automated placement of authoritative and citeable figures and datasets in open repositories. For the climate awareness dataset get the country averages and aggregates from Zenodo, and the plot in jpg or png from figshare. Our plan is to release open data in a modern API with rich descriptive metadata meeting the Dublin Core and DataCite standards, and further administrative metadata for correct coding, joining and further manipulating or data, or for easy import into your database.

Join our open source effort

Want to help us improve our open data service? Include Lationbarómetro and the Caucasus Barometer in our offering? Join the rOpenGov community of R package developers, an our open collaboration to create the automated data observatories. We are not only looking for developers, but data curators and service design associates, too.

Open Data - The New Gold Without the Rush

Fri, 18 Jun 2021 17:00:00 +0000

If open data is the new gold, why even those who release fail to reuse it? We created an open collaboration of data curators and open-source developers to dig into novel open data sources and/or increase the usability of existing ones. We transform reproducible research software into research- as-service.

Every year, the EU announces that billions and billions of data are now “open” again, but this is not gold. At least not in the form of nicely minted gold coins, but in gold dust and nuggets found in the muddy banks of chilly rivers. There is no rush for it, because panning out its value requires a lot of hours of hard work. Our goal is to automate this work to make open data usable at scale, even in trustworthy AI solutions.

There is no rush for it, because panning out its value requires a lot of hours of hard work. Our goal is to automate this work to make open data usable at scale, even in trustworthy AI solutions.

Most open data is not public, it is not downloadable from the Internet – in the EU parlance, “open” only means a legal entitlement to get access to it. And even in the rare cases when data is open and public, often it is mired by data quality issues. We are working on the prototypes of a data-as-service and research-as-service built with open-source statistical software that taps into various and often neglected open data sources.

We are in the prototype phase in June and our intentions are to have a well-functioning service by the time of the conference, because we are working only with open-source software elements; our technological readiness level is already very high. The novelty of our process is that we are trying to further develop and integrate a few open-source technology items into technologically and financially sustainable data-as-service and even research-as-service solutions.

Our review of about 80 EU, UN and OECD data observatories reveals that most of them do not use these organizations’s open data - instead they use various, and often not well processed proprietary sources.

We are taking a new and modern approach to the data observatory concept, and modernizing it with the application of 21st century data and metadata standards, the new results of reproducible research and data science. Various UN and OECD bodies, and particularly the European Union support or maintain more than 60 data observatories, or permanent data collection and dissemination points, but even these do not use these organizations and their members open data. We are building open-source data observatories, which run open-source statistical software that automatically processes and documents reusable public sector data (from public transport, meteorology, tax offices, taxpayer funded satellite systems, etc.) and reusable scientific data (from EU taxpayer funded research) into new, high quality statistical indicators.

We are taking a new and modern approach to the ‘data observatory’ concept, and modernizing it with the application of 21st century data and metadata standards, the new results of reproducible research and data science

We are building various open-source data collection tools in R and Python to bring up data from big data APIs and legally open, but not public, and not well served data sources. For example, we are working on capturing representative data from the Spotify API or creating harmonized datasets from the Eurobarometer and Afrobarometer survey programs.
Open data is usually not public; whatever is legally accessible is usually not ready to use for commercial or scientific purposes. In Europe, almost all taxpayer funded data is legally open for reuse, but it is usually stored in heterogeneous formats, processed into an original government or scientific need, and with various and low documentation standards. Our expert data curators are looking for new data sources that should be (re-) processed and re-documented to be usable for a wider community. We would like to introduce our service flow, which touches upon many important aspects of data scientist, data engineer and data curatorial work.
We believe that even such generally trusted data sources as Eurostat often need to be reprocessed, because various legal and political constraints do not allow the common European statistical services to provide optimal quality data – for example, on the regional and city levels.
With rOpenGov and other partners, we are creating open-source statistical software in R to re-process these heterogenous and low-quality data into tidy statistical indicators to automatically validate and document it.
We are carefully documenting and releasing administrative, processing, and descriptive metadata, following international metadata standards, to make our data easy to find and easy to use for data analysts.
We are automatically creating depositions and authoritative copies marked with an individual digital object identifier (DOI) to maintain data integrity.
We are building simple databases and supporting APIs that release the data without restrictions, in a tidy format that is easy to join with other data, or easy to join into databases, together with standardized metadata.
We maintain observatory websites (see: Digital Music Observatory, Green Deal Data Observatory, Economy Data Observatory) where not only the data is available, but we provide tutorials and use cases to make it easier to use them. Our mission is to show a modern, 21st century reimagination of the data observatory concept developed and supported by the UN, EU and OECD, and we want to show that modern reproducible research and open data could make the existing 60 data observatories and the planned new ones grow faster into data ecosystems.

We are working around the open collaboration concept, which is well-known in open source software development and reproducible science, but we try to make this agile project management methodology more inclusive, and include data curators, and various institutional partners into this approach. Based around our early-stage startup, Reprex, and the open-source developer community rOpenGov, we are working together with other developers, data scientists, and domain specific data experts in climate change and mitigation, antitrust and innovation policies, and various aspects of the music and film industry.

Our open collaboration is truly open: new data curators,developers and service designers, even volunteers and citizen scientists are welcome to join.

Our open collaboration is truly open: new data curators, data scientists and data engineers are welcome to join. We develop open-source software in an agile way, so you can join in with an intermediate programming skill to build unit tests or add new functionality, and if you are a beginner, you can start with documentation and testing our tutorials. For business, policy, and scientific data analysts, we provide unexploited, exciting new datasets. Advanced developers can join our development team: the statistical data creation is mainly made in the R language, and the service infrastructure in Python and Go components.

There are Numerous Advantages of Switching from a National Level of the Analysis to a Sub National Level

Wed, 16 Jun 2021 12:00:00 +0000

The new version of our rOpenGov R package regions was released today on CRAN. This package is one of the engines of our experimental open data-as-service Green Deal Data Observatory , Economy Data Observatory , Digital Music Observatory prototypes, which aim to place open data packages into open-source applications.

In international comparison the use of nationally aggregated indicators often have many disadvantages: they inhibit very different levels of homogeneity, and data is often very limited in number of observations for a cross-sectional analysis. When comparing European countries, a few missing cases can limit the cross-section of countries to around 20 cases which disallows the use of many analytical methods. Working with sub-national statistics has many advantages: the similarity of the aggregation level and high number of observations can allow more precise control of model parameters and errors, and the number of observations grows from 20 to 200-300.

The change from national to sub-national level comes with a huge data processing price: internal administrative boundaries, their names, codes codes change very frequently.

Yet the change from national to sub-national level comes with a huge data processing price. While national boundaries are relatively stable, with only a handful of changes in each recent decade. The change of national boundaries requires a more-or-less global consensus. But states are free to change their internal administrative boundaries, and they do it with large frequency. This means that the names, identification codes and boundary definitions of sub-national regions change very frequently. Joining data from different sources and different years can be very difficult.

Our regions R package helps the data processing, validation and imputation of sub-national, regional datasets and their coding.

There are numerous advantages of switching from a national level of the analysis to a sub-national level comes with a huge price in data processing, validation and imputation, and the regions package aims to help this process.

You can review the problem, and the code that created the two map comparisons, in the Maping Regional Data, Maping Metadata Problems vignette article of the package. A more detailed problem description can be found in Working With Regional, Sub-National Statistical Products.

This package is an offspring of the eurostat package on rOpenGov. It started as a tool to validate and re-code regional Eurostat statistics, but it aims to be a general solution for all sub-national statistics. It will be developed parallel with other rOpenGov packages.

Get the Package

You can install the development version from GitHub with:

devtools::install_github("rOpenGov/regions")

or the released version from CRAN:

install.packages("regions")

You can review the complete package documentation on regions.dataobservaotry.eu. If you find any problems with the code, please raise an issue on Github. Pull requests are welcome if you agree with the Contributor Code of Conduct

If you use regions in your work, please cite the package as: Daniel Antal, Kasia Kulma, Istvan Zsoldos, & Leo Lahti. (2021, June 16). regions (Version 0.1.7). CRAN. http://doi.org/10.5281/zenodo.4965909

Join us

Join our open collaboration Economy Data Observatory team as a data curator, developer or business developer. More interested in environmental impact analysis? Try our Green Deal Data Observatory team! Or your interest lies more in data governance, trustworthy AI and other digital market problems? Check out our Digital Music Observatory team!

Open Data is Like Gold in the Mud Below the Chilly Waves of Mountain Rivers

Thu, 10 Jun 2021 07:00:00 +0000

Open data is like gold in the mud below the chilly waves of mountain rivers. Panning it out requires a lot of patience, or a good machine.

As the founder of the automated data observatories that are part of Reprex’s core activities, what type of data do you usually use in your day-to-day work?

The automated data observatories are results of syndicated research, data pooling, and other creative solutions to the problem of missing or hard-to-find data. The music industry is a very fragmented industry, where market research budgets and data are scattered in tens of thousands of small organizations in Europe. Working for the music and film industry as a data analyst and economist was always a pain because most of the efforts went into trying to find any data that can be analyzed. I spent most of the last 7-8 years trying to find any sort of information—from satellites to government archives—that could be formed into actionable data. I see three big sources of information: textual,numeric, and continuous recordings for on-site, offsite, and satellite sensors. I am much better with numbers than with natural language processing, and I am improving with sensory sources. But technically, I can mint any systematic information—the text of an old book, a satellite image, or an opinion poll—into datasets.

For you, what would be the ultimate dataset, or datasets that you would like to see in the Economy Data Observatory?

I am a data scientist now, but I used to be a regulatory economist, and I have worked a lot with competition policy and monopoly regulation issues. Our observatories can automatically monitor market and environmental processes, which would allow us to get into computational antitrust. Peter Ormosi, our competition curator, is particularly interested in killer acquisitions: approved mergers of big companies that end up piling up patents that are not used. I am more interested in describing systematically which markets are getting more concentrated and more competitive, in real time. Does data concentration coincide with market concentration?

To bring an example from the realm of our Digital Music Observatory, which was a prototype to this one, I have been working for some time on creating streaming volume and price indexes, like the Dow Jones Industrial Average or the various bond market indexes, that talk more about price, demand, and potential revenue in music streaming markets all over the world. We did a first take on this in the Central European Music Industry Report and recently we iterated on the model for the UK Intellectual Property Office and the UK Music Creators’ Earnings project. We want to take this further to create a pan-Europe streaming market index, and we will be probably the first to actually be able to report on music market concentrations, and in fact, more or less in a real-time mode.

We would like to further developer our 20-country streaming indexes into a global music market index.

Is there a number or piece of information that recently surprised you? If so, what was it?

There were a few numbers that surprised me, and some of them were brought up by our observatory teams. Karel is talking about the fact that not all green energy is green at all: many hydropower stations contribute to the greenhouse effect and not reduce it. Annette brought up the growing interest in the Dalmatian breed after the Disney 101 Dalmatians movies, and it reminded me of the astonishing growth in interest for chess sets, chess tutorials, and platform subscriptions after the success of Netflix’s The Queen’s Gambit.

The Queen’s Gambit’ Chess Boom Moves Online By Rachael Dottle on bloomberg.com

Annette is talking about the importance of cultural influencers, and on that theme, what could be more exciting that Netflix’s biggest success so far is not a detective series or a soap opera but a coming-of-age story of a female chess prodigy. Intelligence is sexy, and we are in the intelligence business.

But to tell a more serious and more sobering number, I recently read with surprise that there are more people smoking cigarettes on Earth in 2021 than in 1990. Population growth in developing countries replaced the shrinking number of developed country smokers. While I live in Europe, where smoking is strongly declining, it reminds me that Europe’s population is a small part of the world. We cannot take for granted that our home-grown experiences about the world are globally valid.

Do you have a good example of really good, or really bad use of data?

FiveThirtyEight.com had a wonderful podcast series, produced by Jody Avirgan, called What’s the Point. It is exactly about good and bad uses of data, and each episode is super interesting. Maybe the most memorable is Why the Bronx Really Burned. New York City tried to measure fire response times, identify redundancies in service, and close or re-allocate fire stations accordingly. What resulted, though, was a perfect storm of bad data: The methodology was flawed, the analysis was rife with biases, and the results were interpreted in a way that stacked the deck against poorer neighborhoods. It is similar to many stories told in a very compelling argument by Catherine D’Ignazio and Lauren F. Klein in their much celebrated book, Data Feminism. Usually, the bad use of data starts with a bad data collection practice. Data analysts in corporations, NGOs, public policy organizations and even in science usually analyze the data that is available.

You can find these examples, together with many more that our contributors recommend, in the motivating examples of Create New Datasets and the Remain Critical parts of our onboarding material. We hope that more and more professionals and citizen scientist will help us to create high-quality and open data.

The real power lies in designing a data collection program. A consistent data collection program usually requires an investment that only powerful organizations, such as government agencies, very large corporations, or the richest universities can afford. You cannot really analyze the data that is not collected and recorded; and usually what is not recorded is more interesting than what is. Our observatories want to democratize the data collection process and make it more available, more shared with research automation and pooling.

You cannot really analyze the data that is not collected and recorded; and usually what is not recorded is more interesting than what is. Our observatories want to democratize the data collection process and make it more available, more shared with research automation and pooling.

From your perspective, what do you see being the greatest problem with open data in 2021?

I have been involved with open data policies since 2004. The problem has not changed much: more and more data are available from governmental and scientific sources, but in a form that makes them useless. Data without clear description and clear processing information is useless for analytical purposes: it cannot be integrated with other data, and it cannot be trusted and verified. If researchers or government entities that fall under the Open Data Directive release data for reuse in a way that does not have descriptive or processing metadata, it is almost as if they did not release anything. You need this additional information to make valid analyses of the data, and to reverse-engineer them may cost more than to recollect the data in a properly documented process. Our developers, particularly Leo and Pyry are talking eloquently about why you have to be careful even with governmental statistical products, and constantly be on the watch out for data quality.

Our API is not only publishing descriptive and processing metadata alongside with our data, but we also make all critical elements of our processing code available for peer-review on rOpenGov

What do you think the Economy Data Observatory, and our other automated observatories do, to make open data more credible in the European economic policy community and be accepted as verified information?

Most of our work is in research automation, and a very large part of our efforts are aiming to reverse engineer missing descriptive and processing metadata. In a way, I like to compare ourselves to the working method of the open-source intelligence platform Bellingcat. They were able to use publicly available, scattered information from satellites and social media to identify each member of the Russian military company that illegally entered the territory of Ukraine and shot down the Malaysian Airways MH17 with 297, mainly Dutch, civilians on board.

How we create value for research-oriented consultancies, public policy institutes, university research teams, journalists or NGOs.

We do not do such investigations but work very similarly to them in how we are filtering through many data sources and attempting to verify them when their descriptions and processing history is unknown. In the last years, we were able to estore the metadata of many European and African open data surveys, economic impact, and environmental impact data, or many other open data that was lying around for many years without users.

Open data is like gold in the mud below the chilly waves of mountain rivers. Panning it out requires a lot of patience, or a good machine. I think we will come to as surprising and strong findings as Bellingcat, but we are not focusing on individual events and stories, but on social and environmental processes and changes.

Join us

Comparing Data to Oil is a Cliché: Crude Oil Has to Go Through a Number of Steps and Pipes Before it Becomes Useful

Mon, 07 Jun 2021 10:00:00 +0000

As a developer at rOpenGov, and as an economic sociologist, what type of data do you usually use in your work?

Generally speaking, people’s access to (or inequalities in accessing) different types of resources and their ability in transforming these resources to other types of resources is what interests me. The data I usually work with is the kind of data that is actually nicely covered by existing rOpenGov tools: data about population demographics and administrative units from Statistics Finland, statistical information on welfare and health from Sotkanet and also data from Eurostat. Aside from these a lot of information is of course data from surveys and texts scraped from the internet.

We are placing the growing number of rOpenGov tools in a modern application with a user-friendly service and a modern data API.

In your ideal data world, what would be the ultimate dataset, or datasets that you would like to see in the Music Data Observatory?

Late spring and early summer time is, at least for me, defined by the Eurovision Song Contest. Every year watching the contest makes me ponder the state of the music industry in my home country Finland as well as in Europe. Was the song produced by homegrown talent or was it imported? Was it better received by the professional jury or the public? How well does the domestic appeal of an artist translate to the international stage? Many interesting phenomena are difficult to quantify in a meaningful way and writing a catchy song with international appeal is probably more an art than a science. Nevertheless that should not deter us from trying as music, too, is bound by certain rules and regularities that can be researched.

Music, too, is bound by certain rules and regularities that can be researched. Our Digital Music Observatory and its Listen Local experimental App does this exactly, and we would love to create Eurovision musicology datasets. Photo: Eurovision Song Contest 2021 press photo by Jordy Brada

Why did you decide to join the EU Datathon challenge team and why do you think that this would be a game changer for researchers and policymakers?

The challenge has, in my opinion, great potential in leading by example when it comes to open data access and reproducible research. Comparing data to oil is a common phrase but fitting in the sense that crude oil has to go through a number of steps and pipes before it becomes useful. Most users and especially policymakers appreciate ease-of-use of the finished product, but the quality of the product and the process must also be guaranteed somehow. Openness and peer-review practices are the best guarantors in the field of data, just as industrial standards and regulations are in the oil industry.

We provide many layers of fully transparent quality control about the data we are placing in our data APIs and provide for our end-users.

Join us

Creating Algorithmic Tools to Interpret and Communicate Open Data Efficiently

Fri, 04 Jun 2021 10:00:00 +0000

As a developer at rOpenGov, what type of data do you usually use in your work?

As an academic data scientist whose research focuses on the development of general-purpose algorithmic methods, I work with a range of applications from life sciences to humanities. Population studies play a big role in our research, and often the information that we can draw from public sources - geospatial, demographic, environmental - provides invaluable support. We typically use open data in combination with sensitive research data but some of the research questions can be readily addressed based on open data from statistical authorities such as Statistics Finland or Eurostat.

In your ideal data world, what would be the ultimate dataset, or datasets that you would like to see in the Music Data Observatory?

One line of our research analyses the historical trends and spread of knowledge production, in particular book printing based on large-scale metadata collections. It would be interesting to extend this research to music, to understand the contemporary trends as well as the broader historical developments. Gaining access to a large systematic collection of music and composition data from different countries across long periods of time would make this possible.

Why did you decide to join the challenge and why do you think that this would be a game changer for researchers and policymakers?

Joining the challenge was a natural development based on our overall activities in this area; the rOpenGov project has been around for a decade now, since the early days of the broader open data movement. This has also created an active international developer network and we felt well equipped for picking up the challenge. The game changer for researchers is that the project highlights the importance of data quality, even when dealing with official statistics, and provides new methods to solve these issues efficiently through the open collaboration model. For policymakers, this provides access to new high-quality curated data and case studies that can support evidence-based decision-making.

Do you have a favorite, or most used open governmental or open science data source? What do you think about it? Could it be improved?

Regarding open government data, one of my favorites is not a single data source but a data representation standard. The px format is widely used by statistical authorities in various countries, and this has allowed us to create R tools that allow the retrieval and analysis of official statistics from many countries across Europe, spanning dozens of statistical institutions. Standardization of open data formats allows us to build robust algorithmic tools for downstream data analysis and visualization. Open government data is still too often shared in obscure, non-standard or closed-source file formats and this is creating significant bottlenecks for the development of scalable and interoperable AI and machine learning methods that can harness the full potential of open data.

Regarding open government data, one of my favorites is not a single data source but a data representation standard, the Px format.

From your perspective, what do you see being the greatest problem with open data in 2021?

Although there are a variety of open data sources available (and the numbers continue to increase), the availability of open algorithmic tools to interpret and communicate open data efficiently is lagging behind. One of the greatest challenges for open data in 2021 is to demonstrate how we can maximize the potential of open data by designing smart tools for open data analytics.

What can our automated data observatories do to make open data more credible in the European economic policy community and be accepted as verified information?

The role of the professional network backing up the project, and the possibility of getting critical feedback and later adoption by the academic communities will support the efforts. Transparency of the data harmonization operations is the key to credibility, and will be further supported by concrete benchmarks that highlight the critical differences in drawing conclusions based on original sources versus the harmonized high-quality data sets.

We need to get critical feedback and later adoption by the academic communities.

How we can ensure the long-term sustainability of the efforts?

The extent of open data space is such that no single individual or institution can address all the emerging needs in this area. The open developer networks play a huge role in the development of algorithmic methods, and strong communities have developed around specific open data analytical environments such as R, Python, and Julia. These communities support networked collaboration and provide services such as software peer review. The long-term sustainability will depend on the support that such developer communities can receive, both from individual contributors as well as from institutions and governments.

Join us

Economic and Environment Impact Analysis, Automated for Data-as-Service

Thu, 03 Jun 2021 16:00:00 +0000

We have released a new version of iotables as part of the rOpenGov project. The package, as the name suggests, works with European symmetric input-output tables (SIOTs). SIOTs are among the most complex governmental statistical products. They show how each country’s 64 agricultural, industrial, service, and sometimes household sectors relate to each other. They are estimated from various components of the GDP, tax collection, at least every five years.

SIOTs offer great value to policy-makers and analysts to make more than educated guesses on how a million euros, pounds or Czech korunas spent on a certain sector will impact other sectors of the economy, employment or GDP. What happens when a bank starts to give new loans and advertise them? How is an increase in economic activity going to affect the amount of wages paid and and where will consumers most likely spend their wages? As the national economies begin to reopen after COVID-19 pandemic lockdowns, is to utilize SIOTs to calculate direct and indirect employment effects or value added effects of government grant programs to sectors such as cultural and creative industries or actors such as venues for performing arts, movie theaters, bars and restaurants.

Making such calculations requires a bit of matrix algebra, and understanding of input-output economics, direct, indirect effects, and multipliers. Economists, grant designers, policy makers have those skills, but until now, such calculations were either made in cumbersome Excel sheets, or proprietary software, as the key to these calculations is to keep vectors and matrices, which have at least one dimension of 64, perfectly aligned. We made this process reproducible with iotables and eurostat on rOpenGov

Our iotables package creates direct, indirect effects and multipliers programatically. Our observatory will make those indicators available for all European countries.

Accessing and tidying the data programmatically

The iotables package is in a way an extension to the eurostat R package, which provides a programmatic access to the Eurostat data warehouse. The reason for releasing a new package is that working with SIOTs requires plenty of meticulous data wrangling based on various metadata sources, apart from actually accessing the data itself. When working with matrix equations, the bar is higher than with tidy data. Not only your rows and columns must match, but their ordering must strictly conform the quadrants of the a matrix system, including the connecting trade or tax matrices.

When you download a country’s SIOT table, you receive a long form data frame, a very-very long one, which contains the matrix values and their labels like this:

## Table naio_10_cp1700 cached at C:\Users\...\Temp\RtmpGQF4gr/eurostat/naio_10_cp1700_date_code_FF.rds
# we save it for further reference here
saveRDS(naio_10_cp1700, "not_included/naio_10_cp1700_date_code_FF.rds")
# should you need to retrieve the large tempfiles, they are in
dir (file.path(tempdir(), "eurostat"))
dplyr::slice_head(naio_10_cp1700, n = 5)
## # A tibble: 5 x 7
## unit stk_flow induse prod_na geo time values
## <chr> <chr> <chr> <chr> <chr> <date> <dbl>
## 1 MIO_EUR DOM CPA_A01 B1G EA19 2019-01-01 141873.
## 2 MIO_EUR DOM CPA_A01 B1G EU27_2020 2019-01-01 174976.
## 3 MIO_EUR DOM CPA_A01 B1G EU28 2019-01-01 187814.
## 4 MIO_EUR DOM CPA_A01 B2A3G EA19 2019-01-01 0
## 5 MIO_EUR DOM CPA_A01 B2A3G EU27_2020 2019-01-01 0

The metadata reads like this: the units are in millions of euros, we are analyzing domestic flows, and the national account items B1-B2 for the industry A01. The information of a 64x64 matrix (the SIOT) and its connecting matrices, such as taxes, or employment, or C**O₂ emissions, must be placed exactly in one correct ordering of columns and rows. Every single data wrangling error will usually lead in an error (the matrix equation has no solution), or, what is worse, in a very difficult to trace algebraic error. Our package not only labels this data meaningfully, but creates very tidy data frames that contain each necessary matrix of vector with a key column.

iotables package contains the vocabularies (abbreviations and human readable labels) of three statistical vocabularies: the so called COICOP product codes, the NACE industry codes, and the vocabulary of the ESA2010 definition of national accounts (which is the government equivalent of corporate accounting).

Our package currently solves all equations for direct, indirect effects, multipliers and inter-industry linkages. Backward linkages show what happens with the suppliers of an industry, such as catering or advertising in the case of music festivals, if the festivals reopen. The forward linkages show how much extra demand this creates for connecting services that treat festivals as a ‘supplier’, such as cultural tourism.

Let’s seen an example

## Downloading employment data from the Eurostat database.
## Table lfsq_egan22d cached at C:\Users\...\Temp\RtmpGQF4gr/eurostat/lfsq_egan22d_date_code_FF.rds

and match it with the latest structural information on from the Symmetric input-output table at basic prices (product by product) Eurostat product. A quick look at the Eurostat website already shows that there is a lot of work ahead to make the data look like an actual Symmetric input-output table. Download it with iotable_get() which does basic labelling and preprocessing on the raw Eurostat files. Because of the size of the unfiltered dataset on Eurostat, the following code may take several minutes to run.

sk_io <- iotable_get ( labelled_io_data = NULL,
source = "naio_10_cp1700", geo = "SK",
year = 2015, unit = "MIO_EUR",
stk_flow = "TOTAL",
labelling = "iotables" )
## Reading cache file C:\Users\..\Temp\RtmpGQF4gr/eurostat/naio_10_cp1700_date_code_FF.rds
## Table naio_10_cp1700 read from cache file: C:\Users\..\Temp\RtmpGQF4gr/eurostat/naio_10_cp1700_date_code_FF.rds
## Saving 808 input-output tables into the temporary directory
## C:\Users\...\Temp\RtmpGQF4gr
## Saved the raw data of this table type in temporary directory C:\Users\...\Temp\RtmpGQF4gr/naio_10_cp1700.rds.

The input_coefficient_matrix_create() creates the input coefficient matrix, which is used for most of the analytical functions.

a_i**j = X_i**j / x_j

It checks the correct ordering of columns, and furthermore it fills up 0 values with 0.000001 to avoid division with zero.

input_coeff_matrix_sk <- input_coefficient_matrix_create(
data_table = sk_io
)
## Columns and rows of real_estate_imputed_a, extraterriorial_organizations are all zeros and will be removed.

Then you can create the Leontieff-inverse, which contains all the structural information about the relationships of 64x64 sectors of the chosen country, in this case, Slovakia, ready for the main equations of input-output economics.

I_sk <- leontieff_inverse_create(input_coeff_matrix_sk)

And take out the primary inputs:

primary_inputs_sk <- coefficient_matrix_create(
data_table = sk_io,
total = 'output',
return = 'primary_inputs')
## Columns and rows of real_estate_imputed_a, extraterriorial_organizations are all zeros and will be removed.

Now let’s see if there the government tries to stimulate the economy in three sectors, agricultulre, car manufacturing, and R&D with a billion euros. Direct effects measure the initial, direct impact of the change in demand and supply for a product. When production goes up, it will create demand in all supply industries (backward linkages) and create opportunities in the industries that use the product themselves (forward linkages.)

direct_effects_create( primary_inputs_sk, I_sk ) %>%
select ( all_of(c("iotables_row", "agriculture",
"motor_vechicles", "research_development"))) %>%
filter (.data$iotables_row %in% c("gva_effect", "wages_salaries_effect",
"imports_effect", "output_effect"))
## iotables_row agriculture motor_vechicles research_development
## 1 imports_effect 1.3684350 2.3028203 0.9764921
## 2 wages_salaries_effect 0.2713804 0.3183523 0.3828014
## 3 gva_effect 0.9669621 0.9790771 0.9669467
## 4 output_effect 2.2876287 3.9840251 2.2579634

Car manufacturing requires much imported components, so each extra demand will create a large importing activity. The R&D will create a the most local wages (and supports most jobs) because research is job-intensive. As we can see, the effect on imports, wages, gross value added (which will end up in the GDP) and output changes are very different in these three sectors.

This is not the total effect, because some of the increased production will translate into income, which in turn will be used to create further demand in all parts of the domestic economy. The total effect is characterized by multipliers.

Then solve for the multipliers:

multipliers_sk <- input_multipliers_create(
primary_inputs_sk %>%
filter (.data$iotables_row == "gva"), I_sk )

And select a few industries:

set.seed(12)
multipliers_sk %>%
tidyr::pivot_longer ( -all_of("iotables_row"),
names_to = "industry",
values_to = "GVA_multiplier") %>%
select (-all_of("iotables_row")) %>%
arrange( -.data$GVA_multiplier) %>%
dplyr::sample_n(8)
## # A tibble: 8 x 2
## industry GVA_multiplier
## <chr> <dbl>
## 1 motor_vechicles 7.81
## 2 wood_products 2.27
## 3 mineral_products 2.83
## 4 human_health 1.53
## 5 post_courier 2.23
## 6 sewage 1.82
## 7 basic_metals 4.16
## 8 real_estate_services_b 1.48

Vignettes

The Germany 1990 provides an introduction of input-output economics and re-creates the examples of the Eurostat Manual of Supply, Use and Input-Output Tables, by Jörg Beutel (Eurostat Manual).

The United Kingdom Input-Output Analytical Tables Daniel Antal, based on the work edited by Richard Wild is a use case on how to correctly import data from outside Eurostat (i.e. not with eurostat::get_eurostat()) and join it properly to a SIOT. We also used this example to create unit tests of our functions from a published, official government statistical release.

Finally, Working With Eurostat Data is a detailed use case of working with all the current functionalities of the package by comparing two economies, Czechia and Slovakia and guides you through a lot more examples than this short blogpost.

Our package was originally developed to calculate GVA and employment effects for the Slovak music industry (see our Slovak Music Industry Report), and similar calculations for the Hungarian film tax shelter. We can now programatically create reproducible multipliers for all European economies in the Digital Music Observatory, and create further indicators for economic policy making in the Economy Data Observatory.

Environmental Impact Analysis

Our package allows the calculation of various economic policy scenarios, such as changing the VAT on meat or effects of re-opening music festivals on aggregate demand, GDP, tax revenues, or employment. But what about the C**O₂, methane and other greenhouse gas effects of the reopening festivals, or the increasing meat prices?

Technically our package can already calculate such effects, but to do so, you have to carefully match further statistical vocabulary items used by the European Environmental Agency about air pollutants and greenhouse gases.

The last released version of iotables is Importing and Manipulating Symmetric Input-Output Tables (Version 0.4.4). Zenodo. https://doi.org/10.5281/zenodo.4897472, but we are alread working on a new major release. In that release, we are planning to build in the necessary vocabulary into the metadata functions to increase the functionality of the package, and create new indicators for our Green Deal Data Observatory. This experimental data observatory is creating new, high quality statistical indicators from open governmental and open science data sources that has not seen the daylight yet.

rOpenGov and the EU Datathon Challenges

rOpenGov, Reprex, and other open collaboration partners teamed up to build on our expertise of open source statistical software development further: we want to create a technologically and financially feasible data-as-service to put our reproducible research products into wider user for the business analyst, scientific researcher and evidence-based policy design communities.

rOpenGov is a community of open governmental data and statistics developers with many packages that make programmatic access and work with open data possible in the R language. Reprex is a Dutch-startup that teamed up with rOpenGov and other open collaboration partners to create a technologically and financially feasible service to exploit reproducible research products for the wider business, scientific and evidence-based policy design community. Open data is a legal concept - it means that you have the rigth to reuse the data, but often the reuse requires significant programming and statistical know-how. We entered into the annual EU Datathon competition in all three challenges with our applications to not only provide open-source software, but daily updated, validated, documented, high-quality statistical indicators as open data in an open database. Our iotables package is one of our many open-source building blocks to make open data more accessible to all.

Join our open collaboration Digital Music Observatory team as a data curator, developer or business developer. More interested in environmental impact analysis? Try our Green Deal Data Observatory team! Or economic policies, particularly computation antitrust, innovation and small enterprises? Check out our Economy Music Observatory team!

New Indicators for Computational Antitrust

Wed, 02 Jun 2021 17:00:00 +0000

As someone who’s worked in data for almost 20 years, what type of data do you usually use in your research?

In my field (industrial organisation, competition policy), company level financial data, and product price and sales data have been the conventional building blocks of research papers. Ideally this has been the sort of data that I would seek out for my work. Of course as academic researchers we often get knocked back by the reality of data access and availability. I would think that industrial organisation is one of those fields where researchers have to be quite innovative in terms of answering interesting and relevant policy questions, whilst having to operate in an environment where most relevant data is proprietary and very expensive. Against this backdrop, I have worked with neatly organised proprietary datasets, self-assembled data collections, and also textual data.

From your experience working with various data sets, models, and frameworks, what would be the ultimate dataset, or datasets that you would like to see from the Economy Data Observatory?

There seems to be an emerging consensus that market concentration and markups have been continuously increasing across the economy. But most of these works use industry classification to define markets. One of the things I’d really like to see coming out of the Economy Data Observatory is a mapping of what we call antitrust markets.

Mapping NACE to Antitrust Markets.

Available datasets use standard industry classification (such as NACE in the EU), which is often very different from what we call a product market in microeconomics. Product markets are defined by demand, and supply-side substitutability, which is a dynamically evolving feature and difficult to capture systematically on a wider scale. But with the recent proliferation of data and the growth (and fall in price) of computing power, I am positive that we could attempt to map out the European economy along these product market boundaries. Of course this is not without any challenge. For example in digital markets, traditional ways to define markets have caused serious challenges to competition authorities around the world.

I believe that there is an immensely rich, and largely unexplored source of information in unstructured textual data that would be hugely useful for applied microeconomic works, including my own area of IO and competition policy. This includes a large corpus of administrative and court decisions that relate to businesses, such as merger control decisions of the European Commission. To give two examples from my experience, we’ve used a large corpus of news reports related to various firms to gauge the reputational impact of European Commission cartel investigations, or we’ve trained an algorithm to be able to classify US legislative bills and predict whether they have been lobbied or not. Finding a way to collect and convert this unstructured data into a format that is relevant and useful for users is not a trivial challenge, but is one of the most exciting parts of our Economy Data Observatory plans (see related project plan).

Finding a way to collect and convert this unstructured data into a format that is relevant and useful for users is not a trivial challenge, but is one of the most exciting parts of our Economy Data Observatory plans.

What is an idea that you consider will be a game changer for researchers and/or policymakers?

Partly talking in the past tense, the use of data driven approaches, automation in research, and machine learning have been increasingly influential and I think this trend will continue to all areas of social science. 10 years ago, to do machine learning, you had to build your models from scratch, typically requiring a solid understanding of programming and linear algebra. Today, there are readily available deep learning frameworks like TensorFlow, Keras, PyTorch, to design a neural network for your own application. 10 years ago, natural language processing would have only been relevant for a small group of computational linguists. Today we have massive word embedding models trained on an enormous corpus of texts, at the fingertip of any researcher. 10 years ago, the cost of computing power would have made it prohibitive for most researchers to run even relatively shallow neural networks. Today, I can run complex deep learning models on my laptop using cloud computing servers. As a result of these developments, whereas 10 years ago one would have needed a small (or large) research team to explore certain research questions, much of this can now be automated and be done by a single researcher. For researchers without access to large research grants and without the ability to hire a research team, this has truly been an amazing victory for the democratisation of research.

You can already try out our API.

Do you have a favorite, or most used open governmental or open science data source? What do you think about it? Could it be improved?

As a competition economist, I tend to need very specific data for each research question I’m working on, which has to be collected from scratch. On the other hand, most works do require us to use data that has already been collected and made available. For example, access to census data has been immensely useful in ensuring that we can control for local demographic features, in papers where local competition plays a role. Census data is made readily available by most governments, but I particularly liked the Australian data, partly because they run a census every 5 years, but also because they have made the data available through a great table making tool.

Is there a number that recently surprised you? What was it?

I have these moments of surprise fairly frequently. To give one example from something I’m currently working on, looking at the distributional impact of increasing market concentration, we’ve found that low income households experience a larger increase in the petrol retail margin when market concentration increases than high income households. This fits nicely with theoretical works on search in homogeneous costs, i.e. low income households are less good at engaging with the market, and, as a result, if suppliers can price discriminate, they will charge a higher margin to these households.

The figure below shows our raw data (18 years of petrol station level daily price data from Western Australia) for low and high income areas, and the increase in the margin following an increase in market concentration (vertical dotted line). The left hand side, low income areas, displays a large increase in the margin (when compared to a control group), whereas the right hand side (high income households) experience no change. In our paper of course we build a fairly data intensive quasi experiment for identification of the treatment effect of changing market concentration on the price margin applied to various demographic groups.

Surprising findings: market concentration and margin changes for petrol stations.

Do you have a good example of really good, or really bad use of data science /data curation?

Out of professional courtesy I really wouldn’t like to mention names from academic research as examples of bad use of data. But there are ample examples from newspaper coverage of data related work, or simply the misuse of data by newspapers. This may be intentional but is often a result of journalists not having the necessary training in using and analysing data.

When the press finds a piece of academic research interesting, often bad things come out of it. This is often because not all journalists are well equipped to interpret scientific findings. As a result, sometimes conclusions are drawn as a result of a misinterpretation of good data analysis. Correlation interpreted as causation is a frequently recurring example. Equally bad is press coverage changes the incentive system of producing good research, when scientists work too hard for their work to be noticed by the press, and sacrifice scientific rigour in data analysis for the sake of media attention. There can also be less discernible but equally damaging errors.

In some cases requiring to pre disclose the tests the research is going to run on data helps maintain credibility in many instances. Moreover, I am always a bit suspicious if the authors do not give access to their data for reproduction.

Our Economy Data Observatory places all new indicators on Zenodo with a DOI, and asks future individual contributors their data for replication there.

What do you see as the greatest challenge with open data in 2021?

The things I mentioned above about the democratisation of research driven by automation and access to big data does raise serious challenges as well. The obvious one is to do with the fact that there are enormous economies of scale in the use of data. As such, larger players will always be better positioned to outdo their smaller competitors, simply as a result of their superior data and infrastructure (for example having more granular consumer data allowing them to offer better designed customised experience for the consumer). Like many others, I see this as the biggest challenge for open data - to level the playing field for smaller players. This is not a trivial task at all; and even if, miraculously, small businesses could access the same data as the biggest players, they still would not have the capacity or the ability to use this data. So allowing access to data alone is unlikely to solve any of these problems. I would say that fostering engagement with open data is probably as big a challenge as creating the open data in the first place.

How do you envision the Economy Data Observatory making open data more credible in the European economic policy community and accepted as verified information?

I think starting with a focused agenda is a good idea. For example, linking up with the Centre for Competition Policy means that we have an initial focus of competition policy relevant economic data. This is still a large domain, but it is one where we have ample expertise. Starting with specific research questions such as linking competition enforcement and merger decisions to related information on innovation and ownership data puts the Economy Data Observatory at the heart of some of the most topical policy questions, such as the role of killer acquisitions (acquisitions with the intent to kill of sources of rival innovation), or common ownership, both of which are increasingly discussed in policy and practitioner circles. Once we established ourselves as a credible source of data in the competition policy community, we can look into joining this up with other policy areas, and also with our other Data Observatories (Music and Green Deal).

Join us

Reprex is Contesting all Three Challenges of the EU Datathon 2021 Prize

Fri, 21 May 2021 20:00:00 +0000

Reprex, a Dutch start-up enterprise formed to utilize open source software and open data, is looking for partners in an agile, open collaboration to win at least one of the three EU Datathon Prizes. We are looking for policy partners, academic partners and a consultancy partner. Our project is based on agile, open collaboration with three types of contributors.

With our competing prototypes we want to show that we have a research automation technology that can find open data, process it and validate it into high-quality business, policy or scientific indicators, and release it with daily refreshments in a modern API.

We are looking for institutions to challenge us with their data problems, and sponsors to increase our capacity. Over then next 5 months, we need to find a sustainable business model for a high-quality and open alternative to other public data programs.

The EU Datathon 2021 Challenge

To take part, you should propose the development of an application that links and uses open datasets. - our data curator team
Your application … is also expected to find suitable new approaches and solutions to help Europe achieve important goals set by the European Commission through the use of open data.” - this application is developed by our technology contributors
Your application should showcase opportunities for concrete business models or social enterprises. - our service development team is working to make this happen!
We use open source software and open data. The applications are hosted on the cloud resources of Reprex, an early-stage technology startup currently building a viable, open-source, open-data business model to create reproducible research products.
We are working together with experts in the domain as curators (check out our guidelines if you want to join: Data Curators: Get Inspired!).
Our development team works on an open collaboration basis. Our indicator R packages, and our services are developed together with rOpenGov.

Mission statement

We want to win an EU Datathon prize by processing the vast, already-available governmental and scientific open data made usable for policy-makers, scientific researchers, and business researcher end-users.

“To take part, you should propose the development of an application that links and uses open datasets. Your application should showcase opportunities for concrete business models or social enterprises. It is also expected to find suitable new approaches and solutions to help Europe achieve important goals set by the European Commission through the use of open data.”

We aim to win at least one first prize in the EU Datathon 2021. We are contesting all three challenges, which are related to the EU’s official strategic policies for the coming decade.

Challenge 1: A European Grean Deel

Our Green Deal Data Observatory connects socio-economic and environmental data to help understanding and combating climate change.

Challenge 1: A European Green Deal, with a particular focus on the The European Climate Pact, the Organic Action Plan, and the New European Bauhaus, i.e., mitigation strategies.

Climate change and environmental degradation are an existential threat to Europe and the world. To overcome these challenges, the European Union created the European Green Deal strategic plan, which aims to make the EU’s economy sustainable by turning climate and environmental challenges into opportunities and making the transition just and inclusive for all.

Our Green Deal Data Observatory is a modern reimagination of existing ‘data observatories’; currently, there are over 70 permanent international data collection and dissemination points. One of our objectives is to understand why the dozens of the EU’s observatories do not use open data and reproducible research. We want to show that open governmental data, open science, and reproducible research can lead to a higher quality and faster data ecosystem that fosters growth for policy, business, and academic data users.

We provide high quality, tidy data through a modern API which enables data flows between public and proprietary databases. We believe that introducing Open Policy Analysis standards with open data, open-source software, and research automation, can help the Green Deal policymaking process. Our collaboration is open for individuals, citizens scientists, research institutes, NGOS, and companies.

Challenge 2: A Europe fit for the digital age

Our Economy Data Observatory will focus on competition, small and medium sized enterprizes and robotization.

Challenge 2: An economy that works for people, with a particular focus on the Single market strategy, and particular attention to the strategy’s goals of 1. Modernising our standards system, 2. Consolidating Europe’s intellectual property framework, and 3. Enabling the balanced development of the collaborative economy strategic goals.

Big data and automation create new inequalities and injustices and have the potential to create a jobless growth economy. Our Economy Data Observatory is a fully automated, open source, open data observatory that produces new indicators from open data sources and experimental big data sources, with authoritative copies and a modern API.

Our observatory monitors the European economy to protect consumers and small companies from unfair competition, both from data and knowledge monopolization and robotization. We take a critical Small and Medium-Sized Enterprises (SME)-, intellectual property, and competition policy point of view of automation, robotization, and the AI revolution on the service-oriented European social market economy.

We would like to create early-warning, risk, economic effect, and impact indicators that can be used in scientific, business, and policy contexts for professionals who are working on re-setting the European economy after a devastating pandemic in the age of AI. We are particularly interested in designing indicators that can be early warnings for killer acquisitions, algorithmic and offline discrimination against consumers based on nationality or place of residence, and signs of undermining key economic and competition policy goals. Our goal is to help small and medium-sized enterprises and start-ups to grow, and to furnish data that encourages the financial sector to provide loans and equity funds for their growth.

Challenge 3: A Europe fit for the digital age

Our Digital Music Observatory is not only a demo of the European Music Observatory, but a testing ground for data governance, Digital Servcies Act, and trustworthy AI problems.

Challenge 3: A Europe fit for the digital age, with a particular focus Artificial Intelligence, the European Data Strategy, the Digital Services Act, Digital Skills and Connectivity.

The Digital Music Observatory (DMO) is a fully automated, open source, open data observatory that creates public datasets to provide a comprehensive view of the European music industry. It provides high-quality and timely indicators in all four pillars of the planned official European Music Observatory as a modern, open source and largely open data-based, automated, API-supported alternative solution for this planned observatory. The insight and methodologies we are refining in the DMO are applicable and transferable to about 60 other data observatories funded by the EU which do not currently employ governmental or scientific open data.

Music is one of the most data-driven service industries where most sales are currently executed by AI-driven autonomous systems that influence market shares and intellectual property remuneration. We provide a template that enables making these AI-driven systems accountable and trustworthy, with the goal of re-balancing the legitimate interests of creators, distributors, and consumers. Within Europe, this new balance will be an important use case of the European Data Strategy and the Digital Services Act.

The DMO is a fully functional service that can serve as a testing ground of the European Data Strategy. It can showcase the ways in which the music industry is affected by the problems that the Digital Services Act and European Trustworthy AI initiatives attempt to regulate. It is being built in open collaboration with national music stakeholders, NGOs, academic institutions, and industry groups.

Our Product/Market Fit was validated in the world’s 2nd ranked university-backed incubator program, the Yes!Delft AI Validation Lab. We are currently developing this project with the help of the JUMP European Music Market Accelerator program.

Problem Statement

The EU has an 18-year-old open data regime and it makes public taxpayer-funded data in the values of tens of billions of euros per year; the Eurostat program alone handles 20,000 international data products, including at least 5,000 pan-European environmental indicators.

As open science principles gain increased acceptance, scientific researchers are making hundreds of thousands of valuable datasets public and available for replication every year.

The EU, the OECD, and UN institutions run around 100 data collection programs, so-called ‘data observatories’ that more or less avoid touching this data, and buy proprietary data instead. Annually, each observatory spends between 50 thousand and 3 million EUR on collecting untidy and proprietary data of inconsistent quality, while never even considering open data.

Our automated data observatories are modern reimaginations of the existing observatories that do not use open data and research automation.

The problem with the current EU data strategy is that while it produces enormous quantities of valuable open data, in the absence of common basic data science and documentation principles, it seems often cheaper to create new data than to put the existing open data into shape.

This is an absolute waste of resources and efforts. With a few R packages and our deep understanding of advanced data science techniques, we can create valuable datasets from unprocessed open data. In most domains, we are able to repurpose data originally created for other purposes at a historical cost of several billions of euros, converting these unused data assets into valuable datasets that can replace tens of millions’ worth of proprietary data.

What we want to achieve with this project – and we believe such an accomplishment would merit one of the first prizes - is to add value to a significant portion of pre-existing EU open data (for example, available on data.europa.eu/data) by re-processing and integrating them into a modern, tidy database with an API access, and to find a business model that emphasises a triangular use of data in 1. business, 2. science and 3. policy-making. Our mission is to modernize the concept of data observatories.

Recommendation Systems: What can Go Wrong with the Algorithm?

Thu, 06 May 2021 07:10:00 +0000

Traitors in a war used to be executed by firing squad, and it was a psychologically burdensome task for soldiers to have to shoot former comrades. When a 10-marksman squad fired 8 blank and 2 live ammunition, the traitor would be 100% dead, and the soldiers firing would walk away with a semblance of consolation in the fact they had an 80% chance of not having been the one that killed a former comrade. This is a textbook example of assigning responsibility and blame in systems. AI-driven systems such as the YouTube or Spotify recommendation systems, the shelf organization of Amazon books, or the workings of a stock photo agency come together through complex processes, and when they produce undesirable results, or, on the contrary, they improve life, it is difficult to assign blame or credit.

This is the edited text of my presentation on Copyright Data Improvement in the EU – Towards Better Visibility of European Content and Broader Licensing Opportunities in the Light of New Technologies - download the entire webinar’s agenda.

Assigning and avoding blame.

If you do not see enough women on streaming charts, or if you think that the percentage of European films on your favorite streaming provider—or Slovak music on your music streaming service—is too low, you have to be able to distribute the blame in more precise terms than just saying “it’s the system” that is stacked up against women, small countries, or other groups. We need to be able to point the blame more precisely in order to effect change through economic incentives or legal constraints.

This is precisely the type of work we are doing with the continued support of the Slovak national rightsholder organizations, as well as in our research in the United Kingdom. We try to understand why classical musicians are paid less, or why 15% of Slovak, Estonian, Dutch, and Hungarian artists never appear on anybody’s personalized recommendations. We need to understand how various AI-driven systems operate, and one approach would at the very least model and assign blame for undesirable outcomes in probabilistic terms. The problem is usually not that an algorithm is nasty and malicious; Algorithms are often trained through “machine learning” techniques, and often, machines “learn” from biased, faulty, or low-quality information.

Outcomes: What Can Go Wrong With a Recommendation System?

In complex systems there are hardly ever singular causes that explain undesired outcomes; in the case of algorithmic bias in music streaming, there is no single bullet that eliminates women from charts or makes Slovak or Estonian language content less valuable than that in English. Some apparent causes may in fact be “blank cartridges,” and the real fire might come from unexpected directions. Systematic, robust approaches are needed in order to understand what it is that may be working against female or non-cisgender artists, long-tail works, or small-country repertoires.

Some examples of “undesirable outcomes” in recommendation engines might include:

Recommending too small a proportion of female or small country artists; or recommending artists that promote hate and violence.
Placing Slovak books on lower shelves.
Making the works of major labels easier to find than those of independent labels.
Placing a lower number of European works on your favorite video or music streaming platform’s start window than local television or radio regulations would require.
Filling up your social media newsfeed with fake news about covid-19 spread by some malevolent agents.

These undesirable outcomes are sometimes illegal as they may go against non-discrimination or competition law. (See our ideas on what can go wrong – Music Streaming: Is It a Level Playing Field?) They may undermine national or EU-level cultural policy goals, media regulation, child protection rules, and fundamental rights protection against discrimination without basis. They may make Slovak artists earn significantly less than American artists.

Metadata problems: no single bullet theory

In our work in Slovakia, we reverse engineered some of these undesirable outcomes. Popular video and music streaming recommendation systems have at least three major components based on machine learning:

The users’ history – Is it that users’ history is sexist, or perhaps the training metadata database is skewed against women?
The works’ characteristics – are Dvorak’s works as well documented for the algorithm as Taylor Swift’s or Drake’s?
Independent information from the internet – Does the internet write less about women artists?

In the making of a recommendation or an autonomous playlist, these sources of information can be seen as “metadata” concerning a copyright-protected work (as well as its right-protected recorded fixation.) More often than not, we are not facing a malicious algorithm when we see undesirable system outcomes. The usual problem is that the algorithm is learning from data that is historically biased against women or biased for British and American artists, or that it is only able to find data in English language film and music reviews. Metadata plays an incredibly important role in supporting or undermining general music education, media policy, copyright policy, or competition rules. If a video or music steaming platform’s algorithm is unaware of the music that music educators find suitable for Slovak or Estonian teenagers, then it will not recommend that music to your child.

Furthermore, metadata is very costly. In the case of cultural heritage, European states and the EU itself have been traditionally investing in metadata with each technological innovation. For Dvorak’s or Beethoven’s works, various library descriptions were made in the analogue world, then work and recording identifiers were assigned to CDs and mp3s, and eventually we must describe them again in a way intelligible for contemporary autonomous systems. In the case of classical music and literature, early cinema, or reproductions of artworks, we have public funding schemes for this work. But this seems not to be enough. In the current economy of streaming, the increasingly low income generated by most European works is insufficient to even cover the cost of proper documentation, which then sends that part of the European repertoire into a self-fulfilling oblivion: the algorithm cannot “learn” its properties and it never shows these works to users and audiences.

Until now, in most cases, it was assumed that it is the artists or their representative’s duty to provide high quality metadata, but in the analogue era, or in the era of individual digital copies, we did not anticipate that the sales value will not even cover the documentation cost. We must find technical solutions with interoperability and new economic incentives to create proper metadata for Europe’s cultural products. With that, we can cover one area out of the three possible problem terrains.

But this is not enough. We need to address the question of how new, better Algorithms can learn from user history and avoid amplifying pre-existing bias against women or hateful speech. We need to make sure that when Algorithms are “scraping” the internet, they do so in an accountable way that does not make small language repertoires vulnerable.

Incentives and investments into metadata

In our paper we argue for new regulatory considerations to create a better, and more accountable playing field for deploying Algorithms in a quasi-autonomous system, and we suggest further research to align economic incentives with the creation of higher quality and less biased metadata. The need for further research on how these large systems affect various fundamental rights, consumer or competition rights, or cultural and media policy goals cannot be overstated. The first step is to open and understand these autonomous systems. It is not enough to say that the firing squads of Big Tech are shooting women out from charts, ethnic minority artists from screens, and small language authors from the virtual bookshelves. We must put a lot more effort on researching the sources of the problems that make machine learning Algorithms behave in a way that is not compatible with our European values or regulations.

*This blogpost was first published on our general interest blog Data & Lyrics

Feasibility Study On Promoting Slovak Music In Slovakia & Abroad

Thu, 25 Mar 2021 11:00:00 +0000

How to help promote local music?

The new study opens the question of the local music promotion within the digital environment. The Slovak Performing and Mechanical Rights Society (SOZA), the State51 music group in the United Kingdom, and the Slovak Arts Council commissioned Reprex to created a feasibility study which provides recommendations for better use of quotas for Slovak radio stations and which also maps the share and promotion of Slovak music within large streaming and media platforms such as Spotify.

What should a good local content policy (radio quota, recommendation system, streaming quota) achieve?

The study proposes best practices for the introduction of mandatory quotas for Slovak radio stations and points out how current recommendation systems used by large platforms such as Spotify, YouTube, or Apple hardly consider local music from smaller countries. Local music stands against competition consisting of million songs from the whole world, and for ordinary Slovak musicians, whose music doesn’t belong to the global hits playlists, it is almost impossible to get recommended by the recommendation systems of large platforms.

Listen Local App for discovering new music

We aimed to create a demo version of a utility-based, transparent, accountable recommendation system.

The solution to this problem could be the Listen Local App, built on a comprehensive reference database of local music, which we created as a demo version within the study. The app aims to help listeners discover more local music; the app also presents new and alternative ways for large digital platforms to recommend local artists. Through Listen Local, listeners search for artists and bands based on their taste and the city they are situated in. In this way, listeners can easily search for music by artists from particular cities or from the town they are about to visit. We are releasing today the feasibility study in English and Slovak. We call for an open consultation to evaluate the results of this work and continue developing the Slovak Music Database, the Listen Local recommendation, and the AI validation system.

Check out the Demo Listen Local App. We explain here why.

Screenshot of the first verison of the demo app.

Database

The Slovak Music Database is connected to Reprex’s flagship project, the Demo Music Observatory, an open collaboration-based demo version of the planned European Music Observatory, currently being further developed in the JUMP Music Market Accelerator Programme supported by Music Moves Europe.

The project website contains the demo version of the Slovak Music Database.

Download the Study

You can download the study herein Slovak or in English.

Next steps

In the next phase of the work, we add further data to our Slovak Demo Music Database and carry out more and more experiments and educational activities to understand how Slovak music can become more visible and targeted. We are also bringing this project into an international collaboration for better utilization of R&D efforts and experiences throughout Europe. This agile project method originated in reproducible scientific practice and open-source software development and allows participation in large projects on any scale: from individual musicians and educators to large research universities and music distributors. Anyone can join in on the effort.

Reprex is looking for further international partners; Reprex is currently part of the Dutch AI Coalition and the European AI Alliance project. SOZA and Reprex are committed to opening this project for international collaboration while ensuring that a significant part of the R&D activities remains in the Slovak Republic.

We are preparing informal, online information sessions for artists, promoters, researchers, and developers to join our project.

Contributors

The Reprex team who contributed to the English version:

Budai, Sándor, programming and deployment
Dr. Emily H. Clarke, musicologist
Stef Koenis, musicologist, musician
Dr. Andrés Garcia Molina, data scientist, musicologist, editor
Kátya Nagy, music journalist, research assistant;

and the Slovak version:

Dáša Bulíková, musician, translator
Dominika Semaňáková, musicologist, editor, layout.

Special thanks to Tammy Nižňanska & the Youniverse for the case study.

Where Are People More Likely To Treat Climate Change as the Most Serious Global Problem?

Sat, 06 Mar 2021 00:00:00 +0000

library(regions)
library(lubridate)
library(dplyr)
if ( dir.exists('data-raw') ) {
data_raw_dir <- "data-raw"
} else {
data_raw_dir <- file.path("..", "..", "data-raw")
}

The first results of our longitudinal table were difficult to map, because the surveys used an obsolete regional coding. We will adjust the wrong coding, when possible, and join the data with the European Environment Agency’s (EEA) Air Quality e-Reporting (AQ e-Reporting) data on environmental pollution. We recoded the annual level for every available reporting stations [not shown here] and all values are in μg/m3. The period under observation is 2014-2016. Data file: https://www.eea.europa.eu/data-and-maps/data/aqereporting-8 (European Environment Agency 2021).

Recoding the Regions

Recoding means that the boundaries are unchanged, but the country changed the names and codes of regions because there were other boundary changes which did not affect our observation unit. We explain the problem and the solution in greater detail in our tutorial that aggregates the data on regional levels.

panel <- readRDS((file.path(data_raw_dir, "climate-panel.rds")))
climate_data_geocode <- panel %>%
mutate ( year: lubridate::year(date_of_interview)) %>%
recode_nuts()

Let’s join the air pollution data and join it by corrected geocodes:

load(file.path("data", "air_pollutants.rda")) ## good practice to use system-independent file.path
climate_awareness_air <- climate_data_geocode %>%
rename ( region_nuts_codes : .data$code_2016) %>%
left_join ( air_pollutants, by: "region_nuts_codes" ) %>%
select ( -all_of(c("w1", "wex", "date_of_interview",
"typology", "typology_change", "geo", "region"))) %>%
mutate (
# remove special labels and create NA_numeric_
age_education: retroharmonize::as_numeric(age_education)) %>%
mutate_if ( is.character, as.factor) %>%
mutate (
# we only have responses from 4 years, and this should be treated as a categorical variable
year: as.factor(year)
) %>%
filter ( complete.cases(.) )

The climate_awareness_air data frame contains the answers of 75086 individual respondents. 17.07% thought that climate change was the most serious world problem and 33.6% mentioned climate change as one of the three most important global problems.

summary ( climate_awareness_air )
## rowid serious_world_problems_first
## ZA5877_v2-0-0_1 : 1 Min. :0.0000
## ZA5877_v2-0-0_10 : 1 1st Qu.:0.0000
## ZA5877_v2-0-0_100 : 1 Median :0.0000
## ZA5877_v2-0-0_1000 : 1 Mean :0.1707
## ZA5877_v2-0-0_10000: 1 3rd Qu.:0.0000
## ZA5877_v2-0-0_10001: 1 Max. :1.0000
## (Other) :75080
## serious_world_problems_climate_change isocntry
## Min. :0.000 BE : 3028
## 1st Qu.:0.000 CZ : 3023
## Median :0.000 NL : 3019
## Mean :0.336 SK : 3000
## 3rd Qu.:1.000 SE : 2980
## Max. :1.000 DE-W : 2978
## (Other):57058
## marital_status age_education
## (Re-)Married: without children :13242 18 :15485
## (Re-)Married: children this marriage :12696 19 : 7728
## Single: without children : 7650 16 : 5840
## (Re-)Married: w children of this marriage: 6520 still studying: 5098
## (Re-)Married: living without children : 6225 17 : 5092
## Single: living without children : 4102 15 : 4528
## (Other) :24651 (Other) :31315
## age_exact occupation_of_respondent
## Min. :15.0 Retired, unable to work :22911
## 1st Qu.:36.0 Skilled manual worker : 6774
## Median :51.0 Employed position, at desk : 6716
## Mean :50.1 Employed position, service job: 5624
## 3rd Qu.:65.0 Middle management, etc. : 5252
## Max. :99.0 Student : 5098
## (Other) :22711
## occupation_of_respondent_recoded
## Employed (10-18 in d15a) :32763
## Not working (1-4 in d15a) :37125
## Self-employed (5-9 in d15a): 5198
##
##
##
##
## respondent_occupation_scale_c_14
## Retired (4 in d15a) :22911
## Manual workers (15 to 18 in d15a) :15269
## Other white collars (13 or 14 in d15a): 9203
## Managers (10 to 12 in d15a) : 8291
## Self-employed (5 to 9 in d15a) : 5198
## Students (2 in d15a) : 5098
## (Other) : 9116
## type_of_community is_student no_education
## DK : 34 Min. :0.0000 Min. :0.000000
## Large town :20939 1st Qu.:0.0000 1st Qu.:0.000000
## Rural area or village :24686 Median :0.0000 Median :0.000000
## Small or middle sized town: 9850 Mean :0.0679 Mean :0.008151
## Small/middle town :19577 3rd Qu.:0.0000 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.000000
##
## education year region_nuts_codes country_code
## Min. :14.00 2013:25103 LU : 1432 DE : 4531
## 1st Qu.:17.00 2015: 0 MT : 1398 GB : 3538
## Median :18.00 2017:25053 CY : 1192 BE : 3028
## Mean :19.61 2019:24930 SK02 : 1053 CZ : 3023
## 3rd Qu.:22.00 EL30 : 974 NL : 3019
## Max. :30.00 EE : 973 SK : 3000
## (Other):68064 (Other):54947
## pm2_5 pm10 o3 BaP
## Min. : 2.109 Min. : 5.883 Min. : 66.37 Min. :0.0102
## 1st Qu.: 9.374 1st Qu.: 28.326 1st Qu.: 90.89 1st Qu.:0.1779
## Median :11.866 Median : 33.673 Median :102.81 Median :0.4105
## Mean :12.954 Mean : 38.637 Mean :101.49 Mean :0.8759
## 3rd Qu.:15.890 3rd Qu.: 49.488 3rd Qu.:110.73 3rd Qu.:1.0692
## Max. :41.293 Max. :123.239 Max. :141.04 Max. :7.8050
##
## so2 ap_pc1 ap_pc2 ap_pc3
## Min. : 0.0000 Min. :-4.6669 Min. :-2.21851 Min. :-2.1007
## 1st Qu.: 0.0000 1st Qu.:-0.4624 1st Qu.:-0.49130 1st Qu.:-0.5695
## Median : 0.0000 Median : 0.4263 Median : 0.02902 Median :-0.1113
## Mean : 0.1032 Mean : 0.1031 Mean : 0.04166 Mean :-0.1746
## 3rd Qu.: 0.0000 3rd Qu.: 0.9748 3rd Qu.: 0.57416 3rd Qu.: 0.3309
## Max. :42.5325 Max. : 2.0344 Max. : 3.25841 Max. : 4.1615
##
## ap_pc4 ap_pc5
## Min. :-1.7387 Min. :-2.75079
## 1st Qu.:-0.1669 1st Qu.:-0.18748
## Median : 0.0371 Median : 0.01811
## Mean : 0.1154 Mean : 0.06797
## 3rd Qu.: 0.3050 3rd Qu.: 0.34937
## Max. : 3.2476 Max. : 1.42816
##

Let’s see a simple CART tree! We remove the regional codes, because there are very serious differences among regional climate awareness. These differences, together with education level, and the year we are talking about, are the most important predictors of thinking about climate change as the most important global problem in Europe.

# Classification Tree with rpart
library(rpart)
# grow tree
fit <- rpart(as.factor(serious_world_problems_first) ~ .,
method="class", data=climate_awareness_air %>%
select ( - all_of(c("rowid", "region_nuts_codes"))),
control: rpart.control(cp: 0.005))
printcp(fit) # display the results
##
## Classification tree:
## rpart(formula: as.factor(serious_world_problems_first) ~ .,
## data: climate_awareness_air %>% select(-all_of(c("rowid",
## "region_nuts_codes"))), method: "class", control: rpart.control(cp: 0.005))
##
## Variables actually used in tree construction:
## [1] age_education isocntry
## [3] serious_world_problems_climate_change year
##
## Root node error: 12817/75086: 0.1707
##
## n= 75086
##
## CP nsplit rel error xerror xstd
## 1 0.0240566 0 1.00000 1.00000 0.0080438
## 2 0.0082703 3 0.92783 0.92783 0.0078055
## 3 0.0050000 5 0.91129 0.91425 0.0077588
plotcp(fit) # visualize cross-validation results

summary(fit) # detailed summary of splits
## Call:
## rpart(formula: as.factor(serious_world_problems_first) ~ .,
## data: climate_awareness_air %>% select(-all_of(c("rowid",
## "region_nuts_codes"))), method: "class", control: rpart.control(cp: 0.005))
## n= 75086
##
## CP nsplit rel error xerror xstd
## 1 0.024056592 0 1.0000000 1.0000000 0.008043837
## 2 0.008270266 3 0.9278302 0.9278302 0.007805478
## 3 0.005000000 5 0.9112897 0.9142545 0.007758824
##
## Variable importance
## serious_world_problems_climate_change isocntry
## 31 26
## country_code BaP
## 20 8
## pm2_5 ap_pc1
## 4 3
## age_education pm10
## 2 2
## education ap_pc2
## 2 1
## year
## 1
##
## Node number 1: 75086 observations, complexity param=0.02405659
## predicted class=0 expected loss=0.1706976 P(node): 1
## class counts: 62269 12817
## probabilities: 0.829 0.171
## left son=2 (25229 obs) right son=3 (49857 obs)
## Primary splits:
## serious_world_problems_climate_change < 0.5 to the right, improve=2214.2040, (0 missing)
## isocntry splits as RRLLLRRRLLRLRLLLLLLLLLLRRLLLRLL, improve= 728.0160, (0 missing)
## country_code splits as RRLLLRRLLRLLLLLLLLLLRRLLLRLL, improve= 673.3656, (0 missing)
## BaP < 0.4300347 to the right, improve= 310.6229, (0 missing)
## pm2_5 < 13.38264 to the right, improve= 296.4013, (0 missing)
## Surrogate splits:
## age_education splits as ----RRRRRR-RRRRRRRRRR-RRRRRRRRRR-RRRRRRRRRR-RRRRRRRRRR-RRRRRL-RRR-RRRRRRRRR--RRRLLR--R-R, agree=0.664, adj=0, (0 split)
## pm10 < 7.491315 to the left, agree=0.664, adj=0, (0 split)
##
## Node number 2: 25229 observations
## predicted class=0 expected loss=0 P(node): 0.3360014
## class counts: 25229 0
## probabilities: 1.000 0.000
##
## Node number 3: 49857 observations, complexity param=0.02405659
## predicted class=0 expected loss=0.2570752 P(node): 0.6639986
## class counts: 37040 12817
## probabilities: 0.743 0.257
## left son=6 (34631 obs) right son=7 (15226 obs)
## Primary splits:
## isocntry splits as RRLLLRRRLLRLRLLLLLLLLLLRRLLLRLL, improve=1454.9460, (0 missing)
## country_code splits as RRLLLRRLLRLLLLLLLLLLRRLLLRLL, improve=1359.7210, (0 missing)
## BaP < 0.4300347 to the right, improve= 629.8844, (0 missing)
## pm2_5 < 13.38264 to the right, improve= 555.7484, (0 missing)
## ap_pc1 < -0.005459537 to the left, improve= 533.3579, (0 missing)
## Surrogate splits:
## country_code splits as RRLLLRRLLRLLLLLLLLLLRRLLLRLL, agree=0.987, adj=0.957, (0 split)
## BaP < 0.1749425 to the right, agree=0.775, adj=0.264, (0 split)
## pm2_5 < 5.206993 to the right, agree=0.737, adj=0.140, (0 split)
## ap_pc1 < 1.405527 to the left, agree=0.733, adj=0.126, (0 split)
## pm10 < 25.31211 to the right, agree=0.718, adj=0.076, (0 split)
##
## Node number 6: 34631 observations
## predicted class=0 expected loss=0.1769802 P(node): 0.4612178
## class counts: 28502 6129
## probabilities: 0.823 0.177
##
## Node number 7: 15226 observations, complexity param=0.02405659
## predicted class=0 expected loss=0.4392487 P(node): 0.2027808
## class counts: 8538 6688
## probabilities: 0.561 0.439
## left son=14 (11607 obs) right son=15 (3619 obs)
## Primary splits:
## isocntry splits as LL---LLR--L-L----------LL---R--, improve=337.5462, (0 missing)
## country_code splits as LL---LR--L-L--------LL---R--, improve=337.5462, (0 missing)
## age_education splits as ----LLLLLL-LLLRRRRRRR-RRRRRRRRRL-RRRRRRLLRR-RRRRLLRLRL-RRLRRR-RRR-LLLLRRR-----LR-----L-R, improve=294.0807, (0 missing)
## education < 22.5 to the left, improve=262.3747, (0 missing)
## BaP < 0.053328 to the right, improve=232.7043, (0 missing)
## Surrogate splits:
## BaP < 0.053328 to the right, agree=0.878, adj=0.485, (0 split)
## pm2_5 < 4.810361 to the right, agree=0.827, adj=0.271, (0 split)
## ap_pc2 < 0.8746175 to the left, agree=0.792, adj=0.124, (0 split)
## so2 < 0.3302972 to the left, agree=0.781, adj=0.078, (0 split)
## age_education splits as ----LLLLLL-LLLLLLLRLR-LRRLRRRRRR-RRRRLLLLLR-LRLRLLRRLL-LLRLLR-LLR-RRLLLLL-----RR-----R-L, agree=0.779, adj=0.071, (0 split)
##
## Node number 14: 11607 observations, complexity param=0.008270266
## predicted class=0 expected loss=0.3804601 P(node): 0.1545827
## class counts: 7191 4416
## probabilities: 0.620 0.380
## left son=28 (7462 obs) right son=29 (4145 obs)
## Primary splits:
## age_education splits as ----LLLLLL-LRRRRRRRRR-RRLRRLRRLL-RRRRLRLLRR-RLRLLLRLRL-RR-RR--RRL-L-LLRRR------------L-R, improve=123.71070, (0 missing)
## year splits as R-LR, improve=107.79460, (0 missing)
## education < 20.5 to the left, improve= 90.28724, (0 missing)
## occupation_of_respondent splits as LRRLRRRRRLRLLLRLLL, improve= 84.62865, (0 missing)
## respondent_occupation_scale_c_14 splits as LRLLLRRL, improve= 68.88653, (0 missing)
## Surrogate splits:
## education < 20.5 to the left, agree=0.950, adj=0.861, (0 split)
## occupation_of_respondent splits as LLLLRLLRRLRLLLRLLL, agree=0.738, adj=0.267, (0 split)
## respondent_occupation_scale_c_14 splits as LRLLLLRL, agree=0.733, adj=0.251, (0 split)
## is_student < 0.5 to the left, agree=0.709, adj=0.186, (0 split)
## age_exact < 23.5 to the right, agree=0.676, adj=0.094, (0 split)
##
## Node number 15: 3619 observations
## predicted class=1 expected loss=0.3722023 P(node): 0.04819807
## class counts: 1347 2272
## probabilities: 0.372 0.628
##
## Node number 28: 7462 observations
## predicted class=0 expected loss=0.326052 P(node): 0.09937938
## class counts: 5029 2433
## probabilities: 0.674 0.326
##
## Node number 29: 4145 observations, complexity param=0.008270266
## predicted class=0 expected loss=0.4784077 P(node): 0.05520337
## class counts: 2162 1983
## probabilities: 0.522 0.478
## left son=58 (2573 obs) right son=59 (1572 obs)
## Primary splits:
## year splits as L-LR, improve=40.13885, (0 missing)
## occupation_of_respondent splits as LRLLRRRRRLRLLLRLLL, improve=18.33254, (0 missing)
## marital_status splits as LRRRLRRRLRRLRLLRRRRRRLRLRLLRR, improve=17.86888, (0 missing)
## type_of_community splits as LRLRL, improve=17.55254, (0 missing)
## age_education splits as ------------LLRRRRRRR-RR-RL-RR---LRRR-R--LR-R-R---R-R--RR-RR--RR------RRR--------------R, improve=14.66121, (0 missing)
## Surrogate splits:
## type_of_community splits as LLLRL, agree=0.777, adj=0.412, (0 split)
## marital_status splits as RRLLLLLRLLLLLLLRRRLLLLLLRLRLL, agree=0.680, adj=0.155, (0 split)
## isocntry splits as LL---LL---L-R----------LL------, agree=0.669, adj=0.127, (0 split)
## country_code splits as LL---L---L-R--------LL------, agree=0.669, adj=0.127, (0 split)
## o3 < 83.06345 to the right, agree=0.650, adj=0.076, (0 split)
##
## Node number 58: 2573 observations
## predicted class=0 expected loss=0.4240187 P(node): 0.03426737
## class counts: 1482 1091
## probabilities: 0.576 0.424
##
## Node number 59: 1572 observations
## predicted class=1 expected loss=0.43257 P(node): 0.02093599
## class counts: 680 892
## probabilities: 0.433 0.567
# plot tree
plot(fit, uniform=TRUE,
main="Classification Tree: Climate Change Is The Most Serious Threat")
text(fit, use.n=TRUE, all=TRUE, cex=.8)
## Warning in labels.rpart(x, minlength: minlength): more than 52 levels in a
## predicting factor, truncated for printout

saveRDS ( climate_awareness_air , file.path(tempdir(), "climate_panel_recoded.rds"), version: 2)
# not evaluated
saveRDS( climate_awareness_air, file: file.path("data-raw", "climate-panel_recoded.rds"))

Our Music Observatory in the Jump European Music Market Accelerator: Meet the 2021 Fellows and their Tutors

Thu, 04 Mar 2021 15:00:00 +0200

According to the announcement of JUMP, the European Music Market Accelerator, after a careful screening of all applications received, the selection committee composed of all JUMP board members has selected the most promising ideas and projects to be developed together with renowned tutors for this 2021 fellowship.

For nine months, the 20 fellows living in many European countries will develop their innovative projects, while receiving a comprehensive 360° training. In addition to specialised workshops by highly qualified experts, each fellow will receive one-on-one tutoring sessions from the most renowned music professionals coming from all over Europe.

The 20 selected projects cover a great variety of urgent needs faced within the music sector. They will:

help fostering social change with projects focusing on diversity in the industry, more fairness and transparency as well as raising awareness on timely issues.
enhance technological development with projects using blockchain, immersive sound and VR and AR.
build bridges between different key actors of the ecosystem.

Download the entire JUMP press release.

Reprex’s project, the automated Demo Music Observatory will be represented by Daniel Antal, co-founder of Reprex among other building bridges projects. This project offers a different approach to the planned European Music Observatory based on the principles of open collaboration, which allows contributions from small organizations and even individuals, and which provides higher levels of quality in terms of auditability, timeliness, transparency and general ease of use. Our open collaboration approach allows to power trustworthy, ethical AI systems like our Listen Local that we started out from Slovakia with the support of the Slovak Arts Council.

JUMP fellows building bridges between different key actors of the ecosystem.

Apart from our Demo Music Observatory the build bridges section Groovly with Martin Zenzerovich, From Play To Rec by Jeremy Dunne, Hajde Radio by Thibaut Boudaud, LowDee by Alex Davidson and ONO-HU! by Gina Akers.

Meet all the JUMP 2021 Fellows, including the technology and social change professionals!

Reprex is a start-up company based in the Netherlands and the United States that validated its early products in the Yes!Delft AI+Blockchain Lab in the Hague. In 2021 we joined the Dutch AI Coalition – NL AIC and requested membership in the European AI Alliance. Reprex is committed to applying reproducible in an open collaboration with our business, scientific, policy and civil society partners, and facilitate the use of open data and open-source software. Many fellows in the program are connected to other regions, like North America and Australia – because music is one of the most globalized industries and forms of art in the world! Reprex is a startup based in the Netherlands and the United States, and we are very excited to collaborate with our peers in new European territories, and in Canada and Australia.

Hope to meet you in these great events - maybe not only online!

Further links:

From Play to Rec on Facebook
HAJDE FR/EN

What is Retrospective Survey Harmonization?

Thu, 04 Mar 2021 00:00:00 +0000

Reproducible ex post harmonization of survey microdata

Retrospective survey harmonization allows the comparison of opinion poll data conducted in different countries or time. In this example we are working with data from surveys that were ex ante harmonized to a certain degree – in our tutorials we are choosing questions that were asked in the same way in many natural languages. For example, you can compare what percentage of the European people in various countries, provinces and regions thought climate change was a serious world problem back in 2013, 2015, 2017 and 2019.

We developed the retroharmonize R package to help this process. We have tested the package with about 80 Eurobarometer, 5 Afrobarometer survey files extensively, and a bit with Arabbarometer files. This allows the comparison of various survey answers in about 70 countries. This policy-oriented survey programs were designed to be harmonized to a certain degree, but their ex post harmonization is still necessary, challenging and errorprone. Retrospective harmonization includes harmonization of the different coding used for questions and answer options, post-stratification weights, and using different file formats.

Eurobarometer, Afrobaromer, Arab Barometer and Latinobarómetro make survey files that are harmonized across countries available for research with various terms. Our retroharmonize is not affiliated with them, and to run our examples, you must visit their websites, carefully read their terms, agree to them, and download their data yourself. What we add as a value is that we help to connect their files across time (from different years) or across these programs.

The survey programs mentioned above publish their data in the proprietary SPSS format. This file format can be imported and translated to R objects with the haven package; however, we needed to re-design haven’s labelled_spss class to maintain far more metadata, which, in turn, a modification of the labelled class. The haven package was designed and tested with data stored in individual SPSS files.

The author of labelled, Joseph Larmarange describes two main approaches to work with labelled data, such as SPSS’s method to store categorical data in the Introduction to labelled.

Two main approaches of labelled data conversion.

Our approach is a further extension of Approach B. Survey harmonization in our case always means the joining data from several SPSS files, which requires a consistent coding among several data sources. This means that data cleaning and recoding must take place before conversion to factors, character or numeric vectors. This is particularly important with factor data (and their simple character conversions) and numeric data that occasionally contains labels, for example, to describe the reason why certain data is missing. Our tutorial vignette labelled_spss_survey gives you more information about this.

In the next series of tutorials, we will deal with an array of problems. These are not for the faint heart – you need to have a solid intermediate level of R to follow.

Tidy, joined survey data

The original files identifiers may not be unique, we have to create new, truly unique identifiers. Weighting may not be straightforward.
Neither the number of observations or the number of variables (which represents the survey questions and their translation to coded data) is the same. Certain data may be only present in one survey and not the other. This means that you will likely to run loops on lists and not data.frames, but eventually you must carefully join them.

Class conversion

Similar questions may be imported from a non-native R format, in our case, from an SPSS files, in an inconsistent manner. SPSS’s variable formats cannot be translated unambiguously to R classes. retroharmonize introduced a new S3 class system that handles this problem, but eventually you will have to choose if you want to see a numeric or character coding of each categorical variable.
The harmonized surveys, with harmonized variable names and harmonized value labels, must be brought to consistent R representations (most statistical functions will only work on numeric, factor or character data) and carefully joined into a single data table for analysis.

Harmonization of variables and variable labels

Same variables may come with dissimilar variable names and variable labels. It may be a challenge to match age with age. We need to harmonize the names of variables.
The harmonized variables may have different labeling. One may call refused answers as declined and the other refusal. On a simple choice, climate change may be ‘Climate change’ or Problem: Climate change. Binary choices may have survey-specific coding conventions. Value labels must be harmonized. There are good tools to do this in a single file - but we have to work with several of them.

Missing value harmonization

There are likely to be various types of missing values. Working with missing values is probably where most human judgment is needed. Why are some answers missing: was the question not asked in some questionnaires? Is there a coding error? Did the respondent refuse the question, or sad that she did not have an answer? retroharmonize has a special labeled vector type that retains this information from the raw data, if it is present, but you must make the judgment yourself – in R, eventually you will either create a missing category, or use NA_character_ or NA_real_.

That’s a lot to put on your plate.

It is unlikely that you will be able to work with completely unfamiliar survey programs if you do not have a strong intermediate level of R. Our package comes with tutorials for Eurobarometer, Afrobarometer and our development version already covers Arab Barometer, highlighting some peculiar issues with these survey programs, that we hope to give a head start for less experienced R users.

Open Data Day Interview: Mapping Data with Milos Popovic

Wed, 03 Mar 2021 22:23:00 +0200

Milos Popovic is a researcher, a data scientist, Marie Curie postdoc & Top 10 dataviz & R contributor on Twitter according to NodeXL. He took part in policy debates about terrorism and military intervention and appeared on a number of TV channels including N1 (the CNN affiliate in the Western Balkans), Serbian National Television and Al-Jazeera Balkans. My research interests are at the intersection of civil war dynamics and postwar politics in the Balkans. He is going to join the Data & Lyrics team on International Open Data Day to help us put harmonized environmental degradation perception and environmental sensory data on maps. We asked him four questions about his passion, mapping data. Please join us 6 March 2021 9.30 EST / 15.30 CET for an informal digital coffee.

As a researcher, why are you so much drawn into maps? Is this connected to your interest in territorial conflicts, or you have some other inspiration?

That’s a great question that really makes me pause and look back at the past 5 years. My mapping story started out of curiosity: I found interesting data on the post-WWII violence in Serbia and thought how cool it would be to make a map in R. I quickly made an unimpressive choropleth map and noticed some unexpected patterns. Then I realized just how much unused violence and census data sits out there while we have no clue about geographic patterns. So, it began. I started off with map-making but my curiosity took me to the world of georeferencing and geospatial analysis. In the process, I created over 300 maps hosted on my website as well as dozens of shapefiles from the scratch.

I used to think that my interest is linked to growing up in a war-torn country. But, as my map-making evolved, I discovered that my passion is to use maps as a way to democratize the data: to take the scores of unused, and often buried datasets, place them on the map and share the dataviz with people.

Can you show us an example of the best use of mapped data, and the best map that you have personally created? What is their distinctive value?

I’m immensely proud of my work that required making the shapefiles from the scratch. For instance, my shapefile of over 1500 Kosovo cadastral settlements came into being after I turned dozens of high-resolution raster files into a shapefile fully compatible with Open Street Maps. After months of hard work, I managed to merge the shapefile with the 2011 Kosovo census and present several laser-focused demographic maps to my audience. Same goes for the settlement shapefile of Republika Srpska [the Serb-speaking entity of Bosnia-Herzegovina — the editor], which I made out of a pdf file and merged with the 2013 census data. Whereas most existing maps take a bird’s eye view, my work offers a more fine-grained view of the local dynamics to stakeholders.

Another similar undertaking was my transformation of the pre-WWII German military map of Yugoslavia into a unique shapefile of a few hundred Yugoslav municipalities. I combined this shapefile with the 1931 census data, 80 years after it was first published (better late than never!). It took me almost a year to complete this tremendous project but I enjoyed every bit of it. I have teamed up with my brother who is a web developer and we even made an interactive map of Yugoslavia based on the 1931 census.[The screenshot of this interactive map is the top image in the post – the editor] We hope this project would serve not only scholars but also history enthusiasts to better understand a history of the country that is no more.

Check out Milos’s beautiful static and interactive maps on https://milosp.info/

What do you think about collaboration based on open data and open-source software that processes such data?

It’s a fantastic opportunity for small teams to bypass traditional gatekeepers such as state institutions or big companies and use open source apps for the benefit of their local communities. For example, the access to Open Street Map allows small teams to map pressing communal issues as crime, deceases, or environmental degradation and come up with innovative solutions. In my work, too, I used OSM has helped me create several fine-grained maps that shed more light on local problems in Serbia such as pollution, car accidents or violence.

We are hoping to bring together environmental, sensory data and public attitude data on environmental issues? How can mapping help? What do you expect from this project?

More than ever, we are compelled to figure out how maladies spreads locally. Without mapping the hotspots, our understanding of the consequences of, for example, viral transmission or pollution is shrouded with a lot of uncertainty. We might have no clue how environmental issues shape public attitudes in localities until we use the mapping to turn on the light. Mapping would help this project pin down geographic clusters that require immediate attention from the private and public stakeholders.

Please join us for a digital coffee, tea or beer on International Open Data Day - we will put never seen data on maps, and discuss how to build successful open collaborations, with little, independent contributions to build large data observatories. Make sure you check out Milos’ amazing website, too!

This blogpost was originally posted on our Data & Lyrics blog and its mutation on Medium.

Eurobarometer Surveys Used In Our Project

Wed, 03 Mar 2021 00:00:00 +0000

In our tutorial series, we are going to harmonize the following questionnaire items from five Eurobarometer harmonized survey files. The Eurobarometer survey files are harmonized across countries, but they are only partially harmonized in time.

All data must be downloaded from the GESIS Data Archive in Cologne. We are not affiliated with GESIS and you must read and accept their terms to use the data.

Eurobarometer 80.2 (2013)

GESIS Data Archive, Cologne. ZA5877 Data file Version 2.0.0, https://doi.org/10.4232/1.12792

Data file: ZA6595 data file (European Commission 2017).
Questionnaire: Eurobarometer 83.4 Basic Bilingual Questionnaire
Citation: ZA6595 Bibtex

QA1a Which of the following do you consider to be the single most serious problem facing the world as a whole? (single choice)

QA1b Which others do you consider to be serious problems? (multiple choice)

QA2 And how serious a problem do you think climate change is at this moment? Please use a scale from 1 to 10, with '1' meaning it is "not at all a serious problem (scale 1-10)

QA4 To what extent do you agree or disagree with each of the following statements? - Fighting climate change and using energy more efficiently can boost the economy and jobs in the EU (agreement-disagreement 4-scale)

QA4 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU could benefit the EU economically (agreement-disagreement 4-scale)

QA5 Have you personally taken any action to fight climate change over the past six months? (binary)

Eurobarometer 83.4 (2015)

European Commission, Brussels; Directorate General Communication COMM.A.1 ´Strategy, Corporate Communication Actions and Eurobarometer´GESIS Data Archive, Cologne. ZA6595 Data file Version 3.0.0, https://doi.org/10.4232/1.13146

Data file: ZA6595 data file (European Commission 2018).
Questionnaire: Eurobarometer 83.4 Basic Bilingual Questionnaire
Citation: ZA6595 Bibtex

Eurobarometer 87.1 (2017)

European Commission, Brussels; Directorate General Communication, COMM.A.1 ‘Strategic Communication’; European Parliament, Directorate-General for Communication, Public Opinion Monitoring Unit GESIS Data Archive, Cologne. ZA6861 Data file Version 1.2.0, https://doi.org/10.4232/1.12922

Data file: ZA6861 data file.
Questionnaire: Eurobarometer 90.2 Basic Bilingual Questionnaire
Citation: ZA6861 Bibtex

QC1a Which of the following do you consider to be the single most serious problem facing the world as a whole? (single choice)

QC1b Which others do you consider to be serious problems? (multiple choice)

QC2 And how serious a problem do you think climate change is at this moment? Please use a scale from 1 to 10, with '1' meaning it is "not at all a serious problem (scale 1-10)

Qc4 To what extent do you agree or disagree with each of the following statements? - Fighting climate change and using energy more efficiently can boost the economy and jobs in the EU (agreement-disagreement 4-scale)

Qc4 To what extent do you agree or disagree with each of the following statements? - Promoting EU expertise in new clean technologies to countries outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

Qc4 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

Qc4 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can increase the security of EU energy supplies (agreement-disagreement 4-scale)

Qc4 To what extent do you agree or disagree with each of the following statements? - More public financial support should be given to the transition to clean energies even if it means subsidies to fossil fuels should be reduced. (agreement-disagreement 4-scale)

Qc5 Have you personally taken any action to fight climate change over the past six months? (binary)

Eurobarometer 90.2 (2018)

European Commission, Brussels; Directorate General Communication, COMM.A.3 ‘Media Monitoring and Eurobarometer’ GESIS Data Archive, Cologne. ZA7488 Data file Version 1.0.0, https://doi.org/10.4232/1.13289

Data file: ZA7488 data file (European Commission 2019a)
Questionnaire: Eurobarometer 90.2 Basic Bilingual Questionnaire
Citation: ZA7488 Bibtex

QB5 To what extent do you agree or disagree with each of the following statements? - Fighting climate change and using energy more efficiently can boost the economy and jobs in the EU (agreement-disagreement 4-scale)

QB5 To what extent do you agree or disagree with each of the following statements? - Promoting EU expertise in new clean technologies to countries outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

QB5 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

QB5 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can increase the security of EU energy supplies (agreement-disagreement 4-scale)

QB5 To what extent do you agree or disagree with each of the following statements? - More public financial support should be given to the transition to clean energies even if it means subsidies to fossil fuels should be reduced. (agreement-disagreement 4-scale)

Eurobarometer 91.3 (2019)

European Commission, Brussels; Directorate General Communication, COMM.A.3 ‘Media Monitoring and Eurobarometer’ GESIS Data Archive, Cologne. ZA7572 Data file Version 1.0.0, https://doi.org/10.4232/1.13372

Data file: ZA7572 data file (European Commission 2019b).
Questionnaire: Eurobarometer 91.3 Basic Bilingual Questionnaire
Citation: ZA7572 Bibtex

QB4 To what extent do you agree or disagree with each of the following statements? - Taking action on climate change will lead to innovation that will make EU companies more competitive (N) (agreement-disagreement 4-scale)

QB4 To what extent do you agree or disagree with each of the following statements? - Promoting EU expertise in new clean technologies to countries outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

QB4 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

QB4 To what extent do you agree or disagree with each of the following statements? - Adapting to the adverse impacts of climate change can have positive outcomes for citizens in the EU (agreement-disagreement 4-scale)

QB5 Have you personally taken any action to fight climate change over the past six months? (binary)

References

European Commission, Brussels. 2017. “Eurobarometer 80.2 (2013).” GESIS Data Archive, Cologne. ZA5877 Data file Version 2.0.0, https://doi.org/10.4232/1.12792. https://doi.org/10.4232/1.12792.

———. 2018. “Eurobarometer 83.4 (2015).” GESIS Data Archive, Cologne. ZA6595 Data file Version 3.0.0, https://doi.org/10.4232/1.13146. https://doi.org/10.4232/1.13146.

———. 2019a. “Eurobarometer 90.2 (2018).” GESIS Data Archive, Cologne. ZA7488 Data file Version 1.0.0, https://doi.org/10.4232/1.13289. https://doi.org/10.4232/1.13289.

———. 2019b. “Eurobarometer 91.3 (2019).” GESIS Data Archive, Cologne. ZA7572 Data file Version 1.0.0, https://doi.org/10.4232/1.13372. https://doi.org/10.4232/1.13372.

Music Streaming: Is It a Level Playing Field?

Tue, 23 Feb 2021 21:23:00 +0200

Our article, Music Streaming: Is It a Level Playing Field? is published in the February 2021 issue of CPI Antitrust Chronicle, which is fully devoted to competition policy issues in the music industry.

The dramatic growth of music streaming over recent years is potentially very positive. Streaming provides consumers with low cost, easy access to a wide range of music, while it provides music creators with low cost, easy access to a potentially wide audience. But many creators are unhappy about the major streaming platforms. They consider that they act in an unfair way, create an unlevel playing field and threaten long-term creativity in the music industry.

Our paper describes and assesses the basis for one element of these concerns, competition between recordings on streaming platforms. We argue that fair competition is restricted by the nature of the remuneration arrangements between creators and the streaming platforms, the role of playlists, and the strong negotiating power of the major labels. It concludes that urgent consideration should be given to a user-centric payment system, as well as greater transparency of the factors underpinning playlist creation and of negotiated agreements.

You can read the entire issue and the full text of our article on Competition Policy International in pdf.

Daniel Antal, co-founder of Reprex Was Selected into the 2021 Fellowship Program of the European Music Market Accelerator

Mon, 22 Feb 2021 21:23:00 +0200

Daniel Antal, co-founder of Reprex, was selected into 2021 Fellowship program of JUMP, the European Music Market Accelerator. Jump provides a framework for music professionals to develop innovative business models, encouraging the music sector to work on a transnational level. The European Music Market Accelerator composed of MaMA Festival and Convention, UnConvention, MIL, Athens Music Week, Nouvelle Prague and Linecheck support him in the development of our two, interrelated projects over the next nine months.

Our Demo Music Observatory is a demo version of the European Music Observatory based on open data, open source, automated research in open collaboration with music stakeholders. We hope that we can further develop our business model and find new users, and help the recovery of the festival and live music segment.
Listen Local is our AI system that validated third party music AI, such as Spotify’s or YouTube’s recommendation systems, and provides trustworthy, accountable, transparent alternatives for the European music industry. We hope to expand our pilot project from Slovakia to several European countries in 2021.

Reprex is committed to applying reproducible in an open collaboration with our business, scientific, policy and civil society partners, and facilitate the use of open data and open-source software.

Reprex Joins The Dutch AI Coalition

Tue, 16 Feb 2021 17:10:00 +0200

Reprex, our start-up, is based in the Netherlands and the United States that validated its early products in the Yes!Delft AI+Blockchain Lab in the Hague. In 2021, we decided to join the Dutch AI Coalition – NL AIC.

The NL AIC is a public-private partnership in which the government, the business sector, educational and research institutions, as well as civil society organisations collaborate to accelerate and connect AI developments and initiatives. The ambition is to position the Netherlands at the forefront of knowledge and application of AI for prosperity and well-being. We are continually doing so with due observance of both the Dutch and European standards and values. The NL AIC functions as the catalyst for AI applications in our country.

We are particularly looking forward to participating in the Culture working group of NLAIC, but we will also take a look at the Security, Peace and Justice and the Energy and Sustainability working groups. Reprex is committed to use and further develop AI solutions that fulfil the requirements of trustworthy AI, a human-centric, ethical, and accountable use of artificial intelligence. We are committed to develop our data platforms, or automated data observatories, and our Listen Local system in this manner. Furthermore, we are involved in various scientific collaborations that are researching ideas on future regulation of copyright and fair competition with respect to AI algorithms.

We are committed to applying reproducible in an open collaboration with our business, scientific, policy and civil society partners, and facilitate the use of open data and open-source software.

Ensuring the Visibility and Accessibility of European Creative Content on the World Market: The Need for Copyright Data Improvement in the Light of New Technologies

Sat, 13 Feb 2021 18:10:00 +0200

The majority of music sales in the world is driven by AI-algorithm powered robots that create personalized playlists, recommendations and help programming radio music streams or festival lineups. It is critically important that an artist’s work is documented, described in a way that the algorithm can work with it.

In our research paper – soon to be published – made for the Listen Local Initiative we found that 15% of Dutch, Estonian, Hungarian, or Slovak artists had no chance to be recommended, and they usually end up on Forgetify, an app that lists never-played songs of Spotify. In another project with rights management organizations, we found that about half of the rightsholders are at risk of not getting all their royalties from the platforms because of poor documentation.

But how come that distributors give streaming platforms songs that are not properly documented? What sort of information is missing for the European repertoire’s visibility? Reprex is exploring this problem in a practical cooperation with SOZA, the Slovak Performing and Mechanical Rights Society, and in an academic cooperation that involves leading researchers in the field. A manuscript co-authored Martin Senftleben, director of the Institute for Information Law in Amsterdam, and eminent researchers in copyright law and music economics, Reprex’s co-founder makes the case that Europe must invest public money to resolve this problem, because in the current scenario, the documentation costs of a song exceed the expected income from streaming platforms.

In the European Strategy for Data, the European Commission highlighted the EU’s ambition to acquire a leading role in the data economy. At the same time, the Commission conceded that the EU would have to increase its pools of quality data available for use and re-use. In the creative industries, this need for enhanced data quality and interoperability is particularly strong. Without data improvement, unprecedented opportunities for monetising the wide variety of EU creative and making this content available for new technologies, such as artificial intelligence training systems, will most probably be lost. The problem has a worldwide dimension. While the US have already taken steps to provide an integrated data space for music as of 1 January 2021, the EU is facing major obstacles not only in the field of music but also in other creative industry sectors. Weighing costs and benefits, there can be little doubt that new data improvement initiatives and sufficient investment in a better copyright data infrastructure should play a central role in EU copyright policy. A trade-off between data harmonisation and interoperability on the one hand, and transparency and accountability of content recommender systems on the other, could pave the way for successful new initiatives. Download the manuscript from SSRN

Our Slovak Demo Music Database project is a best example for this. We started systematically collect publicly available information from Slovak artists (in our write-in process) and ask them to give GDPR-protected further data (in our opt-in process) to create a comprehensive database that can help recommendation engines as well as market-targeting or educational AI apps.

We believe that one of the problems of current AI algorithms that they solely or almost only work with English language documentation, putting other, particularly small language repertoires at risk of being buried below well-documented music mainly arriving from the United States.

We are looking for rightsholders and their organizations, artists, researchers to work with us to find out how we can increase the visibility of European music.

Demo Slovak Music Database

Thu, 17 Dec 2020 17:10:00 +0200

We are finalizing our first local recommendation system, Listen Local Slovakia, and the accompanying Demo Slovak Music Database. Our aim is

Show how the Slovak repertoire is seen by media and streaming platforms
What are the possibilities to give greater visibility to the Slovak repertoire in radio and streaming platforms
What are the specific problems why certain artists and music is almost invisible.

In the next year, we would like to create a modern, comprehensive national music database that serves music promotion in radio, streaming, live music within Slovakia and abroad.

To train our locally relevant, alternative recommendation system, we filled the Demo Slovak Music Database from two sources. In the opt-in process we asked artists to participate in Listen Local, and we selected those artists who opted in from Slovakia, or whose language is Slovak. In the write-in process we collected publicly available data from other artists that our musicology team considered to be Slovak, mainly on the basis of their language use, residence, and other public biographical information. The following artists form the basis of our experiment. (If you want to be excluded from the write-in list, write to us, or you want to be included, please, fill out this form.)

Click here to view the table on a separate page

Modern recommendation systems usually rely on data provided by artists or their representatives, data on who and how is listening to their music, and what music is listened to by the audience of the artists, and certain musicological features of the music. Usually they collect data from various data sources, but these data sources are mainly English language sources.

The problem with these recommendation systems is that they do not help music discovery, and make starting new acts very difficult. Recommendation systems tend to help already established artists, and artists whose work is well described in the English language.

Our alternative recommendation system is a utility-based system that gives a user-defined priority to artists released in Slovakia, or artists identified as Slovak, or both. The system can be extended for lyrics language priorities, too. Currently, our app is demonstration to provide a more comprehensive database-driven tool that can support various music discovery, recommendation or music export tools. Our Feasibility Study to build such tools and our Demo App is currently under consultation with Slovak stakeholders.

Listen Local is developing transparent algorithms and open source solutions to find new audiences for independent music. We want to correct the injustice and inherent bias of market leading big data algorithms. If you want your music and audience to be analysed in Listen Local, fill this form in. We will include you in our demo application for local music recommendations and our analysis to be revealed in December.

Listen Local: Why We Need Alternative Recommendation Systems

Mon, 14 Dec 2020 17:10:00 +0200

The first version of our demo application

Recommendation systems utilize knowledge about music content and their audiences while also pursuing the objectives or needs of recommenders.

The simplest recommendation systems just follow the charts: for example, they select from well-known current or perennial greatest hits. Such a system may work well for an amateur DJ in a home party or a small local radio that just wants to make sure that the music in its programme will be liked by many people. They reinforce existing trends and make already popular songs and their creators even more popular.

If the recommendation engine is supported by big data and a machine learning system – or increasingly, a combination of several machine learning algorithms – the general modus operandi is to exploit information about both content and users in order to achieve certain goals.

How algorithmic recommendation systems work?

How recommendation systems work?

Spotify’s recommendation system is a mix of content- and collaborative filtering that exploits information about users’ past behaviour (e.g. liked, skipped, and re-listened songs), the behaviour of similar users, as well as data collected from the users’ social media and other online activities, or from blogs. Deezer uses a similar system that is boosted by the acquisition of Last.fm – big data created from user comments are used to understand the mood of the songs, for example.

Spotify makes 16 billion music recommendations each month in 2020.

YouTube, which plays an even larger role in music discovery, uses a system comprised of two neural networks: one for candidate generation and one for ranking. The candidate generation deep neural network provides works on the basis of collaborative filtering, while the ranking system is based on content-based filtering and a form of utility ranking that takes into consideration the user’s languages, for example.

What makes these systems common is that they maximize the algorithm creators’ corporate key performance indicators. Spotify wants to be ‘your playlist to life’ and increase the amount of music played during work or sports in the background, during travelling, or active music listening –- i.e. maximizing the number of hours spent using it, and do not let empty timeslots for other music providers, such as radio stations. YouTube and Netflix have similar targets. They are in many ways like commercial radio targets, which want to maximize the time spent listening to the broadcast stream. Radios and YouTube, in particular, have similar goals because they are mainly financed through advertising. For Spotify or Netflix, their key financial motivation is to avoid users’ cancelling their subscriptions or changing it to different providers, such as Amzon, Apple or Deezer.

What is the problem with black box recommendation systems?

What they also have in common is that they do not aim to give a fair chance to each uploaded song, serve equally every artist, or provide whatever equality of chances for English, Slovak or Farsi language content.

They tend to reinforce trends similarly to music charts, but with far bigger efficiency. As the Dutch comedian, author and journalist Arjen Lubach explains the YouTube algorithm, to keep their personal recommendations engaging all the day and all of the night, they create a comfortable universe for the user allows little distraction in. If the user wants to listen to global hit music, or stoner rock, it will never be distracted with anything else.

Zondag met Lubach on Dutch public broadcaster VPRO. Click settings sign to change the language of the captions.

The problem with such hyper-personalized media is that they leave no room for public activities. Public broadcasters, which had a monopoly to television broadcasting in most European countries until the early 1990s, for example, were aiming to air a diversity of news, knowledge and access to local culture. Many countries on all continents have maintained local content guidelines for broadcasting on public, commercial and community television and radio channels, for example, local music and films, and reliable news as a public service. Personalized media-, social media- and streaming platforms do not have such obligations.

Black box recommendation systems usually maximize a corporate key performance indicator, and they are not subject to usual public new service or local content regulations that traditional broadcast media is.
The goal and the steps that the algorithm is pursuing is not know to content creators, and they do not know when will the algorithm work for their benefit or against them.

Transparent and regulated AI

In our view, utility-based recommendation system can provide a bridge between current, corporate-owned systems that maximize a media or streaming platforms’ business indicators.

Public new service requirements or local content requirements (“national quotas”) set for commercial broadcasting are similar to utility or knowledge-based recommendation systems. A utility-based recommendation system, for example, would prefer from two candidates for a playlist the one that has a Slovak composer, or a performer from Wales, or which has Farsi lyrics.

Our Demo App creates recommendations on the basis of a pre-existing radio or personal streaming playlist that, by choice, contains a pre-defined ratio of music produced in Slovakia, or performed by Slovak artists. We will soon add Dutch and Hungarian choices to this demo, but naturally, we could add any city’s, regions’, province’s our countries preferences into the app.

Our Demo App and accompanying Feasibility Study in Slovakia shows how can a regulator create better broadcasting regulations, learning from the experience with AI-driven streaming platforms, and how can it apply the goals of local content requirements (such as a certain visibility for Slovak or a city-based music) and public service requirements (for example, spreading reliable information or stopping hateful music.)

Access for all

We do not believe that the current heated discussion on the re-regulation of AI and music streaming will solve all the problems of independent artists, bands from ethnic or racial minorities, or otherwise vulnerable producers. New regulation can limit the unintended collateral damage of big data algorithms deployed by big corporations, but they will not bring down the benefits of AI to these creators.

Take the example of the controversial new initiative that let’s artists, labels and publishers to promote their music in exchange for a cut in royalty rates. While felt by many artists injust and even corrupt, it is an answer to the growing need to influence how the recommendation algorithms are promoting certain music at the expense of tens of millions of sound recordings that are not recommended.

We believe that algorithms create value for the users, and if artists are not happy to pay for corporations to influence black box algoritms, thatn they must collaborate and share data, and build large enough data pools so that they can deploy white, transparent algorithms that work for them.

Our Feasibility Study shows why it is important that creators have a control over what data describes their music, their biographies and other information online, because corporate streaming platforms use this information for their algorithms.
We should that with relatively little effort creators can pool enough information to create alternative recommendation systems that follow a more agreeable goal, that is sensitive to local content requirements and more access to new artists, women or black performers, or which suppress hateful lyrics.
The benefit of an open algorithm and pooled data is that artists can actively look for audiences in various age groups or in cities that are accessible for them on a performing tour after the pandemic.

Overall, we want to show that regulating black box, private algorithms and data monopolies is only a first step to damage control. Deploying white, transparent algorithms and building collaborative or open data pools can only guarantee fairness in the digital platforms, in recommendations, and generally in the use of AI.

Reproducible research in practice: empirical study on the structural conditions of book piracy in global and European academia

Sat, 05 Dec 2020 08:10:00 +0200

PLOS One is the fourth most influential multidisciplinary journal after Nature, and Science, and Proceedings of the National Academy of Sciences of the United States of America (based on H index.) On December 3, 2020 it published a paper co-authored by Dr. Balazs Bodo, associate professor at the Institute for Information Law (IViR), Daniel Antal (Reprex, Demo Music Observatory), a data scientist interested in reproducible research, as an independent researcher, and Zoltan Puha, a Data Science PhD at Tilburg University, JADS. PLOS (Public Library of Science) is a nonprofit Open Access publisher, empowering researchers to accelerate progress in science and medicine by leading a transformation in research communication.

The article utilizes the our reproducible datasets created with our regions package, and builds on many years of expertise in empirical research on the field of music and audiovisual piracy, home copying and private copying compensation (see for example Private Copying in Croatia.) Our aim is to provide reliable, high quality indicators for the creative industries not only on national, but provincial, state, regional and metropolitan area level, too, because these levels are often more relevant for creators, performers and policy-makers.

The topic of the paper is Library Genesis (LG), the biggest piratical scholarly library on the internet, which provides copyright infringing access to more than 2.5 million scientific monographs, edited volumes, and textbooks. The paper uses advanced statistical methods to explain why researchers around the globe use copyright infringing knowledge resources. The analysis is based on a huge usage dataset from LG, as well as data from the World Bank, Eurostat, and Eurobarometer, to identify the role of macroeconomic factors, such as R&D and higher education spending, GDP, researcher density in scholarly copyright infringing activities.

We created a global and a far more detailed European model for pirate book downloads.

The main finding of the paper is that open access, even if it is radical, is not a panacea. The hypothesis of the research was that researchers in low-income regions use piratical open knowledge resources relatively more to compensate for the limitations of their legal access infrastructures. The authors found evidence to the contrary. Researchers in high income countries and European regions with access to high quality knowledge infrastructures, and high levels of funding use radical open access resources more intensively than researchers in lower income countries and regions, with less resourceful libraries. This means that while open knowledge is an important resource to close the knowledge gap between centrum and periphery, equality in access does not translate into equality in use. Structural knowledge inequalities are both present and are being reproduced in the context of open access resources.

The paper is unique not just because of the data it is based on. It also sets new standards in interdisciplinary legal research by publishing the paper, the data and the software code in the same time in open access repositories, following reproducible research best practices — the practices that we want to promote in our Demo Music Observatory and further data observatories to serve business, evidence-based policy and scientific research.

Jian Yang and Monica Hall Win the Best Paper Award at Wowchemy 2020

Wed, 02 Dec 2020 00:00:00 +0000

Congratulations to Jian Yang and Monica Hall for winning the Best Paper Award at the 2020 Conference on Wowchemy for their paper “Learning Wowchemy”.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer tempus augue non tempor egestas. Proin nisl nunc, dignissim in accumsan dapibus, auctor ullamcorper neque. Quisque at elit felis. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Aenean eget elementum odio. Cras interdum eget risus sit amet aliquet. In volutpat, nisl ut fringilla dignissim, arcu nisl suscipit ante, at accumsan sapien nisl eu eros.

Sed eu dui nec ligula bibendum dapibus. Nullam imperdiet auctor tortor, vel cursus mauris malesuada non. Quisque ultrices euismod dapibus. Aenean sed gravida risus. Sed nisi tortor, vulputate nec quam non, placerat porta nisl. Nunc varius lobortis urna, condimentum facilisis ipsum molestie eu. Ut molestie eleifend ligula sed dignissim. Duis ut tellus turpis. Praesent tincidunt, nunc sed congue malesuada, mauris enim maximus massa, eget interdum turpis urna et ante. Morbi sem nisl, cursus quis mollis et, interdum luctus augue. Aliquam laoreet, leo et accumsan tincidunt, libero neque aliquet lectus, a ultricies lorem mi a orci.

Mauris dapibus sem vel magna convallis laoreet. Donec in venenatis urna, vitae sodales odio. Praesent tortor diam, varius non luctus nec, bibendum vel est. Quisque id sem enim. Maecenas at est leo. Vestibulum tristique pellentesque ex, blandit placerat nunc eleifend sit amet. Fusce eget lectus bibendum, accumsan mi quis, luctus sem. Etiam vitae nulla scelerisque, eleifend odio in, euismod quam. Etiam porta ullamcorper massa, vitae gravida turpis euismod quis. Mauris sodales sem ac ultrices viverra. In placerat ultrices sapien. Suspendisse eu arcu hendrerit, luctus tortor cursus, maximus dolor. Proin et velit et quam gravida dapibus. Donec blandit justo ut consequat tristique.

Feasibility Study For The Establishment Of A European Music Observatory & The Demo Observatory

Mon, 16 Nov 2020 07:03:00 +0200

The Feasibility study for the establishment of a European Music Observatory was published on 13 November. Our private observatory, CEEMID was consulted in the creation of the Feasibility Study, and some of our recommendations found way into the consultant’s document. We created a Demo Music Observatory to provide a practical guidance on the decisions facing the European stakeholders, and to answer the questions that were left open in the Feasibility Study — particularly on data integration and the institutional model, where a wrong choice can lead to very long delivery time, quality control and budgeting.

We have been developing our Demo Music Observatory in the world’s 2nd ranked university-backed incubator program, the Yes!Delft AI Validation Lab since 15 September 2020. Our aim is to show a better organizational model, examples of research automation and other data integration innovation that can reduce the budgetary needs of the European Music Observatory by 80-90% and provide far more timely, accurate, and relevant service than most data observatories in Europe.

CEEMID has been creating a similar data observatory to the foreseen European Data Observatory, solely based on the contribution of about 60 European stakeholders. As the Feasibility Study suggests, we would be happy to transfer much of CEEMID’s content to the European Data Observatory, which could potentially fill up about 50-70% of the envisioned observatory. We are building our Demo Music Observatory based on the 2000 pan-European indicators collected by CEEMID since 2014.

Challenge Our Demo Observatory: Check out the Music Diversity & Circulation Pillar of our Demo Music Observatory. If you do not find what you are looking for, contact us — we will try to put the data there from our repositories.

Illusory data gap: active and music participation is available on EU level both for gender groups or four ethnic minorities – this is regularly featured in various European CAP surveys and in our national CAP surveys, too.

The Feasibility Study is based on perceived data gaps between data needs of the European stakeholders and data availability. We have shown earlier this year to the European stakeholders that much of these data gaps are illusory. We would like to give about 50 indicators with full documentation, automated, weekly, monthly, quarterly, or annual refreshment for free for all music industry users. We would like to challenge the stakeholders to formulate data requests to us and think together on the ways how could the European music industry build a better observatory faster and with less cost.

Challenge Our Demo Observatory: Check out the Music Economy Pillar of our Demo Music Observatory. If you do not find what you are looking for, contact us — we will try to put the data there from our repositories.

The Feasibility Study concludes that a “European Music Observatory would require a very significant allocation of funds, beyond what could be currently expected from the possible budget of the future Creative Europe programme”. While the Feasibility Study provide cost options, or any cost-benefit analysis, we are certain that this is an exaggeration. Most European data observatories operate with an annual 20,000-200,000-euro subsidy. We want to show with our Demo Music Observatory what can be achieved with an annual budget of 20,000 euros, 50,000 euros, 100,000 euros or 200,000 euros.

Challenge Our Demo Observatory: Check out the Music, Society and Citizenship Pillar of our Demo Music Observatory. If you do not find what you are looking for, contact us — we will try to put the data there from our repositories.

Product/Market Fit Validation in Yes!Delft

Fri, 25 Sep 2020 15:31:39 +0000

We would like to validate our product market/fit in two segments, business/policy research and scientific research, with a supporting role given to data journalism. Because we want to follow a bootstrapping strategy, we must focus on those clients where we find the highest value proposition, which is of course easier said than done. We see much interest in our offering from other continents, therefore we truly welcome the opportunity that we can do this on a truly global business canvas in one of the worlds’ top five incubators, the number 2 university-backed incubator in the world, second to none in Europe, in the Yes!Delft AI+Blockchain Validation Lab.

In Europe hundreds of thousands of microenterprises, such as record labels, video producers or book publishers are facing data and AI giants like Google’s YouTube, Apple Music, Spotify, Netflix or Amazon. If the recommendation engines of these giants do not recommend their songs, films or books, then their investments are doomed to fail, because about half of the global sales are driven by AI algorithms. When they make a claim for the missing money, they will immediately find themselves in a dispute with gigabytes of data that they can only handle with a data scientist, even though they do not even have an IT professional or an HR professional to make the hire.

An awful lot of money, creativity and real values are at stake, and we want to be on the creator’s side, their technician’s side, their manager’s side when they want to get a fair share from the pie and they want to help these industry leader to make the pie grow.

The UNESCO and the EU have been promoting as an organizational solution the fragmentation problem with the so-called data observatories that are pooling the business, policy, and scientific research needs of various domains, like music. This is an idea that we really like, and we believe that our research automation solutions can help these observatories to grow faster as ecosystems, create better quality and more timely data and research products and a far lower cost.

We define ourselves as a reproducible research company inspired by the philosophy of open collaboration, based on open-source software and open data. We want to explore various revenue models around these ideas.

We are not committed to open source licensing if more permissive licensing policies provide us with better opportunities.
We would like to explore various data-as-service models, because we do not want to be locked into the position of cheap open data vendors.
We want to deploy AI applications that really help earning money in these sectors with playlisting, recommendation engines, forecasting applications, or royalty valuations, because our open collaboration approach brings up enough data sooner to than its alternatives, because it manages inherent conflicts of interests, fragmentation, and decentralization better than hierarchical solutions.

Timeline

In January CEEMID reached its peak: we introduced a 12-country reproducible research project made with only freelancers in Brussels, presented as best use case of evidence-based policy design.
In February Daniel visited the Yes!Delft Co-Lab to find out who would be the best co-founder to re-launch CEEMID as an enterprise.
In April we started to release our data as open data for validation.
One month ago we started-up.
Then we launched the music.dataobservatory.eu project.
A few other data observatories.

Bonus:

Palato in the Hague, where we took our selfie and had an absolutely amazing dinner after the pitch. Check them out!

Reproducible Survey Harmonization: retroharmonize Is Released

Mon, 21 Sep 2020 11:31:39 +0000

Our original intention was to make surveying more accessible for music and creative industry partners, by relying more on already existing survey data, and better designing complementary, smaller surveys, becasue surveying, opinion polling is becoming increasingly expensive in the develop world. People are less and less likely to sit down for an interview in their houses. We have tried to harmonize our custom surveys, particuarly with Kantar in Hungary and Focus in Slovakia with exisiting EU projects. But we ended up making a part of international survey harmonization across countries and throughout years easier to automate.

Surveys are like sensors for natural sciences and industrial production. They are essential for almost any social and economic statistical indicator, for calculating the inflation, parts of the GDP, participation in education programs. Making surveys easier to harmonize and exploit more already existing survey data can bring down research cost, and can increase research value at the same time. (See our earlier blog post Increase The Value Of Market Research With Open Data And Survey Harmonization.)

So, if you are an R user, you can use install.packages(“retroharmonize”) to get the released 0.1.13 version and make tutorials with real Eurobarometer or Afrobarometer microdata. With devtools::install_github("antaldaniel/retroharmonize") you can already install the current development version 0.1.14, which handles perl-like regex, which will be necessary for our next tutorial in the making for Arab Barometer.

Related:

Launching Our Demo Music Observatory

Tue, 15 Sep 2020 08:00:39 +0000

Today, on 15 September 2020, we officially launched our minimal viable product as we promised to partners back in February. This was a particularly difficult period for everybody. We aspired to deliver by September in a very different environment, our hopes for commissioned work went up in flames with the pandemic, and our targeted users, musicians and music entrepreneurs, talent managers, music venues lost most of their income. The organizations helping them, granting authorities, export offices and collective management societies are overwhelmed with the problem. During these troublesome times, our team expanded, attracted great new talent, and kept working.

Our first product is the Demo Music Observatory, a collaborative, automated research-based observatory for the music industry, one that is particularly hard hit by the COVID19 crisis. Not only great artists, composers, technicians, managers fell victim to the virus, but musicians lost about 50–90% of their income from live music. This translates to a 100% loss for the live music technicians and managers.

See our earlier blogpost on what you see on the video.

The music industry was never a place for great job security. For putting up a show, you usually need a network of 10–200 artists, technicians and managers to work together as freelancers without all those social benefits that many people enjoy in other walks of life. We have been trying to figure out how to help this microenterprise and freelancer-network based industry with research for five years. Our aim is to make them competitive when they are talking with their buyers: Google, Apple, Spotify, who are really heavy-weight data and AI pros. Our better plan their tours, when they will be back on the road, to understand what sort of audiences and purchasing power waits for them in different European cities.

We are launching at a time when the music industry is crying for help.Therefore, we have decided to make our demo observatory open and unfinished. Over the last 7 years, we have built up about 2000 music and creative sector indicators to be used for business KPIs, forecasting targets, grant evaluations, royalty valuations, concert demography target group analysis and other professional uses. We would like to open up, based on your needs, about 50 well-designed indicators, and pledge to keep it daily refreshed, corrected, documented, citaable, downloadable. Also, feel free to use our most valuable source code—use it for your own purposes, even modify it, as long as you keep it open.

For our smaller partners, we follow what musicians do these days on Bandcamp: name your price. We make a pledge to our small partners: if you need reliable data to plan your next grant calls, calculate royalties, compensations, predict hit candidates, give us the job—and name your price. Post-corona, you can take for a dollar the best music from Bandcamp. You can take our research products, for a limited period, for any amount you name, as long as it is for a good cause and serves the industry, musicians, technicians or managers. In return, we ask for your feedback. Help us validate whether we are on the right track, tell us how we can cooperate after the pandemic, in better times.

Our larger and better funded partners? We ask you to pay the price we name, because we believe that it is a well-justified, fair and competitive price, set by pricing experts.

We appreciate it if you take a look at our offering, or if you pass this blogpost on to your colleagues in the industry. Our main target audience initially are music professional in broader Europe, but we are planning to cover all major global markets very soon, too. Feedback from the U.S., Australia, Canada, Colombia, Brazil & Argentina is particularly welcome as we have great plans over there!

Who we are?

We started our operations on 1 September 2020 on the basis of CEEMID, a pan-European data observatory that created about 2000 music and creative industry indicators for its users. In the coming days, we are gradually opening up about 50 music industry and 50 broader creative industry indicators in a fully reproducible workflow, with daily re-freshed, re-processed, well-formatted and documented indicators for business and policy decisions.

We would like to validate this approach in one of the world’s most prestigious university-backed incubator programs, in the Yes!Delft AI/Blockchain Validation Lab. We’re finalist on their selection, and all help before 23 September from our friends in the music industry is more than appreciated. If we get there, we can rely on probably the best pros in Europe to make our offering better tailored and financially sustainable.

Get in touch!

We use the very simple and extremely secure keybase.io, a kind of mix of Whatsapp, Skype, Google Drive, One Drive and zoom. You can get in touch on that platform with us in anytime here.

You can easily contact on LinkedIn Daniel or Kátya and of course, we have a usually working email contact form, too. Our email is name.surname at our main domain.

Video credits

Data acquisition and processing: Daniel Antal, CFA and Marta Kołczyńska, PhD (survey data).
Documentation automation: Sandor Budai
Video art: Line Matson
Music: Moon Moon Moon.

Creating An Automated Data Observatory

Fri, 11 Sep 2020 16:00:39 +0000

We are building data ecosystems, so called observatories, where scientific, business, policy and civic users can find factual information, data, evidence for their domain. Our open source, open data, open collaboration approach allows to connect various open and proprietary data sources, and our reproducible research workflows allow us to automate data collection, processing, publication, documentation and presentation.

Our scripts are checking data sources, such as Eurostat’s Eurobase, Spotify’s API and other music industry sources every day for new information, and process any data corrections or new disclosure, interpolate, backcast or forecast missing values, make currency translations and unit conversions. This is shown illustrated with an earlier post.

For direct access to the file visit this link.

In the video we show automated the creation of an observatory website with well-formatted, statistical data dissemination, a technical document in PDF and an ebook can be automated. In our view, our technology is particularly useful technology in business and scientific researech projects, where it is important that always the most timely and correct data is being analyzed, and remains automatically documented and cited. We are ready deploy public, collaborative, or private data observatories in short time.

Data processing costs can be as high as 80% for any in-house AI deployment project. We work mainly with organization that do not have in house data science team, and acquire their data anyway from outside the organization. In their case, this rate can be as high as 95%, meaning that getting and processing the data for deploying AI can be 20x more expensive than the AI solution itself.

AI solutions require a large amount of standardized, well processed data to learn from. We want to radically decrease the cost of data acquisition and processing for our users so that exploiting AI becomes in their reach. This is particularly important in one of our target industries, the music industries, where most of the global sales is algorithmic and AI-driven. Artists, bands, small labels, publishers, even small country national associations cannot remain competitive if they cannot participate in this technological revolution.

We would like to validate this approach in one of the world’s most prestigious university-backed incubator programs, in the Yes!Delft AI/Blockchain Validation Lab.

Video credits

Data acquisition and processing: Daniel Antal, CFA and Marta Kołczyńska, PhD (survey data).
Documentation automation: Sandor Budai
Video art: Line Matson
Music: Moon Moon Moon.

Starting-up

Mon, 24 Aug 2020 10:15:00 +0000

The big day has come: the co-founders singed off the documents at the public notary and started the registration of a reproducible research start-up in Leiden. We got a lot of support from our friends! Your encouragement gives us a lot of energy to accomplish our first milestones, and to get Reprex B.V. going!

Reprex means ‘reproducible example’ in data science. When you are stuck with a problem, creating a reproducible example allows other computer scientists, statisticians, programmers or data users to solve it. In 80% of the cases, you usually find the solution while creating a generalized example. In the 20% other cases, you can reach out for help easily.

In the coming days, we are launching demo versions of our headline products, data observatories. music.dataobservatory.eu will be a fully automated online service that every day collects, processes, cleans, and publishes scientifically valid data about European music. Very soon after we will launch two other observatories.

The creative and cultural sector, NGOs, most research institutions, data journalism teams are usually very small, and they do not have internal IT or data science capacities. We would like to provide them a transparent, high quality, and fully open source solution to acquire data, process it without errors, document it and make sense of it. We would like to embrace the idea of open collaboration among creative enterprises, scientific researchers, NGOs, data journalists and policymakers with our work.

Our work will comply with the Open Policy Analysis standards developed by the Berkeley Initiative for Transparency in the Social Sciences & Center for Effective Global Action and the four principles of reproducible research: reviewability, replicability, confirmability and auditability. We believe that these standards apply in reproducible finance, empirical evidence presentation in courts, or advocating sound policies and producing high-quality journalism.

Do you want to help our start?

We would like to enter into the Validation Lab of one of the best artificial intelligence incubators in early September. Talented team members, letters of intents and assignments from organizations will give a lot of credibility to our start Meet our team ».

Put as in contact with people who love to write code in R and interested in automating business and social science research and primary data collection such as surveying. Check out what sort of code we create »
Introduce us to people who need data and information to make better informed decision and analysis in music, film, book publishing, photography services or socially responsible finance.
Share contacts of data journalists who would like to develop stories from big survey programs like Eurobarometer, Afrobarometer and Lationbarometro, or base their storytelling on data and its visualizations. See our survey harmonization examples »

Do you know such people? Send over this post or connect us in an email or social media message!

Thanks again for your good wishes and encouragements, and hope to hear from you soon!