R | Reprex

Learn R with Reprex

Fri, 07 Oct 2022 12:35:00 +0200

Big Data Creates Inequalities

Only the largest corporations, best-endowed universities, and rich governments can afford data collection and processing capacities that are large enough to harness the advantages of AI.

Fullscreen: F

Next: ️> or Space | Previous :️<
Start: Home | Finish: End
Overview: Esc| Speaker notes: S
Zoom: Alt + Click 🖱️

Big data that works for all

No matter how big is the problem or how small is your team, `Reprex` fill your reports, dashboards, newsletters, books with data and its visualization.
Learn R with us: you can reduce the inequalities by joining the open source movement, learning to run open source software, ask for help, improve the tutorials, the documentation, and eventually learn to make the computer work for you.
Contributor Covenant: Participating in open source is often a highly collaborative experience. We’re encouraged to create in public view, and we’re incentivized to welcome contributions of all kinds from people around the world. This makes the practice of open source as much social as it is technical.

Get Inspired

Find more interesting and better data: you don’t have to be a data scientist or write code to contribute to our projects.
Data feminism: Catherine D’Ignazio and Lauren Klein present a new way of thinking about data science and data ethics—one that is informed by intersectional feminist thought. Highly inspirational, free, open-source book.
RLadies is a world-wide organization to promote gender diversity in the R community.

Contributor Covenant

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.

Run code from tutorials

retroharmonize.dataobservatory.eu
🖱 Get started
[🖱️ Articles](https://retroharmonize.dataobservatory.eu/articles/index.htm

Find help, ask for help: reprex

Documentation for better tutorials

Debugging and testing code

Contribute to documentation

R is a functional language

R is both a statistical environment and a programming language
R, the open source and further developed version of the S language, is mainly functional
If you did a task at least twice, the 3rd time you better write a function script to keep doing it forever.
Most of your effort will be to find a well-written function for your work
If you cannot find a function, you will modify somebody else’s function, or eventually write your own

R + YAML + markdown = web ready

Learn YAML in Y minutes: tell the computer what you want to do with a document
R Markdown basics: it is just a plain markdown that allows you to insert little R program ‘chunks’.
Awesome markdown editors and pre-writers: find a convenient tool
Google Docs to markdown: practice by translating your Google Docs text to markdown. It is very easy.

Package and release: a team effort

Our open source development projects

🔢 dataset: Synchronize datasets with global knowledge hubs #️⃣ statcodelists: Make your data codes understood globally ♻️ iotables: Create economic or environmental impact assessments in any EU country 🌍 regions: Create from raw survey data more granular statistics in any EU country ✅ retroharmonize: Harmonize questions banks, recycle answers from past surveys ⏭️ all in on one page

Create with us

Questions?

Email | Keybase

LinkedIn: Daniel Antal - Reprex | Home

Learn R with Reprex

Fri, 07 Oct 2022 12:35:00 +0200

Big Data Creates Inequalities

Only the largest corporations, best-endowed universities, and rich governments can afford data collection and processing capacities that are large enough to harness the advantages of AI.

Fullscreen: F

Next: ️> or Space | Previous :️<
Start: Home | Finish: End
Overview: Esc| Speaker notes: S
Zoom: Alt + Click 🖱️

Big data that works for all

No matter how big is the problem or how small is your team, `Reprex` fill your reports, dashboards, newsletters, books with data and its visualization.
Learn R with us: you can reduce the inequalities by joining the open source movement, learning to run open source software, ask for help, improve the tutorials, the documentation, and eventually learn to make the computer work for you.
Contributor Covenant: Participating in open source is often a highly collaborative experience. We’re encouraged to create in public view, and we’re incentivized to welcome contributions of all kinds from people around the world. This makes the practice of open source as much social as it is technical.

Data Feminism

Get Inspired

Find more interesting and better data: you don’t have to be a data scientist or write code to contribute to our projects.
Data feminism: Catherine D’Ignazio and Lauren Klein present a new way of thinking about data science and data ethics—one that is informed by intersectional feminist thought. Highly inspirational, free, open-source book.
RLadies is a world-wide organization to promote gender diversity in the R community.

Contributor Covenant

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.

Run code from tutorials

retroharmonize.dataobservatory.eu
🖱 Get started
🖱️ Articles

Find help, ask for help: reprex

Documentation for better tutorials

Debugging and testing code

Contribute to documentation

R is a functional language

R is both a statistical environment and a programming language
R, the open source and further developed version of the S language, is mainly functional
If you did a task at least twice, the 3rd time you better write a function script to keep doing it forever.
Most of your effort will be to find a well-written function for your work
If you cannot find a function, you will modify somebody else’s function, or eventually write your own

R + YAML + markdown = web ready

Learn YAML in Y minutes: tell the computer what you want to do with a document
R Markdown basics: it is just a plain markdown that allows you to insert little R program ‘chunks’.
Awesome markdown editors and pre-writers: find a convenient tool
Google Docs to markdown: practice by translating your Google Docs text to markdown. It is very easy.

Package and release: a team effort

Our open source development projects

Create with us

Questions?

Email | Keybase

LinkedIn: Daniel Antal - Reprex | Home

stacodelists: use standard, language-independent variable codes to help international data interoperability and machine reuse in R

Wed, 29 Jun 2022 08:12:00 +0100

Visit the documentation website of statcodelists on statcodelists.dataobservatory.eu/.

The goal of statcodelists is to promote the reuse and exchange of statistical information and related metadata with making the internationally standardized SDMX code lists available for the R user. SDMX – the Statistical Data and Metadata eXchange has been published as an ISO International Standard (ISO 17369). The metadata definitions, including the codelists are updated regularly according to the standard. The authoritative version of the code lists made available in this package is https://sdmx.org/?page_id=3215/.

Purpose

Cross-domain concepts in the SDMX framework describe concepts relevant to many, if not all, statistical domains. SDMX recommends using these concepts whenever feasible in SDMX structures and messages to promote the reuse and exchange of statistical information and related metadata between organisations.

Code lists are predefined sets of terms from which some statistical coded concepts take their values. SDMX cross-domain code lists are used to support cross-domain concepts. What are these cross-domain coded concepts?

Geographical codes, like NL: the Netherlands in the CL_AREA code list.
Standard industry codes J631 for Data processing, hosting and related activities in Europe. (NACE Rev 2 in Europe, beware, it is J592in Australia and New Zealand, see CL_ACTIVITY_ANZSIC06.)
Occupations, like OC2521 for Database designers and administrators in CL_OCCUPATIONS
Time fomatting standards, like CCYY for annual data series in CL_TIME_FORMAT.

Check out the available codlists on the package homepage.

The use of common code lists will help users to work even more efficiently, easing the maintenance of and reducing the need for mapping systems and interfaces delivering data and metadata to them. A very obvious advantage of using the code systems is that you can retrieve data from national sources indifferent of the natural language used in North Macedonia, Japan, the U.S. or the Netherlands. While the data labels may change to be locally human-readable, computers and geeks can read the codes and understand them immediately. Provided that they use the standard codes.

Our data observatories are rolling out SDMX coding across all datasets to help data ingestion and interoperability, data findability and data reuse. statcodelists can help the use of standard SDMX codes in your R workflow–both for downloading data from statistical agencies and to produce publication-ready datasets that the rest of the world (and even APIs) will understand.

Installation

You can install statcodelists from CRAN:

install.packages("statcodelists")

Further recommended code values for expressing general statistical concepts like not applicable, etc., can be found in section Generic codes of the Guidelines for the creation and management of SDMX Cross-Domain Code Lists.

For further codelists used by reliable statistical agency but not harmonized on SDMX level please consult the SDMX Global Registry Codelists page.

The creator of this package is not affiliated with SDMX, and this package was has not been endorsed by SDMX.

Code of Conduct

Please note that the statcodelists project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Including Indicators from Arab Barometer in Our Observatory

Mon, 28 Jun 2021 09:00:00 +0000

A new version of the retroharmonize R package – which is working with retrospective, ex post harmonization of survey data – was released yesterday after peer-review on CRAN. It allows us to compare opinion polling data from the Arab Barometer with the Eurobarometer and Afrorbarometer. This is the first version that is released in the rOpenGov community, a community of R package developers on open government data analytics and related topics.

Surveys are the most important data sources in social and economic statistics – they ask people about their lives, their attitudes and self-reported actions, or record data from companies and NGOs. Survey harmonization makes survey data comparable across time and countries. It is very important, because often we do not know without comparison if an indicator value is low or high. If 40% of the people think that climate change is a very serious problem, it does not really tell us much without knowing what percentage of the people answered this question similarly a year ago, or in other parts of the world.

With the help of Ahmed Shabani and Yousef Ibrahim, we created a third case study after the Eurobarometer, and Afrobarometer, about working with the Arab Barometer harmonized survey data files.

Ex ante survey harmonization means that researchers design questionnaires that are asking the same questions with the same survey methodology in repeated, distinct times (waves), or across different countries with carefully harmonized question translations. Ex post harmonizations means that the resulting data has the same variable names, same variable coding, and can be joined into a tidy data frame for joint statistical analysis. While seemingly a simple task, it involves plenty of metadata adjustments, because established survey programs like Eurobarometer, Afrobarometer or Arab Barometer have several decades of history, and several decades of coding practices and file formatting legacy.

Variable harmonization means that if the same question is called in one microdata source Q108 and the other eval-parl-elections then we make sure that they get a harmonize and machine readable name without spaces and special characters.
Variable label harmonization means that the same questionnaire items get the same numeric coding and same categorical labels.
Missing case harmonization means that various forms of missingness are treated the same way.

For the evaluation of the economic situation dataset, get the country averages and aggregates from Zenodo, and the plot in jpg or png from figshare.

In our new Arab Barometer case study, the evaulation of parliamentary elections has the following labels. We code them consistently 1: free_and_fair, 2: some_minor_problems, 3: some_major_problems and 4: not_free.

“0. missing”	“1. they were completely free and fair”
“2. they were free and fair, with some minor problems”	“3. they were free and fair, with some major problems”
“4. they were not free and fair”	“8. i don’t know”
“9. declined to answer”	“Missing”
“They were completely free and fair”	“They were free and fair, with some minor breaches”
“They were free and fair, with some major breaches”	“They were not free and fair”
“Don’t know”	“Refuse”
“Completely free and fair”	“Free and fair, but with minor problems”
“Free and fair, with major problems”	“Not free or fair”
“Don’t know (Do not read)”	“Decline to answer (Do not read)”

Of course, this harmonization is essential to get clean results like this:

For evaluation or reuse of parliamentary elections dataset get the replication data and the code from the Zenodo open repository.

In our case study, we had three forms of missingness: the respondent did not know the answer, the respondent did not want to answer, and at last, in some cases the respondent was not asked, because the country held no parliamentary elections. While in numerical processing, all these answers must be left out from calculating averages, for example, in a more detailed, categorical analysis they represent very different cases. A high level of refusal to answer may be an indicator of surpressing democratic opinion forming in itself.

Survey harmonization with many countries entails tens of thousands of small data management task, which, unless automatically documented, logged, and created with a reproducible code, is a helplessly error-prone process. We believe that our open-source software will bring many new statistical information to the light, which, while legally open, was never processed due to the large investment needed.

We also started building experimental APIs data is running retroharmonize regularly. We will place cultural access and participation data in the Digital Music Observatory, climate awareness, policy support and self-reported mitigation strategies into the Green Deal Data Observatory, and economy and well-being data into our Economy Data Observatory.

Further plans

Retrospective survey harmonization is a far more complex task than this blogpost suggest. Retrospective survey harmonization is a far more complex task than this blogpost suggest, because established survey programs have gathered decades of legacy data in legacy coding schemes and legacy file formats. Putting the data right, and especially putting the invaluable descriptive and administrative (processing) metadata right is a huge undertaking. We are releasing example codes, datasets and charts for researchers to comapre our harmonized results with theirs, and improve our software. We are releasing example codes, datasets and charts for researchers to comapre our harmonized results with theirs, and improve our software.

Use our software

The retroharmonize R package can be freely used, modified and distributed under the GPL-3 license. For the main developer and contributors, see the package homepage. If you use it for your work, please kindly cite it as:

Daniel Antal (2021). retroharmonize: Ex Post Survey Data Harmonization. R package version 0.1.17. https://doi.org/10.5281/zenodo.5034752

Download the BibLaTeX entry.

Tutorial to work with the Arab Barometer survey data

Daniel Antal, & Ahmed Shaibani. (2021, June 26). Case Study: Working With Arab Barometer Surveys for the retroharmonize R package (Version 0.1.6). Zenodo. https://doi.org/10.5281/zenodo.5034759

For the replication data to report potential issues and improvement suggestions with the code:

Daniel Antal, & Ahmed Shaibani. (2021). Replication Data for the retroharmonize R Package Case Study: Working With Arab Barometer Surveys (Version 0.1.6) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5034741

Experimental API

We are also experimenting with the automated placement of authoritative and citeable figures and datasets in open repositories. For the climate awareness dataset get the country averages and aggregates from Zenodo, and the plot in jpg or png from figshare. Our plan is to release open data in a modern API with rich descriptive metadata meeting the Dublin Core and DataCite standards, and further administrative metadata for correct coding, joining and further manipulating or data, or for easy import into your database.

Join our open source effort

Want to help us improve our open data service? Include Lationbarómetro and the Caucasus Barometer in our offering? Join the rOpenGov community of R package developers, an our open collaboration to create the automated data observatories. We are not only looking for developers, but data curators and service design associates, too.

Open Data - The New Gold Without the Rush

Fri, 18 Jun 2021 17:00:00 +0000

If open data is the new gold, why even those who release fail to reuse it? We created an open collaboration of data curators and open-source developers to dig into novel open data sources and/or increase the usability of existing ones. We transform reproducible research software into research- as-service.

Every year, the EU announces that billions and billions of data are now “open” again, but this is not gold. At least not in the form of nicely minted gold coins, but in gold dust and nuggets found in the muddy banks of chilly rivers. There is no rush for it, because panning out its value requires a lot of hours of hard work. Our goal is to automate this work to make open data usable at scale, even in trustworthy AI solutions.

There is no rush for it, because panning out its value requires a lot of hours of hard work. Our goal is to automate this work to make open data usable at scale, even in trustworthy AI solutions.

Most open data is not public, it is not downloadable from the Internet – in the EU parlance, “open” only means a legal entitlement to get access to it. And even in the rare cases when data is open and public, often it is mired by data quality issues. We are working on the prototypes of a data-as-service and research-as-service built with open-source statistical software that taps into various and often neglected open data sources.

We are in the prototype phase in June and our intentions are to have a well-functioning service by the time of the conference, because we are working only with open-source software elements; our technological readiness level is already very high. The novelty of our process is that we are trying to further develop and integrate a few open-source technology items into technologically and financially sustainable data-as-service and even research-as-service solutions.

Our review of about 80 EU, UN and OECD data observatories reveals that most of them do not use these organizations’s open data - instead they use various, and often not well processed proprietary sources.

We are taking a new and modern approach to the data observatory concept, and modernizing it with the application of 21st century data and metadata standards, the new results of reproducible research and data science. Various UN and OECD bodies, and particularly the European Union support or maintain more than 60 data observatories, or permanent data collection and dissemination points, but even these do not use these organizations and their members open data. We are building open-source data observatories, which run open-source statistical software that automatically processes and documents reusable public sector data (from public transport, meteorology, tax offices, taxpayer funded satellite systems, etc.) and reusable scientific data (from EU taxpayer funded research) into new, high quality statistical indicators.

We are taking a new and modern approach to the ‘data observatory’ concept, and modernizing it with the application of 21st century data and metadata standards, the new results of reproducible research and data science

We are building various open-source data collection tools in R and Python to bring up data from big data APIs and legally open, but not public, and not well served data sources. For example, we are working on capturing representative data from the Spotify API or creating harmonized datasets from the Eurobarometer and Afrobarometer survey programs.
Open data is usually not public; whatever is legally accessible is usually not ready to use for commercial or scientific purposes. In Europe, almost all taxpayer funded data is legally open for reuse, but it is usually stored in heterogeneous formats, processed into an original government or scientific need, and with various and low documentation standards. Our expert data curators are looking for new data sources that should be (re-) processed and re-documented to be usable for a wider community. We would like to introduce our service flow, which touches upon many important aspects of data scientist, data engineer and data curatorial work.
We believe that even such generally trusted data sources as Eurostat often need to be reprocessed, because various legal and political constraints do not allow the common European statistical services to provide optimal quality data – for example, on the regional and city levels.
With rOpenGov and other partners, we are creating open-source statistical software in R to re-process these heterogenous and low-quality data into tidy statistical indicators to automatically validate and document it.
We are carefully documenting and releasing administrative, processing, and descriptive metadata, following international metadata standards, to make our data easy to find and easy to use for data analysts.
We are automatically creating depositions and authoritative copies marked with an individual digital object identifier (DOI) to maintain data integrity.
We are building simple databases and supporting APIs that release the data without restrictions, in a tidy format that is easy to join with other data, or easy to join into databases, together with standardized metadata.
We maintain observatory websites (see: Digital Music Observatory, Green Deal Data Observatory, Economy Data Observatory) where not only the data is available, but we provide tutorials and use cases to make it easier to use them. Our mission is to show a modern, 21st century reimagination of the data observatory concept developed and supported by the UN, EU and OECD, and we want to show that modern reproducible research and open data could make the existing 60 data observatories and the planned new ones grow faster into data ecosystems.

We are working around the open collaboration concept, which is well-known in open source software development and reproducible science, but we try to make this agile project management methodology more inclusive, and include data curators, and various institutional partners into this approach. Based around our early-stage startup, Reprex, and the open-source developer community rOpenGov, we are working together with other developers, data scientists, and domain specific data experts in climate change and mitigation, antitrust and innovation policies, and various aspects of the music and film industry.

Our open collaboration is truly open: new data curators,developers and service designers, even volunteers and citizen scientists are welcome to join.

Our open collaboration is truly open: new data curators, data scientists and data engineers are welcome to join. We develop open-source software in an agile way, so you can join in with an intermediate programming skill to build unit tests or add new functionality, and if you are a beginner, you can start with documentation and testing our tutorials. For business, policy, and scientific data analysts, we provide unexploited, exciting new datasets. Advanced developers can join our development team: the statistical data creation is mainly made in the R language, and the service infrastructure in Python and Go components.

There are Numerous Advantages of Switching from a National Level of the Analysis to a Sub National Level

Wed, 16 Jun 2021 12:00:00 +0000

The new version of our rOpenGov R package regions was released today on CRAN. This package is one of the engines of our experimental open data-as-service Green Deal Data Observatory , Economy Data Observatory , Digital Music Observatory prototypes, which aim to place open data packages into open-source applications.

In international comparison the use of nationally aggregated indicators often have many disadvantages: they inhibit very different levels of homogeneity, and data is often very limited in number of observations for a cross-sectional analysis. When comparing European countries, a few missing cases can limit the cross-section of countries to around 20 cases which disallows the use of many analytical methods. Working with sub-national statistics has many advantages: the similarity of the aggregation level and high number of observations can allow more precise control of model parameters and errors, and the number of observations grows from 20 to 200-300.

The change from national to sub-national level comes with a huge data processing price: internal administrative boundaries, their names, codes codes change very frequently.

Yet the change from national to sub-national level comes with a huge data processing price. While national boundaries are relatively stable, with only a handful of changes in each recent decade. The change of national boundaries requires a more-or-less global consensus. But states are free to change their internal administrative boundaries, and they do it with large frequency. This means that the names, identification codes and boundary definitions of sub-national regions change very frequently. Joining data from different sources and different years can be very difficult.

Our regions R package helps the data processing, validation and imputation of sub-national, regional datasets and their coding.

There are numerous advantages of switching from a national level of the analysis to a sub-national level comes with a huge price in data processing, validation and imputation, and the regions package aims to help this process.

You can review the problem, and the code that created the two map comparisons, in the Maping Regional Data, Maping Metadata Problems vignette article of the package. A more detailed problem description can be found in Working With Regional, Sub-National Statistical Products.

This package is an offspring of the eurostat package on rOpenGov. It started as a tool to validate and re-code regional Eurostat statistics, but it aims to be a general solution for all sub-national statistics. It will be developed parallel with other rOpenGov packages.

Get the Package

You can install the development version from GitHub with:

devtools::install_github("rOpenGov/regions")

or the released version from CRAN:

install.packages("regions")

You can review the complete package documentation on regions.dataobservaotry.eu. If you find any problems with the code, please raise an issue on Github. Pull requests are welcome if you agree with the Contributor Code of Conduct

If you use regions in your work, please cite the package as: Daniel Antal, Kasia Kulma, Istvan Zsoldos, & Leo Lahti. (2021, June 16). regions (Version 0.1.7). CRAN. http://doi.org/10.5281/zenodo.4965909

Join us

Join our open collaboration Economy Data Observatory team as a data curator, developer or business developer. More interested in environmental impact analysis? Try our Green Deal Data Observatory team! Or your interest lies more in data governance, trustworthy AI and other digital market problems? Check out our Digital Music Observatory team!

Creating Algorithmic Tools to Interpret and Communicate Open Data Efficiently

Fri, 04 Jun 2021 10:00:00 +0000

As a developer at rOpenGov, what type of data do you usually use in your work?

As an academic data scientist whose research focuses on the development of general-purpose algorithmic methods, I work with a range of applications from life sciences to humanities. Population studies play a big role in our research, and often the information that we can draw from public sources - geospatial, demographic, environmental - provides invaluable support. We typically use open data in combination with sensitive research data but some of the research questions can be readily addressed based on open data from statistical authorities such as Statistics Finland or Eurostat.

In your ideal data world, what would be the ultimate dataset, or datasets that you would like to see in the Music Data Observatory?

One line of our research analyses the historical trends and spread of knowledge production, in particular book printing based on large-scale metadata collections. It would be interesting to extend this research to music, to understand the contemporary trends as well as the broader historical developments. Gaining access to a large systematic collection of music and composition data from different countries across long periods of time would make this possible.

Why did you decide to join the challenge and why do you think that this would be a game changer for researchers and policymakers?

Joining the challenge was a natural development based on our overall activities in this area; the rOpenGov project has been around for a decade now, since the early days of the broader open data movement. This has also created an active international developer network and we felt well equipped for picking up the challenge. The game changer for researchers is that the project highlights the importance of data quality, even when dealing with official statistics, and provides new methods to solve these issues efficiently through the open collaboration model. For policymakers, this provides access to new high-quality curated data and case studies that can support evidence-based decision-making.

Do you have a favorite, or most used open governmental or open science data source? What do you think about it? Could it be improved?

Regarding open government data, one of my favorites is not a single data source but a data representation standard. The px format is widely used by statistical authorities in various countries, and this has allowed us to create R tools that allow the retrieval and analysis of official statistics from many countries across Europe, spanning dozens of statistical institutions. Standardization of open data formats allows us to build robust algorithmic tools for downstream data analysis and visualization. Open government data is still too often shared in obscure, non-standard or closed-source file formats and this is creating significant bottlenecks for the development of scalable and interoperable AI and machine learning methods that can harness the full potential of open data.

Regarding open government data, one of my favorites is not a single data source but a data representation standard, the Px format.

From your perspective, what do you see being the greatest problem with open data in 2021?

Although there are a variety of open data sources available (and the numbers continue to increase), the availability of open algorithmic tools to interpret and communicate open data efficiently is lagging behind. One of the greatest challenges for open data in 2021 is to demonstrate how we can maximize the potential of open data by designing smart tools for open data analytics.

What can our automated data observatories do to make open data more credible in the European economic policy community and be accepted as verified information?

The role of the professional network backing up the project, and the possibility of getting critical feedback and later adoption by the academic communities will support the efforts. Transparency of the data harmonization operations is the key to credibility, and will be further supported by concrete benchmarks that highlight the critical differences in drawing conclusions based on original sources versus the harmonized high-quality data sets.

We need to get critical feedback and later adoption by the academic communities.

How we can ensure the long-term sustainability of the efforts?

The extent of open data space is such that no single individual or institution can address all the emerging needs in this area. The open developer networks play a huge role in the development of algorithmic methods, and strong communities have developed around specific open data analytical environments such as R, Python, and Julia. These communities support networked collaboration and provide services such as software peer review. The long-term sustainability will depend on the support that such developer communities can receive, both from individual contributors as well as from institutions and governments.

Join us

Reproducible Survey Harmonization: retroharmonize Is Released

Mon, 21 Sep 2020 11:31:39 +0000

Our original intention was to make surveying more accessible for music and creative industry partners, by relying more on already existing survey data, and better designing complementary, smaller surveys, becasue surveying, opinion polling is becoming increasingly expensive in the develop world. People are less and less likely to sit down for an interview in their houses. We have tried to harmonize our custom surveys, particuarly with Kantar in Hungary and Focus in Slovakia with exisiting EU projects. But we ended up making a part of international survey harmonization across countries and throughout years easier to automate.

Surveys are like sensors for natural sciences and industrial production. They are essential for almost any social and economic statistical indicator, for calculating the inflation, parts of the GDP, participation in education programs. Making surveys easier to harmonize and exploit more already existing survey data can bring down research cost, and can increase research value at the same time. (See our earlier blog post Increase The Value Of Market Research With Open Data And Survey Harmonization.)

So, if you are an R user, you can use install.packages(“retroharmonize”) to get the released 0.1.13 version and make tutorials with real Eurobarometer or Afrobarometer microdata. With devtools::install_github("antaldaniel/retroharmonize") you can already install the current development version 0.1.14, which handles perl-like regex, which will be necessary for our next tutorial in the making for Arab Barometer.

Related:

Launching Our Demo Music Observatory

Tue, 15 Sep 2020 08:00:39 +0000

Today, on 15 September 2020, we officially launched our minimal viable product as we promised to partners back in February. This was a particularly difficult period for everybody. We aspired to deliver by September in a very different environment, our hopes for commissioned work went up in flames with the pandemic, and our targeted users, musicians and music entrepreneurs, talent managers, music venues lost most of their income. The organizations helping them, granting authorities, export offices and collective management societies are overwhelmed with the problem. During these troublesome times, our team expanded, attracted great new talent, and kept working.

Our first product is the Demo Music Observatory, a collaborative, automated research-based observatory for the music industry, one that is particularly hard hit by the COVID19 crisis. Not only great artists, composers, technicians, managers fell victim to the virus, but musicians lost about 50–90% of their income from live music. This translates to a 100% loss for the live music technicians and managers.

See our earlier blogpost on what you see on the video.

The music industry was never a place for great job security. For putting up a show, you usually need a network of 10–200 artists, technicians and managers to work together as freelancers without all those social benefits that many people enjoy in other walks of life. We have been trying to figure out how to help this microenterprise and freelancer-network based industry with research for five years. Our aim is to make them competitive when they are talking with their buyers: Google, Apple, Spotify, who are really heavy-weight data and AI pros. Our better plan their tours, when they will be back on the road, to understand what sort of audiences and purchasing power waits for them in different European cities.

We are launching at a time when the music industry is crying for help.Therefore, we have decided to make our demo observatory open and unfinished. Over the last 7 years, we have built up about 2000 music and creative sector indicators to be used for business KPIs, forecasting targets, grant evaluations, royalty valuations, concert demography target group analysis and other professional uses. We would like to open up, based on your needs, about 50 well-designed indicators, and pledge to keep it daily refreshed, corrected, documented, citaable, downloadable. Also, feel free to use our most valuable source code—use it for your own purposes, even modify it, as long as you keep it open.

For our smaller partners, we follow what musicians do these days on Bandcamp: name your price. We make a pledge to our small partners: if you need reliable data to plan your next grant calls, calculate royalties, compensations, predict hit candidates, give us the job—and name your price. Post-corona, you can take for a dollar the best music from Bandcamp. You can take our research products, for a limited period, for any amount you name, as long as it is for a good cause and serves the industry, musicians, technicians or managers. In return, we ask for your feedback. Help us validate whether we are on the right track, tell us how we can cooperate after the pandemic, in better times.

Our larger and better funded partners? We ask you to pay the price we name, because we believe that it is a well-justified, fair and competitive price, set by pricing experts.

We appreciate it if you take a look at our offering, or if you pass this blogpost on to your colleagues in the industry. Our main target audience initially are music professional in broader Europe, but we are planning to cover all major global markets very soon, too. Feedback from the U.S., Australia, Canada, Colombia, Brazil & Argentina is particularly welcome as we have great plans over there!

Who we are?

We started our operations on 1 September 2020 on the basis of CEEMID, a pan-European data observatory that created about 2000 music and creative industry indicators for its users. In the coming days, we are gradually opening up about 50 music industry and 50 broader creative industry indicators in a fully reproducible workflow, with daily re-freshed, re-processed, well-formatted and documented indicators for business and policy decisions.

We would like to validate this approach in one of the world’s most prestigious university-backed incubator programs, in the Yes!Delft AI/Blockchain Validation Lab. We’re finalist on their selection, and all help before 23 September from our friends in the music industry is more than appreciated. If we get there, we can rely on probably the best pros in Europe to make our offering better tailored and financially sustainable.

Get in touch!

We use the very simple and extremely secure keybase.io, a kind of mix of Whatsapp, Skype, Google Drive, One Drive and zoom. You can get in touch on that platform with us in anytime here.

You can easily contact on LinkedIn Daniel or Kátya and of course, we have a usually working email contact form, too. Our email is name.surname at our main domain.

Video credits

Data acquisition and processing: Daniel Antal, CFA and Marta Kołczyńska, PhD (survey data).
Documentation automation: Sandor Budai
Video art: Line Matson
Music: Moon Moon Moon.

Creating An Automated Data Observatory

Fri, 11 Sep 2020 16:00:39 +0000

We are building data ecosystems, so called observatories, where scientific, business, policy and civic users can find factual information, data, evidence for their domain. Our open source, open data, open collaboration approach allows to connect various open and proprietary data sources, and our reproducible research workflows allow us to automate data collection, processing, publication, documentation and presentation.

Our scripts are checking data sources, such as Eurostat’s Eurobase, Spotify’s API and other music industry sources every day for new information, and process any data corrections or new disclosure, interpolate, backcast or forecast missing values, make currency translations and unit conversions. This is shown illustrated with an earlier post.

For direct access to the file visit this link.

In the video we show automated the creation of an observatory website with well-formatted, statistical data dissemination, a technical document in PDF and an ebook can be automated. In our view, our technology is particularly useful technology in business and scientific researech projects, where it is important that always the most timely and correct data is being analyzed, and remains automatically documented and cited. We are ready deploy public, collaborative, or private data observatories in short time.

Data processing costs can be as high as 80% for any in-house AI deployment project. We work mainly with organization that do not have in house data science team, and acquire their data anyway from outside the organization. In their case, this rate can be as high as 95%, meaning that getting and processing the data for deploying AI can be 20x more expensive than the AI solution itself.

AI solutions require a large amount of standardized, well processed data to learn from. We want to radically decrease the cost of data acquisition and processing for our users so that exploiting AI becomes in their reach. This is particularly important in one of our target industries, the music industries, where most of the global sales is algorithmic and AI-driven. Artists, bands, small labels, publishers, even small country national associations cannot remain competitive if they cannot participate in this technological revolution.

We would like to validate this approach in one of the world’s most prestigious university-backed incubator programs, in the Yes!Delft AI/Blockchain Validation Lab.

Video credits

Data acquisition and processing: Daniel Antal, CFA and Marta Kołczyńska, PhD (survey data).
Documentation automation: Sandor Budai
Video art: Line Matson
Music: Moon Moon Moon.

Starting-up

Mon, 24 Aug 2020 10:15:00 +0000

The big day has come: the co-founders singed off the documents at the public notary and started the registration of a reproducible research start-up in Leiden. We got a lot of support from our friends! Your encouragement gives us a lot of energy to accomplish our first milestones, and to get Reprex B.V. going!

Reprex means ‘reproducible example’ in data science. When you are stuck with a problem, creating a reproducible example allows other computer scientists, statisticians, programmers or data users to solve it. In 80% of the cases, you usually find the solution while creating a generalized example. In the 20% other cases, you can reach out for help easily.

In the coming days, we are launching demo versions of our headline products, data observatories. music.dataobservatory.eu will be a fully automated online service that every day collects, processes, cleans, and publishes scientifically valid data about European music. Very soon after we will launch two other observatories.

The creative and cultural sector, NGOs, most research institutions, data journalism teams are usually very small, and they do not have internal IT or data science capacities. We would like to provide them a transparent, high quality, and fully open source solution to acquire data, process it without errors, document it and make sense of it. We would like to embrace the idea of open collaboration among creative enterprises, scientific researchers, NGOs, data journalists and policymakers with our work.

Our work will comply with the Open Policy Analysis standards developed by the Berkeley Initiative for Transparency in the Social Sciences & Center for Effective Global Action and the four principles of reproducible research: reviewability, replicability, confirmability and auditability. We believe that these standards apply in reproducible finance, empirical evidence presentation in courts, or advocating sound policies and producing high-quality journalism.

Do you want to help our start?

We would like to enter into the Validation Lab of one of the best artificial intelligence incubators in early September. Talented team members, letters of intents and assignments from organizations will give a lot of credibility to our start Meet our team ».

Put as in contact with people who love to write code in R and interested in automating business and social science research and primary data collection such as surveying. Check out what sort of code we create »
Introduce us to people who need data and information to make better informed decision and analysis in music, film, book publishing, photography services or socially responsible finance.
Share contacts of data journalists who would like to develop stories from big survey programs like Eurobarometer, Afrobarometer and Lationbarometro, or base their storytelling on data and its visualizations. See our survey harmonization examples »

Do you know such people? Send over this post or connect us in an email or social media message!

Thanks again for your good wishes and encouragements, and hope to hear from you soon!

R | Reprex

Learn R with Reprex

Big Data Creates Inequalities

Slide navigation

Big data that works for all

Get Inspired

Contributor Covenant

Run code from tutorials

Find help, ask for help: reprex

Documentation for better tutorials

Debugging and testing code

Contribute to documentation

R is a functional language

R + YAML + markdown = web ready

Package and release: a team effort

Our open source development projects

Create with us

Questions?

Learn R with Reprex

Big Data Creates Inequalities

Slide navigation

Big data that works for all

Data Feminism

Get Inspired

Contributor Covenant

Run code from tutorials

Find help, ask for help: reprex

Documentation for better tutorials

Debugging and testing code

Contribute to documentation

R is a functional language

R + YAML + markdown = web ready

Package and release: a team effort

Our open source development projects

Create with us

Questions?

stacodelists: use standard, language-independent variable codes to help international data interoperability and machine reuse in R

Purpose

Installation

Code of Conduct

Including Indicators from Arab Barometer in Our Observatory

Further plans

Use our software

Tutorial to work with the Arab Barometer survey data

Experimental API

Join our open source effort

Open Data - The New Gold Without the Rush

There are Numerous Advantages of Switching from a National Level of the Analysis to a Sub National Level

Get the Package

Join us

Creating Algorithmic Tools to Interpret and Communicate Open Data Efficiently

Join us

Reproducible Survey Harmonization: retroharmonize Is Released

Launching Our Demo Music Observatory

Who we are?

Get in touch!

Video credits

Creating An Automated Data Observatory

Video credits

Starting-up

Do you want to help our start?