Notes on (Software Engineering) Research
Now that I’ve finally delivered my PhD thesis (awaiting public defense), and after sharing for more than 3 years my tools, strategies, and tips for surviving in the academic world by word of mouth, I have finally got the time — and energy — to put this in a written form.
While I will try to present these tips and hints in a way they are common to most research fields in computer science, a bias will exist towards software engineering and Internet-of-Things research. This post will be split into 5 parts. Gather will focus on the means to gather knowledge, find relevant research challenges and pending issues, discover trends, and keep up-to-date with the latest publications. Systematize will describe the means by which one can be more efficient reading and summarizing published works. Apply presents the tools for writing, evolving, and maintaining documents, papers, thesis, etc. Publish and Broadcast focuses on the publishing process and broadcasting your own work.
I will try to be as concise as possible, mostly pointing to external resources. If you want some tips and tricks from a process view, I recommend the paper Towards a pattern language for the masters student.
“As knowledge grows in a specific area, solutions are captured and documented by experts in books, research papers, concrete implementations, web pages, and a myriad of other types of communication media. While we may intuitively think that any growth in the body of knowledge implies better design choices, it seems that the way (and the amount) this knowledge is being captured raises an important issue per-se.” — Hugo Sereno Ferreira
You are new, and you are lost. How can you merely grasp what is going on? Even if you have a research field and something closer to a research topic… how can you sail the vast sea of literature?
Newsletters and other news feeds/aggregators are a good start, a few examples:
From a technology viewpoint, knowing current market trends is fundamental:
- Gartner Hype Cycles (technology-wise) and Magic Quadrants (enterprise-wise) are a good high-level view on the current market directions.
- Example for IoT, 2020.
- This analysis is based on internal studies and market analysis and should not be considered ground truth.
- Statista is also another known provider of market and consumer data, being an acceptable source of adoption metrics and other studies.
- Example for IoT, including market size, number of connected devices, etc. Includes forecasts and trend analyses.
From a practice viewpoint:
- Technology Radar by ThoughtWorks is a known source for finding the current adoption stages of different tools and methods.
- An alternative radar is maintained by Zalando.
- GitHub Octoverse contains data about the GitHub community (e.g., most used IDE, languages, etc.)
- JetBrains Developer Ecosystem contains data about the JetBrains developer community (e.g., most used languages, frameworks, etc.)
From a scientific literature viewpoint:
There are two big blobs of literature: (1) the scientific, published, and peer-reviewed literature and (2) the grey literature, consisting of any written resource (e.g., websites, posts, etc.).
For scientific literature, the first go-to solution is the big scientific indexers, the most well-known being:
- Web of Science, which is behind a paywall1, but is one of the most used indexes (most universities have access to it).
- Scopus, another well-known indexer, also behind a paywall1. Once again, most universities have access to it.
- Google Scholar, one of the most widely used indexers but with dubious indexing process (any PDF public stored in a university network is indexed). Use with caution. Most of the time, systematic literature reviews2 made using Google Scholar is not considered valid.
Another solution is to search directly within the digital libraries of specific publishers (being limited to the papers published by them): IEEEXplore, ACM Digital Library, Elsevier, and others. Most of the time, the access to the full papers is behind a paywall1, with a few being open access. Once again, most universities pay for full access.
Searching in pre-print servers also provides good insights on unpublished (or yet to be published) research, arXiv.
From a scientific jobs market (e.g. scholarships) viewpoint:
“We are drowning in information but starved for knowledge.” — John Naisbitt
Too many times, I’ve read full articles just to find out that the paper is poorly done or that it does not provide any new insight nor raises relevant research questions. While there is no silver-bullet to avoid this, one can make efforts to reduce the number of times it happens. Here goes a list of steps:
- Prioritize surveys over normal papers in the early stages of your research, given that someone already has summarized the existent works for you.
- Read papers from relevant and solid sources first. This includes highly reputed conferences and journals (more about it in Publish).
- If the paper is badly formatted — e.g., with missing figures and really badly written — avoid it (and revisit if you do not find anything else relevant to your search).
- If a paper does not have an experiments & results section, give it a low priority. Most of the time, these papers only present ideas or early-stage research that lacks validation.
Now that you have only 999+ papers to read, a reference/paper manager becomes crucial. I have been using Mendeley Desktop (old version) for long as it provides reference management, PDF organization, and notes with synchronization capabilities. However, some other alternatives exist including Zotero which appears similar, Paperpile, and Papers by ReadCube.
Tools setup done. Now to the reading part. A flowchart by Subramanyam et al.:
┌───────────────────────────────────────────────────────┐ ┌────┐ │Is the Title related to the topic that I'm looking for?├───────────────────► NO │ │Does it have the keywords which I have in mind? │ └─┬──┘ └───────────────────────┬───────────────────────────────┘ │ │ │ ┌──▼──┐ ┌─────────▼──────────┐ │ YES │ │Skip the article and│ └──┬──┘ │go to the next one │ │ └─────────▲──────────┘ ┌─────────────────▼──────────────────────┐ │ │Read the Abstract / Summary / Conclusion│ │ └─────────────────┬──────────────────────┘ │ │ │ ┌────────────▼─────────────────┐ │ │Clear-cut aims and objectives?│ │ └────────────┬─────────────────┘ │ │ │ ┌─────────────▼───────────────────┐ │ │Well-defined research hypothesis?│ │ └─────────────┬───────────────────┘ │ │ │ ┌─────────────▼───────────────────┐ │ │Are the conclusions precise? │ │ └─────────────┬───────────────────┘ │ │ │ ┌──────────────────────▼──────────────────────────────────┐ ┌─┴──┐ │Is the above useful or relevant to what I am looking for?├───────────────►| NO │ └──────────────────────┬──────────────────────────────────┘ └────┘ │ │ ┌──▼──┐ │ YES │ └──┬──┘ │ ┌───────────▼────────────┐ │Read the entire article.│ └────────────────────────┘
Other useful tips and hints on reading articles How to Read a Paper.
Use markdown. Make backups. Maintain a logbook.
For collaborative quick notes: hackmd.
Create mindmaps of ideas, subjects, and lines of research.
“You don’t start out writing good stuff. You start out writing crap and thinking it’s good stuff, and then gradually, you get better at it. That’s why I say one of the most valuable traits is persistence.” — Octavia E. Butler
Now that you’re up-to-date on the current state-of-the-art, and you have found out that 99.9% of your ideas are already published, you take that 0.1% and put it into action. This is a multi-stage process depending on the work/idea/challenge/issue/whatever, but there are a few common points that can be grasped by analyzing a typical paper structure.
Here goes the summary of the paper. Little context, the main challenge, main contribution. It should be supported by numbers or other evidence in a clear and direct form.
Typically, here goes the keywords too. Optimize those keywords to be concrete and common since they play a crucial role in the indexing services as they can use them to pick or not your paper amongst the sea of literature.
Here goes the bedtime story. What’s the context and motivation of your work? Is the motivation supported by other authors? What is the concrete problem to be tackled? What are the methods and mechanisms that will be used? What are the main contributions and observations? And what is the paper structure?
Here is a good place to clearly state your problem, including any existing hypothesis, research questions, and others. Also, point to what metrics/aspects will be used to validate your research.
There are, for sure, a good amount of works that are related to your very specific thing. Read them. Cite them. Criticize them. Compare the different works (use tables, charts, etc.). Find your very specific thing that makes your work valuable and differentiates it from the rest. Remember key points, and, if possible, compare the results of other works with your own observations. Sometimes Related Work can appear later in the paper, but it’s not so common…
In some cases, Related Work can be preceded by a Background / Preliminaries section where you present the key concepts that the reader needs to grasp in other to understand the following work.
What did you do? What is the architecture of your thing? If there are any mathematical formulations or algorithms that should be presented, here is the spot. Try to follow some pseudo-code conventions. If you are presenting math, clearly present the meaning of all the uncommon symbols that are used on your equations. If there are any software-related diagrams, try to follow UML if possible. Draw.io is your friend.
Always have an anonymous replication code package with some documentation and meta-data, including how to setup? and how to run? . If you have used Git, there are some services that automatically anonymize your repository, e.g. anonymous.4open.science.
This is optional. But, if you faced some esoteric issues when implementing your approach, maybe it’s best to document them. Here goes code snippets, technologies, and other detail that were not the focus of the approach but did have an impact on your work. Always use correct code syntax and clear highlight.
Experiments and Results
Is your solution the best around? How will you validate it? What is the methodology/design of the experiments? What is the configuration of the validation testbed (computer details, etc.)? In the case of human participants, do not forget to gather information about their experience/background.
Collect as many metrics/data points as possible, even if you don’t know if they are going to be useful in the future. Store all of the information in a way that it’s easily handled (e.g., CSV files), and never trust your computer with those valuable files — always keep a cloud-based backup.
Use and abuse of visualizations. Charts, tables, and other relevant visualizations make the data — and your results — easier to understand. Use readily available software to do this analysis, e.g. Google Colab Python Notebooks, IBM SPSS, or good old Excel or Google Sheets.
Always verify the statistical significance (p-value) of your results using the appropriate methods, more info.
Here goes a critical analysis of your results and what they mean. If possible, compare the results with existing literature justifying differences in the results. Use additional visualizations if needed.
Threats to Validity
There are several factors, especially in software engineering, that might influence your results. Typically, these threats are split into the internal and external validity of the experiment. Clearly present the possible threats (e.g., bias) that exist in your research methodology and discuss the effort carried to mitigate or reduce the impact of these threats. More about threats to validity: Threats to validity of Research Design, Internal and External Validity, and Threats to Validity of your Design.
Here goes a highlight of the most important takeaways of your article. Also, point to limitations and future work / open research challenges.
References are hard to manage. Maintain your Mendeley up-to-date (Overleaf as a direct Mendeley integration). If you keep your Bibtex manually, use a linter from time to time, e.g. Online BibTeX Tidy - Clean up BibTeX files and BibTeX tools.
Publish or Perish. — Authors et al.
Now that you have your research done, your paper written, it is time to publish. But where? What is the most suitable venue? Is it relevant? Is it predatory (also known as pay-to-publish in some shady and badly indexed website)?
Search for good venues, take into account well-known rankings. For journals, Scimago Journal & Country Rank (Q1 -> Q2 -> Q3 -> Q4 -> nothing important) is golden rule. For conferences, use CORE (A* -> A -> B -> C -> nothing important). Some conferences have good workshops and other side events, but they are not taken into account by rankings (if you care about publishing even in workshops, always check if the workshops are part of the proceedings or any companion of the conference).
If the conference/journal lets you publish a pre-print, publish it as soon as the paper is accepted in the arXiv.
“If a tree falls in a forest and no one is around to hear it, does it make a sound?” — maybe George Berkeley
You have finished your paper, and it is now published. But will somebody read it? While you can’t force anyone to read it, you can increase the probability of it happening. Some key points:
- Publish a tweet with the paper title, authors, and, if possible, some key takeaways;
- Create and maintain a personal website with all your research content, with URLs to the papers and abstracts;
- Create and maintain your personal research accounts:
Participate and present your work when possible. Go to meetups and other informal gatherings. Advertise, advertise, advertise…